This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adversarial Robustness via Fisher-Rao Regularization

Marine Picot, Francisco Messina, Malik Boudiaf, Fabrice Labeau, 
 Ismail Ben Ayed, and Pablo Piantanida
M. Picot, F. Messina, and F. Labeau are with the Department of Electrical and Computer Engineering, McGill University, QC, Canada.
Email:{marine.picot,francisco.messina}@mail.mcgill.ca Malik Boudiaf and Ismail Ben Ayed are with ETS Montreal
Email:{malik.boudiaf.1, Ismail.BenAyed}@etsmtl.net M. Picot is with Laboratoire des Signaux et Systèmes (L2S), Université Paris-Saclay CNRS CentraleSupélec, Gif-sur-Yvette, France. Email:[email protected] P. Piantanida is with the International Laboratory on Learning Systems (ILLS), McGill - ETS - Mila - CNRS - Université Paris Saclay - CentraleSupelec. Email:[email protected]
Abstract

Adversarial robustness has become a topic of growing interest in machine learning since it was observed that neural networks tend to be brittle. We propose an information-geometric formulation of adversarial defense and introduce Fire, a new Fisher-Rao regularization for the categorical cross-entropy loss, which is based on the geodesic distance between the softmax outputs corresponding to natural and perturbed input features. Based on the information-geometric properties of the class of softmax distributions, we derive an explicit characterization of the Fisher-Rao Distance (FRD) for the binary and multiclass cases, and draw some interesting properties as well as connections with standard regularization metrics. Furthermore, we verify on a simple linear and Gaussian model, that all Pareto-optimal points in the accuracy-robustness region can be reached by Fire while other state-of-the-art methods fail. Empirically, we evaluate the performance of various classifiers trained with the proposed loss on standard datasets, showing up to a simultaneous 1% of improvement in terms of clean and robust performances while reducing the training time by 20% over the best-performing methods.

Index Terms:
Safety AI, computer vision, adversarial training, Fisher-Rao distance, information geometry, neural networks, deep learning, adversarial regularization.

1 Introduction

Deep Neural Networks (DNNs) have achieved several breakthroughs in different fields such as computer vision, speech recognition, and Natural Language Processing (NLP). Nevertheless, it is well-known that these systems are extremely sensitive to small perturbations on the inputs [1], known as adversarial examples. Formally, an adversarial example represents a corrupted input, characterized by a bounded optimal perturbation from the original vector, designed to fool a specified neural networks’ task. Adversarial examples have already proven threatful in several domains, including vision and NLP [2], hence leading to the emergence of the rich area of adversarial machine learning [3]. The effectiveness of adversarial examples has been attributed to the linear regime of DNNs [4] and the data manifold geometrical structure itself [5], among other hypotheses. More recently, it has been related to the existence of valuable features for classification but meaningless for humans [6].

In this paper, we focus on the so-called white-box attacks, for which the attacker has full access to the model. However, it should be noted that black-box attacks, in which the attacker can only query predictions from the model without access to further information, are also feasible [7]. The literature on adversarial machine learning is extensive and can be divided into three overlapping groups, studying the generation, detection, and defense aspects. The simplest method to generate adversarial examples is the Fast Gradient Sign Method (FGSM) [4], including its iterative variant called Projected Gradient Descent (PGD) [8]. Although widely used, PGD has a few issues that can lead to overestimating the robustness of a model. AutoAttack [9] has been recently developed to overcome those problems, enabling an effective way to test and compare the different defensive schemes.

A simple approach to cope with corrupted examples is to detect and discard them before classification. For instance, [10], [11], and [12] present different methods to detect corrupted inputs. Although these ideas can be useful to ensure robustness to outliers (i.e., inputs with large deviations with respect to clean examples), they do not seem to be satisfactory solutions for mild adversarial perturbations. In addition, adversarial detection can generally be bypassed by sophisticated attack methods [13].

Recently, several works focused on improving the robustness of neural networks by investigating various defense mechanisms. For instance, certified defense mechanisms addressed the need for more guarantees on the task performance beyond standard evaluation metrics [14, 15, 16, 17, 18, 19, 20, 21]. These methods aim at training classifiers whose predictions at any input feature will remain constant within a set of neighborhoods around the original input. However, these algorithms do not achieve state-of-the-art performance yet. Also, some approaches tend to rely on convex relaxations of the original problem [22, 23] since directly solving the adversarial problem is not tractable. Although these solutions are promising, it is still not possible to scale them to high-dimensional datasets. Finally, we could mention distillation, initially introduced in [24], and further studied in [25]. The idea of distillation is to use a large DNN (the teacher) to train a smaller one (the student), which can perform with similar accuracy while utilizing a temperature parameter to reduce sensitivity to input variations. The resulting defense strategy may be efficient for some attacks but can be defeated with the standard Carlini-Wagner attack.

In this work, we will focus on the most popular strategy for enhancing robustness, which is based on adversarial training, i.e., learning with an augmented training set containing adversarial examples [4].

1.1 Summary of contributions

Our work investigates the problem of optimizing the trade-off between accuracy and robustness and advances state-of-the-art methods in very different ways.

  • We derive an explicit characterization of the Fisher-Rao Distance (FRD) based on the information-geometric properties of the soft-predictions of the neural classifier. That leads to closed-form expressions of the FRD for the binary and multiclass cases (Theorems 1 and 2, respectively). We further relate them to well-known regularization metrics (presented in Proposition 1).

  • We propose a new formulation of adversarial defense, called FIsher-rao REgularizer (Fire). It consists of optimizing a regularized loss, which encourages the predictions of natural and perturbed samples to be close to each other, according to the manifold of the softmax distributions induced by the neural network. Our loss in Eq. (7) consists of two terms: the categorical cross-entropy, which favors natural accuracy, and a Fisher-Rao regularization term, which increases adversarial robustness. Furthermore, we prove for a simple logistic regression and Gaussian model that all Pareto-optimal points in the accuracy-robustness region can be reached by Fire, while state-of-the-art methods fail (cf. Section 4 and Proposition 2).

  • Experimentally, on standard benchmarks, we found that FIRE provides an improvement up to roughly 2% of robust accuracy compared to the widely used Kullback-Leibler regularizer [26]. We also observed significant improvements over other state-of-the-art methods. In addition, our method typically requires, on average, less computation time (measured by the training runtime on the same GPU cluster) than state-of-the-art methods.

1.2 Related work

Adversarial training. Adversarial training (AT) [4] is one of the few defenses that has not been broken so far. Indeed, different variations of this method have been proposed. It is based on an attack-defense scheme where the attacker’s goal is to create perturbated inputs by maximizing a loss to fool the classifier, while the defender’s goal is to classify those attacked inputs rightfully.

Inner attack generation. The inner attacker design is vital to AT since it has to create meaningful attacks. One of the most popular algorithms to generate adversarial examples is Projected Gradient Descent (PGD) [4, 8], which is an iterative attack: the output at each step is the addition of the previous output and the sign of the loss gradient modulated by a fixed step size. The loss maximized in PGD is often the same loss that is minimized for the defense.

Robust defense loss. An essential choice of the defense mechanism is the robust loss used to attack and defend the network. Initially, [4] used the adversarial cross-entropy. However, it was shown that one way to improve adversarial training is through the choice of this loss. TRADES [26] introduces a robustness regularizer based on the Kullback-Leibler divergence. MART [27] uses a robustness regularizer that considers the misclassified inputs and boosted losses.

Additional improvement. Whether it is AT, TRADES or MART, they all have been improved in recent years. Those improvements can either rely on pretraining [28], early stopping [29], curriculum learning [30], adaptative models [31], unlabeled data to improve generalization [32, 33] or additional perturbations on the model weights [34].

It should be noted that the main disadvantage of adversarial training-based methods remains the required computational expenses. Nevertheless, as will be shown in Section 5, Fire can significantly reduce them.

2 Background

We consider a standard supervised learning framework where 𝐱𝒳n\boldsymbol{\mathbf{x}}\in\mathcal{X}\subseteq\mathbb{R}^{n} denotes the input vector on the space 𝒳\mathcal{X} and y𝒴y\in\mathcal{Y} the class variable, where 𝒴{1,,M}\mathcal{Y}\coloneqq\{1,\ldots,M\}. The unknown data distribution is denoted by p(𝐱,y)=p(𝐱)p(y|𝐱)p(\boldsymbol{\mathbf{x}},y)=p(\boldsymbol{\mathbf{x}})p(y|\boldsymbol{\mathbf{x}}). We define a classifier to be a parametric soft-probability model of p(y|𝐱)p(y|\boldsymbol{\mathbf{x}}), denoted as q𝜽(y|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}), where 𝜽Θ\boldsymbol{\mathbf{\theta}}\in\Theta are the parameters. This can be readily used to induce a hard decision: f𝜽:𝒳𝒴f_{\boldsymbol{\mathbf{\theta}}}:\mathcal{X}\to\mathcal{Y} with f𝜽(𝐱)argmaxy𝒴q𝜽(y|𝐱){f_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})\coloneqq\arg\max_{y\in\mathcal{Y}}~{}q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})}. Adversarial examples are denoted as 𝐱=𝐱+𝜹𝒳\boldsymbol{\mathbf{x}}^{\prime}=\boldsymbol{\mathbf{x}}+\boldsymbol{\mathbf{\delta}}\in\mathcal{X}, where 𝜹ε\|\boldsymbol{\mathbf{\delta}}\|\leq\varepsilon for an arbitrary norm \|\cdot\|. Loss functions are denoted as (𝐱,𝐱,y,𝜽)\ell(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{x^{\prime}}},y,\boldsymbol{\mathbf{\theta}}) and the corresponding risk functions by L(𝜽)L(\boldsymbol{\mathbf{\theta}}). We also define the natural missclassification probability as Pe(𝜽)(f𝜽(𝐗)Y)P_{e}(\boldsymbol{\mathbf{\theta}})\doteq\mathbb{P}(f_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{X}})\neq Y), the adversarial missclassification probability as Pe(𝜽)(f𝜽(𝐗)Y)P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})\doteq\mathbb{P}(f_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{X}}^{\prime})\neq Y).

2.1 Adversarial learning

We provide some background on adversarial learning, focusing on adversarial defense’s most popular proposed loss functions. Adversarial examples are slightly modified inputs that can fool a target classifier. Concretely, Szegedy et al. [1] define the adversarial generation problem as:

𝐱=arg min𝐱:𝐱𝐱p<ε𝐱𝐱 s.t. fθ(𝐱)y,\boldsymbol{\mathbf{x}}^{\prime}=\underset{\boldsymbol{\mathbf{x}}^{\prime}\,:\,\lVert\boldsymbol{\mathbf{x}}^{\prime}-\boldsymbol{\mathbf{x}}\rVert_{p}<\varepsilon}{\text{arg min}}\lVert\boldsymbol{\mathbf{x}}^{\prime}-\boldsymbol{\mathbf{x}}\rVert\text{~{}~{}s.t.~{}~{}}f_{\theta}(\boldsymbol{\mathbf{x}}^{\prime})\neq y, (1)

where yy is the true label (supervision) associated to the sample 𝐱\boldsymbol{\mathbf{x}}. This formulation shows that the vulnerable points of a classifier are the ones close to its decision boundaries. Since this problem is difficult to tackle, it is commonly relaxed as follows [35]:

𝐱=arg max𝐱:𝐱𝐱p<ε(𝐱,𝐱,y,𝜽).\boldsymbol{\mathbf{x}}^{\prime}=\underset{\boldsymbol{\mathbf{x}}^{\prime}\,:\,\lVert\boldsymbol{\mathbf{x}}^{\prime}-\boldsymbol{\mathbf{x}}\rVert_{p}<\varepsilon}{\text{arg max}}\ell(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{x}}^{\prime},y,\boldsymbol{\mathbf{\theta}}). (2)

Once adversarial examples are obtained, they can be used to learn a robust classifier as discussed next.

The adversarial problem has been presented in [8] as follows:

min𝜽𝔼p(𝐱,y)[max𝐱:𝐱𝐱pε(𝐱,𝐱,y,𝜽)],\min_{\boldsymbol{\mathbf{\theta}}}~{}\mathbb{E}_{p(\boldsymbol{\mathbf{x}},y)}\left[\max_{\boldsymbol{\mathbf{x}}^{\prime}:\|\boldsymbol{\mathbf{x}}^{\prime}-\boldsymbol{\mathbf{x}}\|_{p}\leq\varepsilon}\;\ell(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{x}}^{\prime},y,\boldsymbol{\mathbf{\theta}})\right], (3)

where ε\varepsilon denotes the maximal distortion allowed in the adversarial examples according to the lpl_{p}-norm. Since the exact solution to the above inner max problem is generally intractable, a relaxation is proposed by generating an adversarial example using an iterative algorithm such as PGD.

2.1.1 Adversarial Cross-Entropy (ACE)

If we take the loss to be the Cross-Entropy (CE), i.e., (𝐱,𝐱,y,𝜽)=logq𝜽(y|𝐱)\ell(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{x}}^{\prime},y,\boldsymbol{\mathbf{\theta}})=-\log q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}^{\prime}), we obtain the ACE risk:

LACE(𝜽)𝔼p(𝐱,y)[max𝐱:𝐱𝐱pεlogq𝜽(y|𝐱)].L_{\text{ACE}}(\boldsymbol{\mathbf{\theta}})\doteq\mathbb{E}_{p(\boldsymbol{\mathbf{x}},y)}\left[\max_{\boldsymbol{\mathbf{x}}^{\prime}:\|\boldsymbol{\mathbf{x}}^{\prime}-\boldsymbol{\mathbf{x}}\|_{p}\leq\varepsilon}-\log q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}^{\prime})\right]. (4)

2.1.2 TRADES

Later [26] defined a new risk based on a trade-off between natural and adversarial performances, controlled through an hyperparameter λ\lambda. The resulting risk is the addition of the natural cross-entropy and the Kullback-Leibler (KL) divergence between natural and adversarial probability distributions:

LTRADES(𝜽)𝔼p(𝐱,y)[\displaystyle L_{\textrm{TRADES}}(\boldsymbol{\mathbf{\theta}})\doteq\mathbb{E}_{p(\boldsymbol{\mathbf{x}},y)}\Big{[} max𝐱:𝐱𝐱pεlogq𝜽(y|𝐱)\displaystyle\max_{\boldsymbol{\mathbf{x}}^{\prime}:\|\boldsymbol{\mathbf{x}}^{\prime}-\boldsymbol{\mathbf{x}}\|_{p}\leq\varepsilon}-\log q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})
+λKL(q𝜽(|𝐱)q𝜽(|𝐱))],\displaystyle+\lambda~{}\text{KL}(q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}})\|q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}^{\prime}))\Big{]}, (5)

where

KL(q𝜽(|𝐱)q𝜽(|𝐱))𝔼q𝜽(y|𝐱)[logq𝜽(y|𝐱)q𝜽(y|𝐱)].\text{KL}(q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}})\|q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}^{\prime}))\doteq\mathbb{E}_{q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})}\left[\log\frac{q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})}{q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}^{\prime})}\right]. (6)

3 Adversarial Robustness with Fisher-Rao Regularization

3.1 Information Geometry and Statistical Manifold

Statistics on manifolds and information geometry are two different ways in which differential geometry meets statistics. A statistical manifold can be defined as a parameterized family of probability distributions (or density functions) of interest. It is worth to mention that the concept of statistics on manifolds is very different from manifold learning which is a branch of machine learning where the goal is to learn a latent manifold from valued data. In this paper, we are interested in the statistical manifold obtained when fixing the parameters 𝜽\boldsymbol{\mathbf{\theta}} of a DNN and changing its feature input. We consider the following statistical manifold: 𝒞{q𝜽(|𝐱):𝐱𝒳}\mathcal{C}\doteq\big{\{}q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}):\boldsymbol{\mathbf{x}}\in\mathcal{X}\big{\}}. In particular, the focus is on changes in a neighborhood of a particular sample in an adversarial manner (i.e., considering a worst-case perturbation). Please notice that the statistical manifold is different from the loss landscape. The loss landscape is defined as the changes of the risk function with respect to changes in the model parameters (i.e., L(𝜽)L(\boldsymbol{\mathbf{\theta}}) vs 𝜽\boldsymbol{\mathbf{\theta}}), while the statistical manifold refers to the changes of the soft-probabilities of the classifier with respect to changes in the input (i.e., q𝜽(y|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}) vs 𝐱\boldsymbol{\mathbf{x}}). In order to understand the effect of a perturbation on the input, we first need to be able to capture the distance over the statistical manifold between different probability distributions, i.e., between two different feature inputs. That is precisely what the FRD computes, as illustrated by the red curve in Fig. 1. It is worth to mention that FRD can be very much different from the euclidean distance since the later does not depend on the shape of the manifold. For a formal mathematical definition of FRD and a short review of basic concepts in information geometry, we refer the reader to Section 6.1.

Refer to caption
Figure 1: Illustration of FRD between two distributions q𝜽=q𝜽(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}) and q𝜽=q𝜽(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}^{\prime}=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}^{\prime}) over the statistical manifold 𝒞\mathcal{C}.

As discussed in Section 2, the robustness of a classifier is related to the distance between natural examples and the decision boundaries (i.e., points 𝐱\boldsymbol{\mathbf{x}} such that q𝜽(y|𝐱)q𝜽(y|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})\approx q_{\boldsymbol{\mathbf{\theta}}}(y^{\prime}|\boldsymbol{\mathbf{x}}) for yyy\neq y^{\prime}). In fact, if a natural example is far from the decision boundaries, a norm-constrained attack will clearly fail (in this case, the optimization problem (1) will be infeasible). Since the decision boundaries are given by the soft-probabilities q𝜽(y|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}), this can be equivalently studied by analyzing the shape of the statistical manifold 𝒞\mathcal{C} (which should not be confused with the loss landscape). In fact, if q𝜽(y|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}) is relatively flat (i.e., does not change much) with respect to perturbations of 𝐱\boldsymbol{\mathbf{x}} around 𝐱0\boldsymbol{\mathbf{x}}_{0}, it is clear that adversarial perturbations will not modify the classifier decision at this point. In contrast, if q𝜽(y|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}) changes sharply with perturbations of 𝐱\boldsymbol{\mathbf{x}} around 𝐱0\boldsymbol{\mathbf{x}}_{0}, an adversarial can easily leverage this vulnerability to fool the classifier. This notion of robustness is related to the Lipschitz constant of the network, as discussed in various works (e.g., [36]). To illustrate these ideas clearly, let us consider the logistic regression model q𝜽(y|𝐱)=1/[1+exp(y𝜽𝐱)]q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})=1/[1+\exp(-y\,\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{x}})], where n=2n=2 and 𝒴={1,1}\mathcal{Y}=\{-1,1\}, as a simple example. One way to visualize the statistical manifold 𝒞\mathcal{C} is to plot q𝜽(1|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(1|\boldsymbol{\mathbf{x}}) as a function of 𝐱\boldsymbol{\mathbf{x}} (since q𝜽(1|𝐱)=1q𝜽(1|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(-1|\boldsymbol{\mathbf{x}})=1-q_{\boldsymbol{\mathbf{\theta}}}(1|\boldsymbol{\mathbf{x}}), this completely characterizes the manifold). This is shown in Fig. 2(a) for the value of 𝜽\boldsymbol{\mathbf{\theta}} which minimizes the natural missclassification probability PeP_{e} under a conditional Gaussian model for the input 𝐱\boldsymbol{\mathbf{x}} (see Section 4 for details). As can be seen, the manifold is quite sharp around a particular region of 𝒳\mathcal{X}. This region corresponds to the neighborhood of the points for which 𝜽𝐱0\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{x}}\approx 0 as 𝐱\boldsymbol{\mathbf{x}} is perturbed in the direction of 𝜽\boldsymbol{\mathbf{\theta}}. Therefore, we can say that this model is clearly non-robust, as its output can be significantly changed by small perturbations on the input. Consider now the same model but with the values of 𝜽\boldsymbol{\mathbf{\theta}} obtained by minimizing the adversarial missclassification probability PeP_{e}^{\prime}. As can be seen in Fig. 2(b), the statistical manifold is much flatter than in Fig. 2(a), which means that the model is less sensitive to adversarial perturbations on the input. Therefore, it is more robust.

Refer to caption
(a) 𝜽=[6.4290,25.7487]\boldsymbol{\mathbf{\theta}}=[-6.4290,25.7487]^{\intercal}.
Refer to caption
(b) 𝜽=[0.9364,0.3509]\boldsymbol{\mathbf{\theta}}=[-0.9364,0.3509]^{\intercal}.
Figure 2: Visualization of statistical manifold 𝒞\mathcal{C} defined by the model q𝜽(y|𝐱)=1/[1+exp(y𝜽𝐱)]q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})=1/[1+\exp(-y\,\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{x}})] with different values of 𝜽\boldsymbol{\mathbf{\theta}}: (a) Parameters minimizing the natural misclassification error probability PeP_{e}, (b) Parameters minimizing the adversarial misclassification error probability PeP_{e}^{\prime}.

Let us now consider the FRD of the two models around the point 𝐱=𝟎\boldsymbol{\mathbf{x}}=\boldsymbol{\mathbf{0}}, which gives a point that lies in the decision boundary, by letting 𝜹=𝐱𝐱\boldsymbol{\mathbf{\delta}}=\boldsymbol{\mathbf{x}}^{\prime}-\boldsymbol{\mathbf{x}} vary in the \ell_{\infty} ball ,ε={𝜹:𝜹ε}\mathcal{B}_{\infty,\varepsilon}=\{\boldsymbol{\mathbf{\delta}}:\|\boldsymbol{\mathbf{\delta}}\|_{\infty}\leq\varepsilon\}, with ε=0.1\varepsilon=0.1. Fig. 3(a) displays the FRD for the parameters 𝜽\boldsymbol{\mathbf{\theta}} which minimize the misclassification error probability PeP_{e}, and Fig. 3(b) shows the FRD for the parameters 𝜽\boldsymbol{\mathbf{\theta}} which minimize the adversarial misclassification error probability PeP_{e}^{\prime}. Clearly, the abrupt transition of q𝜽(1|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(1|\boldsymbol{\mathbf{x}}) in Fig.  2(a) corresponds to a sharp increase on the FRD as 𝜹\|\boldsymbol{\mathbf{\delta}}\|_{\infty} increases. On the contrary, for a flatter manifold as in Fig. 2(b), the FRD increases much more slowly as 𝜹\|\boldsymbol{\mathbf{\delta}}\|_{\infty} increases. This example shows how FRD reflects the shape of the statistical manifold 𝒞\mathcal{C}.

Our goal in this work is to use the FRD to control the shape of the statistical manifold by regularizing the misclassification risk.

Refer to caption
(a) 𝜽=[6.4290,25.7487]\boldsymbol{\mathbf{\theta}}=[-6.4290,25.7487]^{\intercal}.
Refer to caption
(b) 𝜽=[0.9364,0.3509]\boldsymbol{\mathbf{\theta}}=[-0.9364,0.3509]^{\intercal}.
Figure 3: FRD between the distributions q𝜽(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}) and q𝜽(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}^{\prime}) as a function of 𝜹\boldsymbol{\mathbf{\delta}} using the logistic model with different values of 𝜽\boldsymbol{\mathbf{\theta}}: (a) Parameters minimizing the natural misclassification error probability PeP_{e}, (b) Parameters minimizing the adversarial misclassification error probability PeP_{e}^{\prime}.

The rest of this section is organized as follows. We begin by introducing the FIRE risk function, which is our main theoretical proposal to improve the robustness of neural networks. We continue with the evaluation of the FRD given by expression (26) for the binary and multiclass classification frameworks and provide some exciting properties and connections with other standard distances and well-known information divergences.

3.2 The Fire risk function

The main proposal of this paper is the Fire risk function, defined as follows:

LFire(𝜽)𝔼p(𝐱,y)[\displaystyle L_{\textrm{{Fire}}}(\boldsymbol{\mathbf{\theta}})\doteq\,\mathbb{E}_{p(\boldsymbol{\mathbf{x}},y)}\Big{[} max𝐱:𝐱𝐱pεlogq𝜽(y|𝐱)\displaystyle\max_{\boldsymbol{\mathbf{x}}^{\prime}:\|\boldsymbol{\mathbf{x}}^{\prime}-\boldsymbol{\mathbf{x}}\|_{p}\leq\varepsilon}-\log q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})
+λdR,𝒞2(q𝜽(|𝐱),q𝜽(|𝐱))],\displaystyle+\lambda\cdot d_{R,\mathcal{C}}^{2}(q_{\boldsymbol{\mathbf{\theta}}}\big{(}\cdot|\boldsymbol{\mathbf{x}}),q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}^{\prime})\big{)}\Big{]}, (7)

where λ>0\lambda>0 controls the trade-off between natural accuracy and robustness to the adversary.

In Fig. 4, we show the shape of the statistical manifold 𝒞\mathcal{C} as λ\lambda is varied for the logistic regression model discussed in Section 3.1 (see also Section 4). Notice that when no FRD regularization is used, the manifold in Fig. 4(a) is very similar to the one in Fig. 2(a). As the value of λ\lambda increases, the weight of the FRD regularization term also increases. As a consequence, the statistical manifold is flattened as expected which is illustrated in Fig. 4(b). However, as shown in Fig. 4(c), setting λ\lambda to a very high value causes the statistical manifold to become extremely flat. This means that the model is basically independent of the input, and the classification performance will be poor. Notice the similarities between Fig. 2 and Fig. 4.

Refer to caption
(a) λ=0\lambda=0.
Refer to caption
(b) λ=3.8957\lambda=3.8957.
Refer to caption
(c) λ=98.1730\lambda=98.1730.
Figure 4: Visualization of statistical manifold 𝒞\mathcal{C} defined by the model q𝜽(y|𝐱)=1/[1+exp(y𝜽𝐱)]q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})=1/[1+\exp(-y\,\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{x}})] when minimizing the Fire risk function for different values of λ\lambda: (a) No adversarial FRD regularization, (b) Medium adversarial FRD regularization, (c) High adversarial FRD regularization.

In what follows, we derive closed-form expressions of the FRD for general classification problems. However, for the sake of clarity, we begin with the binary case.

3.3 FRD for the case of binary classification

Let us first consider the binary classification setting, in which 𝒳n\mathcal{X}\subseteq\mathbb{R}^{n} and 𝒴={1,1}\mathcal{Y}=\{-1,1\} are input and label spaces, respectively. Consider an arbitrary given model:

q𝜽(y|𝐱)=11+eh𝜽(𝐱)y.q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})=\frac{1}{1+e^{-h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})y}}. (8)

Here h𝜽(𝐱)h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}) represents an arbitrary parametric representation or latent code of the input 𝐱\boldsymbol{\mathbf{x}}. As a matter of fact, we only need to assume that h𝜽h_{\boldsymbol{\mathbf{\theta}}} is a smooth function. The FRD for this model can be computed in closed-form, as shown in the following result. The proof is relegated to Section 6.2.

Theorem 1 (FRD for binary classifier).

The FRD between soft-predictions q𝛉q𝛉(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}\equiv q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}) and q𝛉q𝛉(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}^{\prime}\equiv q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}^{\prime}), according to (8) and corresponding to inputs 𝐱\boldsymbol{\mathbf{x}} and 𝐱\boldsymbol{\mathbf{x}}^{\prime}, is given by

dR,𝒞(q𝜽,q𝜽)=2|arctan(eh𝜽(𝐱)/2)arctan(eh𝜽(𝐱)/2)|.d_{R,{\mathcal{C}}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})=2\Big{|}\arctan(e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime})/2})-\arctan(e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})/2})\Big{|}. (9)

For illustration purposes, Fig. 5(a) shows the behavior of the FRD with respect to changes in the latent code compared with the Euclidean distance. It can be observed that the resulting FRD is rather sensitive to variations in the latent space when h𝜽(𝐱)0h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})\approx 0 while being close to zero for the region in which |h𝜽(𝐱)||h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})| is large and |h𝜽(𝐱)||h𝜽(𝐱)||h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x^{\prime}}})|\ll|h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})|. This asymmetric behavior is in sharp contrast with the one of the Euclidean distance. However, these quantities are related as shown by the next proposition.

Refer to caption
(a) Fisher-Rao distance.
Refer to caption
(b) Euclidean distance.
Figure 5: Comparison between FRD and Euclidean distance.
Proposition 1 (FRD vs. Euclidean distance).

The Fisher-Rao distance can be bounded as follows:

dR,𝒞(q𝜽,q𝜽)12|h𝜽(𝐱)h𝜽(𝐱)|,d_{R,{\mathcal{C}}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})\leq\frac{1}{2}\big{|}h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime})-h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})\big{|}, (10)

for any pair of inputs 𝐱,𝐱𝒳\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{x}}^{\prime}\in\mathcal{X}.

The proof of this proposition is relegated to Section C.1.

3.3.1 Logistic regression:

A particular case of significant importance is that of logistic regression: h𝜽(𝐱)=𝜽𝐱h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})=\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{x}}. In this case, the FRD reduces to:

dR,𝒞(q𝜽,q𝜽)=2|arctan(e𝜽𝐱/2)arctan(e𝜽𝐱/2)|.\displaystyle d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})=2\left|\arctan(e^{\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{x}}^{\prime}/2})-\arctan(e^{\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{x}}/2})\right|. (11)

A first-order Taylor approximation in the variable 𝜹=𝐱𝐱\boldsymbol{\mathbf{\delta}}=\boldsymbol{\mathbf{x}}^{\prime}-\boldsymbol{\mathbf{x}} and maximization over 𝜹\boldsymbol{\mathbf{\delta}} such that 𝜹ε\|\boldsymbol{\mathbf{\delta}}\|\leq\varepsilon yields

dR,𝒞(q𝜽,q𝜽)12cosh(𝜽𝐱/2)ε𝜽,d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})\approx\frac{1}{2\cosh(\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{x}}/2)}\varepsilon\|\boldsymbol{\mathbf{\theta}}\|_{*}, (12)

where \|\cdot\|_{*} is the dual norm of \|\cdot\|, which is defined as 𝐳sup{|𝐳𝐰|:𝐰1}\|\boldsymbol{\mathbf{z}}\|_{*}\doteq\sup\{|\boldsymbol{\mathbf{z}}^{\intercal}\boldsymbol{\mathbf{w}}|:\|\boldsymbol{\mathbf{w}}\|\leq 1\}. Therefore, in this case, we obtain a weighted dual norm regularization on 𝜽\boldsymbol{\mathbf{\theta}}, with the weighting being large when 𝜽𝐱\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{x}} is close to zero (i.e., points with large uncertainty in the class assignment), and being small when |𝜽𝐱||\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{x}}| is large (i.e., points with low uncertainty in the class assignment). An even more direct connection between the FRD and the dual norm regularization on 𝜽\boldsymbol{\mathbf{\theta}} can be obtained from Proposition 1, which leads to

dR,𝒞(q𝜽,q𝜽)12|𝜹𝜽|12ε𝜽.\displaystyle d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})\leq\frac{1}{2}|\boldsymbol{\mathbf{\delta}}^{\intercal}\boldsymbol{\mathbf{\theta}}|\leq\frac{1}{2}\varepsilon\|\boldsymbol{\mathbf{\theta}}\|_{*}. (13)

Eq. (13) formalizes our intuitive idea that p\ell_{p}-regularized classifiers tend to be more robust. This is in agreement with other results, e.g., [37].

3.4 FRD for the case of multiclass classification

Consider the general MM-classification problem in which 𝒴={1,,M}\mathcal{Y}=\{1,\dots,M\}, and let

q𝜽(y|𝐱)=ehy(𝐱,𝜽)y𝒴ehy(𝐱,𝜽),q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})=\frac{e^{h_{y}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})}}{\sum_{y^{\prime}\in\mathcal{Y}}e^{h_{y^{\prime}}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})}}, (14)

be a standard softmax output, where 𝐡:𝒳×ΘM\boldsymbol{\mathbf{h}}:\mathcal{X}\times\Theta\to\mathbb{R}^{M} is a parametric representation function and zyz_{y} denotes the yy-th component of the vector 𝐳\boldsymbol{\mathbf{z}}. The FRD for this model can also be obtained in closed-form as summarized below. The proof is given in Section 6.3.

Theorem 2 (FRD multiclass classifier).

The FRD between soft-predictions q𝛉q𝛉(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}\equiv q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}) and q𝛉q𝛉(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}^{\prime}\equiv q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}^{\prime}), according to (14) and corresponding to inputs 𝐱\boldsymbol{\mathbf{x}} and 𝐱\boldsymbol{\mathbf{x}}^{\prime}, is given by

dR,𝒞(q𝜽\displaystyle d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}} ,q𝜽)=\displaystyle,q_{\boldsymbol{\mathbf{\theta}}}^{\prime})= 2arccos(y𝒴q𝜽(y|𝐱)q𝜽(y|𝐱)).\displaystyle 2\arccos\left(\sum_{y\in\mathcal{Y}}\sqrt{q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}^{\prime})}\right). (15)

Remark. Although not obvious, the FRD for the multiclass case (Eq. (15)) is indeed consistent with the FRD for the binary case (Eq. (9)), i.e., they are equal for the case M=2M=2 (for further details the reader is referred to Section 6.3).

3.5 Comparison between FRD and KL divergence

The Fisher-Rao distance (15) has some interesting connections with other distances and information divergences. We are particularly interested in its relation with the KL divergence, which is the adversarial regularization mechanism used in the TRADES method [26]. The next theorem summarizes the mathematical connection between these quantities. The proof of this theorem is relegated to Section C.2.

Theorem 3 (Relation between FRD and KL divergence).

The FRD between soft-predictions q𝛉=q𝛉(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}) and q𝛉=q𝛉(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}^{\prime}=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}^{\prime}), given by (15) is related to the KL divergence through the inequality:

1cos(dR,𝒞(q𝜽,q𝜽)2)12KL(q𝜽,q𝜽),1-\cos\left(\frac{d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})}{2}\right)\leq\frac{1}{2}\text{KL}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime}), (16)

which means that the KL divergence is a surrogate of the FRD. In addition, it can also be shown that the KL divergence is a second-order approximation of the FRD, i.e.,

KL(q𝜽q𝜽)=12dR,𝒞2(q𝜽,q𝜽)+𝒪(dR,𝒞3(q𝜽,q𝜽)),\text{KL}(q_{\boldsymbol{\mathbf{\theta}}}\|q_{\boldsymbol{\mathbf{\theta}}}^{\prime})=\frac{1}{2}d_{R,\mathcal{C}}^{2}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})+\mathcal{O}(d_{R,\mathcal{C}}^{3}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})), (17)

where 𝒪()\mathcal{O}(\cdot) denotes big-O notation.

The above result shows that the KL is a weak approximation of the FRD in the sense that it gives an upper bound and a second-order approximation of the geodesic distance. However, in general, we are interested in distances over arbitrarily distinct softmax distributions, so it is clear that the KL divergence and the FRD can behave very differently. In fact, only the latter measures the actual distance on the statistical manifold 𝒞{q𝜽(|𝐱):𝐱𝒳}\mathcal{C}\doteq\big{\{}q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}):\boldsymbol{\mathbf{x}}\in\mathcal{X}\big{\}} (for further details, the reader is referred to Section 6.1). In the next section, we show that this has an important consequence on the set of solutions obtained by minimizing the respective empirical risks while varying λ\lambda.

Refer to caption
(a) Pareto-optimal points
Refer to caption
(b) TRADES
Refer to caption
(c) FIRE
Figure 6: Plot of all the possible points (1Pe(𝜽),1Pe(𝜽))(1-P_{e}(\boldsymbol{\mathbf{\theta}}),1-P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})) for the Gaussian model with ε=0.1\varepsilon=0.1, 𝝁=[0.0218;0.0425]\boldsymbol{\mathbf{\mu}}=[-0.0218;0.0425] and 𝚺=[0.0212,0.0036;0.0036,0.0042]\boldsymbol{\mathbf{\Sigma}}=[0.0212,0.0036;0.0036,0.0042] shown in blue. In red, we show the Pareto-optimal points (Fig. 6(a)). In black, we show the solutions obtained by minimizing the risk LTRADES(𝜽)L_{\textrm{TRADES}}(\boldsymbol{\mathbf{\theta}}) in equation (5) (Fig. 6(b)), and the risk LFIRE(𝜽)L_{\textrm{FIRE}}(\boldsymbol{\mathbf{\theta}}) in (7) (Fig. 6(c)).

4 Accuracy-Robustness Trade-offs and Learning in the Gaussian Model

To illustrate the Fire loss and the role of the Fisher-Rao distance to encourage robustness, we study the natural-adversarial accuracy trade-off for a simple logistic regression and Gaussian model and compare the performance of the predictor trained on the Fire loss with those of ACE and TRADES losses, in equations (4) and (5), respectively.

4.1 Accuracy-robustness trade-offs

Consider a binary example with a simplified logistic regression model. Therefore, in this section we assume that 𝒴={1,1}\mathcal{Y}=\{-1,1\} and the softmax probability

q𝜽(y|𝐱)=11+exp(y𝜽𝐱),q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})=\frac{1}{1+\exp(-y\,\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{x}})}, (18)

We choose the standard adversary obtained by maximizing the cross-entropy loss, i.e.,

𝐱\displaystyle\boldsymbol{\mathbf{x}}^{\prime*} =argmax𝐱:𝐱𝐱εlogq𝜽(y|𝐱).\displaystyle=\underset{\boldsymbol{\mathbf{x}}^{\prime}:\|\boldsymbol{\mathbf{x}}^{\prime}-\boldsymbol{\mathbf{x}}\|\leq\varepsilon}{\text{argmax}}-\log q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}^{\prime}). (19)

For simplicity111The analysis can be extended to an arbitrary norm but it is somewhat simplified for the 2-norm case., in this section we assume that the adversary uses the 2-norm (i.e, =2\|\cdot\|=\|\cdot\|_{2}). In such case, 𝐱\boldsymbol{\mathbf{x}}^{\prime*} can be written as 𝐱=𝐱εy𝜽/𝜽2.\boldsymbol{\mathbf{x}}^{\prime*}=\boldsymbol{\mathbf{x}}-\varepsilon\,y\,\boldsymbol{\mathbf{\theta}}/\|\boldsymbol{\mathbf{\theta}}\|_{2}.

We also assume that the classes are equally likely and that the conditional inputs given the class are Gaussian distributions with the particular form 𝐱|y𝒩(y𝝁,𝚺)\boldsymbol{\mathbf{x}}|y\sim\mathcal{N}(y\boldsymbol{\mathbf{\mu}},\boldsymbol{\mathbf{\Sigma}}). In this case, we can write the natural and adversarial misclassification probabilities as:

Pe(𝜽)\displaystyle P_{e}(\boldsymbol{\mathbf{\theta}}) (f𝜽(𝐗)Y)=Φ(𝜽𝝁𝜽𝚺𝜽),\displaystyle\doteq\mathbb{P}(f_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{X}})\neq Y)=\Phi\left(\frac{-\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{\mu}}}{\sqrt{\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{\Sigma}}\boldsymbol{\mathbf{\theta}}}}\right), (20)
Pe(𝜽)\displaystyle P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}}) (f𝜽(𝐗)Y)=Φ(ε𝜽2𝜽𝝁𝜽𝚺𝜽),\displaystyle\doteq\mathbb{P}(f_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{X}}^{\prime*})\neq Y)=\Phi\left(\frac{\varepsilon\|\boldsymbol{\mathbf{\theta}}\|_{2}-\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{\mu}}}{\sqrt{\boldsymbol{\mathbf{\theta}}^{\intercal}\boldsymbol{\mathbf{\Sigma}}\boldsymbol{\mathbf{\theta}}}}\right), (21)

where Φ\Phi denotes the cumulative distribution function of the standard normal random variable. The following result provides lower and upper bounds for Pe(𝜽)P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}}) in terms of ε\varepsilon and the eigenvalues of 𝚺\boldsymbol{\mathbf{\Sigma}}. Notice that the bounds get sharper as ε\varepsilon or the spread of 𝚺\boldsymbol{\mathbf{\Sigma}} decreases and are tight if 𝚺=σ2𝐈\boldsymbol{\mathbf{\Sigma}}=\sigma^{2}\boldsymbol{\mathbf{I}}.

Proposition 2 (Accuracy-robustness trade-offs).

The adversarial misclassification probability Pe(𝛉)P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}}) satisfies the inequalities:

Φ(ελmax1/2(𝚺)+Φ1(Pe(𝜽)))\displaystyle\Phi\left(\frac{\varepsilon}{\lambda^{1/2}_{\max}(\boldsymbol{\mathbf{\Sigma}})}+\Phi^{-1}(P_{e}(\boldsymbol{\mathbf{\theta}}))\right) Pe(𝜽),\displaystyle\leq P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}}), (22)
Φ(ελmin1/2(𝚺)+Φ1(Pe(𝜽)))\displaystyle\Phi\left(\frac{\varepsilon}{\lambda^{1/2}_{\min}(\boldsymbol{\mathbf{\Sigma}})}+\Phi^{-1}(P_{e}(\boldsymbol{\mathbf{\theta}}))\right) Pe(𝜽).\displaystyle\geq P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}}). (23)

The proof of this proposition is relegated to Section 6.4.

4.2 Learning

Let us consider the 2-dimensional case, i.e., n=2n=2. From expressions (20) and (21), it can be noticed that both Pe(𝜽)P_{e}(\boldsymbol{\mathbf{\theta}}) and Pe(𝜽)P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}}) are independent of the 2-norm of 𝜽\boldsymbol{\mathbf{\theta}}. Therefore, we can parameterize 𝜽\boldsymbol{\mathbf{\theta}} as 𝜽=[cos(α),sin(α)]\boldsymbol{\mathbf{\theta}}=[\cos(\alpha),\sin(\alpha)]^{\intercal} with α[0,2π)\alpha\in[0,2\pi) without any loss of generality. Thus, (1Pe(𝜽),1Pe(𝜽))(1-P_{e}(\boldsymbol{\mathbf{\theta}}),1-P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})) for all values of α\alpha gives a curve that represents all the possible values of the natural and adversarial accuracies for this setting. Fig. 6(a) shows this curve for a particular choice222In this experiment, we obtained the components in 𝝁\boldsymbol{\mathbf{\mu}} by sampling 𝒩(0,1/400)\mathcal{N}(0,1/400). We also defined a matrix 𝐀\boldsymbol{\mathbf{A}} with samples of the same distribution and constructed 𝚺\boldsymbol{\mathbf{\Sigma}} as 𝚺=𝐀𝐀\boldsymbol{\mathbf{\Sigma}}=\boldsymbol{\mathbf{A}}\boldsymbol{\mathbf{A}}^{\intercal}. of ε\varepsilon, 𝝁\boldsymbol{\mathbf{\mu}}, and 𝚺\boldsymbol{\mathbf{\Sigma}}. As can be observed, the solution which maximizes the natural accuracy gives poor adversarial accuracy and viceversa333The (normalized) value of 𝜽\boldsymbol{\mathbf{\theta}} that maximizes 1Pe(𝜽)1-P_{e}(\boldsymbol{\mathbf{\theta}}) is θnat=𝚺1𝝁/𝚺1𝝁2\theta_{\text{nat}}^{*}=\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\mu}}/\|\boldsymbol{\mathbf{\Sigma}}^{-1}\boldsymbol{\mathbf{\mu}}\|_{2} (see, for instance, [38]) and corresponds to αnat1.814\alpha_{\text{nat}}^{*}\approx 1.814, giving 1Pe(𝜽)0.7841-P_{e}(\boldsymbol{\mathbf{\theta}})\approx 0.784 and 1Pe(𝜽)0.1831-P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})\approx 0.183. The value of θ\theta that maximizes 1Pe(𝜽)1-P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}}) is αadv2.783\alpha_{\text{adv}}^{*}\approx 2.783, giving 1Pe(𝜽)0.6081-P_{e}(\boldsymbol{\mathbf{\theta}})\approx 0.608 and 1Pe(𝜽)0.3091-P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})\approx 0.309.. The set of Pareto-optimal points (i.e., the set of points for which there is no possible improvement in terms of both natural and adversarial accuracy or, equivalently, the set {max𝜽β(1Pe(𝜽))+(1β)(1Pe(𝜽)):0β1}\{\max_{\boldsymbol{\mathbf{\theta}}}\;\beta(1-P_{e}(\boldsymbol{\mathbf{\theta}}))+(1-\beta)(1-P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})):0\leq\beta\leq 1\}) are also shown in Fig. 6(a). In particular, this set contains the Maximum Average Accuracy (MAA) given by

MAAmax𝜽Θ(1Pe(𝜽)+Pe(𝜽)2).\text{MAA}\doteq\underset{\boldsymbol{\mathbf{\theta}}\in\Theta}{\text{max}}\;\left(1-\frac{P_{e}(\boldsymbol{\mathbf{\theta}})+P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})}{2}\right). (24)

This is a metric of particular importance, which combines with equal weights both natural and adversarial accuracies.

In addition, we present the solution of the (local) Empirical Risk Minimization (ERM)444For experiments, we used 10410^{4} samples for each different class and a BFGS Quasi-Newton method for optimization. The initial value of 𝜽\boldsymbol{\mathbf{\theta}} is zero for both TRADES and FIRE. We do not report the result using ACE risk in (4) because in this setting we would obtain the trivial solution 𝜽=𝟎\boldsymbol{\mathbf{\theta}}=\boldsymbol{\mathbf{0}}, which is a minimizer of LACE(𝜽)L_{\textrm{ACE}}(\boldsymbol{\mathbf{\theta}}). for the TRADES risk function as defined in (5) for different values of λ\lambda in Fig. 6(b). As can be seen, the curve obtained in the (1Pe(𝜽),1Pe(𝜽))(1-P_{e}(\boldsymbol{\mathbf{\theta}}),1-P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})) space covers a large part of the Pareto-optimal points expect for a segment for which the solution does not behave well. Finally, in Fig. 6(c) we present the result for the proposed Fire risk function in (7) for different values of λ\lambda. In this case, we observed that the curve of solutions in the (1Pe(𝜽),1Pe(𝜽))(1-P_{e}(\boldsymbol{\mathbf{\theta}}),1-P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})) space covers all the Pareto-optimal points. Moreover, we have observed that, for some λ\lambda, the 𝜽\boldsymbol{\mathbf{\theta}} which minimizes the Fire risk matches the 𝜽\boldsymbol{\mathbf{\theta}} which achieves the MAA defined in (24), while TRADES method fails in achieving this particularly relevant point. It should be added that none of the methods cover exactly the set of Pareto-optimal points, which is expected, since all loss functions can be considered surrogates for the quantity βPe(𝜽)+(1β)Pe(𝜽)\beta P_{e}(\boldsymbol{\mathbf{\theta}})+(1-\beta)P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}}), where 0β10\leq\beta\leq 1. We performed a similar comparison between FRD and KL using standard datasets. The results are reported in Appendix A

TABLE I: Comparison between KL and Fisher-Rao based regularizer under white-box ll_{\infty} threat model. Note that we do not use the same hyperparameters as presented in [26] for the TRADES method.
Defense Dataset ε\varepsilon Structure Natural AutoAttack Avg. Acc. RunTime
TRADES CNN 99.27 ±\pm 0.03 94.27 ±\pm 0.18 96.77 ±\pm 0.09 2h22
FIRE MNIST 0.30.3 CNN 99.22 ±\pm 0.02 94.44 ±\pm 0.14 96.83 ±\pm 0.10 2h06
TRADES WRN-34-10 85.84 ±\pm 0.31 50.47 ±\pm 0.36 68.15 ±\pm 0.23 13h49
FIRE CIFAR-10 8/2558/255 WRN-34-10 85.98 ±\pm 0.09 51.45 ±\pm 0.32 68.72 ±\pm 0.22 11h00
TRADES WRN-34-10 59.62 ±\pm 0.42 25.89 ±\pm 0.26 42.76 ±\pm 0.26 13h49
FIRE CIFAR-100 8/2558/255 WRN-34-10 61.03 ±\pm 0.21 26.42 ±\pm 0.21 43.73 ±\pm 0.12 11h10
TABLE II: Test robustness on different datasets under white-box ll_{\infty} attack. We ran all methods on 5 different tries and reported the mean and standard deviation. The codes for UAT and Atzmon et al. are not publicly available. Note that retraining the SOTA methods modifies slightly the experimental results.
Defense Dataset ε\varepsilon Structure Natural AutoAttack Avg. Acc. Runtime
Without Additional Data
Madry et al. [8] MNIST 0.30.3 CNN 98.53 ±\pm 0.06 88.62 ±\pm 0.23 93.58 ±\pm 0.14 2h03
Atzmon et al.[30] CNN 99.35 90.85 95.10 -
TRADES [26] CNN 99.27 ±\pm 0.03 94.27 ±\pm 0.18 96.77 ±\pm 0.09 2h22
FIRE CNN 99.22 ±\pm 0.02 94.44 ±\pm 0.14 96.83 ±\pm 0.10 2h06
Madry et al. [8] WRN-34-10 87.56±\pm 0.09 44.07 ±\pm 0.27 65.82 ±\pm 0.15 10h51
Self Adaptive[31] WRN-34-10 83.39 ±\pm 0.19 53.11 ±\pm 0.29 68.25 ±\pm 0.14 13h57
TRADES [26] WRN-34-10 84.79 ±\pm 0.24 52.12±\pm 0.28 68.45 ±\pm 0.12 17h49
FIRE + Self Adaptive WRN-34-10 83.70 ±\pm 0.36 53.26 ±\pm 0.19 68.48 ±\pm 0.13 11h12
Overfitting [29] WRN-34-10 85.64 ±\pm 0.55 51.72 ±\pm 0.56 68.68 ±\pm 0.44 42h01
FIRE CIFAR-10 8/2558/255 WRN-34-10 85.98 ±\pm 0.09 51.45 ±\pm 0.32 68.72 ±\pm 0.22 11h00
Overfitting [29] RN-18 53.83 18.95 36.39 -
Overfitting [29] WRN-34-10 59.22 ±\pm 0.61 25.99 ±\pm 0.51 42.61 ±\pm 0.28 42h08
FIRE CIFAR-100 8/2558/255 WRN-34-10 61.03 ±\pm 0.21 26.42 ±\pm 0.21 43.73 ±\pm 0.12 11h10
With Additional Data Using 80M-TI
Pre-training[28] CIFAR-10 8/2558/255 WRN-28-10 86.93 ±\pm 0.79 53.35 ±\pm 0.81 70.14 ±\pm 0.54 40h00 + 0h20
UAT [33] WRN-106-8 86.46 56.03 71.24 -
MART [27] WRN-28-10 87.39 ±\pm 0.12 56.69 ±\pm 0.28 72.04 ±\pm 0.15 13h53
RST-adv [32] WRN-28-10 89.49 ±\pm 0.41 59.69 ±\pm 0.26 74.59 ±\pm 0.27 22h12
FIRE WRN-28-10 89.73 ±\pm 0.04 59.97 ±\pm 0.11 74.86 ±\pm 0.05 18h30

5 Experimental Results

In this section, we assess our proposed Fire loss’ effectiveness to improve neural networks’ robustness.555Codes are available on GitHub at https://github.com/MarinePICOT/Adversarial-Robustness-via-Fisher-Rao-Regularization

5.1 Setup

Datasets : We resort to standard benchmarks. First, we use MNIST [39], composed of 60,000 black and white images of size 28×\times28 - 50,000 for training, and 10,000 for testing - divided into 10 different classes. Then, we test on CIFAR-10, and CIFAR-100 [40], composed of 60,000 color images of size 32×\times32×\times3 - 50,000 for training and 10,000 for testing - divided into 10 and 100 classes, respectively. Finally, we also test on CIFAR-10 with additional data thanks to 80 Million Tiny Images [41], an experiment that will be detailed later.

Architectures : In order to provide fair comparisons, we use standard model architectures. For the MNIST simulations, we use the 7-layer CNN as in [26]. For CIFAR-10 and CIFAR-100, we use a WideResNet (WRN) with 34 layers, and a widen factor of 10 (shortened as WRN-34-10) as in [26, 8, 34]. For the simulations with additional data, we use a WRN-28-10 as in [32].

Training procedure: For all standard experiments (without additional data), we use a Stochastic Gradient Descent (SGD) optimizer with a momentum of 0.90.9, a weight decay of 51045\cdot 10^{-4} and Nesterov momentum. We train our models on 100100 epochs with a batch size equal to 256256. The initial learning rate is set to 0.010.01 for MNIST and 0.10.1 for CIFAR-10 and CIFAR-100. Following [26], the learning rate is divided by 1010 at epochs 7575 and 9090. For our experiments with additional data, we follow the protocol introduced by [32], using a cosine decay [42], and training on 200200 epochs.

Generation of adversarial samples: For all experiments, we use PGD [8] to generate the adversarial examples during training. The loss which is maximized during the PGD algorithm is the Fisher-Rao distance (FRD) for our experiments, and the Kullback-Leibler divergence for the TRADES method. For MNIST, we use 40 steps and 10 steps for the rest. The step size is set to 0.01 for MNIST and 0.007 for the others. The maximal distortion in ll_{\infty}-norm ε\varepsilon allowed is 0.0310.031 for CIFAR-10 and CIFAR-100, and 0.30.3 for MNIST. This setting is used in most methods, such as [4, 32].

Additional data: To improve performance, [32] propose the use of additional data when training on CIFAR-10. Specifically, [32] use 500k additional images from 80M-TI 666Images available at https://github.com/yaircarmon/semisup-adv. Those images have been selected such that their l2l_{2}-distance to images from CIFAR-10 are below a threshold.

Hyperparameters: The Rao regularizer used to improve the robustness of neural networks introduces a hyperparameter λ\lambda to balance natural accuracy and adversarial robustness. We select λ\lambda = 12 from the CIFAR-10 simulations and use this value for all the other datasets. Further study on the effect of λ\lambda is provided in Section 5.2.3.

Test metrics: First, we provide the accuracy of the model on clean samples after adversarial training (Natural). Second, we test our models using the recently introduced AutoAttack [9], keeping the same hyperparameters as the ones used in the original paper and code777The AutoAttack code is available on https://github.com/fra31/auto-attack. AutoAttack tests the model under a comprehensive series of attacks and provides a more reliable assessment of robustness than the traditionally used PGD-based evaluation. Given that we care equally about natural and adversarial accuracies, we also compute the average sum of the two, i.e., the Average Accuracy ( Avg. Acc.). This is an empirical version of the Maximum Average Accuracy (MAA) defined in (24). Finally, we report the runtime of each method as the time required to complete the adversarial training. To provide fair comparisons between runtimes, we run the official code of each method on the same 44 NvidiaV100 GPUs (for further details, see Section 5.2.2).

5.2 Experimental results

5.2.1 Kullback-Leibler versus Fisher-Rao regularizer:

To disentangle the influence of different regularizers, we compare the Fisher-Rao-based regularizer to the Kullback-Leibler-based regularizer used in TRADES with the exact same model and hyperparameters (as detailed previously) on CIFAR-10, CIFAR-100, and MNIST datasets. We used λ=6\lambda=6 to train the TRADES method, since it is the value presented in the original paper [26]. The results are averaged over 5 tries and summarized in Table I. Those results confirm the superiority of the proposed regularizer. Specifically, the natural and adversarial accuracies increase up to 1% each under AutoAttack, improving the trade-off up to 1%1\%.

5.2.2 Comparison with state-of-the-art

We compare Fire with the state-of-the-art methods using adversarial training under ll_{\infty}-norm attacks. Due to its effectiveness on adversarial performances, we use the self-adaptive scheme from [31] along with the FIRE loss to smooth the one-hot labels on CIFAR-10, and report the results under FIRE + Self Adaptive. We do not include the Adversarial Weight Perturbation (AWP) [34] method since it leverages two networks to increase the robustness of the model. We trained all methods from the state-of-the art on 5 tries and report the mean the standard variance of the results in Table II. Note that the codes for UAT and Atzmon et al. are not publicly available, therefore we reported the results available on RobustBench [43].

Average Accuracy ( Avg. Acc.): Overall, Fire exhibits the best Avg. Acc. among compared methods in all settings. On CIFAR-10, at equivalent method, our Fire method outperforms [31]. The gain on natural examples is close to 0.3% and 0.15% in adversarial performances, giving an improvemnt of 0.23% of Avg. Acc. Moreover, our Fire method performs slightly better (0.04%) than [29] but, given that their method requires four times Fire’s runtime to complete its training, the gain of our method appears to be more significant. Besides, Fire outperforms [29] on the more challenging CIFAR-100 in both Avg. Acc., with more significant gain ( 1.12%) and runtime. Taking all metrics into account, Fire appears to be the best overall method.

Runtimes: Interestingly, our method exhibits a significant advantage over previous state-of-the-art methods using similar backbones. Our method outruns methods with similar performances by 20% on average. We presume that the difference between the different runtimes comes from the fact that our proposed Fire loss comprises 5 different operations. In contrast, the KL divergence, proposed in [26], is composed of 6 operations.

5.2.3 Ablation studies

Influence of λ\lambda: We study the influence of the hyperparameter λ\lambda on the performances of the Fire method. Figure 7 clearly shows the trade-off between natural and adversarial accuracy under AutoAttack [9]. When increasing λ\lambda, we emphasize our robust regularizer, consequently decreasing the performance on clean samples. Such phenomenon properly aligns with intuition and is also observed in [26]. Even though several values of λ\lambda lead to reasonable performances, we chose λ=12\lambda=12 to have a natural accuracy close to the natural accuracy under the TRADES method en CIFAR-10. This ensures a fair comparison.

Refer to caption
Figure 7: Influence of the hyperparameter λ\lambda on the natural and adversarial accuracies for Fire regularizer on CIFAR-10.

6 Proofs of Theorems and Propositions

6.1 Review of Fisher-Rao Distance (FRD)

Consider the family of probability distributions over the classes888Since we are interested in the dependence of q𝜽(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}) with changes in input 𝐱\boldsymbol{\mathbf{x}} and, particularly, its robustness to adversarial perturbations, we consider 𝐱\boldsymbol{\mathbf{x}} as the “parameters” of the model over which the regularity conditions must be imposed. 𝒞{q𝜽(|𝐱):𝐱𝒳}\mathcal{C}\doteq\big{\{}q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}):\boldsymbol{\mathbf{x}}\in\mathcal{X}\big{\}}. Assume that the following regularity conditions hold [44]:

  1. (i)

    𝐱q𝜽(y|𝐱)\nabla_{\boldsymbol{\mathbf{x}}}\,q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}) exists for all 𝐱,y\boldsymbol{\mathbf{x}},y and 𝜽Θ\boldsymbol{\mathbf{\theta}}\in\Theta;

  2. (ii)

    y𝒴𝐱q𝜽(y|𝐱)=0\sum_{y\in\mathcal{Y}}\nabla_{\boldsymbol{\mathbf{x}}}\,q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})=0 for all 𝐱\boldsymbol{\mathbf{x}} and 𝜽Θ\boldsymbol{\mathbf{\theta}}\in\Theta;

  3. (iii)

    𝐆(𝐱)=𝔼Yq𝜽(|𝐱)[𝐱logq𝜽(Y|𝐱)𝐱logq𝜽(Y|𝐱)]\boldsymbol{\mathbf{G}}(\boldsymbol{\mathbf{x}})=\mathbb{E}_{Y\sim q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}})}\big{[}\nabla_{\boldsymbol{\mathbf{x}}}\,\log q_{\boldsymbol{\mathbf{\theta}}}(Y|\boldsymbol{\mathbf{x}})\nabla_{\boldsymbol{\mathbf{x}}}^{\intercal}\,\log q_{\boldsymbol{\mathbf{\theta}}}(Y|\boldsymbol{\mathbf{x}})\big{]} is positive definite for any 𝐱\boldsymbol{\mathbf{x}} and 𝜽Θ\boldsymbol{\mathbf{\theta}}\in\Theta.

Notice that if (i) holds, (ii) also holds immediately for discrete distributions over finite spaces (assuming that y𝒴\sum_{y\in\mathcal{Y}} and 𝐱\nabla_{\boldsymbol{\mathbf{x}}} are interchangeable operations) as in our case. When (i)-(iii) are met, the variance of the differential form 𝐱logq𝜽(Y|𝐱)d𝐱\nabla_{\boldsymbol{\mathbf{x}}}^{\intercal}\log q_{\boldsymbol{\mathbf{\theta}}}(Y|\boldsymbol{\mathbf{x}})d\boldsymbol{\mathbf{x}} can be interpreted as the square of a differential arc length ds2ds^{2} in the space 𝒞\mathcal{C}, which yields

ds2=d𝐱,d𝐱𝐆(𝐱)=d𝐱𝐆(𝐱)d𝐱.ds^{2}=\langle d\boldsymbol{\mathbf{x}},d\boldsymbol{\mathbf{x}}\rangle_{\boldsymbol{\mathbf{G}}(\boldsymbol{\mathbf{x}})}=d\boldsymbol{\mathbf{x}}^{\intercal}\boldsymbol{\mathbf{G}}(\boldsymbol{\mathbf{x}})d\boldsymbol{\mathbf{x}}. (25)

Thus, 𝐆\boldsymbol{\mathbf{G}}, which is the Fisher Information Matrix (FIM), can be adopted as a metric tensor. We now consider a curve 𝜸:[0,1]𝒳\boldsymbol{\mathbf{\gamma}}:[0,1]\to\mathcal{X} in the input space connecting two arbitrary points 𝐱\boldsymbol{\mathbf{x}} and 𝐱\boldsymbol{\mathbf{x}}^{\prime}, i.e., such that 𝜸(0)=𝐱\boldsymbol{\mathbf{\gamma}}(0)=\boldsymbol{\mathbf{x}} and 𝜸(1)=𝐱\boldsymbol{\mathbf{\gamma}}(1)=\boldsymbol{\mathbf{x}}^{\prime}. Notice that this curve induces the following curve in the space 𝒞\mathcal{C}: q𝜽(|𝜸(t))q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{\gamma}}(t)) for t[0,1]t\in[0,1]. The Fisher-Rao distance between the distributions q𝜽=q𝜽(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}) and q𝜽=q𝜽(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}^{\prime}=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}^{\prime}) will be denoted as dR,𝒞(q𝜽,q𝜽)d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime}) and is formally defined by

dR,𝒞(q𝜽,q𝜽)inf𝜸01d𝜸(t)dt𝐆(𝜸(t))d𝜸(t)dt,d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})\doteq\underset{\boldsymbol{\mathbf{\gamma}}}{\text{inf}}\int_{0}^{1}\sqrt{\frac{d\boldsymbol{\mathbf{\gamma}}^{\intercal}(t)}{dt}\boldsymbol{\mathbf{G}}(\boldsymbol{\mathbf{\gamma}}(t))\frac{d\boldsymbol{\mathbf{\gamma}}(t)}{dt}}, (26)

where the infimum is taken over all piecewise smooth curves. This means that the FRD is the length of the geodesic between points 𝐱\boldsymbol{\mathbf{x}} and 𝐱\boldsymbol{\mathbf{x}}^{\prime} using the FIM as the metric tensor. In general, the minimization of the functional in (26) is a problem that can be solved using the well-known Euler-Lagrange differential equations. Several examples for simple families of distributions can be found in [44].

6.2 Proof of Theorem 1

To compute the FRD, we first need to compute the FIM of the family 𝒞\mathcal{C} with q𝜽(y|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}) given in expression (8). A direct calculation gives:

𝐆(𝐱)\displaystyle\boldsymbol{\mathbf{G}}(\boldsymbol{\mathbf{x}}) =eh𝜽(𝐱)(1+eh𝜽(𝐱))2𝐱h𝜽(𝐱)𝐱h𝜽(𝐱).\displaystyle=\frac{e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})}}{\left(1+e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})}\right)^{2}}\nabla_{\boldsymbol{\mathbf{x}}}h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})\nabla_{\boldsymbol{\mathbf{x}}}^{\intercal}h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}). (27)

It is clear from this expression that the FIM of this model is of rank one and therefore singular. This matches the fact that q𝜽(y|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}) has a single degree of freedom given by h𝜽(𝐱)h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}). Therefore, the statistical manifold 𝒞\mathcal{C} has dimension 11.

To proceed, we consider the following model:

q𝜽(y|u)=11+euy,q_{\boldsymbol{\mathbf{\theta}}}(y|u)=\frac{1}{1+e^{-uy}}, (28)

where we have defined u=h𝜽(𝐱)u=h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}). This effectively removes the model ambiguities because q(y|u)q(y|u)q(y|u)\neq q(y|u^{\prime}) if and only if uuu\neq u^{\prime}. Note that q𝜽(y|u)q_{\boldsymbol{\mathbf{\theta}}}(y|u), for fixed 𝜽\boldsymbol{\mathbf{\theta}}, can be interpreted as a one-dimensional parametric model with parameter uu. Its FIM is a scalar that can be readily obtained, yielding:

G(u)=eu(1+eu)2.G(u)=\frac{e^{u}}{(1+e^{u})^{2}}. (29)

Clearly, the FIM G(u)G(u) is non-singular for any uu (i.e., G(u)>0G(u)>0 for any uu) as required. Let 𝒟={q𝜽(|u):u𝒰}\mathcal{D}=\{q_{\boldsymbol{\mathbf{\theta}}}(\cdot|u):u\in\mathcal{U}\}, where 𝒰=h𝜽(𝒳)\mathcal{U}=h_{\boldsymbol{\mathbf{\theta}}}(\mathcal{X}) and consider two distributions in 𝒟\mathcal{D}: q𝜽=q𝜽(|u)q_{\boldsymbol{\mathbf{\theta}}}=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|u) and q𝜽=q𝜽(|u)q_{\boldsymbol{\mathbf{\theta}}}^{\prime}=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|u^{\prime}). Then, the FRD can be evaluated directly as follows [44][Eq. (3.13)]:

dR,𝒟(q𝜽,q𝜽)\displaystyle d_{R,\mathcal{D}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime}) =|uuG1/2(v)𝑑v|\displaystyle=\left|\int_{u}^{u^{\prime}}G^{1/2}(v)dv\right|
=2|arctan(eu/2)arctan(eu/2)|.\displaystyle=2\Big{|}\arctan\left(e^{u^{\prime}/2}\right)-\arctan\left(e^{u/2}\right)\Big{|}. (30)

Therefore, the FRD between the distributions q𝜽=q𝜽(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}) and q𝜽=q𝜽(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}^{\prime}=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}^{\prime}) can be directly obtained by substituting u=h𝜽(𝐱)u=h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}) and u=h𝜽(𝐱)u^{\prime}=h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime}) in Equation (6.2), yielding the final result in expression (9).

6.3 Proof of Theorem 2

As in the binary case developed in Section 3.3, the FIM of the family 𝒞\mathcal{C} with q𝜽(y|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}) given in expression (14) is singular. To shows this, we first notice that q𝜽(y|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}) can be written as

q𝜽(y|𝐱)=egy(𝐱,𝜽)y𝒴egy(𝐱,𝜽),q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})=\frac{e^{g_{y}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})}}{\sum_{y^{\prime}\in\mathcal{Y}}e^{g_{y^{\prime}}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})}}, (31)

where gy(𝐱,𝜽)=hy(𝐱,𝜽)h1(𝐱,𝜽)g_{y}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})=h_{y}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})-h_{1}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}}). Since g1(𝐱,𝜽)=0g_{1}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})=0, this shows that 𝒞\mathcal{C} has M1M-1 degrees of freedom: g2(𝐱,𝜽),,gM(𝐱,𝜽)g_{2}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}}),\ldots,g_{M}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}}). A direct calculation of the FIM gives

𝐆(𝐱)\displaystyle\boldsymbol{\mathbf{G}}(\boldsymbol{\mathbf{x}}) =y=2Mq𝜽(y|𝐱)𝐱gy(𝐱,𝜽)𝐱gy(𝐱,𝜽)\displaystyle=\sum_{y=2}^{M}q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})\nabla_{\boldsymbol{\mathbf{x}}}g_{y}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})\nabla_{\boldsymbol{\mathbf{x}}}^{\intercal}g_{y}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}}) (32)
y,y=2Mq𝜽(y|𝐱)q𝜽(y|𝐱)𝐱gy(𝐱,𝜽)𝐱gy(𝐱,𝜽).\displaystyle-\sum_{y,y^{\prime}=2}^{M}q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})q_{\boldsymbol{\mathbf{\theta}}}(y^{\prime}|\boldsymbol{\mathbf{x}})\nabla_{\boldsymbol{\mathbf{x}}}g_{y}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})\nabla_{\boldsymbol{\mathbf{x}}}g_{y^{\prime}}^{\intercal}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}}).

Let 𝐯𝒳\boldsymbol{\mathbf{v}}\in\mathcal{X} be an arbitrary vector and define βy𝐱gy(𝐱,𝜽)𝐯\beta_{y}\doteq\nabla_{\boldsymbol{\mathbf{x}}}^{\intercal}g_{y}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})\,\boldsymbol{\mathbf{v}} for y=[2:M]y=[2:M]. Notice that

𝐆(𝐱)𝐯\displaystyle\boldsymbol{\mathbf{G}}(\boldsymbol{\mathbf{x}})\boldsymbol{\mathbf{v}} =y=2Mq𝜽(y|𝐱)βy𝐱gy(𝐱,𝜽)\displaystyle=\sum_{y=2}^{M}q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})\beta_{y}\nabla_{\boldsymbol{\mathbf{x}}}g_{y}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})
y=2M(y=2Mq𝜽(y|𝐱)βy)q𝜽(y|𝐱)𝐱gy(𝐱,𝜽).\displaystyle-\sum_{y=2}^{M}\left(\sum_{y^{\prime}=2}^{M}q_{\boldsymbol{\mathbf{\theta}}}(y^{\prime}|\boldsymbol{\mathbf{x}})\beta_{y^{\prime}}\right)q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})\nabla_{\boldsymbol{\mathbf{x}}}g_{y}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}}). (33)

Therefore, the range of 𝐆(𝐱)\boldsymbol{\mathbf{G}}(\boldsymbol{\mathbf{x}}) is a subset of the span of the set {𝐱g2(𝐱,𝜽),,𝐱gM(𝐱,𝜽)}\{\nabla_{\boldsymbol{\mathbf{x}}}g_{2}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}}),\ldots,\nabla_{\boldsymbol{\mathbf{x}}}g_{M}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})\}. Thus, it follows that rank(𝐆(𝐱))M1\text{rank}(\boldsymbol{\mathbf{G}}(\boldsymbol{\mathbf{x}}))\leq M-1, which implies that it is singular.

The singularity issue can be overcome by embedding 𝒞\mathcal{C} into the probability simplex 𝒫\mathcal{P} defined as follows:

𝒫={q:𝒴[0,1]M:y𝒴q(y)=1}.\mathcal{P}=\Big{\{}q:\mathcal{Y}\to[0,1]^{M}:\sum_{y\in\mathcal{Y}}q(y)=1\Big{\}}. (34)

To proceed, we follow [45][Section 2.8] and consider the following parameterization for any distribution q𝒫q\in\mathcal{P}:

q(y|𝐳)=zy24,y{1,,M}.q(y|\boldsymbol{\mathbf{z}})=\frac{z_{y}^{2}}{4},\quad y\in\{1,\ldots,M\}. (35)

We then consider the following statistical manifold:

𝒟={q(|𝐳):𝐳2=4,zy0,y𝒴}.\mathcal{D}=\Big{\{}q(\cdot|\boldsymbol{\mathbf{z}}):\|\boldsymbol{\mathbf{z}}\|^{2}=4,z_{y}\geq 0,\;\forall y\in\mathcal{Y}\Big{\}}. (36)

Notice that the parameter vector 𝐳\boldsymbol{\mathbf{z}} belongs to the positive portion of a sphere of radius 2 and centered at the origin in M\mathbb{R}^{M}. As a consequence, the FIM follows by

𝐆(𝐳)\displaystyle\boldsymbol{\mathbf{G}}(\boldsymbol{\mathbf{z}}) =y𝒴zy24(2zy𝐞y)(2zy𝐞y)=𝐈,\displaystyle=\sum_{y\in\mathcal{Y}}\frac{z_{y}^{2}}{4}\left(\frac{2}{z_{y}}\boldsymbol{\mathbf{e}}_{y}\right)\left(\frac{2}{z_{y}}\boldsymbol{\mathbf{e}}_{y}^{\intercal}\right)=\boldsymbol{\mathbf{I}}, (37)

where {𝐞y}\{\boldsymbol{\mathbf{e}}_{y}\} are the canonical basis vectors in M\mathbb{R}^{M} and 𝐈\boldsymbol{\mathbf{I}} is the identity matrix. From expression (37) we can conclude that the Fisher metric is equal to the Euclidean metric. Since the parameter vector lies on a sphere, the FRD between the distributions q=q(|𝐳)q=q(\cdot|\boldsymbol{\mathbf{z}}) and q=q(|𝐳)q^{\prime}=q(\cdot|\boldsymbol{\mathbf{z}}^{\prime}) can be written as the radius of the sphere times the angle between the vectors 𝐳\boldsymbol{\mathbf{z}} and 𝐳\boldsymbol{\mathbf{z}}^{\prime}. This leads to

dR,𝒟(q,q)\displaystyle d_{R,\mathcal{D}}(q,q^{\prime}) =2arccos(𝐳𝐳4)\displaystyle=2\arccos\left(\frac{\boldsymbol{\mathbf{z}}^{\intercal}\boldsymbol{\mathbf{z}}^{\prime}}{4}\right)
=2arccos(y𝒴q(y|𝐳)q(y|𝐳)).\displaystyle=2\arccos\left(\sum_{y\in\mathcal{Y}}\sqrt{q(y|\boldsymbol{\mathbf{z}})q(y|\boldsymbol{\mathbf{z}}^{\prime})}\right). (38)

Finally, we can compute the FRD for distributions in 𝒞\mathcal{C} using:

dR,𝒞(q𝜽,q𝜽)=2arccos(y𝒴q𝜽(y|𝐱)q𝜽(y|𝐱)).\displaystyle d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})=2\arccos\left(\sum_{y\in\mathcal{Y}}\sqrt{q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}}^{\prime})}\right). (39)

Notice that 0dR,𝒞(q𝜽,q𝜽)π0\leq d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})\leq\pi, 𝐱,𝐱𝒳\forall\ \boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{x}}^{\prime}\in\mathcal{X}, being zero if q𝜽(|𝐱)=q𝜽(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}})=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}^{\prime}) and π\pi when [q𝜽(1|𝐱),,q𝜽(M|𝐱)][q_{\boldsymbol{\mathbf{\theta}}}(1|\boldsymbol{\mathbf{x}}),\dots,q_{\boldsymbol{\mathbf{\theta}}}(M|\boldsymbol{\mathbf{x}})] and [q𝜽(1|𝐱),,q𝜽(M|𝐱)][q_{\boldsymbol{\mathbf{\theta}}}(1|\boldsymbol{\mathbf{x}}^{\prime}),\dots,q_{\boldsymbol{\mathbf{\theta}}}(M|\boldsymbol{\mathbf{x}}^{\prime})] are orthogonal vectors.

Proof the consistency between Theorem 1 and Theorem 2. This consists in showing the equivalence between the FRD for the binary case, given by Eq. (9), and the multiclass case, given by Eq. (15), for M=2M=2. First, notice that the models (8) and (14) coincide if we consider the following correspondence: h𝜽(𝐱)=h2(𝐱,𝜽)h1(𝐱,𝜽)h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})=h_{2}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}})-h_{1}(\boldsymbol{\mathbf{x}},\boldsymbol{\mathbf{\theta}}), y=1y=1y=1\leftrightarrow y=-1 and y=2y=1y=2\leftrightarrow y=1. Then, using standard trigonometric identities, we rewrite the FRD for the multiclass case:

dR,𝒞(q𝜽,q𝜽)=2arccos(11+eh𝜽(𝐱)11+eh𝜽(𝐱)\displaystyle d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})=2\arccos\left(\sqrt{\frac{1}{1+e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})}}}\sqrt{\frac{1}{1+e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime})}}}\right.
+11+eh𝜽(𝐱)11+eh𝜽(𝐱))\displaystyle\left.+\sqrt{\frac{1}{1+e^{-h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})}}}\sqrt{\frac{1}{1+e^{-h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime})}}}\right)
=2arccos[cos(arctan(eh𝜽(𝐱)/2))cos(arctan(eh𝜽(𝐱)/2))\displaystyle=2\arccos\left[\cos(\arctan(e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})/2}))\cos(\arctan(e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime})/2}))\right.
+sin(arctan(eh𝜽(𝐱)/2))sin(arctan(eh𝜽(𝐱)/2))]\displaystyle\left.+\sin(\arctan(e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})/2}))\sin(\arctan(e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime})/2}))\right]
=2arccos[cos(arctan(eh𝜽(𝐱)/2)arctan(eh𝜽(𝐱)/2))]\displaystyle=2\arccos\left[\cos\left(\arctan(e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})/2})-\arctan(e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime})/2})\right)\right]
=2|arctan(eh𝜽(𝐱)/2)arctan(eh𝜽(𝐱)/2)|,\displaystyle=2\left|\arctan(e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime})/2})-\arctan(e^{h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})/2})\right|, (40)

where the last step follows by |arctan(α)|π/2|\arctan(\alpha)|\leq\pi/2, α\forall\alpha\in\mathbb{R}, so the argument of the cosine function belongs to [π,π][-\pi,\pi] and arccos(cos(α))=|α|\arccos(\cos(\alpha))=|\alpha| for |α|π|\alpha|\leq\pi, which completes the proof.

6.4 Proof of Proposition 2

For completeness, we first present the derivation of the misclassification probabilities (20) and (21):

Pe(𝜽)\displaystyle P_{e}(\boldsymbol{\mathbf{\theta}}) =(f𝜽(𝐗)Y)=Φ(θ𝝁θ𝚺θ),\displaystyle=\mathbb{P}(f_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{X}})\neq Y)=\Phi\left(-\frac{\theta^{\intercal}\boldsymbol{\mathbf{\mu}}}{\sqrt{\theta^{\intercal}\boldsymbol{\mathbf{\Sigma}}\theta}}\right), (41)
Pe(𝜽)\displaystyle P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}}) =(f𝜽(𝐗)Y)=Φ(ε𝜽2θ𝝁θ𝚺θ).\displaystyle=\mathbb{P}(f_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{X}}^{\prime*})\neq Y)=\Phi\left(\frac{\varepsilon\|\boldsymbol{\mathbf{\theta}}\|_{2}-\theta^{\intercal}\boldsymbol{\mathbf{\mu}}}{\sqrt{\theta^{\intercal}\boldsymbol{\mathbf{\Sigma}}\theta}}\right). (42)

Notice that Φ\Phi, i.e., the cumulative distribution function of the standard normal random variable, is a monotonic increasing function and ε𝜽2/θ𝚺θ0\varepsilon\|\boldsymbol{\mathbf{\theta}}\|_{2}/\sqrt{\theta^{\intercal}\boldsymbol{\mathbf{\Sigma}}\theta}\geq 0, so we have Pe(𝜽)Pe(𝜽)P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})\geq P_{e}(\boldsymbol{\mathbf{\theta}}), as expected. Furthermore, observe also that Φ\Phi is invertible so we can write Pe(𝜽)P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}}) explicitly as a function of Pe(𝜽)P_{e}(\boldsymbol{\mathbf{\theta}}):

Pe(𝜽)=Φ(ε𝜽2θ𝚺θ+Φ1(Pe)).P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})=\Phi\left(\frac{\varepsilon\|\boldsymbol{\mathbf{\theta}}\|_{2}}{\sqrt{\theta^{\intercal}\boldsymbol{\mathbf{\Sigma}}\theta}}+\Phi^{-1}(P_{e})\right). (43)

We now proceed to the proof of the proposition. Notice that by the Rayleigh theorem [46][Theorem 4.2.2], we have that

λmin(𝚺)𝜽22θ𝚺θλmax(𝚺)𝜽22,\lambda_{\text{min}}(\boldsymbol{\mathbf{\Sigma}})\|\boldsymbol{\mathbf{\theta}}\|_{2}^{2}\leq\theta^{\intercal}\boldsymbol{\mathbf{\Sigma}}\theta\leq\lambda_{\text{max}}(\boldsymbol{\mathbf{\Sigma}})\|\boldsymbol{\mathbf{\theta}}\|_{2}^{2}, (44)

where λmin(𝚺)\lambda_{\text{min}}(\boldsymbol{\mathbf{\Sigma}}) and λmax(𝚺)\lambda_{\text{max}}(\boldsymbol{\mathbf{\Sigma}}) are the minimum and maximum eigenvalues of 𝚺\boldsymbol{\mathbf{\Sigma}}, respectively. Therefore, we can bound ε𝜽2/θ𝚺θ\varepsilon\|\boldsymbol{\mathbf{\theta}}\|_{2}/\sqrt{\theta^{\intercal}\boldsymbol{\mathbf{\Sigma}}\theta} as follows:

ελmax1/2(𝚺)ε𝜽2θ𝚺θελmin1/2(𝚺).\frac{\varepsilon}{\lambda_{\text{max}}^{1/2}(\boldsymbol{\mathbf{\Sigma}})}\leq\frac{\varepsilon\|\boldsymbol{\mathbf{\theta}}\|_{2}}{\sqrt{\theta^{\intercal}\boldsymbol{\mathbf{\Sigma}}\theta}}\leq\frac{\varepsilon}{\lambda_{\text{min}}^{1/2}(\boldsymbol{\mathbf{\Sigma}})}. (45)

Using this fact together with the monotonicity of Φ\Phi in expression (43), we obtain the desired result.

7 Summary and Concluding Remarks

We introduced Fire, a new robustness regularizer-based method on the geodesic distance of softmax probabilities using concepts from information geometry. The main innovation is to employ Fisher-Rao Distance (FRD) to encourage invariant softmax probabilities for both natural and adversarial examples while maintaining high performances on natural samples. Our empirical results showed that Fire consistently enhances the robustness of neural networks using various architectures, settings, and datasets. Compared to the state-of-the-art methods for adversarial defenses, Fire increases the Average Accuracy (Avg. Acc.). Besides, it succeeds in doing so with a 20% reduction in terms of the training time.

Interestingly, FRD has rich connections with Hellinger distance, the Kullback-Leibler divergence, and even other standard regularization terms. Moreover, as illustrated via our simple logistic regression and Gaussian model, the optimization based on Fire is well-behaved and gives all the desired Pareto-optimal points in the natural-adversarial region. This observation contrasts with the results of other state-of-the-art adversarial learning approaches. Further theoretical explanation of this change in behaviour is worth exploring in a future work.

Acknowledgment

The authors wish to thank the Associate Editor and the reviewers for theirsuggestions which significantly improved the manuscript. This work was supported by the NSERC, and McGill University in the framework of the NSERC/Hydro-Quebec Industrial Research Chair in Interactive Information Infrastructure for the Power Grid (IRCPJ406021-14). This work was performed using HPC resources from GENCI-IDRIS (Grant 2021-AD011012277 and AD011013109). This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 792464.

References

  • [1] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” International Conference on Learning Representations, 2014. [Online]. Available: http://arxiv.org/abs/1312.6199
  • [2] M. Alzantot, Y. Sharma, A. Elgohary, B.-J. Ho, M. Srivastava, and K.-W. Chang, “Generating natural language adversarial examples,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.   Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp. 2890–2896. [Online]. Available: https://www.aclweb.org/anthology/D18-1316
  • [3] Y. Vorobeychik, M. Kantarcioglu, R. Brachman, P. Stone, and F. Rossi, Adversarial Machine Learning, ser. Synthesis Lectures on Artificial Intelligence and Machine Learning.   Morgan & Claypool Publishers, 2018. [Online]. Available: https://books.google.ca/books?id=Lw5pDwAAQBAJ
  • [4] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” International Conference on Learning Representations, 2015.
  • [5] J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. J. Goodfellow, “Adversarial spheres,” 2018. [Online]. Available: https://openreview.net/forum?id=SkthlLkPf
  • [6] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,” in Advances in Neural Information Processing Systems, 2019, pp. 125–136.
  • [7] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against machine learning,” in Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ser. ASIA CCS ’17.   New York, NY, USA: ACM, 2017, pp. 506–519. [Online]. Available: http://doi.acm.org/10.1145/3052973.3053009
  • [8] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=rJzIBfZAb
  • [9] F. Croce and M. Hein, “Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks,” in International Conference on Machine Learning.   PMLR, 2020, pp. 2206–2216.
  • [10] R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner, “Detecting adversarial samples from artifacts,” CoRR, vol. abs/1703.00410, 2017. [Online]. Available: http://arxiv.org/abs/1703.00410
  • [11] Z. Zheng and P. Hong, “Robust detection of adversarial attacks by modeling the intrinsic properties of deep neural networks,” in Neural Information Processing Systems, 2018, pp. 7924–7933.
  • [12] K. Grosse, P. Manoharan, N. Papernot, M. Backes, and P. D. McDaniel, “On the (statistical) detection of adversarial examples,” CoRR, vol. abs/1702.06280, 2017.
  • [13] N. Carlini and D. Wagner, “Adversarial examples are not easily detected: Bypassing ten detection methods,” in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, ser. AISec ’17.   New York, NY, USA: ACM, 2017, pp. 3–14. [Online]. Available: http://doi.acm.org/10.1145/3128572.3140444
  • [14] B. Li, C. Chen, W. Wang, and L. Carin, “Certified adversarial robustness with additive noise,” in Advances in Neural Information Processing Systems, 2019, pp. 9464–9474.
  • [15] M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana, “Certified robustness to adversarial examples with differential privacy,” in 2019 IEEE Symposium on Security and Privacy (SP).   IEEE, 2019, pp. 656–672.
  • [16] J. Cohen, E. Rosenfeld, and Z. Kolter, “Certified adversarial robustness via randomized smoothing,” in International Conference on Machine Learning.   PMLR, 2019, pp. 1310–1320.
  • [17] F. Croce, M. Andriushchenko, and M. Hein, “Provable robustness of relu networks via maximization of linear regions,” in the 22nd International Conference on Artificial Intelligence and Statistics.   PMLR, 2019, pp. 2057–2066.
  • [18] E. Wong, F. Schmidt, J. H. Metzen, and J. Z. Kolter, “Scaling provable adversarial defenses,” in Advances in Neural Information Processing Systems, 2018, pp. 8400–8409.
  • [19] H. Zhang, H. Chen, C. Xiao, S. Gowal, R. Stanforth, B. Li, D. Boning, and C.-J. Hsieh, “Towards stable and efficient training of verifiably robust neural networks,” International Conference on Learning Representation, 2020.
  • [20] S. Gowal, K. D. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. Mann, and P. Kohli, “Scalable verified training for provably robust image classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4842–4851.
  • [21] M. Mirman, T. Gehr, and M. Vechev, “Differentiable abstract interpretation for provably robust neural networks,” in International Conference on Machine Learning, 2018, pp. 3578–3586.
  • [22] A. Raghunathan, J. Steinhardt, and P. Liang, “Certified defenses against adversarial examples,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=Bys4ob-Rb
  • [23] E. Wong and J. Z. Kolter, “Provable defenses against adversarial examples via the convex outer adversarial polytope,” in ICML, ser. JMLR Workshop and Conference Proceedings, vol. 80.   JMLR.org, 2018, pp. 5283–5292.
  • [24] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in NIPS Deep Learning and Representation Learning Workshop, 2015. [Online]. Available: http://arxiv.org/abs/1503.02531
  • [25] N. Papernot, P. D. McDaniel, and I. J. Goodfellow, “Transferability in machine learning: from phenomena to black-box attacks using adversarial samples,” CoRR, vol. abs/1605.07277, 2016. [Online]. Available: http://arxiv.org/abs/1605.07277
  • [26] H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan, “Theoretically principled trade-off between robustness and accuracy,” in International Conference on Machine Learning, 2019, pp. 1–11.
  • [27] Y. Wang, D. Zou, J. Yi, J. Bailey, X. Ma, and Q. Gu, “Improving adversarial robustness requires revisiting misclassified examples,” in International Conference on Learning Representations, 2019.
  • [28] D. Hendrycks, K. Lee, and M. Mazeika, “Using pre-training can improve model robustness and uncertainty,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2712–2721.
  • [29] L. Rice, E. Wong, and Z. Kolter, “Overfitting in adversarially robust deep learning,” in International Conference on Machine Learning.   PMLR, 2020, pp. 8093–8104.
  • [30] M. Atzmon, N. Haim, L. Yariv, O. Israelov, H. Maron, and Y. Lipman, “Controlling neural level sets,” in Advances in Neural Information Processing Systems, 2019, pp. 2034–2043.
  • [31] L. Huang, C. Zhang, and H. Zhang, “Self-adaptive training: beyond empirical risk minimization,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [32] Y. Carmon, A. Raghunathan, L. Schmidt, J. C. Duchi, and P. S. Liang, “Unlabeled data improves adversarial robustness,” in Advances in Neural Information Processing Systems, 2019, pp. 11 192–11 203.
  • [33] J.-B. Alayrac, J. Uesato, P.-S. Huang, A. Fawzi, R. Stanforth, and P. Kohli, “Are labels required for improving adversarial robustness?” in Advances in Neural Information Processing Systems, 2019, pp. 12 214–12 223.
  • [34] D. Wu, S.-T. Xia, and Y. Wang, “Adversarial weight perturbation helps robust generalization,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [35] N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 IEEE Symposium on Security and Privacy (SP), May 2017, pp. 39–57.
  • [36] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier, “Parseval networks: Improving robustness to adversarial examples,” in Proceedings of the 34th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, D. Precup and Y. W. Teh, Eds., vol. 70.   PMLR, 06–11 Aug 2017, pp. 854–863. [Online]. Available: https://proceedings.mlr.press/v70/cisse17a.html
  • [37] M. A. Torkamani and D. Lowd, “On robustness and regularization of structural support vector machines,” in Proceedings of the 31st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, E. P. Xing and T. Jebara, Eds., vol. 32, no. 2.   Bejing, China: PMLR, 22–24 Jun 2014, pp. 577–585. [Online]. Available: http://proceedings.mlr.press/v32/torkamani14.html
  • [38] C. M. Bishop, Pattern Recognition and Machine Learning.   Springer, 2006.
  • [39] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,” ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, vol. 2, 2010.
  • [40] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
  • [41] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 11, pp. 1958–1970, 2008.
  • [42] I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” International Conference on Learning Representations, 2017. [Online]. Available: https://openreview.net/pdf/f1d82a22d77c21787940a33db6ce95a245c55eeb.pdf
  • [43] F. Croce, M. Andriushchenko, V. Sehwag, E. Debenedetti, N. Flammarion, M. Chiang, P. Mittal, and M. Hein, “Robustbench: a standardized adversarial robustness benchmark,” arXiv preprint arXiv:2010.09670, 2020.
  • [44] C. Atkinson and A. F. S. Mitchell, “Rao’s distance measure,” Sankhyā: The Indian Journal of Statistics, Series A (1961-2002), vol. 43, no. 3, pp. 345–365, 1981. [Online]. Available: http://www.jstor.org/stable/25050283
  • [45] O. Calin and C. Udrişte, Geometric modeling in probability and statistics.   Springer, 2014.
  • [46] R. Horn and C. Johnson, Matrix Analysis, ser. Matrix Analysis.   Cambridge University Press, 2013. [Online]. Available: https://books.google.de/books?id=5I5AYeeh0JUC
  • [47] A. B. Tsybakov, Introduction to nonparametric estimation.   Springer Science & Business Media, 2008.

Appendix A Comparison between the Fisher-Rao distance and the KL divergence on real data

Refer to caption
Refer to caption
Figure 8: Plots of all the possible points (1Pe(𝜽),1Pe(𝜽))(1-P_{e}(\boldsymbol{\mathbf{\theta}}),1-P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})) for ResNet-18 model on CIFAR-10.

In order to confirm on real data the difference between the Kullback-Leibler divergence and the Fisher-Rao distance, we reproduce the simulation presented on Fig. 6 based on the CIFAR-10 dataset. We trained multiple classifier using both TRADES and FIRE methods, varying λ[0,50]\lambda\in[0,50]. We use the dataset CIFAR-10 with a ResNet-18 for the model, and train the models for 100100 epochs. The optimizer is the Stochastic Gradient Descent (SGD) with a learning rate of 0.01, with a decay of 0.1 at epoch 75, 90 and 100. We also use a weight decay of 5.1045.10^{-4}, a momentum of 0.9. The results are averaged over 2 tries.

We present the results on Fig. 8. On the left figure, we observe the entire curve, and on the right we zoomed on the zone of interest. We can see that Fisher-Rao distance presents a better trade-off between natural and adversairal accuracies than the Kullback-Leibler divergence on real data, confirming the improvements presented on Fig. 6. For natural accuracies below 80%, the Fisher-Rao distance and the Kullback-Leibler divergence seem to behave quite similarly. For natural accuracies above 80%, which is actually the zone of interest, the improvement caused by the use of the Fisher-Rao distance seem to be quite consistent among all the training points. At fixed adversarial accuracies, the Fisher-Rao distance can increase the results by up to 1% of natural accuracies. Note that, in this simulation the Kullback-Leibler divergence can achieve a better adversarial accuracy that the Fisher-Rao distance (44.21% compared to 43.57%), but with a cost of sighly more than 2.5% for natural accuracies (78.69% compared to 81.23%). The Fisher-Rao divergence therefore achieves a better trade-off between natural and adversarial accuracies than the Kullback-Leibler divergence.

Refer to caption
(a) Pareto-optimal points
Refer to caption
(b) Hellinger
Refer to caption
(c) FIRE
Figure 9: Plots of all the possible points (1Pe(𝜽),1Pe(𝜽))(1-P_{e}(\boldsymbol{\mathbf{\theta}}),1-P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})) for the Gaussian model with ε=0.1\varepsilon=0.1, 𝝁=[0.0218;0.0425]\boldsymbol{\mathbf{\mu}}=[-0.0218;0.0425] and 𝚺=[0.0212,0.0036;0.0036,0.0042]\boldsymbol{\mathbf{\Sigma}}=[0.0212,0.0036;0.0036,0.0042] shown in blue. In red, we show the Pareto-optimal points (Fig. 9(a)). In black, we show the solutions obtained by minimizing the risk LHel(𝜽)L_{\textrm{Hel}}(\boldsymbol{\mathbf{\theta}}) in (48) (Fig. 9(b)), the risk LFIRE(𝜽)L_{\textrm{FIRE}}(\boldsymbol{\mathbf{\theta}}) in (7) (Fig. 9(c)).
TABLE III: Comparison between Hellinger and Fisher-Rao based regularizer under white-box ll_{\infty} threat model.
Defense Dataset ε\varepsilon Structure Natural AutoAttack Avg. Acc. RunTime
Hellinger CNN 99.31 ±\pm 0.03 94.03 ±\pm 0.24 96.67 ±\pm 0.07 2h06
FIRE MNIST 0.30.3 CNN 99.22 ±\pm 0.02 94.44 ±\pm 0.14 96.83 ±\pm 0.10 2h06
Hellinger WRN-34-10 85.96 ±\pm 0.28 50.52 ±\pm 0.31 68.24 ±\pm 0.15 11h02
FIRE CIFAR-10 8/2558/255 WRN-34-10 85.98 ±\pm 0.09 51.45 ±\pm 0.32 68.72 ±\pm 0.22 11h00
Hellinger WRN-34-10 60.79 ±\pm 0.88 25.58 ±\pm 0.27 43.18 ±\pm 0.54 11h12
FIRE CIFAR-100 8/2558/255 WRN-34-10 61.03 ±\pm 0.21 26.42 ±\pm 0.21 43.73 ±\pm 0.12 11h10

Appendix B Comparison between Fisher-Rao and Hellinger distances

B.1 Theoretical relation

The Hellinger distance between two distributions qq and qq^{\prime} on 𝒴\mathcal{Y} is defined as follows:

H(q,q)2(1y𝒴q(y)q(y))1/2.\text{H}(q,q^{\prime})\doteq\sqrt{2}\left(1-\sum\limits_{y\in\mathcal{Y}}\sqrt{q(y)q(y)^{\prime}}\right)^{1/2}. (46)

Using (15), we readily obtain the following relation between the Hellinger distance and the Fisher-Rao distance.

Theorem 4 (Relation between FRD and Hellinger distance).

The FRD between soft-predictions q𝛉=q𝛉(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}) and q𝛉=q𝛉(|𝐱)q_{\boldsymbol{\mathbf{\theta}}}^{\prime}=q_{\boldsymbol{\mathbf{\theta}}}(\cdot|\boldsymbol{\mathbf{x}}^{\prime}), given by (15) is related to the Hellinger distance through the relation

H2(q𝜽,q𝜽)=2[1cos(dR,𝒞(q𝜽,q𝜽)2)].\text{H}^{2}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})=2\left[1-\cos\left(\frac{d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})}{2}\right)\right]. (47)

Since 0dR,𝒞(q𝛉,q𝛉)π0\leq d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})\leq\pi, it is clear that H2(q𝛉,q𝛉)\text{H}^{2}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime}) is a monotonically increasing function of dR,𝒞(q𝛉,q𝛉)d_{R,\mathcal{C}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime}).

We conclude from this result that the FRD and the Hellinger distance are theoretically equivalent regularization mechanisms. However, it is clear that the empirical optimization of these distances may be different. This is further explored and confirmed in the following.

B.2 Experimental comparison

Since the Hellinger distance and the Fisher-Rao distance are theoretically equivalent metrics (see Section B.1), we investigate the empirical difference between those two distances. First, we perform the same simulations as those presented in Fig 6 but using the Hellinger distance as a robust regulatizer. Then, we perform comparison between Hellinger and FRD on real datasets (similar to the simulations given in Section 5.2.1).

B.2.1 Accuracy-Robustness trade-offs in the Gaussian case

As we defined the FIRE risk, it is possible to define the Hellinger risk as :

LHel(𝜽)\displaystyle L_{\textrm{{Hel}}}(\boldsymbol{\mathbf{\theta}})\doteq 𝔼p(𝐱,y)[max𝐱:𝐱𝐱pεlogq𝜽(y|𝐱)\displaystyle\,\mathbb{E}_{p(\boldsymbol{\mathbf{x}},y)}\Big{[}\max_{\boldsymbol{\mathbf{x}}^{\prime}:\|\boldsymbol{\mathbf{x}}^{\prime}-\boldsymbol{\mathbf{x}}\|_{p}\leq\varepsilon}-\log q_{\boldsymbol{\mathbf{\theta}}}(y|\boldsymbol{\mathbf{x}})
+λH2(qθ(|𝐱),qθ(|𝐱)))],\displaystyle+\lambda\;H^{2}(q_{\theta}(\cdot|\boldsymbol{\mathbf{x}}),q_{\theta}(\cdot|\boldsymbol{\mathbf{x}}^{\prime})))\Big{]}, (48)

where H(qθ,qθ)H(q_{\theta},q_{\theta}^{\prime}) is defined in equation (46). In Fig. 6(b), we present the solution of the (local) Empirical Risk Minimization (ERM) for the Hellinger risk function as defined in (48) for different values of λ\lambda. As can be observed, the curve obtained for all pairs of (1Pe(𝜽),1Pe(𝜽))(1-P_{e}(\boldsymbol{\mathbf{\theta}}),1-P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})) covers about half of the Pareto-optimal points while the curve of all solutions for (1Pe(𝜽),1Pe(𝜽))(1-P_{e}(\boldsymbol{\mathbf{\theta}}),1-P_{e}^{\prime}(\boldsymbol{\mathbf{\theta}})) corresponding to FIRE risk (see (7)) covers all the Pareto-optimal points. This simulation shows that even if the Hellinger distance and the Fisher-Rao distance are theoretically equivalent, their dynamics in training are actually quite different. In addition, FRD seems to be better suited for training in the Gaussian case.

B.2.2 Comparison based on real data

We now study the empirical difference between those two distances on real datasets. We therefore perform the same simulations as those provided in subsection 5.2.1 using the squared Hellinger distance between the natural and the adversarial predictions as the robustness regularizer. The results are summarized in Table III. Even though using the Hellinger distance as a robust regularizer performs better than using the Kullback-Leibler divergence, it still performs worse that the Fisher-Rao distance. In the case of real data, FRD appears to be better suited for training than the Hellinger distance.

In conclusion, FRD provides a better regularization objective than the Hellinger distance.

Appendix C Proof of Proposition 1 and Theorem 3

C.1 Proof of Proposition 1

Notice that the FRD in (9) can be written as follows:

dR,𝒞(q𝜽,q𝜽)=2|f(h𝜽(𝐱))f(h𝜽(𝐱))|,\displaystyle d_{R,{\mathcal{C}}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})=2\Big{|}f(h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime}))-f(h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}))\Big{|}, (49)

where we have defined the function f:f:\mathbb{R}\to\mathbb{R} as f(z)arctan(ez/2)f(z)\doteq\arctan(e^{z/2}). Notice that ff is a smooth function. Using the mean value theorem, we have

dR,𝒞(q𝜽,q𝜽)=2|f(c)||h𝜽(𝐱)h𝜽(𝐱)|,d_{R,{\mathcal{C}}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime})=2|f^{\prime}(c)|\cdot\big{|}h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime})-h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})\big{|}, (50)

where f(z)=ez/2/(2ez+2)f^{\prime}(z)=e^{z/2}/(2e^{z}+2) is the derivative of ff and cc is a point in the (open) interval with endpoints h𝜽(𝐱)h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}) and h𝜽(𝐱)h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime}). Notice that 0f(z)1/40\leq f^{\prime}(z)\leq 1/4 for any zz. Then, we have

dR,𝒞(q𝜽,q𝜽)\displaystyle d_{R,{\mathcal{C}}}(q_{\boldsymbol{\mathbf{\theta}}},q_{\boldsymbol{\mathbf{\theta}}}^{\prime}) sup𝑐 2|f(c)||h𝜽(𝐱)h𝜽(𝐱)|\displaystyle\leq\underset{c}{\text{sup}}\;2|f^{\prime}(c)|\cdot\big{|}h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime})-h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})\big{|} (51)
=12|h𝜽(𝐱)h𝜽(𝐱)|.\displaystyle=\frac{1}{2}\big{|}h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}}^{\prime})-h_{\boldsymbol{\mathbf{\theta}}}(\boldsymbol{\mathbf{x}})\big{|}. (52)

C.2 Proof of Theorem 3

Consider first the Hellinger distance between two distributions qq and qq^{\prime} over 𝒴\mathcal{Y}, defined as follows:

H(q,q)2(1y𝒴q(y)q(y))1/2.\text{H}(q,q^{\prime})\doteq\sqrt{2}\left(1-\sum\limits_{y\in\mathcal{Y}}\sqrt{q(y)q(y)^{\prime}}\right)^{1/2}.

Using (15), we readily obtain the following relation:

H(q,q)=2[1cos(dR,𝒞(q,q)2)]1/2.\text{H}(q,q^{\prime})=\sqrt{2}\left[1-\cos\left(\frac{d_{R,\mathcal{C}}(q,q^{\prime})}{2}\right)\right]^{1/2}. (53)

We now use the following inequality relating the Hellinger distance and the KL divergence [47, Lemma 2.4]:

H2(q,q)KL(q,q).\text{H}^{2}(q,q^{\prime})\leq\text{KL}(q,q^{\prime}). (54)

Using the relation (53), we obtain the desired bound (16). The second-order approximation (17) follows directly from [45][Theorem 4.4.5].