This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Existence and Minimax Theorems for Adversarial Surrogate Risks in Binary Classification

\nameNatalie S. Frank \email[email protected]
\addrCourant Institute
New York University
New York, NY 10012, USA \AND
\nameJonathan Niles-Weed \email[email protected]
\addrCourant Institute and Center for Data Science
New York University
New York, NY 10012, USA
Abstract

We prove existence, minimax, and complementary slackness theorems for adversarial surrogate risks in binary classification. These results extend recent work of Pydi and Jog (2021), who established analogous minimax and existence theorems for the adversarial classification risk. We show that their conclusions continue to hold for a very general class of surrogate losses; moreover, we remove some of the technical restrictions present in prior work. Our results provide an explanation for the phenomenon of transfer attacks and inform new directions in algorithm development.

Keywords: Adversarial Learning, Minimax Theorems, Optimal Transport, Adversarial Bayes risk, Convex Relaxation

1 Introduction

Neural networks are state-of-the-art methods for a variety of machine learning tasks including image classification and speech recognition. However, a concerning problem with these models is their susceptibility to adversarial attacks: small perturbations to inputs can cause incorrect classification by the network (Szegedy et al., 2013; Biggio et al., 2013). This issue has security implications; for instance, Gu et al. (2017) show that a yellow sticker can cause a neural net to misclassify a stop sign. Furthermore, one can find adversarial examples that generalize to other neural nets; these sort of attacks are called transfer attacks. In other words, an adversarial example generated for one neural net will sometimes be an adversarial example for a different neural net trained for the same classification problem (Tramèr et al., 2017; Demontis et al., 2018; Kurakin et al., 2017; Rozsa et al., 2016; Papernot et al., 2016). This phenomenon shows that access to a specific neural net is not necessary for generating adversarial examples. One method for defending against such adversarial perturbations is adversarial training, in which a neural net is trained on adversarially perturbed data points (Kurakin et al., 2017; Madry et al., 2019; Wang et al., 2021). However, adversarial training is not well understood from a theoretical perspective.

From a theoretical standpoint, the most fundamental question is whether it is possible to design models which are robust to such attacks, and what the properties of such robust models might be. In contrast to the classical, non-adversarial setting, much is still unknown about the basic properties of optimal robust models. For instance, in the context of binary classification, several prior works study properties of the adversarial classification risk—the expected number of classification errors under adversarial perturbations. A minimizer of the adversarial risk is optimally robust against adversarial perturbations of the data; however, it is not clear under what conditions such a minimizer exists. Recently, Awasthi et al. (2021b), Bungert et al. (2021), and Pydi and Jog (2021) all showed existence of a minimizer to the adversarial classification risk under suitable assumptions, and characterized some of its properties. A crucial observation, emphasized by Pydi and Jog (2021), is that minimizing the adversarial classification risk is equivalent to a dual robust classification problem involving uncertainty sets with respect to the \infty-Wasserstein metric. This observation gives rise to a game-theoretic interpretation of robustness, which takes the form of a zero-sum game between a classifier and an adversary who is allowed to perturb the data by a certain amount. As noted by Pydi and Jog (2021), this interpretation has implications for algorithm design by suggesting that robust classifiers can be constructed by jointly optimizing over classification rules and adversarial perturbations.

This recent progress on adversarial binary classification lays the groundwork for a theoretical understanding of adversarial robustness, but it is limited insofar as it focuses only on minimizers of the adversarial classification risk. In practice, minimizing the empirical adversarial classification risk is computationally intractable; as a consequence; the adversarial training procedure typically minimizes an objective based on a surrogate risk, which is chosen to be easier to optimize. In the non-adversarial setting, the properties of surrogate risks are well known (see, e.g. Bartlett et al., 2006), but in the adversarial scenario, existing results for the adversarial classification risk fail to carry over to surrogate risks. In particular, the existence and minimax results described above are not known to hold. We close this gap by developing an analogous theory for adversarial surrogate risks. Our main theorems (Theorems 79) establish that strong duality holds for the adversarial surrogate risk minimization problem, that solutions to the primal and dual problems exist, and that these optimizers satisfy a complementary slackness condition.

These results suggest explanations for empirical observations, such as the existence of transfer attacks. Specifically, our analysis suggests that adversarial examples are a property of the data distribution rather than a specific model. In fact, our complimentary slackness theorem states that certain attacks are the strongest possible adversary against any function that minimizes the adversarial surrogate risk, which might explain why adversarial examples tend to transfer between trained neural nets. Furthermore, our theorems suggest that a training algorithm should optimize over neural nets and adversarial perturbations simultaneously. Adversarial training, the current state of the art method for finding adversarially robust networks, does not follow this procedure. The adversarial training algorithm tracks an estimate of the optimal function f~\tilde{f} and updates f~\tilde{f} through gradient descent. To update f~\tilde{f}, the algorithm first finds optimal adversarial examples at the current estimate f~\tilde{f}. See the papers (Kurakin et al., 2017; Madry et al., 2019; Goodfellow et al., 2014) for more details on adversarial training. Finding these adversarial examples is a computationally intensive procedure. On the other hand, algorithms for optimizing minimax problems in the finite dimensional setting alternate between primal and dual steps (Mokhtari et al., 2019). This observation suggests that designing an algorithm that optimizes over model parameters and adversarial perturbations simultaneously is a promising research direction. Trillos and Trillos (2023); Wang and Chizat (2023); Domingo-Enrich et al. (2021) adopt this approach, and one can view the minimax results of this paper as a mathematical justification for the use of surrogate losses in such algorithms.

Lastly, our theorems are an important first step in understating statistical properties of surrogate losses. Recall that one minimizes a surrogate risk because minimizing the original risk is computationally intractable. If a sequence of functions which minimizes the surrogate risk also minimizes the classification risk, then that surrogate is referred to as a consistent risk. Similarly, if a sequence of functions which minimizes the adversarial surrogate risk also minimizes the adversarial classification risk, then that surrogate is referred to as an adversarially consistent risk. Much prior work studies the consistency of surrogate risks (Bartlett et al., 2006; Lin, 2004; Steinwart, 2007; Philip M. Long, 2013; Mingyuan Zhang, 2020; Zhang, 2004). Alarmingly, Meunier et al. (2022) show that a family of surrogates used in applications is not adversarially consistent. In follow up work, we show that our results can be used to characterize adversarially consistent supremum-based risks for binary classification (Frank and Niles-Weed, 2023), strengthening results on calibration in the adversarial setting Bao et al. (2021); Meunier et al. (2022); Awasthi et al. (2021a).

2 Related Works

This paper extends prior work on the adversarial Bayes classifier. Pydi and Jog (2021) first proved multiple minimax theorems for the adversarial classification risk using optimal transport and Choquet capacities, showing an intimate connection between adversarial learning and optimal transport. Later, follow-up work used optimal transport minimax reformulations of the adversarial learning problem to derive new algorithms for adversarial learning. Trillos et al. (2022) reformulate adversarial learning in terms of a multi-marginal optimal transport problem and then apply existing techniques from optimal transport to find a new algorithm. Trillos and Trillos (2023); Wang and Chizat (2023); Domingo-Enrich et al. (2021) propose ascent-descent algorithms based on optimal transport and use mean-field dynamics to analyze convergence. These approaches leverage the minimax view of the adversarial training problem to optimize over model parameters and optimal attacks simultaneously. Gao et al. (2022) use an optimal transport reformulation to find regularizers that encourage robustness. Wong et al. (2019); Wu et al. (2020) use Wasserstein metrics to formulate adversarial attacks on neural networks.

Further work analyzes properties of the adversarial Bayes classifier. Awasthi et al. (2021b), Bhagoji et al. (2019), and Bungert et al. (2021) all prove the existence of the adversarial Bayes classifier, using different techniques. Yang et al. (2020) studied the adversarial Bayes classifier in the context of non-parametric methods. Pydi and Jog (2019) and Bhagoji et al. (2019) further introduced methods from optimal transport to study adversarial learning. Lastly, (Trillos and Murray, 2020) give necessary and sufficient conditions describing the boundary of the adversarial Bayes classifier. Simultaneous work (Li and Telgarsky, 2023) also proves the existence of minimizers to adversarial surrogate risks using prior results on the adversarial Bayes classifier.

The adversarial training algorithm is also well studied from an empirical perspective. Demontis et al. (2018) discussed an explanation of transfer attacks on neural nets trained using standard methods, but did not extend their analysis to the adversarial training setting. (Wang et al., 2021; Kurakin et al., 2017; Madry et al., 2019) study the adversarial training algorithm and improving the runtime. Two particularly popular attacks used in adversarial training are the FGSM attack (Goodfellow et al., 2014) and the PGD attack (Madry et al., 2019). More recent popular variants of this algorithm include (Shafahi et al., 2019; Xie et al., 2018; Kannan et al., 2018; Wong et al., 2020).

3 Background and Notation

3.1 Adversarial Classification

This paper studies binary classification on d\mathbb{R}^{d} with two classes encoded as 1-1 and +1+1. Data is distributed according to a distribution 𝒟\mathcal{D} on d×{1,+1}\mathbb{R}^{d}\times\{-1,+1\}. We denote the marginals according to the class label as 0(S)=𝒟(S×{1}){\mathbb{P}}_{0}(S)=\mathcal{D}(S\times\{-1\}) and 1(S)=𝒟(S×{+1}){\mathbb{P}}_{1}(S)=\mathcal{D}(S\times\{+1\}). Throughout the paper, we assume 0(d){\mathbb{P}}_{0}(\mathbb{R}^{d}) and 1(d){\mathbb{P}}_{1}(\mathbb{R}^{d}) are finite but not necessarily that 0(d)+1(d)=1{\mathbb{P}}_{0}(\mathbb{R}^{d})+{\mathbb{P}}_{1}(\mathbb{R}^{d})=1.

To classify points in d\mathbb{R}^{d}, algorithms typically learn a real-valued function ff and then classify points 𝐱{\mathbf{x}} according to the sign of ff (arbitrarily assigning the sign of 0 to be 1-1). The classification error, also known as the classification risk, of a function ff is

R(f)=𝟏f(𝐱)0𝑑1+𝟏f(𝐱)>0𝑑0.R(f)=\int{\mathbf{1}}_{f({\mathbf{x}})\leq 0}d{\mathbb{P}}_{1}+\int{\mathbf{1}}_{f({\mathbf{x}})>0}d{\mathbb{P}}_{0}. (1)

Notice that finding minimizers to RR is straightforward: define the measure =0+1{\mathbb{P}}={\mathbb{P}}_{0}+{\mathbb{P}}_{1} and let η=d1/d\eta=d{\mathbb{P}}_{1}/d{\mathbb{P}}. Then the risk RR can be re-written as

R(f)=η(𝐱)𝟏f(𝐱)0+(1η(𝐱))𝟏f(𝐱)>0d.R(f)=\int\eta({\mathbf{x}}){\mathbf{1}}_{f({\mathbf{x}})\leq 0}+(1-\eta({\mathbf{x}})){\mathbf{1}}_{f({\mathbf{x}})>0}d{\mathbb{P}}.

Hence a minimizer of RR must minimize the function C(η(𝐱),α)=η(𝐱)𝟏α0+(1η(𝐱))𝟏α>0C(\eta({\mathbf{x}}),\alpha)=\eta({\mathbf{x}}){\mathbf{1}}_{\alpha\leq 0}+(1-\eta({\mathbf{x}})){\mathbf{1}}_{\alpha>0} at each 𝐱{\mathbf{x}} {\mathbb{P}}-a.e. The optimal Bayes risk is then

inffR(f)=C(η)𝑑\inf_{f}R(f)=\int C^{*}(\eta)d{\mathbb{P}}

where C(η)=infαC(η,α)=min(η,1η)C^{*}(\eta)=\inf_{\alpha}C(\eta,\alpha)=\min(\eta,1-\eta).

This paper analyzes the evasion attack, in which an adversary knows both the function ff and the true label of the data point, and can perturb each input before it is evaluated by the function ff. To constrain the adversary, we assume that perturbations are bounded by ϵ\epsilon in a norm \|\cdot\|. Thus a point 𝐱{\mathbf{x}} with label +1+1 is misclassified if there is a perturbation 𝐡{\mathbf{h}} with 𝐡ϵ\|{\mathbf{h}}\|\leq\epsilon for which f(𝐱+𝐡)0f({\mathbf{x}}+{\mathbf{h}})\leq 0 and a point 𝐱{\mathbf{x}} with label 1-1 is misclassified if there is a perturbation 𝐡{\mathbf{h}} with 𝐡ϵ\|{\mathbf{h}}\|\leq\epsilon for which f(𝐱+𝐡)>0f({\mathbf{x}}+{\mathbf{h}})>0. Therefore, the adversarial classification risk is

Rϵ(f)=sup𝐡ϵ𝟏f(𝐱+𝐡)0d1+sup𝐡ϵ𝟏f(𝐱+𝐡)>0d0R^{\epsilon}(f)=\int\sup_{\|{\mathbf{h}}\|\leq\epsilon}{\mathbf{1}}_{f({\mathbf{x}}+{\mathbf{h}})\leq 0}d{\mathbb{P}}_{1}+\int\sup_{\|{\mathbf{h}}\|\leq\epsilon}{\mathbf{1}}_{f({\mathbf{x}}+{\mathbf{h}})>0}d{\mathbb{P}}_{0} (2)

which is the expected proportion of errors subject to the adversarial evasion attack. The expression sup𝐡ϵ𝟏f(𝐱+𝐡)0\sup_{\|{\mathbf{h}}\|\leq\epsilon}{\mathbf{1}}_{f({\mathbf{x}}+{\mathbf{h}})\leq 0} evaluates to 1 at a point 𝐱{\mathbf{x}} iff 𝐱{\mathbf{x}} can be moved into the set [f0][f\leq 0] by a perturbation of size at most ϵ\epsilon. Equivalently, this set is the Minkowski sum \oplus of [f0][f\leq 0] and Bϵ(𝟎)¯\overline{B_{\epsilon}({\mathbf{0}})}. For any set AA, let AϵA^{\epsilon} denote

Aϵ={𝐱:𝐡 with 𝐡ϵ and 𝐱+𝐡A}=ABϵ(𝟎)¯=𝐚ABϵ(𝐚)¯.A^{\epsilon}=\{{\mathbf{x}}\colon\exists{\mathbf{h}}\text{ with }\|{\mathbf{h}}\|\leq\epsilon\text{ and }{\mathbf{x}}+{\mathbf{h}}\in A\}=A\oplus\overline{B_{\epsilon}({\mathbf{0}})}=\bigcup_{{\mathbf{a}}\in A}\overline{B_{\epsilon}({\mathbf{a}})}. (3)

This operation ‘thickens’ the boundary of a set by ϵ\epsilon. With this notation, (2) can be expressed as Rϵ(f)=𝟏{f0}ϵ𝑑1+𝟏{f>0}ϵ𝑑0R^{\epsilon}(f)=\int{\mathbf{1}}_{\{f\leq 0\}^{\epsilon}}d{\mathbb{P}}_{1}+\int{\mathbf{1}}_{\{f>0\}^{\epsilon}}d{\mathbb{P}}_{0}.

Unlike the classification risk RR, finding minimizers to RϵR^{\epsilon} is nontrivial. One can re-write RϵR^{\epsilon} in terms of {\mathbb{P}} and η\eta but the resulting integrand cannot be minimized in a pointwise fashion. Nevertheless, it can be shown that minimizers of RϵR^{\epsilon} exist (Awasthi et al., 2021b; Bungert et al., 2021; Pydi and Jog, 2021; Frank and Niles-Weed, 2023).

3.2 Surrogate Risks

As minimizing the empirical version of risk in (1) is computationally intractable, typical machine learning algorithms minimize a proxy to the classification risk called a surrogate risk. In fact, Ben-David et al. (2003) show that minimizing the empirical classification risk is NP-hard in general. One popular surrogate is

Rϕ(f)=ϕ(f)𝑑1+ϕ(f)𝑑0R_{\phi}(f)=\int\phi(f)d{\mathbb{P}}_{1}+\int\phi(-f)d{\mathbb{P}}_{0} (4)

where ϕ\phi is a decreasing function.111Notice that due to the asymmetry of the sign function at 0 in (1), RϕR_{\phi} is not quite a generalization of RR. To define a classifier, one then threshholds ff at zero. There are many reasonable choices for ϕ\phi—one typically chooses an upper bound on the zero-one loss which is easy to optimize. We make the following assumption on ϕ\phi:

Assumption 1

The loss ϕ\phi is non-increasing, non-negative, lower semi-continuous, and limαϕ(α)=0\lim_{\alpha\to\infty}\phi(\alpha)=0.

A particularly important example, which plays a large role in our proofs, is the exponential loss ψ(α)=eα\psi(\alpha)=e^{-\alpha}, which will be denoted by ψ\psi in the remainder of this paper. Assumption 1 includes many but not all all surrogate risks used in practice. Notably, some multiclass surrogate risks with two classes are of a somewhat different form, see for instance (Tewari and Bartlett, 2007) for more details.

In order to find minimizers of RϕR_{\phi}, we rewrite the risk in terms of {\mathbb{P}} and η\eta as

Rϕ(f)=η(𝐱)ϕ(f(𝐱))+(1η(𝐱))ϕ(f(𝐱))dR_{\phi}(f)=\int\eta({\mathbf{x}})\phi(f({\mathbf{x}}))+(1-\eta({\mathbf{x}}))\phi(-f({\mathbf{x}}))d{\mathbb{P}} (5)

Hence the minimizer of RϕR_{\phi} must minimize Cϕ(η,)C_{\phi}(\eta,\cdot) pointwise {\mathbb{P}}-a.e., where

Cϕ(η,α)=ηϕ(α)+(1η)ϕ(α).C_{\phi}(\eta,\alpha)=\eta\phi(\alpha)+(1-\eta)\phi(-\alpha).

In other words, if one defines Cϕ(η)=infαCϕ(η,α)C_{\phi}^{*}(\eta)=\inf_{\alpha}C_{\phi}(\eta,\alpha), then a function ff^{*} is optimal if and only if

η(𝐱)ϕ(f(𝐱))+(1η(𝐱))ϕ(f(𝐱))=Cϕ(η(𝐱))-a.e.\eta({\mathbf{x}})\phi(f^{*}({\mathbf{x}}))+(1-\eta({\mathbf{x}}))\phi(-f^{*}({\mathbf{x}}))=C_{\phi}^{*}(\eta({\mathbf{x}}))\quad{\mathbb{P}}\text{-a.e.} (6)

Thus one can write the minimum value of RϕR_{\phi} as

inffRϕ(f)=Cϕ(η)𝑑.\inf_{f}R_{\phi}(f)=\int C_{\phi}^{*}(\eta)d{\mathbb{P}}. (7)

To guarantee the existence of a function satisfying (6), we must allow our functions to take values in the extended real numbers ¯={,+}\overline{\mathbb{R}}=\mathbb{R}\cup\{-\infty,+\infty\}. Allowing the value α=+\alpha=+\infty is necessary, for instance, for the exponential loss ψ(α)=eα\psi(\alpha)=e^{-\alpha}: when η=1\eta=1, the minimum of Cψ(1,α)=eαC_{\psi}(1,\alpha)=e^{-\alpha} is achieved at α=+\alpha=+\infty. In fact, one can express a minimizer as a function of the conditional probability η(𝐱)\eta({\mathbf{x}}) using (6). For a loss ϕ\phi, define αϕ:[0,1]R¯\alpha_{\phi}:[0,1]\to\overline{R} by

αϕ(η)=inf{α:α is a minimizer of Cϕ(η,)}.\alpha_{\phi}(\eta)=\inf\{\alpha:\alpha\text{ is a minimizer of }C_{\phi}(\eta,\cdot)\}. (8)

Lemma 25 in Appendix C shows that the function αϕ\alpha_{\phi} is montonic and αϕ(η)\alpha_{\phi}(\eta) is in fact a minimizer of Cϕ(η,)C_{\phi}(\eta,\cdot). Thus the function

f(𝐱)=αϕ(η(𝐱))f^{*}({\mathbf{x}})=\alpha_{\phi}(\eta({\mathbf{x}})) (9)

is measurable and satisfies (6). Therefore, ff^{*} must be a minimizer of the risk RϕR_{\phi}.

Similarly, rather directly minimizing the adversarial classification risk, practical algorithms minimize an adversarial surrogate. The adversarial counterpart to (4) is

Rϕϵ(f)=sup𝐡ϵϕ(f(𝐱+𝐡))d1+sup𝐡ϵϕ(f(𝐱+𝐡))d0.\displaystyle R_{\phi}^{\epsilon}(f)=\int\sup_{\|{\mathbf{h}}\|\leq\epsilon}\phi(f({\mathbf{x}}+{\mathbf{h}}))d{\mathbb{P}}_{1}+\int\sup_{\|{\mathbf{h}}\|\leq\epsilon}\phi(-f({\mathbf{x}}+{\mathbf{h}}))d{\mathbb{P}}_{0}. (10)

Due to the definitions of the adversarial risks (2) and (10), the operation of finding the supremum of a function over ϵ\epsilon-balls is central to our subsequent analysis. For a function gg, we define

Sϵ(g)(𝐱)=sup𝐡ϵg(𝐱+𝐡)S_{\epsilon}(g)({\mathbf{x}})=\sup_{\|{\mathbf{h}}\|\leq\epsilon}g({\mathbf{x}}+{\mathbf{h}}) (11)

Using this notation, one can re-write the risk RϕϵR_{\phi}^{\epsilon} as

Rϕϵ(f)=Sϵ(ϕf)d1+Sϵ(ϕf)d0\displaystyle R_{\phi}^{\epsilon}(f)=\int S_{\epsilon}(\phi\circ f)d{\mathbb{P}}_{1}+\int S_{\epsilon}(\phi\circ-f)d{\mathbb{P}}_{0}

By analogy to (5), we equivalently write the risk RϕϵR_{\phi}^{\epsilon} in terms of {\mathbb{P}} and η\eta:

Rϕϵ(f)=η(𝐱)Sϵ(ϕf)(𝐱)+(1η(𝐱))Sϵ(ϕf)(𝐱)d.R_{\phi}^{\epsilon}(f)=\int\eta({\mathbf{x}})S_{\epsilon}(\phi\circ f)({\mathbf{x}})+(1-\eta({\mathbf{x}}))S_{\epsilon}(\phi\circ-f)({\mathbf{x}})d{\mathbb{P}}. (12)

However, unlike (5), because the integrand of RϕϵR_{\phi}^{\epsilon} cannot be minimized in a pointwise manner, proving the existence of minimizers to RϕϵR_{\phi}^{\epsilon} is non-trivial. In fact, unlike the adversarial classification risk RϵR^{\epsilon}, there is little theoretical understanding of properties of minimizing RϕϵR_{\phi}^{\epsilon}.

3.3 Measurability

In order to define the adversarial risks RϵR^{\epsilon} and RϕϵR_{\phi}^{\epsilon}, one must show that Sϵ(𝟏A),Sϵ(ϕf)S_{\epsilon}({\mathbf{1}}_{A}),S_{\epsilon}(\phi\circ f) are measurable. To illustrate this concern, Pydi and Jog (2021) show that for every ϵ>0\epsilon>0 and d>1d>1, there is a Borel set CC for which the function Sϵ(𝟏C)(𝐱)S_{\epsilon}({\mathbf{1}}_{C})({\mathbf{x}}) is not Borel measurable. However, if gg is Borel, then Sϵ(g)S_{\epsilon}(g) is always measurable with respect to a larger σ\sigma-algebra called the universal σ\sigma-algebra 𝒰(d){\mathscr{U}}(\mathbb{R}^{d}). Such a function is called universally measurable. We prove the following theorem and formally define the universal σ\sigma-algebra in Appendix A.

Theorem 1

If ff is universally measurable, then Sϵ(f)S_{\epsilon}(f) is also universally measurable.

In fact, in Appendix A, we show that a function defined by a supremum over a compact set is universally measurable— a result of independent interest. The universal σ\sigma-algebra is smaller than the completion of (d){\mathcal{B}}(\mathbb{R}^{d}) with respect to any Borel measure. Thus, in the remainder of the paper, unless otherwise noted, all measures will be Borel measures and the expression Sϵ(f)𝑑\int S_{\epsilon}(f)d{\mathbb{Q}} will be interpreted as the integral of Sϵ(f)S_{\epsilon}(f) with respect to the completion of {\mathbb{Q}}.

3.4 The WW_{\infty} metric

In this section, we explain how the integral of a supremum Sϵ(f)𝑑\int S_{\epsilon}(f)d{\mathbb{Q}} can be expressed in terms of a supremum of integrals. We start by defining the Wasserstein-\infty metric.

Definition 2

Let ,{\mathbb{P}},{\mathbb{Q}} be two finite measures with (d)=(d){\mathbb{P}}(\mathbb{R}^{d})={\mathbb{Q}}(\mathbb{R}^{d}). A coupling is a positive measure on the product space d×d\mathbb{R}^{d}\times\mathbb{R}^{d} with marginals ,{\mathbb{P}},{\mathbb{Q}}. We denote the set of all couplings with marginals {\mathbb{P}}, {\mathbb{Q}} by Π(,)\Pi({\mathbb{P}},{\mathbb{Q}}). The \infty-Wasserstein distance with respect to a norm \|\cdot\| is defined as

W(,)=infγΠ(,)esssup(𝐱,𝐱)γ𝐱𝐱W_{\infty}({\mathbb{P}},{\mathbb{Q}})=\inf_{\gamma\in\Pi({\mathbb{P}},{\mathbb{Q}})}\operatorname*{ess\,sup}_{({\mathbf{x}},{\mathbf{x}}^{\prime})\sim\gamma}\|{\mathbf{x}}-{\mathbf{x}}^{\prime}\|

In other words, {\mathbb{P}}, {\mathbb{Q}} are within a Wasserstein-\infty distance of ϵ\epsilon if there is a coupling γ\gamma for {\mathbb{P}} and {\mathbb{Q}} for which suppγ\operatorname{supp}\gamma is contained in the set Δϵ={(𝐱,𝐱):𝐱𝐱ϵ}\Delta_{\epsilon}=\{({\mathbf{x}},{\mathbf{x}}^{\prime})\colon\|{\mathbf{x}}-{\mathbf{x}}^{\prime}\|\leq\epsilon\}.

The \infty-Wasserstein metric is closely related to the to the operation SϵS_{\epsilon}. First, we show that SϵS_{\epsilon} can be expressed as a supremum of integrals over a Wasserstein-\infty ball. For a measure {\mathbb{Q}}, we write

ϵ()={ Borel:W(,)ϵ}.{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{Q}})=\{{\mathbb{Q}}^{\prime}\text{ Borel}:W_{\infty}({\mathbb{Q}},{\mathbb{Q}}^{\prime})\leq\epsilon\}.
Lemma 3

Let {\mathbb{Q}} be a finite positive Borel measure and let f:d{}f\colon\mathbb{R}^{d}\to\mathbb{R}\cup\{\infty\} be a Borel measurable function. Then

Sϵ(f)𝑑=supϵ()f𝑑\int S_{\epsilon}(f)d{\mathbb{Q}}=\sup_{\begin{subarray}{c}{\mathbb{Q}}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{Q}})\end{subarray}}\int fd{\mathbb{Q}}^{\prime} (13)

Lemma 5.1 of Pydi and Jog (2021) proves an analogous statement for sets, namely that (Aϵ)=supϵ()(A){\mathbb{Q}}(A^{\epsilon})=\sup_{{\mathbb{Q}}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{Q}})}{\mathbb{Q}}(A), under suitable assumptions on {\mathbb{Q}} and {\mathbb{Q}}^{\prime}.

Conversely, the WW_{\infty} distance between two probability measures can be expressed in terms of the integrals of ff and Sϵ(f)S_{\epsilon}(f). Let Cb(X)C_{b}(X) be the set of continuous bounded functions on the topological space XX.

Lemma 4

Let ,{\mathbb{P}},{\mathbb{Q}} be two finite positive Borel measures with (d)=(d){\mathbb{P}}(\mathbb{R}^{d})={\mathbb{Q}}(\mathbb{R}^{d}). Then

W(,)=infϵ{ϵ0:h𝑑Sϵ(h)𝑑hCb(d)}W_{\infty}({\mathbb{P}},{\mathbb{Q}})=\inf_{\epsilon}\{\epsilon\geq 0\colon\int hd{\mathbb{Q}}\leq\int S_{\epsilon}(h)d{\mathbb{P}}\quad\forall h\in C_{b}(\mathbb{R}^{d})\}

This observation will be central to proving a duality result. See Appendix B for proofs of Lemmas 3 and 4.

4 Main Results and Outline of the Paper

4.1 Summary of Main Results

Our goal in this paper is to understand properties of the surrogate risk minimization problem inffRϕϵ\inf_{f}R_{\phi}^{\epsilon}. The starting point for our results is Lemma 3, which implies that inffRϕϵ\inf_{f}R_{\phi}^{\epsilon} actually involves a inf\inf followed by a sup\sup:

inff BorelRϕϵ(f)=inff Borelsup0ϵ(0)1ϵ(1)ϕfd1+ϕfd0.\inf_{f\text{ Borel}}R_{\phi}^{\epsilon}(f)=\inf_{f\text{ Borel}}\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\int\phi\circ fd{\mathbb{P}}_{1}^{\prime}+\int\phi\circ-fd{\mathbb{P}}_{0}^{\prime}.

We therefore obtain a lower bound on inffRϕϵ\inf_{f}R_{\phi}^{\epsilon} by swapping the sup\sup and inf\inf and recalling the definition of Cϕ(η)=infαCϕ(η,α)C_{\phi}^{*}(\eta)=\inf_{\alpha}C_{\phi}(\eta,\alpha):

inff BorelRϕϵ(f)\displaystyle\inf_{f\text{ Borel}}R_{\phi}^{\epsilon}(f) sup0ϵ(0)1ϵ(1)inff Borelϕfd1+ϕfd0\displaystyle\geq\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\inf_{f\text{ Borel}}\int\phi\circ fd{\mathbb{P}}_{1}^{\prime}+\int\phi\circ-fd{\mathbb{P}}_{0}^{\prime}
=sup0ϵ(0)1ϵ(1)inff Boreld1d(0+1)ϕ(f)+(1d1d(0+1))ϕ(f)d(0+1)\displaystyle=\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\inf_{f\text{ Borel}}\int\frac{d{\mathbb{P}}_{1}^{\prime}}{d({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime})}\phi(f)+\left(1-\frac{d{\mathbb{P}}_{1}^{\prime}}{d({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime})}\right)\phi(-f)d({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime})
sup0ϵ(0)1ϵ(1)Cϕ(d1d(0+1))d(0+1)\displaystyle\geq\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\int C_{\phi}^{*}\left(\frac{d{\mathbb{P}}_{1}^{\prime}}{d({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime})}\right)d({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime}) (14)

If we define

R¯ϕ(0,1)=Cϕ(d1d(0+1))d(0+1),\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})=\int C_{\phi}^{*}\left(\frac{d{\mathbb{P}}_{1}^{\prime}}{d({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime})}\right)d({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime}), (15)

then we have shown

inff BorelRϕϵ(f)sup0ϵ(0)1ϵ(1)R¯ϕ(0,1).\inf_{f\text{ Borel}}R_{\phi}^{\epsilon}(f)\geq\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}). (16)

This statement is a form of weak duality.

When the surrogate adversarial risk is replaced by the standard adversarial classification risk, Pydi and Jog (2021) proved that the analogue of (16) is actually an equality, so that strong duality holds for the adversarial classification problem. Concretely, by analogy to (15), consider

R¯(0,1)=C(d1d(0+1))d(0+1).\bar{R}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})=\int C^{*}\left(\frac{d{\mathbb{P}}_{1}^{\prime}}{d({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime})}\right)d({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime}).

Let μ\mu be the Lebesgue measure and let μ(d)\mathcal{L}_{\mu}(\mathbb{R}^{d}) be the Lebesgue σ\sigma-algebra. Then define

~ϵ()={:W(,)ϵ, a measure on (d,μ(d))}.{\tilde{{\mathcal{B}}}^{\infty}_{\epsilon}}({\mathbb{Q}})=\{{\mathbb{Q}}^{\prime}\colon W_{\infty}({\mathbb{Q}},{\mathbb{Q}}^{\prime})\leq\epsilon,{\mathbb{Q}}^{\prime}\text{ a measure on $(\mathbb{R}^{d},\mathcal{L}_{\mu}(\mathbb{R}^{d}))$}\}. (17)

Pydi and Jog (2021) show the following.

Theorem 5 (Pydi and Jog, 2021, Theorem 7.1)

Assume that 0,1{\mathbb{P}}_{0},{\mathbb{P}}_{1} are absolutely continuous with respect to the Lebesgue measure μ\mu. Then

inff LebesgueRϵ(f)=sup0~ϵ(0)1~ϵ(1)R¯(0,1)\inf_{f\text{ Lebesgue}}R^{\epsilon}(f)=\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{\tilde{{\mathcal{B}}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{\tilde{{\mathcal{B}}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\bar{R}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}) (18)

and furthermore equality is attained at some Lebesgue measurable f^\hat{f} and ^1,^0\hat{\mathbb{P}}_{1},\hat{\mathbb{P}}_{0}.

Additionally, ^i=iφi1\hat{\mathbb{P}}_{i}={\mathbb{P}}_{i}\circ\varphi_{i}^{-1} for some universally measurable φi\varphi_{i} with φi(𝐱)𝐱ϵ\|\varphi_{i}({\mathbf{x}})-{\mathbf{x}}\|\leq\epsilon, sup𝐲𝐱ϵ𝟏f^(𝐲)0=𝟏f^(φ1(𝐱))0\sup_{\|{\mathbf{y}}-{\mathbf{x}}\|\leq\epsilon}{\mathbf{1}}_{\hat{f}({\mathbf{y}})\leq 0}={\mathbf{1}}_{\hat{f}(\varphi_{1}({\mathbf{x}}))\leq 0} 1{\mathbb{P}}_{1}-a.e., and sup𝐲𝐱ϵ𝟏f^(𝐲)>0=𝟏f^(φ0(𝐱))>0\sup_{\|{\mathbf{y}}-{\mathbf{x}}\|\leq\epsilon}{\mathbf{1}}_{\hat{f}({\mathbf{y}})>0}={\mathbf{1}}_{\hat{f}(\varphi_{0}({\mathbf{x}}))>0} 0{\mathbb{P}}_{0}-a.e.

This is a foundational result in the theory of adversarial classification, but it leaves an open question crucial in applications: Does the strong duality relation extend to surrogate risks and to general measures? In this work, we answer this question in the affirmative.

We start by proving the following:

Theorem 6 (Strong Duality)

Let 0,1{\mathbb{P}}_{0},{\mathbb{P}}_{1} be finite Borel measures. Then

inff BorelRϕϵ(f)=sup0ϵ(0)1ϵ(1)R¯ϕ(0,1).\inf_{f\text{ Borel}}R_{\phi}^{\epsilon}(f)=\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}). (19)

When ϵ=0\epsilon=0, we recover the fundamental characterization of the minimum risk for standard (non-adversarial) classification in (7). Theorem 6 can be rephrased as

inff Borelsup0ϵ(0)1ϵ(1)R^ϕ(f,0,1)=sup0ϵ(0)1ϵ(1)inff BorelR^ϕ(f,0,1)\inf_{f\text{ Borel}}\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\hat{R}_{\phi}(f,{\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})=\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\inf_{f\text{ Borel}}\hat{R}_{\phi}(f,{\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}) (20)

where

R^ϕ(f,0,1)=ϕ(f)𝑑1+ϕ(f)𝑑0\hat{R}_{\phi}(f,{\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})=\int\phi(f)d{\mathbb{P}}_{1}^{\prime}+\int\phi(-f)d{\mathbb{P}}_{0}^{\prime}

As discussed in Pydi and Jog (2021), this result has an appealing game theoretic interpretation: adversarial learning with a surrogate risk can be though of as a zero-sum game between the learner who selects a function ff and the adversary who chooses perturbations through 0{\mathbb{P}}_{0}^{\prime} and 1{\mathbb{P}}_{1}^{\prime}. Furthermore, the player to pick first does not have an advantage.

Additionally, (20) suggest that training adversarially robust classifiers could be accomplished by optimizing over primal functions ff and dual distributions 0,1{\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime} simultaneously.

A consequence of Theorem 6 is the following complementary slackness conditions for optimizers f,0,1f^{*},{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}:

Theorem 7 (Complimentary Slackness)

The function ff^{*} is a minimizer of RϕϵR_{\phi}^{\epsilon} and (0,1)({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}) is a maximizer of R¯ϕ\bar{R}_{\phi} over the WW_{\infty} balls around 0{\mathbb{P}}_{0} and 1{\mathbb{P}}_{1} iff the following hold:

  1. 1)
    ϕfd1=Sϵ(ϕ(f))d1andϕfd0=Sϵ(ϕ(f))d0\int\phi\circ f^{*}d{\mathbb{P}}_{1}^{*}=\int S_{\epsilon}(\phi(f^{*}))d{\mathbb{P}}_{1}\quad\text{and}\quad\int\phi\circ-f^{*}d{\mathbb{P}}_{0}^{*}=\int S_{\epsilon}(\phi(f^{*}))d{\mathbb{P}}_{0} (21)
  2. 2)

    If we define =0+1{\mathbb{P}}^{*}={\mathbb{P}}_{0}^{*}+{\mathbb{P}}_{1}^{*} and η=d1/d\eta^{*}=d{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}, then

    η(𝐱)ϕ(f(𝐱))+(1η(𝐱))ϕ(f(𝐱))=Cϕ(η(𝐱))-a.e.\eta^{*}({\mathbf{x}})\phi(f^{*}({\mathbf{x}}))+(1-\eta^{*}({\mathbf{x}}))\phi(-f^{*}({\mathbf{x}}))=C_{\phi}^{*}(\eta^{*}({\mathbf{x}}))\quad{\mathbb{P}}^{*}\text{-a.e.} (22)

This theorem implies that every minimizer ff^{*} of RϕϵR_{\phi}^{\epsilon} and every maximizer (0,1)({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}) of R¯ϕ\bar{R}_{\phi} forms a primal-dual pair. The condition (21) states that every maximizer of R¯ϕ\bar{R}_{\phi} is an optimal adversarial attack on ff^{*} while the condition (22) states that every minimizer ff^{*} of RϕϵR_{\phi}^{\epsilon} also minimizes the conditional risk Cϕ(η,)C_{\phi}(\eta^{*},\cdot) under the distribution of optimal adversarial attacks. In other words, ff^{*} minimizes the standard surrogate risk with respect to the optimal adversarially perturbed distributions. Explicitly: (22) implies that every minimizer ff^{*} minimizes the loss R^ϕ(f,0,1)=C(η(𝐱),f(𝐱))𝑑\hat{R}_{\phi}(f,{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*})=\int C(\eta^{*}({\mathbf{x}}),f({\mathbf{x}}))d{\mathbb{P}}^{*} in a pointwise manner {\mathbb{P}}^{*}-a.e. This fact is the relation (6) with respect to the measures 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} that maximize the dual R¯ϕ\bar{R}_{\phi}.

This interpretation of Theorems 6 and 7 shed light on the phenomenon of transfer attacks. These theorems suggests that adversarial examples are a property of the data distribution rather than a specific model. Later results in the paper even show that there are maximizers of R¯ϕ\bar{R}_{\phi} that are independent of the choice of loss function ϕ\phi (see Lemma 26). Theorem 7 specifically states that every maximizer of R¯ϕ\bar{R}_{\phi} is actually an optimal adversarial attack on every minimizer of RϕϵR_{\phi}^{\epsilon}. Notably, this statement is indepent of the choice of minimizer of RϕϵR_{\phi}^{\epsilon}. Because neural networks are highly expressive model classes, one would hope that some neural net could achieve adversarial error close to inffRϕϵ(f)\inf_{f}R^{\epsilon}_{\phi}(f). If ff^{*} is a minimizer of RϕϵR_{\phi}^{\epsilon} and gg is a neural net with Rϕϵ(g)Rϕϵ(f)R^{\epsilon}_{\phi}(g)\approx R^{\epsilon}_{\phi}(f^{*}), one would expect that an optimal adversarial attack against ff^{*} would be a successful attack on gg as well. Notice that in this discussion, the adversarial attack is independent of the neural net gg. A method for calculating these optimal adversarial attacks is an open problem.

Finally, to demonstrate that Theorem 7 and the preceding discussion is non-vacuous, we prove the existence of primal and dual optimizers along with results that elaborate on their structure.

Theorem 8

Let ϕ\phi be a lower-semicontinuous loss function. Then there exists a maximizer (0,1)({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}) to R¯ϕ\bar{R}_{\phi} over the set ϵ(0)×ϵ(1){{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\times{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}).

Theorem 3.5 of (Jylhä, 2014) implies that when the norm \|\cdot\| is strictly convex and 0,1{\mathbb{P}}_{0},{\mathbb{P}}_{1} are absolutely continuous with respect to Lebesgue measure, the optimal 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} of Theorem 8 are induced by a transport map. Corollary 3.11 of (Jylhä, 2014) further implies that these transport maps are continuous a.e. with respect to the Lebesgue measure μ\mu. As the \ell_{\infty} metric is commonly used in practice, whether there exist maximizers of the dual of this type for non-strictly convex norms remains an attractive open problem.

In analogy with (6) and (9) one would hope that due to the complimentary slackness condition (22), one could define a minimizer in terms of the conditional η(𝐱)\eta^{*}({\mathbf{x}}). Notice, however, that as this quantity is only defined {\mathbb{P}}^{*}-a.e., verifying the other complimentary slackness condition (21) would be a challenge. To circumvent this issue, we construct a function η^:d[0,1]\hat{\eta}:\mathbb{R}^{d}\to[0,1], defined on all of d\mathbb{R}^{d}, to which we can apply (9). Concretely, we show that αϕ(η^(𝐱))\alpha_{\phi}(\hat{\eta}({\mathbf{x}})) is always a minimizer of RϕϵR_{\phi}^{\epsilon}, with αϕ\alpha_{\phi} as defined in (8).

Theorem 9

There exists a Borel function η^:(supp)ϵ[¯0,1]\hat{\eta}:(\operatorname{supp}{\mathbb{P}})^{\epsilon}\to\overline{[}0,1] for which f(𝐱)=αϕ(η^(𝐱))f^{*}({\mathbf{x}})=\alpha_{\phi}(\hat{\eta}({\mathbf{x}})) is a minimizer of RϕϵR_{\phi}^{\epsilon} for any ϕ\phi with αϕ\alpha_{\phi} as in defined in (8). In particular, there exists a Borel minimizer of RϕϵR_{\phi}^{\epsilon}.

In fact, we show that η^\hat{\eta} is a version of the conditional derivative d1/dd{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}, where 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} are the measures which maximize R¯ϕ\bar{R}_{\phi} independently of the choice of ϕ\phi (see Lemma 24), as described in the discussion preceding Theorem 8. The fact that the function η^\hat{\eta} is independent of the choice of loss ϕ\phi suggests that the minimizer of RϕϵR_{\phi}^{\epsilon} encodes some fundamental quality of the distribution 0,1{\mathbb{P}}_{0},{\mathbb{P}}_{1}.

Simultaneous work (Li and Telgarsky, 2023) also proves the existence of a minimizer to the primal RϕϵR_{\phi}^{\epsilon} along with a statement on the structure of this minimizer. Their approach leverages prior results on the adversarial Bayes classifier to construct a minimizer to the adversarial surrogate risk.

4.2 Outline of Main Argument

The central proof strategy of this paper is to apply the Fenchel-Rockafellar duality theorem. This classical result relates the infimum of a convex functional with the supremum of a concave functional. One can argue that R¯ϕ\bar{R}_{\phi} is concave (Lemma 12 below); however, the primal RϕϵR_{\phi}^{\epsilon} is not convex for non-convex ϕ\phi. Thus the Fenchel-Rockafellar theorem is applied to a convex relaxation Θ\Theta of the primal RϕϵR_{\phi}^{\epsilon}.

The remainder of the paper then argues that minimizing Θ\Theta is equivalent to minimizing RϕϵR_{\phi}^{\epsilon}. Notice that the Fenchel-Rockafellar theorem actually implies the existence of dual maximizers. We show that that dual maximizers of R¯ψ\bar{R}_{\psi} for ψ(α)=eα\psi(\alpha)=e^{-\alpha} satisfy certain nice properties that are independent of the loss ψ\psi. These properties then allow us to construct the function η^\hat{\eta} present in Theorem 9 in addition to minimizers of Θ\Theta from the dual maximizers of R¯ψ\bar{R}_{\psi}, for any loss ϕ\phi. The construction of these minimizers guarantee that they minimize RϕϵR_{\phi}^{\epsilon} in addition to Θ\Theta.

4.3 Paper Outline

Section 5 proves strong duality and complimentary slackness theorems for R¯ϕ\bar{R}_{\phi} and Θ\Theta, the convex relaxation of RϕϵR_{\phi}^{\epsilon}. Next, in Section 6, a version of the complimentary slackness result is used to prove the existence of minimizers to Θ\Theta. Subsequently, Section 7 shows the equivalence between Θ\Theta and RϕϵR_{\phi}^{\epsilon}.

Appendix A proves Theorem 1 and further discusses universal measurability. Next, Appendix B proves all the results about the WW_{\infty}-norm used in this paper. Appendix C then defines the function αϕ\alpha_{\phi} which is later used in the proof of several results. Appendices D,  EF, and G.3 contain technical deferred proofs.

5 A Duality Result for Θ\Theta and R¯ϕ\bar{R}_{\phi}

5.1 Strong Duality

The fundamental duality relation of this paper follows from employing the Fenchel-Rockafellar theorem in conjunction with the Riesz representation theorem, stated below for reference. See e.g. (Villani, 2003) for more on this result.

Theorem 10 (Fenchel-Rockafellar Duality Theorem)

Let EE be a normed vector space EE^{*} its topological dual and Θ,Ξ\Theta,\Xi two convex functionals on EE with values in {}\mathbb{R}\cup\{\infty\}. Let Θ,Ξ\Theta^{*},\Xi^{*} be the Legendre-Fenchel transforms of Θ,Ξ\Theta,\Xi respectively. Assume that there exists z0Ez_{0}\in E such that

Θ(z0)<,Ξ(z0)<\Theta(z_{0})<\infty,\Xi(z_{0})<\infty

and that Θ\Theta is continuous at z0z_{0}. Then

infzE[Θ(z)+Ξ(z)]=supzE[Θ(z)Ξ(z)]\inf_{z\in E}[\Theta(z)+\Xi(z)]=\sup_{z^{*}\in E^{*}}[-\Theta^{*}(z^{*})-\Xi^{*}(-z^{*})] (23)

and furthermore, the supremum on the right hand side is attained.

Let (X)\mathcal{M}(X) be the set of finite signed Borel measures on a space XX and recall that Cb(X)C_{b}(X) is the set of bounded continuous functions on the space XX.

Theorem 11 (Riesz Representation Theorem)

Let KK be any compact subset of d\mathbb{R}^{d}. Then the dual of Cb(K)C_{b}(K) is (K)\mathcal{M}(K).

See Theorem 1.9 of (Villani, 2003) and result 7.17 of (Folland, 1999) for more details.

Notice that in the Fenchel-Rockafellar theorem, the left-hand side of (23) is convex while the right-hand side is concave. However, when ϕ\phi is non-convex, RϕϵR_{\phi}^{\epsilon} is not convex. In order to apply the Fenchel-Rockafellar theorem, we will relax the primal RϕϵR_{\phi}^{\epsilon} will to a convex functional Θ\Theta.

We define Θ\Theta as

Θ(h0,h1)=Sϵ(h1)𝑑1+Sϵ(h0)𝑑0\Theta(h_{0},h_{1})=\int S_{\epsilon}(h_{1})d{\mathbb{P}}_{1}+\int S_{\epsilon}(h_{0})d{\mathbb{P}}_{0} (24)

which is convex in h0,h1h_{0},h_{1} due to the sub-additivity of the supremum operation. Notice that one obtains Θ\Theta from RϕϵR_{\phi}^{\epsilon} by replacing ϕf\phi\circ f with h1h_{1} and ϕf\phi\circ-f with h0h_{0}.

The functional Ξ\Xi will be chosen to restrict h0,h1h_{0},h_{1} in the hope that at the optimal value, h1=ϕ(f)h_{1}=\phi(f) and h0=ϕ(f)h_{0}=\phi(-f) for some ff. Notice that if h1=ϕ(f)h_{1}=\phi(f), h0=ϕ(f)h_{0}=\phi(-f) then for all η[0,1]\eta\in[0,1],

ηh1(𝐱)+(1η)h0=ηϕ(f))+(1η)ϕ(f)Cϕ(η).\eta h_{1}({\mathbf{x}})+(1-\eta)h_{0}=\eta\phi(f))+(1-\eta)\phi(-f)\geq C_{\phi}^{*}(\eta).

Thus we will optimize Θ\Theta over the set of functions SϕS_{\phi} defined by

Sϕ={(h0,h1):h0,h1:Kϵ¯ Borel, 0h0,h1 and forall 𝐱dη[0,1]ηh0(𝐱)+(1η)h1(𝐱)Cϕ(η)}S_{\phi}=\left\{\begin{aligned} &(h_{0},h_{1})\colon h_{0},h_{1}\colon K^{\epsilon}\to\overline{\mathbb{R}}\text{ Borel, }0\leq h_{0},h_{1}\text{ and for}\\ &\text{all ${\mathbf{x}}\in\mathbb{R}^{d}$, $\eta\in[0,1]$, }\eta h_{0}({\mathbf{x}})+(1-\eta)h_{1}({\mathbf{x}})\geq C_{\phi}^{*}(\eta)\end{aligned}\right\} (25)

where K=supp(01)K=\operatorname{supp}({\mathbb{P}}_{0}\cup{\mathbb{P}}_{1}). (Notice that the definition of Sϵ(g)S_{\epsilon}(g) in (11) assumes that the domain of gg must include Bϵ(𝐱)¯\overline{B_{\epsilon}({\mathbf{x}})}. Thus in order to define the integral Sϵ(h)𝑑\int S_{\epsilon}(h)d{\mathbb{Q}}, the domain of hh must include (supp)ϵ(\operatorname{supp}{\mathbb{Q}})^{\epsilon}.) Thus we define Ξ\Xi as

Ξ(h0,h1)={0if (h0,h1)Sϕ+otherwise\Xi(h_{0},h_{1})=\begin{cases}0&\text{if }(h_{0},h_{1})\in S_{\phi}\\ +\infty&\text{otherwise}\end{cases} (26)

The following result expresses R¯ϕ\bar{R}_{\phi} as an infimum of linear functionals continuous with respect to the weak topology on probability measures. This lemma will assist in the computation of Ξ\Xi^{*}. In the remainder of this section, +(S)\mathcal{M}_{+}(S) will denote the set of positive finite Borel measures on a set SS.

Lemma 12

Let KdK\subset\mathbb{R}^{d} be compact, E=Cb(Kϵ)×Cb(Kϵ)E=C_{b}(K^{\epsilon})\times C_{b}(K^{\epsilon}), and 0,1+(Kϵ){\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}\in\mathcal{M}_{+}(K^{\epsilon}). Then

inf(h0,h1)SϕEh1𝑑1+h0𝑑0=R¯ϕ(0,1)\inf_{\begin{subarray}{c}(h_{0},h_{1})\in S_{\phi}\cap E\end{subarray}}\int h_{1}d{\mathbb{P}}_{1}^{\prime}+\int h_{0}d{\mathbb{P}}_{0}^{\prime}=\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}) (27)

Therefore, R¯ϕ\bar{R}_{\phi} is concave and upper semi-continuous on +(Kϵ)×+(Kϵ)\mathcal{M}_{+}(K^{\epsilon})\times\mathcal{M}_{+}(K^{\epsilon}) with respect to the weak topology on probability measures.

We sketch the proof and formally fill in the details in Appendix D. Let =0+1{\mathbb{P}}^{\prime}={\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime}, η=d1/d\eta^{\prime}=d{\mathbb{P}}_{1}^{\prime}/d{\mathbb{P}}^{\prime}. Then

h1𝑑1+h0𝑑0=ηh1+(1η)h0d\int h_{1}d{\mathbb{P}}_{1}^{\prime}+\int h_{0}d{\mathbb{P}}_{0}^{\prime}=\int\eta^{\prime}h_{1}+(1-\eta^{\prime})h_{0}d{\mathbb{P}}^{\prime}

Clearly, the inequality \geq holds because ηh1+(1η)h0Cϕ(η)\eta^{\prime}h_{1}+(1-\eta^{\prime})h_{0}\geq C_{\phi}^{*}(\eta^{\prime}) for all (h0,h1)Sϕ(h_{0},h_{1})\in S_{\phi}. Equality is achieved at h1=ϕ(αϕ(η)),h0=ϕ(αϕ(η))h_{1}=\phi(\alpha_{\phi}(\eta^{\prime})),h_{0}=\phi(-\alpha_{\phi}(\eta^{\prime})), with αϕ\alpha_{\phi} as in (8). However, these functions may not be continuous. In Appendix D, we show that h0,h1h_{0},h_{1} can be approximated arbitrarily well by elements of SϕES_{\phi}\cap E.

Lemma 13

Let ϕ\phi be a non-increasing, lower semi-continuous loss function and let 0,1{\mathbb{P}}_{0},{\mathbb{P}}_{1} be compactly supported finite Borel measures. Let SϕS_{\phi} be as in (25).

Then

inf(h0,h1)SϕΘ(h0,h1)=sup0ϵ(0)1ϵ(1)R¯ϕ(0,1)\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta(h_{0},h_{1})=\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}) (28)

Furthermore, there exist 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} which attain the supremum.

Proof  We will show a version of (28) with the infimum taken over SϕES_{\phi}\cap E, and then argue that the same claim holds when the infimum is taken over SϕS_{\phi}.

Notice that if h0h_{0}, h1h_{1} are continuous, then Sϵ(h0)S_{\epsilon}(h_{0}), Sϵ(h1)S_{\epsilon}(h_{1}) are also continuous and Sϵ(h0)𝑑\int S_{\epsilon}(h_{0})d{\mathbb{Q}}, Sϵ(h1)𝑑\int S_{\epsilon}(h_{1})d{\mathbb{Q}} are well-defined for every Borel {\mathbb{Q}}. Hence we assume that 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} are Borel measures rather than their completion.

Let K=supp(0+1)K=\operatorname{supp}({\mathbb{P}}_{0}+{\mathbb{P}}_{1}). We will apply the Fenchel-Rockafellar Duality Theorem to the functionals given by (24) and (26) on the vector space E=Cb(Kϵ)×Cb(Kϵ)E=C_{b}(K^{\epsilon})\times C_{b}(K^{\epsilon}) equipped with the sup norm. By the Riesz representation theorem, dual of the space EE is E=(Kϵ)×(Kϵ)E^{*}=\mathcal{M}(K^{\epsilon})\times\mathcal{M}(K^{\epsilon}).

To start, we argue that the Fenchel-Rockafellar duality theorem applies to these functionals. First, notice that if (h0,h1)E(h_{0},h_{1})\in E, then both h0,h1h_{0},h_{1} are bounded so Θ(h0,h1)<\Theta(h_{0},h_{1})<\infty. Furthermore, Θ\Theta is convex due to the subadditivity of supremum and Ξ\Xi is convex because the constraint h0(𝐱)+(1η)h1(𝐱)Cϕ(η)h_{0}({\mathbf{x}})+(1-\eta)h_{1}({\mathbf{x}})\geq C_{\phi}^{*}(\eta) is linear in h0h_{0} and h1h_{1}. Furthermore, Θ\Theta is continuous on EE because Θ\Theta is convex and bounded and EE is open, see Theorem 2.14 of (Barbu and Precupanu, 2012).

Because the constant function (Cϕ(1/2),Cϕ(1/2))(C_{\phi}^{*}(1/2),C_{\phi}^{*}(1/2)) is in SϕS_{\phi}, Ξ\Xi is not identically \infty and therefore the Fenchel-Rockafellar theorem applies.

Clearly infEΘ(h0,h1)+Ξ(h0,h1)\inf_{E}\Theta(h_{0},h_{1})+\Xi(h_{0},h_{1}) reduces to the left-hand side of (28).

We now compute the dual of Ξ\Xi. Lemma 12 implies that

Ξ(0,1)=sup(h0,h1)SϕEh0𝑑0h1𝑑1\displaystyle-\Xi^{*}(-{\mathbb{P}}_{0}^{\prime},-{\mathbb{P}}_{1}^{\prime})=-\sup_{(h_{0},h_{1})\in S_{\phi}\cap E}-\int h_{0}d{\mathbb{P}}_{0}^{\prime}-\int h_{1}d{\mathbb{P}}_{1}^{\prime} (29)
={R¯ϕ(0,1)if i0otherwise\displaystyle=\begin{cases}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})&\text{if }{\mathbb{P}}_{i}^{\prime}\geq 0\\ -\infty&\text{otherwise}\end{cases}

This computation implies that the term Ξ(0,1)-\Xi^{*}(-{\mathbb{P}}_{0}^{\prime},-{\mathbb{P}}_{1}^{\prime}) present in the Fenchel-Rockafellar Theorem is not -\infty iff 0,1{\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime} are positive measures. Next, notice that because Θ(h0,h1)<+\Theta(h_{0},h_{1})<+\infty for all (h0,h1)E(h_{0},h_{1})\in E, Θ(0,1)-\Theta^{*}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}) is never ++\infty. Therefore, it suffices to compute Θ\Theta^{*} for positive measures 0,1{\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}. Lemma 4 implies that for positive measures 0,1{\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime},

Θ(0,1)=suph0,h1C0(Kϵ)h1𝑑1+h0𝑑0(Sϵ(h0)𝑑0+Sϵ(h1)𝑑1)\displaystyle\Theta^{*}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})=\sup_{h_{0},h_{1}\in C_{0}(K^{\epsilon})}\int h_{1}d{\mathbb{P}}_{1}^{\prime}+\int h_{0}d{\mathbb{P}}_{0}^{\prime}-\left(\int S_{\epsilon}(h_{0})d{\mathbb{P}}_{0}+\int S_{\epsilon}(h_{1})d{\mathbb{P}}_{1}\right)
=suph1C0(Kϵ)(h1𝑑1Sϵ(h1)𝑑1)+suph0C0(Kϵ)(h0𝑑0Sϵ(h0)𝑑0+)\displaystyle=\sup_{h_{1}\in C_{0}(K^{\epsilon})}\left(\int h_{1}d{\mathbb{P}}_{1}^{\prime}-\int S_{\epsilon}(h_{1})d{\mathbb{P}}_{1}\right)+\sup_{h_{0}\in C_{0}(K^{\epsilon})}\left(\int h_{0}d{\mathbb{P}}_{0}^{\prime}-\int S_{\epsilon}(h_{0})d{\mathbb{P}}_{0}+\right)
={00,1 positive measures, with W(0,0)ϵ and W(1,1)ϵ+0,1 positive measures, with either W(0,0)>ϵ or W(1,1)>ϵ\displaystyle=\begin{cases}0&{\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}\text{ positive measures, with }W_{\infty}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{0})\leq\epsilon\text{ and }W_{\infty}({\mathbb{P}}_{1}^{\prime},{\mathbb{P}}_{1})\leq\epsilon\\ +\infty&{\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}\text{ positive measures, with either }W_{\infty}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{0})>\epsilon\text{ or }W_{\infty}({\mathbb{P}}_{1}^{\prime},{\mathbb{P}}_{1})>\epsilon\end{cases}

Therefore

sup0,1(Kϵ)Θ(0,1)Ξ(0,1)=sup0ϵ(0)1ϵ(1)R¯ϕ(0,1)\sup_{{\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}\in\mathcal{M}(K^{\epsilon})}-\Theta({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})-\Xi(-{\mathbb{P}}_{0}^{\prime},-{\mathbb{P}}_{1}^{\prime})=\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})

and furthermore there exist measures 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} maximizing the dual problem. Therefore the Fenchel-Rockafellar Theorem implies that

inf(h0,h1)SϕΘ(h0,h1)inf(h0,h1)SϕEΘ(h0,h1)=sup0ϵ(0)1ϵ(1)R¯ϕ(0,1)\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta(h_{0},h_{1})\leq\inf_{(h_{0},h_{1})\in S_{\phi}\cap E}\Theta(h_{0},h_{1})=\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})

The opposite inequality follows from the weak duality argument presented in (16) in Section 4.1. See Lemma 45 of Appendix E for a full proof.  
Note that this proof does not easily extend to an unbounded domain XX: for a non-compact space, the dual of Cb(X)C_{b}(X) is much larger than (X)\mathcal{M}(X), and thus a naive application of the Fenchel-Rockafellar Theorem would result in a different right-hand side than (28). On the other hand, the Reisz representation theorem for an unbounded domain XX states that the dual of C0(X)C_{0}(X) is (X)\mathcal{M}(X), where C0(X)C_{0}(X) is the set of continuous bounded functions vanishing at \infty. At the same time, if h0,h1C0(X)h_{0},h_{1}\in C_{0}(X), then ηh1(𝐱)+(1η)h0(𝐱)\eta h_{1}({\mathbf{x}})+(1-\eta)h_{0}({\mathbf{x}}) becomes arbitrarily small for large 𝐱{\mathbf{x}} so the constraint ηh1(𝐱)+(1η)h0(𝐱)Cϕ(η)\eta h_{1}({\mathbf{x}})+(1-\eta)h_{0}({\mathbf{x}})\geq C_{\phi}^{*}(\eta) cannot hold for all η\eta. Thus if KK is unbounded, SϕC0(X)=S_{\phi}\cap C_{0}(X)=\emptyset and the functional Ξ\Xi would be ++\infty everywhere on C0(X)C_{0}(X), precluding and application of the Fenchel-Rockafellar Theorem.

However, Lemma 13 can be extended to distributions with arbitrary support via a simple approximation argument. By Lemma 13, the strong duality result holds for the distributions defined by 0n=0|Bn(𝟎)¯,1n=1|Bn(𝟎)¯{\mathbb{P}}_{0}^{n}={\mathbb{P}}_{0}\big{|}_{\overline{B_{n}({\mathbf{0}})}},{\mathbb{P}}_{1}^{n}={\mathbb{P}}_{1}\big{|}_{\overline{B_{n}({\mathbf{0}})}}. One then shows strong duality by computing the limit of the primal and dual problems as nn\to\infty. We therefore obtain the following Lemma, which is proved formally in Appendix E.

Lemma 14

Let ϕ\phi be a non-increasing, lower semi-continuous loss function and let 0,1{\mathbb{P}}_{0},{\mathbb{P}}_{1} be finite Borel measures supported on d\mathbb{R}^{d}. Let SϕS_{\phi} be as in (25). Then

inf(h0,h1)SϕΘ(h0,h1)=sup0ϵ(0)1ϵ(1)R¯ϕ(0,1)\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta(h_{0},h_{1})=\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})

Furthermore, there exist 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} which attain the supremum.

5.2 Complimentary Slackness

Using a standard argument, strong duality (Lemma 14) allows us to prove a version of the complimentary slackness theorem.

Lemma 15

Assume that 0,1{\mathbb{P}}_{0},{\mathbb{P}}_{1} are compactly supported. The functions h0,h1h_{0}^{*},h_{1}^{*} minimize Θ\Theta over SϕS_{\phi} and (0,1)({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}) maximize R¯ϕ\bar{R}_{\phi} over ϵ(0)×ϵ(1){{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\times{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}) iff the following hold:

  1. 1)
    h1𝑑1=Sϵ(h1)𝑑1andh0𝑑0=Sϵ(h0)𝑑0\int h_{1}^{*}d{\mathbb{P}}_{1}^{*}=\int S_{\epsilon}(h_{1}^{*})d{\mathbb{P}}_{1}\quad\text{and}\quad\int h_{0}^{*}d{\mathbb{P}}_{0}^{*}=\int S_{\epsilon}(h_{0}^{*})d{\mathbb{P}}_{0} (30)
  2. 2)

    If we define =0+1{\mathbb{P}}^{*}={\mathbb{P}}_{0}^{*}+{\mathbb{P}}_{1}^{*} and η=d1/d\eta^{*}=d{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}, then

    η(𝐱)h1(𝐱)+(1η(𝐱))h0(𝐱)=Cϕ(η(𝐱))-a.e.\eta^{*}({\mathbf{x}})h_{1}^{*}({\mathbf{x}})+(1-\eta^{*}({\mathbf{x}}))h_{0}^{*}({\mathbf{x}})=C_{\phi}^{*}(\eta^{*}({\mathbf{x}}))\quad{\mathbb{P}}^{*}\text{-a.e.} (31)

This lemma is proved in Appendix F. Theorem 7 will later follow from this result.

To show that Lemma 15 is non-vacuous, one must prove that there exist minimizers to Θ\Theta over SϕS_{\phi}, which we delay to Sections 6 and 7. Notice that the application of the Fenchel-Rockafellar Theorem in Lemma 13 actually implies the existence of dual maximizers in the case of compactly supported 0,1{\mathbb{P}}_{0},{\mathbb{P}}_{1}.

In fact, the complimentary slackness conditions hold approximately for any maximizer of R¯ϕ\bar{R}_{\phi} and any minimizing sequence of Θ\Theta. This result is essential for proving the existence of minimizers to Θ\Theta.

Lemma 16

Let (h0n,h1n)(h_{0}^{n},h_{1}^{n}) be a minimizing sequence for Θ\Theta over SϕS_{\phi}: limnΘ(h0n,h1n)=inf(h0,h1)SϕΘ(h0,h1)\lim_{n\to\infty}\Theta(h_{0}^{n},h_{1}^{n})=\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta(h_{0},h_{1}). Then for any maximizer of the dual problem (0,1)({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}), the following hold:

  1. 1)
    limnSϵ(h0n)𝑑0h0n𝑑0=0,limnSϵ(h1n)𝑑1h1n𝑑1=0\lim_{n\to\infty}\int S_{\epsilon}(h_{0}^{n})d{\mathbb{P}}_{0}-\int h_{0}^{n}d{\mathbb{P}}_{0}^{*}=0,\quad\lim_{n\to\infty}\int S_{\epsilon}(h_{1}^{n})d{\mathbb{P}}_{1}-\int h_{1}^{n}d{\mathbb{P}}_{1}^{*}=0 (32)
  2. 2)

    If we define =0+1{\mathbb{P}}^{*}={\mathbb{P}}_{0}^{*}+{\mathbb{P}}_{1}^{*} and η=d1/d\eta^{*}=d{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}

    limnηh1n+(1η)h0nCϕ(η)d=0\lim_{n\to\infty}\int\eta^{*}h_{1}^{n}+(1-\eta^{*})h_{0}^{n}-C_{\phi}^{*}(\eta^{*})d{\mathbb{P}}^{*}=0 (33)

Proof  Let

m=inf(h0,h1)SϕΘ(h0,h1).m=\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta(h_{0},h_{1}).

Then the fact that (h0n,h1n)Sϕ(h_{0}^{n},h_{1}^{n})\in S_{\phi} and the duality result (Lemma 14) implies

h1n𝑑1+h0n𝑑0=ηh1n+(1η)h0ndCϕ(η)𝑑=m\int h_{1}^{n}d{\mathbb{P}}_{1}^{*}+\int h_{0}^{n}d{\mathbb{P}}_{0}^{*}=\int\eta^{*}h_{1}^{n}+(1-\eta^{*})h_{0}^{n}d{\mathbb{P}}^{*}\geq\int C_{\phi}^{*}(\eta^{*})d{\mathbb{P}}^{*}=m (34)

Now pick δ>0\delta>0 and an NN for which nNn\geq N implies that Θ(h0n,h1n)m+δ\Theta(h_{0}^{n},h_{1}^{n})\leq m+\delta. Then

m+δSϵ(h1n)𝑑1+Sϵ(h0n)𝑑0ηh1n+(1η)h0ndm.m+\delta\geq\int S_{\epsilon}(h_{1}^{n})d{\mathbb{P}}_{1}+\int S_{\epsilon}(h_{0}^{n})d{\mathbb{P}}_{0}\geq\int\eta^{*}h_{1}^{n}+(1-\eta^{*})h_{0}^{n}d{\mathbb{P}}^{*}\geq m.

Subtracting m=Cϕ(η)𝑑m=\int C_{\phi}^{*}(\eta^{*})d{\mathbb{P}}^{*} from this inequality results in

δηh1n+(1η)h0ndCϕ(η)𝑑0\delta\geq\int\eta^{*}h_{1}^{n}+(1-\eta^{*})h_{0}^{n}d{\mathbb{P}}^{*}-\int C_{\phi}^{*}(\eta^{*})d{\mathbb{P}}^{*}\geq 0 (35)

which implies (33). Next, (34) further implies

mh1n𝑑1+h0n𝑑00m-\int h_{1}^{n}d{\mathbb{P}}_{1}^{*}+\int h_{0}^{n}d{\mathbb{P}}_{0}^{*}\leq 0 (36)

Now this inequality implies

δδ+m(h1n𝑑1+h0n𝑑0)Θ(h1n,h0n)(h1n𝑑1+h0n𝑑0)\displaystyle\delta\geq\delta+m-\left(\int h_{1}^{n}d{\mathbb{P}}_{1}^{*}+\int h_{0}^{n}d{\mathbb{P}}_{0}^{*}\right)\geq\Theta(h_{1}^{n},h_{0}^{n})-\left(\int h_{1}^{n}d{\mathbb{P}}_{1}^{*}+\int h_{0}^{n}d{\mathbb{P}}_{0}^{*}\right)
(Sϵ(h1n)𝑑1+Sϵ(h0n)𝑑0)(h1n𝑑1+h0n𝑑0)\displaystyle\geq\left(\int S_{\epsilon}(h_{1}^{n})d{\mathbb{P}}_{1}+\int S_{\epsilon}(h_{0}^{n})d{\mathbb{P}}_{0}\right)-\left(\int h_{1}^{n}d{\mathbb{P}}_{1}^{*}+\int h_{0}^{n}d{\mathbb{P}}_{0}^{*}\right)

However, Lemma 3 implies that both Sϵ(h1n)𝑑1h1n𝑑1,Sϵ(h0n)𝑑0h0n𝑑0\int S_{\epsilon}(h_{1}^{n})d{\mathbb{P}}_{1}-\int h_{1}^{n}d{\mathbb{P}}_{1}^{*},\int S_{\epsilon}(h_{0}^{n})d{\mathbb{P}}_{0}-\int h_{0}^{n}d{\mathbb{P}}_{0}^{*} are positive quantities. Therefore, we have shown that for any δ>0\delta>0, there is an NN for which nNn\geq N implies that

δ>Sϵ(h1n)𝑑1h1n𝑑10andδ>Sϵ(h0n)𝑑0h0n𝑑00\delta>\int S_{\epsilon}(h_{1}^{n})d{\mathbb{P}}_{1}-\int h_{1}^{n}d{\mathbb{P}}_{1}^{*}\geq 0\quad\text{and}\quad\delta>\int S_{\epsilon}(h_{0}^{n})d{\mathbb{P}}_{0}-\int h_{0}^{n}d{\mathbb{P}}_{0}^{*}\geq 0

which implies (32).

 
An analogous approximate complimentary slackness result typically holds in other applications of the Fenchel-Rockafellar theorem. Consider a convex optimization problem which can be written as infxΘ(x)+Ξ(x)\inf_{x}\Theta(x)+\Xi(x) in such a way that the Fenchel-Rockafellar theorem applies and for which Ξ\Xi and Θ\Theta^{*} are indicator functions of the convex sets CP,CDC_{P},C_{D} respectively. Then the Fenchel-Rockafellar Theorem states that

infxCPΘ(x)=infxCpsupyCDy,x=supyCDinfxCPy,x=supyCDΞ(y)\inf_{x\in C_{P}}\Theta(x)=\inf_{x\in C_{p}}\sup_{y\in C_{D}}\langle y,x\rangle=\sup_{y\in C_{D}}\inf_{x\in C_{P}}\langle y,x\rangle=\sup_{y\in C_{D}}\Xi^{*}(y) (37)

Let yy^{*} be a maximizer of the dual problem and let mm be the minimal value of Θ\Theta over CPC_{P}. If xkx_{k} is a minimizing sequence of Θ\Theta, then for δ>0\delta>0 and sufficiently large kk, δ+m>Θ(xk)\delta+m>\Theta(x_{k}) and thus by (37),

m+δ>Θ(xk)=supyCpy,xky,xkinfxCDy,x=infxCDΞ(x)=mm+\delta>\Theta(x_{k})=\sup_{y\in C_{p}}\langle y,x_{k}\rangle\geq\langle y^{*},x_{k}\rangle\geq\inf_{x\in C_{D}}\langle y^{*},x\rangle=\inf_{x\in C_{D}}\Xi^{*}(x)=m (38)

and therefore limky,xk=m\lim_{k\to\infty}\langle y^{*},x_{k}\rangle=m. Condition (31) is this statement adapted to the adversarial learning problem. Furthermore, subtracting Θ(xk)\Theta(x_{k}) from (38) and taking the limit kk\to\infty results in limkΘ(xk)y,xk=0\lim_{k\to\infty}\Theta(x_{k})-\langle y^{*},x_{k}\rangle=0. In our adversarial learning scenario, this statement is equivalent to the conditions in (32) due to Lemma 3. Furthermore, just like the standard complimentary slackness theorems, the relations limky,xk=m\lim_{k\to\infty}\langle y^{*},x_{k}\rangle=m, limkΘ(xk)y,xk=0\lim_{k\to\infty}\Theta(x_{k})-\langle y^{*},x_{k}\rangle=0 imply that xkx_{k} is a minimizing sequence for Θ\Theta.

In the classical proof of the Kantorovich duality, one can choose Θ,Ξ\Theta,\Xi of a form similar to the discussion above, see for instance Theorem 1.3 of Villani (2003). Using an argument similar to (38), one can prove approximate complimentary slackness for the Kantorovich problem called the quantitative Knott-Smith criteria, see Theorems 2.15, 2.16 of Villani (2003) for further discussion.

6 Existence of minimizers of Θ\Theta over SψS_{\psi}

After proving the existence of maximizers to the dual problem, we can now use the approximate complimentary slackness conditions to prove the existence of minimizers to the primal. The exponential loss ψ\psi has certain properties that make it particularly easy to study:

Lemma 17

Let ψ(α)=eα\psi(\alpha)=e^{-\alpha}. Then Cψ(η)=2η(1η)C_{\psi}^{*}(\eta)=2\sqrt{\eta(1-\eta)} and αψ(η)=1/2log(η/1η)\alpha_{\psi}(\eta)=1/2\log(\eta/1-\eta) is the unique minimizer of Cψ(η,)C_{\psi}(\eta,\cdot), with αψ(0),αψ(1)\alpha_{\psi}(0),\alpha_{\psi}(1) interpreted as -\infty, ++\infty respectively. Furthermore, Cψ(η)\partial C_{\psi}^{*}(\eta) is the singleton Cψ(η)={ψ(αψ(η))ψ(αψ(η))}\partial C_{\psi}^{*}(\eta)=\{\psi(\alpha_{\psi}(\eta))-\psi(-\alpha_{\psi}(\eta))\}.

See Appendix G.1 for a proof. The existence of minimizers of Θ\Theta for the exponential loss then follows from properties of CψC_{\psi}. Let (h0n,h1n)(h_{0}^{n},h_{1}^{n}) be a minimizing sequene of R¯ϕ\bar{R}_{\phi}. Because the function CψC_{\psi} is strictly concave, one can use the condition (33) to show that there is a subsequence {nk}\{n_{k}\} along which h0nk(𝐱)h_{0}^{n_{k}}({\mathbf{x}}^{\prime}), h1nk(𝐱)h_{1}^{n_{k}}({\mathbf{x}}^{\prime}) converge 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}-a.e. respectively. Due to (32), Sϵ(h0nk)(𝐱)S_{\epsilon}(h_{0}^{n_{k}})({\mathbf{x}}), Sϵ(h1nk)S_{\epsilon}(h_{1}^{n_{k}}) also converge 0,1{\mathbb{P}}_{0},{\mathbb{P}}_{1}-a.e. respectively along this subsequence. This observation suffices to show the existence of functions that minimize Θ\Theta over SψS_{\psi}.

The first step of this proof is to formalize this argument for sequences in ¯\overline{\mathbb{R}}.

Lemma 18

Let (an,bn)(a_{n},b_{n}) be a sequence for which an,bn0a_{n},b_{n}\geq 0 and

ηan+(1η)bnCψ(η) for all η[0,1]\eta a_{n}+(1-\eta)b_{n}\geq C_{\psi}^{*}(\eta)\text{ for all }\eta\in[0,1] (39)

and

limnη0an+(1η0)bn=Cψ(η0)\lim_{n\to\infty}\eta_{0}a_{n}+(1-\eta_{0})b_{n}=C_{\psi}^{*}(\eta_{0}) (40)

for some η0\eta_{0}. Then limnan=ψ(αψ(η0))\lim_{n\to\infty}a_{n}=\psi(\alpha_{\psi}(\eta_{0})) and limnbn=ψ(αψ(η0))\lim_{n\to\infty}b_{n}=\psi(-\alpha_{\psi}(\eta_{0})).

Notice that if ηa+(1η)bCψ(η)\eta a+(1-\eta)b\geq C_{\psi}^{*}(\eta) and η0a+(1η0)b=Cψ(η0)\eta_{0}a+(1-\eta_{0})b=C_{\psi}^{*}(\eta_{0}), then this lemma implies that a=ψ(αψ(η0))a=\psi(\alpha_{\psi}(\eta_{0})) and b=ψ(αψ(η0))b=\psi(-\alpha_{\psi}(\eta_{0})).

To prove Lemma 18, we show that all convergent subsequences of {an}\{a_{n}\} and {bn}\{b_{n}\} must converge to aa and bb that satisfy η0a+(1η0)b=Cϕ(η0)\eta_{0}a+(1-\eta_{0})b=C_{\phi}^{*}(\eta_{0}) and abCψ(η0)a-b\in\partial C_{\psi}^{*}(\eta_{0}). As the set Cψ(η0)\partial C_{\psi}^{*}(\eta_{0}) is a singleton, the values a=ψ(αψ(η0))a=\psi(\alpha_{\psi}(\eta_{0})) and b=ψ(αψ(η0))b=\psi(\alpha_{\psi}(\eta_{0})) uniquely solve these equations for aa and bb. Therefore the sequences {an}\{a_{n}\} and {bn}\{b_{n}\} must converge to aa and bb as well. See Appendix G.2 for a formal proof. This result applied to a minimizing sequence of Θ\Theta allows one to find a subsequence with certain convergence properties.

Lemma 19

Let (h0n,h1n)(h_{0}^{n},h_{1}^{n}) be a minimizing sequence of Θ\Theta over SψS_{\psi}. Then there exists a subsequence nkn_{k} for which Sϵ(h1nk)S_{\epsilon}(h_{1}^{n_{k}}), Sϵ(h0nk)S_{\epsilon}(h_{0}^{n_{k}}) converge 1{\mathbb{P}}_{1}, 0{\mathbb{P}}_{0}-a.e. respectively.

Proof  Let 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} be maximizers of the dual problem. Let γi\gamma_{i} be the coupling between i,i{\mathbb{P}}_{i},{\mathbb{P}}_{i}^{*} with suppγiΔϵ\operatorname{supp}\gamma_{i}\subset\Delta_{\epsilon}.

Then (33) of Lemma 16 implies that

limnη(𝐱)h1n(𝐱)+(1η(𝐱))h0n(𝐱)Cψ(η(𝐱))d(γ1+γ0)(𝐱,𝐱)=0\lim_{n\to\infty}\int\eta^{*}({\mathbf{x}}^{\prime})h_{1}^{n}({\mathbf{x}}^{\prime})+(1-\eta^{*}({\mathbf{x}}^{\prime}))h_{0}^{n}({\mathbf{x}}^{\prime})-C_{\psi}(\eta^{*}({\mathbf{x}}^{\prime}))d(\gamma_{1}+\gamma_{0})({\mathbf{x}},{\mathbf{x}}^{\prime})=0

and (32) implies that

limnSϵ(h1n)(𝐱)h1n(𝐱)dγ1(𝐱,𝐱)=0,limnSϵ(h0n)(𝐱)h0n(𝐱)dγ0(𝐱,𝐱)=0\lim_{n\to\infty}\int S_{\epsilon}(h_{1}^{n})({\mathbf{x}})-h_{1}^{n}({\mathbf{x}}^{\prime})d\gamma_{1}({\mathbf{x}},{\mathbf{x}}^{\prime})=0,\quad\lim_{n\to\infty}\int S_{\epsilon}(h_{0}^{n})({\mathbf{x}})-h_{0}^{n}({\mathbf{x}}^{\prime})d\gamma_{0}({\mathbf{x}},{\mathbf{x}}^{\prime})=0

Recall that on a bounded measure space, L1L^{1} convergence implies a.e. convergence along a subsequence (see Corollary 2.32 of (Folland, 1999)). Thus one can pick a subsequence nkn_{k} along which

limkη(𝐱)h1nk(𝐱)+(1η(𝐱))h0nk(𝐱)Cψ(η(𝐱))=0\lim_{k\to\infty}\eta^{*}({\mathbf{x}}^{\prime})h_{1}^{n_{k}}({\mathbf{x}}^{\prime})+(1-\eta^{*}({\mathbf{x}}^{\prime}))h_{0}^{n_{k}}({\mathbf{x}}^{\prime})-C_{\psi}(\eta^{*}({\mathbf{x}}^{\prime}))=0 (41)

γ1+γ0\gamma_{1}+\gamma_{0}-a.e. and

limkSϵ(h1nk)(𝐱)h1nk(𝐱)=0,limkSϵ(h0nk)(𝐱)h0nk(𝐱)=0\lim_{k\to\infty}S_{\epsilon}(h_{1}^{n_{k}})({\mathbf{x}})-h_{1}^{n_{k}}({\mathbf{x}}^{\prime})=0,\quad\lim_{k\to\infty}S_{\epsilon}(h_{0}^{n_{k}})({\mathbf{x}})-h_{0}^{n_{k}}({\mathbf{x}}^{\prime})=0 (42)

γ1\gamma_{1}, γ0\gamma_{0}-a.e. respectively.

Furthermore, ηh1n+(1η)h0nCψ(η)\eta h_{1}^{n}+(1-\eta)h_{0}^{n}\geq C_{\psi}^{*}(\eta) for all η[0,1]\eta\in[0,1]. Thus (41) and Lemma 18 imply that hnk1h_{n_{k}}^{1} converges to ψ(αψ(η))\psi(\alpha_{\psi}(\eta^{*})) and hnk0h_{n_{k}}^{0} converges to ψ(αψ(η))\psi(-\alpha_{\psi}(\eta^{*})) γ0+γ1\gamma_{0}+\gamma_{1}-a.e. Equation 42 then implies that Sϵ(h1nk)(𝐱)S_{\epsilon}(h_{1}^{n_{k}})({\mathbf{x}}), Sϵ(h0nk)(𝐱)S_{\epsilon}(h_{0}^{n_{k}})({\mathbf{x}}) converge γ1,γ0\gamma_{1},\gamma_{0} -a.e. respectively. Because 1,0{\mathbb{P}}_{1},{\mathbb{P}}_{0} are marginals of γ1,γ0\gamma_{1},\gamma_{0}, this statement implies the result.  

The existence of a minimizer then follows from the fact that Sϵ(h1nk)S_{\epsilon}(h_{1}^{n_{k}}) converges. The next lemma describes how the SϵS_{\epsilon} operation interacts with lim inf\liminfs and lim sup\limsups.

Lemma 20

Let hnh_{n} be any sequence of functions. Then the sequence hnh_{n} satisfies

lim infnSϵ(hn)Sϵ(lim infnhn)\liminf_{n\to\infty}S_{\epsilon}(h_{n})\geq S_{\epsilon}(\liminf_{n\to\infty}h_{n}) (43)

and

lim supnSϵ(hn)Sϵ(lim supnhn)\limsup_{n\to\infty}S_{\epsilon}(h_{n})\geq S_{\epsilon}(\limsup_{n\to\infty}h_{n}) (44)

See Appendix G.3 for a proof.

Finally, we prove that there exists a minimizer to Θ\Theta over SψS_{\psi}.

Lemma 21

There exists a minimizer (h0,h1)(h_{0}^{*},h_{1}^{*}) to Θ\Theta over the set SψS_{\psi}.

Proof  Let (h0n,h1n)(h_{0}^{n},h_{1}^{n}) be a sequence minimizing Θ\Theta over SψS_{\psi}.

Lemma 19 implies that there is a subsequence {nk}\{n_{k}\} for which limkSϵ(h0nk)\lim_{k\to\infty}S_{\epsilon}(h_{0}^{n_{k}}) exists 0{\mathbb{P}}_{0}-a.e.

Thus

lim supkSϵ(h0nk)=lim infkSϵ(h0nk)0-a.e.\limsup_{k\to\infty}S_{\epsilon}(h_{0}^{n_{k}})=\liminf_{k\to\infty}S_{\epsilon}(h_{0}^{n_{k}})\quad{\mathbb{P}}_{0}\text{-a.e.} (45)

Next, we will argue that the pair (lim supkh0nk,lim infkh1nk)(\limsup_{k}h_{0}^{n_{k}},\liminf_{k}h_{1}^{n_{k}}) is in SψS_{\psi}. Because

Cψ(η)ηh1nk+(1η)h0nk,C_{\psi}^{*}(\eta)\leq\eta h_{1}^{n_{k}}+(1-\eta)h_{0}^{n_{k}},

one can conclude that

Cψ(η)ηlim infk(h1nk+(1η)h0nk)ηlim infkh1nk+(1η)lim supkh0nk.C_{\psi}^{*}(\eta)\leq\eta\liminf_{k\to\infty}(h_{1}^{n_{k}}+(1-\eta)h_{0}^{n_{k}})\leq\eta\liminf_{k\to\infty}h_{1}^{n_{k}}+(1-\eta)\limsup_{k\to\infty}h_{0}^{n_{k}}.

Define

h1=lim infkh1nk,h0=lim supkh0nkh_{1}^{*}=\liminf_{k}h_{1}^{n_{k}},\quad h_{0}^{*}=\limsup_{k}h_{0}^{n_{k}}

Now Fatou’s lemma, Lemma 20, and Equation 45 imply that

limkΘ(h0nk,h1nk)lim infkSϵ(h1nk)d1+lim infkSϵ(h0nk)d0\displaystyle\lim_{k\to\infty}\Theta(h_{0}^{n_{k}},h_{1}^{n_{k}})\geq\int\liminf_{k\to\infty}S_{\epsilon}(h_{1}^{n_{k}})d{\mathbb{P}}_{1}+\int\liminf_{k\to\infty}S_{\epsilon}(h_{0}^{n_{k}})d{\mathbb{P}}_{0} (Fatou’s Lemma)
=lim infkSϵ(h1nk)d1+lim supkSϵ(h0nk)d0\displaystyle=\int\liminf_{k\to\infty}S_{\epsilon}(h_{1}^{n_{k}})d{\mathbb{P}}_{1}+\int\limsup_{k\to\infty}S_{\epsilon}(h_{0}^{n_{k}})d{\mathbb{P}}_{0} (Equation 45)
Sϵ(lim infkh1nk)𝑑1+Sϵ(lim supkh0nk)𝑑0\displaystyle\geq\int S_{\epsilon}(\liminf_{k\to\infty}h_{1}^{n_{k}})d{\mathbb{P}}_{1}+\int S_{\epsilon}(\limsup_{k\to\infty}h_{0}^{n_{k}})d{\mathbb{P}}_{0} (Lemma 20)
=Sϵ(h1)𝑑1+Sϵ(h0)𝑑0\displaystyle=\int S_{\epsilon}(h_{1}^{*})d{\mathbb{P}}_{1}+\int S_{\epsilon}(h_{0}^{*})d{\mathbb{P}}_{0}

Therefore, (h0,h1)(h_{0}^{*},h_{1}^{*}) must be a minimizer.

 

7 Reducing Θ\Theta to RϕϵR_{\phi}^{\epsilon}

Using the properties of Cψ(η)C_{\psi}^{*}(\eta), we showed in the previous section that there exist minimizers to Θ\Theta over the set SψS_{\psi}. The inequality ηh1+(1η)h0Cψ(η)\eta h_{1}^{*}+(1-\eta^{*})h_{0}^{*}\geq C_{\psi}^{*}(\eta) together with (31) imply that h1(𝐱)h0(𝐱)h_{1}^{*}({\mathbf{x}})-h_{0}^{*}({\mathbf{x}}) is a supergradient of Cψ(η(𝐱))C_{\psi}^{*}(\eta^{*}({\mathbf{x}})) and thus h1h0=(Cψ)(η)h_{1}^{*}-h_{0}^{*}=(C_{\psi}^{*})^{\prime}(\eta). This relation together with (31) provides two equations in two variables that can be uniquely solved for h0,h1h_{0}^{*},h_{1}^{*}, resulting in h0=ψαψ(η),h1=ψαψ(η)h_{0}^{*}=\psi\circ-\alpha_{\psi}(\eta^{*}),h_{1}^{*}=\psi\circ\alpha_{\psi}(\eta^{*}).

Next, primal minimizers of Θ\Theta over SϕS_{\phi} for any ϕ\phi will be constructed from the dual maximizers 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} of R¯ψ\bar{R}_{\psi}. Because αψ(η)=1/2log(η/1η)\alpha_{\psi}(\eta)=1/2\log(\eta/1-\eta) is a strictly increasing function, the compositions ψαψ,ψαψ\psi\circ\alpha_{\psi},\psi\circ-\alpha_{\psi} are strictly monotonic. Thus the complimentary slackness condition (30) applied to h1=ψ(αψ(η)),h0=ψ(αψ(η))h_{1}^{*}=\psi(\alpha_{\psi}(\eta^{*})),h_{0}^{*}=\psi(-\alpha_{\psi}(\eta^{*})) implies that supp1\operatorname{supp}{\mathbb{P}}_{1}^{*} is contained in the set of points 𝐱{\mathbf{x}}^{\prime} for which η\eta^{*} assumes its infimum over some ϵ\epsilon-ball at 𝐱{\mathbf{x}}^{\prime} and supp0\operatorname{supp}{\mathbb{P}}_{0}^{*} is contained in the set of points 𝐱{\mathbf{x}}^{\prime} where η\eta^{*} assumes its supremum over some ϵ\epsilon-ball at 𝐱{\mathbf{x}}^{\prime}. Thus, the functions ϕαϕ(η),ϕαϕ(η)\phi\circ\alpha_{\phi}(\eta^{*}),\phi\circ-\alpha_{\phi}(\eta^{*}) satisfy (30) for the loss ϕ\phi. The definition of αϕ\alpha_{\phi} further implies they satisfy (31). Therefore, Lemma 15 implies that for any ϕ\phi, h1=ϕαϕ(η)h_{1}^{*}=\phi\circ\alpha_{\phi}(\eta^{*}), h0=ϕαϕ(η)h_{0}^{*}=\phi\circ\alpha_{\phi}(\eta^{*}) are primal optimal and 0{\mathbb{P}}_{0}^{*}, 1{\mathbb{P}}_{1}^{*} are dual optimal!

This reasoning about η\eta^{*} is technically wrong but correct in spirit. Because η\eta^{*} is a Raydon-Nikodym derivative, it is only defined {\mathbb{P}}^{*}-a.e. As a result, the supremum over an ϵ\epsilon-ball of the function ϕ(αψ(η))\phi(\alpha_{\psi}(\eta^{*})) is not well-defined. The solution is to replace η\eta^{*} in the discussion above by a function η^\hat{\eta} that is defined everywhere. The function η^\hat{\eta} is actually a version of the Raydon-Nikodym derivative d1/dd{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}. The next two lemmas describe how one constructs this function η^\hat{\eta}.

The next two lemmas discuss the analog of the c transform for the Kantorovich problem in optimal transport (see for instance Chapter 1 of (Santambrogio, 2015) or Section 2.5 of (Villani, 2003)).

Lemma 22

Assume that h0,h10h_{0},h_{1}\geq 0 and (h0(𝐱),h1(𝐱))(h_{0}({\mathbf{x}}),h_{1}({\mathbf{x}})) satisfy ηh1+(1η)h0Cϕ(η)\eta h_{1}+(1-\eta)h_{0}\geq C_{\phi}^{*}(\eta) for all η\eta. Then if we define h0Cϕh_{0}^{C_{\phi}^{*}} via

h0Cϕ=supη[0,1)Cϕ(η)ηh11ηh_{0}^{C_{\phi}^{*}}=\sup_{\eta\in[0,1)}\frac{C_{\phi}^{*}(\eta)-\eta h_{1}}{1-\eta} (46)

then h0Cϕh0h_{0}^{C_{\phi}^{*}}\leq h_{0} while and h1+(1η)h0CϕCϕ(η)h_{1}+(1-\eta)h_{0}^{C_{\phi}^{*}}\geq C_{\phi}^{*}(\eta) for all η\eta, and h0Cϕh_{0}^{C_{\phi}^{*}} is the smallest function h0h_{0} for which (h0,h1)Sϕ(h_{0},h_{1})\in S_{\phi}. Furthermore, the function h0Cϕh_{0}^{C_{\phi}^{*}} is Borel and there exists a function η¯:d[0,1]\bar{\eta}\colon\mathbb{R}^{d}\to[0,1] for which η¯(𝐱)h1(𝐱)+(1η¯(𝐱))h1Cϕ(𝐱)=Cϕ(η¯(𝐱))\bar{\eta}({\mathbf{x}})h_{1}({\mathbf{x}})+(1-\bar{\eta}({\mathbf{x}}))h_{1}^{C_{\phi}^{*}}({\mathbf{x}})=C_{\phi}^{*}(\bar{\eta}({\mathbf{x}})).

Proof  For convenience, set h~0=h1Cϕ\tilde{h}_{0}=h_{1}^{C_{\phi}^{*}}. Notice that h~00\tilde{h}_{0}\geq 0 because the right-hand side of (46) evaluates to 0 at η=0\eta=0. We will show that h~0\tilde{h}_{0} is Borel and that (h~0,h1)(\tilde{h}_{0},h_{1}) is a feasible pair.

Next, Notice that the map

G(η,α)={Cϕ(η)ηα1ηif η<1limη1Cϕ(η)ηα1ηif η=1G(\eta,\alpha)=\begin{cases}\frac{C_{\phi}^{*}(\eta)-\eta\alpha}{1-\eta}&\text{if }\eta<1\\ \lim_{\eta\to 1}\frac{C_{\phi}^{*}(\eta)-\eta\alpha}{1-\eta}&\text{if }\eta=1\end{cases} (47)

is continuous in η\eta. Thus, the supremum in (46) can be taken over the countable set [0,1]{\mathbb{Q}}\cap[0,1] and hence the function h~0(𝐱)=supη[0,1)G(η,h1(𝐱))\tilde{h}_{0}({\mathbf{x}})=\sup_{\eta\in[0,1)\cap{\mathbb{Q}}}G(\eta,h_{1}({\mathbf{x}})) is Borel measurable. Because G(η,h1(𝐱))G(\eta,h_{1}({\mathbf{x}})) is continuous in η\eta for each fixed 𝐱{\mathbf{x}}, G(,h1(𝐱))G(\cdot,h_{1}({\mathbf{x}})) assumes its maximum on η[0,1]\eta\in[0,1] for each fixed 𝐱{\mathbf{x}}. Thus there exists a function η¯(𝐱)\bar{\eta}({\mathbf{x}}) that maps 𝐱{\mathbf{x}} to a maximizer of G(,h1(𝐱))G(\cdot,h_{1}({\mathbf{x}})). For this function η¯(𝐱)\bar{\eta}({\mathbf{x}}), one can conclude that h~0(𝐱)=G(η¯(𝐱),𝐱)\tilde{h}_{0}({\mathbf{x}})=G(\bar{\eta}({\mathbf{x}}),{\mathbf{x}}) and hence

η¯(𝐱)h1(𝐱)+(1η¯(𝐱))h~0(𝐱)=Cϕ(η¯(𝐱)).\bar{\eta}({\mathbf{x}})h_{1}({\mathbf{x}})+(1-\bar{\eta}({\mathbf{x}}))\tilde{h}_{0}({\mathbf{x}})=C_{\phi}^{*}(\bar{\eta}({\mathbf{x}})). (48)

Equation 48 implies that if f(𝐱)<h~0(𝐱)f({\mathbf{x}})<\tilde{h}_{0}({\mathbf{x}}) at any 𝐱{\mathbf{x}}, then ηh1(𝐱)+(1η)f(𝐱)<Cϕ(η(𝐱))\eta h_{1}({\mathbf{x}})+(1-\eta)f({\mathbf{x}})<C_{\phi}^{*}(\eta({\mathbf{x}})) so (f,h1)(f,h_{1}) is not in the feasible set SϕS_{\phi}. Therefore, h~0\tilde{h}_{0} is the smallest function ff for which (f,h1)Sϕ(f,h_{1})\in S_{\phi}.  

Next we use this result to define an extension of η\eta^{*} to all of d\mathbb{R}^{d}.

Lemma 23

There exist a Borel minimizer (h0,h1)(h_{0}^{*},h_{1}^{*}) to Θ\Theta over SψS_{\psi} for which

η^(𝐱)h1(𝐱)+(1η^(𝐱))h0(𝐱)=Cψ(η^(𝐱))\hat{\eta}({\mathbf{x}})h_{1}^{*}({\mathbf{x}})+(1-\hat{\eta}({\mathbf{x}}))h_{0}^{*}({\mathbf{x}})=C_{\psi}^{*}(\hat{\eta}({\mathbf{x}})) (49)

for all 𝐱{\mathbf{x}} and some Borel measurable function η^:(supp)ϵ[0,1]\hat{\eta}\colon(\operatorname{supp}{\mathbb{P}})^{\epsilon}\to[0,1].

Proof  Let (h0,h1)(h_{0},h_{1}), be an arbitrary Borel minimizer to the primal (Lemma 21 implies that such a minimizer exists). Set h1=h1h_{1}^{*}=h_{1} and h0=h1Cψh_{0}^{*}=h_{1}^{C_{\psi}^{*}}. Then Lemma 22 implies that h0h0h_{0}^{*}\leq h_{0}, so (h0,h1)(h_{0}^{*},h_{1}^{*}) is also optimal and ηh1+(1η)h0Cψ(η)\eta h_{1}^{*}+(1-\eta)h_{0}^{*}\geq C_{\psi}^{*}(\eta) for all η\eta. Furthermore, Lemma 22 implies that there is a function η^\hat{\eta} for which η^(𝐱)h1(𝐱)+(1η^(𝐱))h0(𝐱)=Cψ(η^(𝐱))\hat{\eta}({\mathbf{x}})h_{1}^{*}({\mathbf{x}})+(1-\hat{\eta}({\mathbf{x}}))h_{0}^{*}({\mathbf{x}})=C_{\psi}^{*}(\hat{\eta}({\mathbf{x}})).

It remains to show that η^\hat{\eta} is Borel measurable. We will express η^(𝐱)\hat{\eta}({\mathbf{x}}) in terms of h1(𝐱)h_{1}^{*}({\mathbf{x}}), and because h1(𝐱)h_{1}^{*}({\mathbf{x}}) is Borel measurable, it will follow that η^\hat{\eta} is Borel measurable as well. Because ηh1(𝐱)+(1η)h0(𝐱)Cψ(η)\eta h_{1}^{*}({\mathbf{x}})+(1-\eta)h_{0}^{*}({\mathbf{x}})\geq C_{\psi}^{*}(\eta) with equality at η=η^(𝐱)\eta=\hat{\eta}({\mathbf{x}}), it follows that h1(𝐱)h0(𝐱)h_{1}^{*}({\mathbf{x}})-h_{0}^{*}({\mathbf{x}}) is a supergradient of CψC_{\psi}^{*} at η=η^(𝐱)\eta=\hat{\eta}({\mathbf{x}}). Thus Lemma 17 implies that h1h0=(12η^)/η^(1η^)h1=h0+(12η^)/η^(1η^)h_{1}^{*}-h_{0}^{*}=(1-2\hat{\eta})/\sqrt{\hat{\eta}(1-\hat{\eta})}\Leftrightarrow h_{1}^{*}=h_{0}^{*}+(1-2\hat{\eta})/\sqrt{\hat{\eta}(1-\hat{\eta})}. Plugging this expression and the formula Cψ(η)=2η(1η)C_{\psi}^{*}(\eta)=2\sqrt{\eta(1-\eta)} into the relation η^h1+(1η^)h0=Cψ(η^)\hat{\eta}h_{1}^{*}+(1-\hat{\eta})h_{0}^{*}=C_{\psi}^{*}(\hat{\eta}) results in the equation h0+η^(12η^)/η^(1η^)=2η^(1η^)h_{0}^{*}+\hat{\eta}(1-2\hat{\eta})/\sqrt{\hat{\eta}(1-\hat{\eta})}=2\sqrt{\hat{\eta}(1-\hat{\eta})}. Solving for η^\hat{\eta} then results in η^=(h0)2/(1+(h0)2)\hat{\eta}=(h_{0}^{*})^{2}/(1+(h_{0}^{*})^{2}). Because h0h_{0}^{*} is Borel measurable, η^\hat{\eta} is measurable as well.  

Notice that this result together with Lemma 18 immediately implies that h1=ψ(αψ(η^))h_{1}^{*}=\psi(\alpha_{\psi}(\hat{\eta})) and h1=ψ(αψ(η^))h_{1}^{*}=\psi(-\alpha_{\psi}(\hat{\eta})), immediately proving that minimizing Θ\Theta over SψS_{\psi} is equivalent to minimizing RψR_{\psi}. Next, this observation is extended to arbitrary losses using properties of η^\hat{\eta}. Because both ψ\psi and αψ\alpha_{\psi} are strictly monotonic, η^\hat{\eta} interacts in a particularly nice way with maximizers of the dual problem:

Lemma 24

Let 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} be any maximizer of R¯ψ\bar{R}_{\psi} over ϵ(0)×ϵ(1){{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\times{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}). Set =0+1{\mathbb{P}}^{*}={\mathbb{P}}_{0}^{*}+{\mathbb{P}}_{1}^{*}, η=d1/d\eta^{*}=d{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}. Let η^\hat{\eta} be defined as in Lemma 23. Then η^=η\hat{\eta}=\eta^{*} {\mathbb{P}}^{*}-a.e.

Furthermore, let γi\gamma_{i} be a coupling between i,i{\mathbb{P}}_{i},{\mathbb{P}}_{i}^{*} with suppγiΔϵ\operatorname{supp}\gamma_{i}\subset\Delta_{\epsilon}. Then

suppγ1{(𝐱,𝐱):inf𝐲𝐱ϵη^(𝐲)=η^(𝐱)}\operatorname{supp}\gamma_{1}\subset\{({\mathbf{x}},{\mathbf{x}}^{\prime})\colon\inf_{\|{\mathbf{y}}-{\mathbf{x}}\|\leq\epsilon}\hat{\eta}({\mathbf{y}})=\hat{\eta}({\mathbf{x}}^{\prime})\} (50)
suppγ0{(𝐱,𝐱):sup𝐲𝐱ϵη^(𝐲)=η^(𝐱)}\operatorname{supp}\gamma_{0}\subset\{({\mathbf{x}},{\mathbf{x}}^{\prime})\colon\sup_{\begin{subarray}{c}\|{\mathbf{y}}-{\mathbf{x}}\|\leq\epsilon\end{subarray}}\hat{\eta}({\mathbf{y}})=\hat{\eta}({\mathbf{x}}^{\prime})\} (51)

The statement η^=η\hat{\eta}=\eta^{*} {\mathbb{P}}^{*}-a.e. implies that η^\hat{\eta} is in fact a version of the Raydon-Nikodym derivative d1/dd{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}.

For convenience, in this proof, we introduce the notation

Iϵ(f)(𝐱)=inf𝐲𝐱ϵf(𝐲).I_{\epsilon}(f)({\mathbf{x}})=\inf_{\|{\mathbf{y}}-{\mathbf{x}}\|\leq\epsilon}f({\mathbf{y}}).

Proof  Let h0,h1h_{0}^{*},h_{1}^{*} be the minimizer described by Lemma 23. Then Lemma 18 implies that h1=ψ(αψ(η^))h_{1}^{*}=\psi(\alpha_{\psi}(\hat{\eta})) and h0=ψ(αψ(η^))h_{0}^{*}=\psi(-\alpha_{\psi}(\hat{\eta})).

The complimentary slackness condition (31) implies that ηh1+(1η)h0=Cψ(η)\eta^{*}h_{1}^{*}+(1-\eta^{*})h_{0}^{*}=C_{\psi}^{*}(\eta^{*}) {\mathbb{P}}^{*}-a.e., and thus Lemma 18 implies that h1=ψ(αψ(η))h_{1}^{*}=\psi(\alpha_{\psi}(\eta^{*})) and h0=ψ(αψ(η))h_{0}^{*}=\psi(\alpha_{\psi}(\eta^{*})) {\mathbb{P}}^{*}-a.e. Therefore, ψ(αψ(η))=ψ(αψ(η^))\psi(\alpha_{\psi}(\eta^{*}))=\psi(\alpha_{\psi}(\hat{\eta})) {\mathbb{P}}^{*}-a.e. Now because the functions ψ,αψ\psi,\alpha_{\psi} are strictly monotonic, they are invertible. Thus it follows that η^=η\hat{\eta}=\eta^{*} {\mathbb{P}}^{*}-a.e.

The complimentary slackness condition (30) states that

Sϵ(hi)(𝐱)hi(𝐱)dγi=0.\int S_{\epsilon}(h_{i})({\mathbf{x}})-h_{i}^{*}({\mathbf{x}}^{\prime})d\gamma_{i}=0.

Therefore,

Sϵ(ψ(αψ(η^)))(𝐱)=ψ(αψ(η^(𝐱))γ1-a.e.andSϵ(ψ(αψ(η^)))(𝐱)=ψ(αψ(η^(𝐱))γ0-a.e.S_{\epsilon}(\psi(\alpha_{\psi}(\hat{\eta})))({\mathbf{x}})=\psi(\alpha_{\psi}(\hat{\eta}({\mathbf{x}}^{\prime}))\quad\gamma_{1}\text{-a.e.}\quad\text{and}\quad S_{\epsilon}(\psi(-\alpha_{\psi}(\hat{\eta})))({\mathbf{x}})=\psi(-\alpha_{\psi}(\hat{\eta}({\mathbf{x}}^{\prime}))\quad\gamma_{0}\text{-a.e.}

which implies

ψ(αψ(Iϵ(η^)(𝐱)))=ψ(αψ(η^(𝐱))γ1-a.e.andψ(αψ(Sϵ(η^)(𝐱)))=ψ(αψ(η^(𝐱))γ0-a.e.\psi(\alpha_{\psi}(I_{\epsilon}(\hat{\eta})({\mathbf{x}})))=\psi(\alpha_{\psi}(\hat{\eta}({\mathbf{x}}^{\prime}))\quad\gamma_{1}\text{-a.e.}\quad\text{and}\quad\psi(-\alpha_{\psi}(S_{\epsilon}(\hat{\eta})({\mathbf{x}})))=\psi(-\alpha_{\psi}(\hat{\eta}({\mathbf{x}}^{\prime}))\quad\gamma_{0}\text{-a.e.}

Now ψ,αψ\psi,\alpha_{\psi} are both strictly monotonic and thus invertible. Therefore

Iϵ(η^)(𝐱)=η^(𝐱)γ1-a.e.andSϵ(η^)(𝐱)=η^(𝐱)γ0-a.e.I_{\epsilon}(\hat{\eta})({\mathbf{x}})=\hat{\eta}({\mathbf{x}}^{\prime})\quad\gamma_{1}\text{-a.e.}\quad\text{and}\quad S_{\epsilon}(\hat{\eta})({\mathbf{x}})=\hat{\eta}({\mathbf{x}}^{\prime})\quad\gamma_{0}\text{-a.e.}

 

Next, the relation (49) suggests that h1=ϕfh_{1}^{*}=\phi\circ f^{*}, h0=ϕfh_{0}^{*}=\phi\circ-f^{*}, where ff^{*} is a function satisfying Cψ(η^(𝐱),f(𝐱))=Cψ(η^(𝐱))C_{\psi}(\hat{\eta}({\mathbf{x}}),f^{*}({\mathbf{x}}))=C_{\psi}^{*}(\hat{\eta}({\mathbf{x}})). In fact, Lemma 24 implies that this relation holds for all loss functions, and not just the exponential loss ψ\psi. To formalize this idea, we prove the following result about minimizers of Cψ(η,)C_{\psi}(\eta,\cdot) in Appendix C:

Lemma 25

Fix a loss function ϕ\phi and let αϕ(η)\alpha_{\phi}(\eta) be as in (8). Then αϕ\alpha_{\phi} maps η\eta to the smallest minimizer of Cϕ(η,)C_{\phi}(\eta,\cdot). Furthermore, the function αϕ(η)\alpha_{\phi}(\eta) non-decreasing in η\eta.

Finally, we use the complimentary slackness conditions of Lemma 15 to construct a minimizer (h0,h1)(h_{0}^{*},h_{1}^{*}) to Θ\Theta over SϕS_{\phi} for which h1=ϕfh_{1}^{*}=\phi\circ f^{*}, h0=ϕfh_{0}^{*}=\phi\circ-f^{*} for some function ff^{*}.

Lemma 26

Let ψ=eα\psi=e^{-\alpha} be the exponential loss and let ϕ\phi be any arbitrary loss. Let 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} be any maximizer of R¯ψ\bar{R}_{\psi} over ϵ(0)×ϵ(1){{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\times{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}). Define =0+1{\mathbb{P}}^{*}={\mathbb{P}}_{0}^{*}+{\mathbb{P}}_{1}^{*} and η=d1/d\eta^{*}=d{\mathbb{P}}_{1}^{*}/d{\mathbb{P}}^{*}. Let η^\hat{\eta} be defined as in Lemma 23.

Then h0=ϕ(αϕ(η^))h_{0}^{*}=\phi(-\alpha_{\phi}(\hat{\eta})), h1=ϕ(αϕ(η^))h_{1}^{*}=\phi(\alpha_{\phi}(\hat{\eta})) minimize Θ\Theta over SϕS_{\phi} and (0,1)({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}) maximize R¯ϕ\bar{R}_{\phi} over ϵ(0)×ϵ(1){{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\times{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}).

Thus there exists a Borel minimizer to RϕϵR_{\phi}^{\epsilon} and inffRϕϵ(f)=inf(h0,h1)SϕΘ(h0,h1)\inf_{f}R_{\phi}^{\epsilon}(f)=\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta(h_{0},h_{1}).

Proof  We will verify the complimentary slackness conditions of Lemma 15.

Lemma 24 implies that η^=η\hat{\eta}=\eta^{*} {\mathbb{P}}^{*}-a.e. Therefore, {\mathbb{P}}^{*}-a.e.,

Cϕ(η)=Cϕ(η^)=η^h1+(1η^)h0=ηh1+(1η)h0C_{\phi}^{*}(\eta^{*})=C_{\phi}^{*}(\hat{\eta})=\hat{\eta}h_{1}+(1-\hat{\eta})h_{0}=\eta^{*}h_{1}+(1-\eta^{*})h_{0}

This calculation verifies the complimentary slackness condition (31).

We next verify the other complimentary slackness condition (30). Let γi\gamma_{i} be a coupling between i,i{\mathbb{P}}_{i},{\mathbb{P}}_{i}^{*} with suppγiΔϵ\operatorname{supp}\gamma_{i}\subset\Delta_{\epsilon}. Next, because ϕαϕ\phi\circ\alpha_{\phi}, ϕαϕ\phi\circ-\alpha_{\phi} are monotonic, the conditions (50) and (51) imply that

ϕ(αϕ(η^))𝑑1=ϕ(αϕ(η^(𝐱)))𝑑γ1(𝐱,𝐱)=Sϵ(ϕ(αϕ(η^)))(𝐱)𝑑γ1(𝐱,𝐱)=Sϵ(ϕ(αϕ(η^)))𝑑1\int\phi(\alpha_{\phi}(\hat{\eta}))d{\mathbb{P}}_{1}^{*}=\int\phi(\alpha_{\phi}(\hat{\eta}({\mathbf{x}}^{\prime})))d\gamma_{1}({\mathbf{x}},{\mathbf{x}}^{\prime})=\int S_{\epsilon}(\phi(\alpha_{\phi}(\hat{\eta})))({\mathbf{x}})d\gamma_{1}({\mathbf{x}},{\mathbf{x}}^{\prime})=\int S_{\epsilon}(\phi(\alpha_{\phi}(\hat{\eta})))d{\mathbb{P}}_{1}
ϕ(αϕ(η^))𝑑0=ϕ(αϕ(η^(𝐱)))𝑑γ0(𝐱,𝐱)=Sϵ(ϕ(αϕ(η^)))(𝐱)𝑑γ0(𝐱,𝐱)=Sϵ(ϕ(αϕ(η^)))𝑑0\int\phi(-\alpha_{\phi}(\hat{\eta}))d{\mathbb{P}}_{0}^{*}=\int\phi(-\alpha_{\phi}(\hat{\eta}({\mathbf{x}}^{\prime})))d\gamma_{0}({\mathbf{x}},{\mathbf{x}}^{\prime})=\int S_{\epsilon}(\phi(-\alpha_{\phi}(\hat{\eta})))({\mathbf{x}})d\gamma_{0}({\mathbf{x}},{\mathbf{x}}^{\prime})=\int S_{\epsilon}(\phi(-\alpha_{\phi}(\hat{\eta})))d{\mathbb{P}}_{0}

This calculation verifies the complimentary slackness condition (30).  

Theorems 6 and 9 immediately follow from Lemmas 14 and 26.

8 Conclusion

We initiated the study of minimizers and minimax relations for adversarial surrogate risks. Specifically, we proved that there always exists a minimizer to the adversarial surrogate risk when perturbing in a closed ϵ\epsilon-ball and a maximizer to the dual problem. Just like the results of (Pydi and Jog, 2021), our minimax theorem provides an interpretation of the adversarial learning problem as a game between two players. This theory helps explain the phenomenon of transfer attacks. We hope the insights gained from this research will assist in the development of algorithms for training classifiers robust to adversarial perturbations.


Acknowledgments

Natalie Frank was supported in part by the Research Training Group in Modeling and Simulation funded by the National Science Foundation via grant RTG/DMS – 1646339. Jonathan Niles-Weed was supported in part by a Sloan Research Fellowship.

Appendix A The Universal σ\sigma-Algebra and a Generalization of Theorem 1

A.1 Definition of the Universal σ\sigma-Algebra and Main Measurability Result

In this Appendix, we prove results for supremums over an arbitrary compact set, not necessarily a unit ball. For a function g:ddg\colon\mathbb{R}^{d}\to\mathbb{R}^{d}, we will abuse notation and denote the supremum of gg over the compact set BB by

SB(g)(𝐱)=sup𝐡Bg(𝐱+𝐡).S_{B}(g)({\mathbf{x}})=\sup_{{\mathbf{h}}\in B}g({\mathbf{x}}+{\mathbf{h}}).

Let XX be a separable metric space and let (X){\mathcal{B}}(X) be the Borel σ\sigma-algebra on XX. Denote the completion of (X){\mathcal{B}}(X) with respect to a Borel measure ν\nu by ν(X)\mathcal{L}_{\nu}(X). Let +(X)\mathcal{M}_{+}(X) be the set of all finite222Alternatively, one could compute the intersection in (52) over all σ\sigma-finite measures. These two approaches are equivalent because for every σ\sigma-finite measure λ\lambda and compact set KK, the restriction λ  K\lambda\mathbin{\vrule height=6.88889pt,depth=0.0pt,width=0.55974pt\vrule height=0.55974pt,depth=0.0pt,width=5.59721pt}K is a finite measure with λ  K(X)λ(X)\mathcal{L}_{\lambda\mathbin{\vrule height=4.82224pt,depth=0.0pt,width=0.39182pt\vrule height=0.39182pt,depth=0.0pt,width=3.91806pt}K}(X)\supset\mathcal{L}_{\lambda}(X). positive Borel measures on XX. Then the universal σ\sigma-algebra on XX, 𝒰(X){\mathscr{U}}(X) is

𝒰(X)=ν+(X)ν(d).{\mathscr{U}}(X)=\bigcap_{\nu\in\mathcal{M}_{+}(X)}\mathcal{L}_{\nu}(\mathbb{R}^{d}). (52)

In other words, the universal σ\sigma-algebra is the sigma-algebra of sets which are measurable with respect to the completion of every Borel measure. Thus 𝒰(X){\mathscr{U}}(X) is contained in ν(X)\mathcal{L}_{\nu}(X) for every Borel measure ν\nu.

Theorem 27

If ff is universally measurable and BB is a compact set, then SB(f)S_{B}(f) is universally measurable.

Thus, if 0,1{\mathbb{P}}_{0},{\mathbb{P}}_{1} are Borel, integrals of the form Sϵ(g)𝑑i\int S_{\epsilon}(g)d{\mathbb{P}}_{i} in (10), can be interpreted as the integral of Sϵ(g)S_{\epsilon}(g) with respect to the completion of i{\mathbb{P}}_{i}.

A.2 Proof Outline

To prove Theorem 27, we analyze the level sets of SB(g)S_{B}(g). One can compute the level set [SB(g)(𝐱)>a][S_{B}(g)({\mathbf{x}})>a] using a direct sum.

Lemma 28

Let g:ddg\colon\mathbb{R}^{d}\to\mathbb{R}^{d} be any function. For a set BB, define B={𝐛:𝐛B}-B=\{-{\mathbf{b}}\colon{\mathbf{b}}\in B\}. Then

[SB(g)>a]=[g>a]B[S_{B}(g)>a]=[g>a]\oplus-B

Proof  To start, notice that SB(g)(𝐱)>aS_{B}(g)({\mathbf{x}})>a iff there is some 𝐡B{\mathbf{h}}\in B for which g(𝐱+𝐡)>ag({\mathbf{x}}+{\mathbf{h}})>a. Thus

𝐱[SB(g)>a]𝐱+𝐡[g>a] for some 𝐡B𝐱[g>a]B{\mathbf{x}}\in[S_{B}(g)>a]\Leftrightarrow{\mathbf{x}}+{\mathbf{h}}\in[g>a]\text{ for some }{\mathbf{h}}\in B\Leftrightarrow{\mathbf{x}}\in[g>a]\oplus-B

 
Therefore, to show that SB(g)S_{B}(g) is measurable for measurable gg, it suffices to show that the direct sum of a measurable set and the compact set BB is measurable. Thus, to prove Theorem 27, it suffices to demonstrate the following result:

Theorem 29

Let A𝒰(d)A\in{\mathscr{U}}(\mathbb{R}^{d}) and let BB be a compact set. Then AB𝒰(d)A\oplus B\in{\mathscr{U}}(\mathbb{R}^{d}).

The proof of Theorem 29 follows from fundamental concepts of measure theory. A classical measure theory result states that if f:XYf:X\to Y is a continuous function, f1f^{-1} maps Borel sets in YY to Borel sets in XX. Consider now the function w:B×dB×dw\colon B\times\mathbb{R}^{d}\to B\times\mathbb{R}^{d} given by w(𝐡,𝐱)=(𝐡,𝐱𝐡)w({\mathbf{h}},{\mathbf{x}})=({\mathbf{h}},{\mathbf{x}}-{\mathbf{h}}). Then ww is invertible and the inverse of ww is w1(𝐡,𝐱+𝐡)w^{-1}({\mathbf{h}},{\mathbf{x}}+{\mathbf{h}}). Furthermore, w1w^{-1} maps the set B×AB\times A to B×ABB\times A\oplus B. Therefore, if A(d)A\in{\mathcal{B}}(\mathbb{R}^{d}), then B×ABB\times A\oplus B is Borel in (B×d){\mathcal{B}}(B\times\mathbb{R}^{d}). However, from this statement, one cannot conclude that ABA\oplus B is Borel in d\mathbb{R}^{d}! On the otherhand, one can use regularity of measures to conclude that ABA\oplus B is in 𝒰(d){\mathscr{U}}(\mathbb{R}^{d}). Therefore, to prove Theorem 29, we prove the following two results:

Lemma 30

Let BdB\subset\mathbb{R}^{d} be a compact set. Then B×A𝒰(B×d)B\times A\in{\mathscr{U}}(B\times\mathbb{R}^{d}) iff A𝒰(d)A\in{\mathscr{U}}(\mathbb{R}^{d}).

In this document, we say a function f:XYf\colon X\to Y is universally measurable if f1(E)𝒰(X)f^{-1}(E)\in{\mathscr{U}}(X) whenever E𝒰(Y)E\in{\mathscr{U}}(Y).

Lemma 31

Let f:XYf:X\to Y be a Borel measurable function. Then ff is universally measurable as well.

This result is stated on page 171 of (Bertsekas and Shreve, 1996), but we include a proof below for completeness.

Lemma 31 applied to ww implies that the set B×ABB\times A\oplus B is universally measurable while Lemma 30 implies that ABA\oplus B is universally measurable.

A.3 Proof of Theorem 29

We begin by proving Lemma 31.

Proof [Proof of Lemma 31] Let AA be a Borel set in YY. We will show that for any finite measure ν\nu on XX, f1(A)ν(X)f^{-1}(A)\in\mathcal{L}_{\nu}(X). As ν\nu is arbitrary, this statement will impy that f1(A)𝒰(X)f^{-1}(A)\in{\mathscr{U}}(X).

Consider the pushforward measure μ=fν\mu=f\sharp\nu. This measure is a finite measure on YY, so by the definition of 𝒰(Y){\mathscr{U}}(Y), Aμ(Y)A\in\mathcal{L}_{\mu}(Y). Therefore, there are Borel sets B1AB2B_{1}\subset A\subset B_{2} in YY for which μ(B1)=μ(B2)\mu(B_{1})=\mu(B_{2}). Thus, f1(B1),f1(B2)f^{-1}(B_{1}),f^{-1}(B_{2}) are Borel sets in XX for which f1(B1)f1(A)f1(B2)f^{-1}(B_{1})\subset f^{-1}(A)\subset f^{-1}(B_{2}) and ν(f1(B1))=ν(f1(B2))\nu(f^{-1}(B_{1}))=\nu(f^{-1}(B_{2})). Therefore, f1(A)ν(X)f^{-1}(A)\in\mathcal{L}_{\nu}(X).  

On the other hand, the proof of Lemma 30 relies on the definition of a regular space XX:

Definition 32

A measure ν\nu is inner regular if for every Borel set AA,

ν(A)=supK compactKAν(K).\nu(A)=\sup_{\begin{subarray}{c}K\text{ compact}\\ K\subset A\end{subarray}}\nu(K).

The topological space XX is regular if every finite Borel measure on XX is inner regular.

The following result implies that most topological spaces encountered in applications are regular.

Theorem 33

A σ\sigma-compact locally compact Hausdorff space is regular.

This theorem is is a consequence of Theorem 7.8 of (Folland, 1999).

The notion of regularity extends to complete measures.

Lemma 34

Let ν¯\overline{\nu} be the completion of a measure ν\nu on a regular space XX. Then for any A(X)A\in\mathcal{L}(X),

ν¯(A)=supK compactKAν(K).\overline{\nu}(A)=\sup_{\begin{subarray}{c}K\text{ compact}\\ K\subset A\end{subarray}}\nu(K).

The proof of this result is left as a exercise to the reader.

Now using the concept of regularity, we prove Lemma 30.

Proof [Proof of Lemma 30] We first prove the forward direction. Consider the projection function Π2:B×dd\Pi_{2}\colon B\times\mathbb{R}^{d}\to\mathbb{R}^{d} given by Π2(𝐱,𝐲)=𝐲\Pi_{2}({\mathbf{x}},{\mathbf{y}})={\mathbf{y}}. Then Π2\Pi_{2} is a continuous function and Π21(A)=B×A\Pi_{2}^{-1}(A)=B\times A. Therefore Lemma 31 implies that if AA is universally measurable in d\mathbb{R}^{d}, then B×AB\times A is universally measurable in B×dB\times\mathbb{R}^{d}.

To prove the other direction, assume that B×AB\times A is universally measurable in B×dB\times\mathbb{R}^{d}. Let ν\nu be any finite Borel measure on d\mathbb{R}^{d}. We will find Borel sets B1,B2B_{1},B_{2} with B1AB2B_{1}\subset A\subset B_{2} for which ν(B1)=ν(B2)\nu(B_{1})=\nu(B_{2}), and thus Aν(d)A\in\mathcal{L}_{\nu}(\mathbb{R}^{d}). As ν\nu was arbitrary, it follows that AA is universally measurable.

Theorem 33 implies that B×dB\times\mathbb{R}^{d} is a regular space. Fix a Borel probability measure λ\lambda on BB. Then λ×ν\lambda\times\nu is a finite Borel measure on B×dB\times\mathbb{R}^{d}, so it is inner regular. Let λ×ν¯\overline{\lambda\times\nu} be the completion of λ×ν\lambda\times\nu. Then by Lemma 34,

λ×ν¯(B×A)=supK compactKB×Aλ×ν(K)\overline{\lambda\times\nu}(B\times A)=\sup_{\begin{subarray}{c}K\text{ compact}\\ K\subset B\times A\end{subarray}}\lambda\times\nu(K)

We will now argue that

supK compactKB×Aλ×ν(K)=supK compactKAν(K)\sup_{\begin{subarray}{c}K\text{ compact}\\ K\subset B\times A\end{subarray}}\lambda\times\nu(K)=\sup_{\begin{subarray}{c}K\text{ compact}\\ K\subset A\end{subarray}}\nu(K) (53)

Let KB×AK\subset B\times A and let Π2\Pi_{2} be projection onto the second coordinate. Because the continuous image of a compact set is compact, K’=Π2(K)\Pi_{2}(K) is compact and contained in AA. Thus B×AB×KKB\times A\supset B\times K^{\prime}\supset K, which implies (53). Now (53) applied to ACA^{C} implies that

λ×ν¯(X×A)=infUC compactUB×Aλ×ν(U)=infUC compactUAν(U).\overline{\lambda\times\nu}(X\times A)=\inf_{\begin{subarray}{c}U^{C}\text{ compact}\\ U\supset B\times A\end{subarray}}\lambda\times\nu(U)=\inf_{\begin{subarray}{c}U^{C}\text{ compact}\\ U\supset A\end{subarray}}\nu(U).

Thus

supK compactKAν(K)=infUC compactUAν(U):=m\sup_{\begin{subarray}{c}K\text{ compact}\\ K\subset A\end{subarray}}\nu(K)=\inf_{\begin{subarray}{c}U^{C}\text{ compact}\\ U\supset A\end{subarray}}\nu(U):=m

Let KnK_{n} be a sequence of compact sets contained in AA for which limnν(Kn)=m\lim_{n\to\infty}\nu(K_{n})=m and UnU_{n} a sequence of sets containing AA for which UnCU_{n}^{C} is compact and limnν(Un)=m\lim_{n\to\infty}\nu(U_{n})=m. Because a finite union of compact sets is compact, one can choose such sequences that satisfy Kn+1KnK_{n+1}\supset K_{n} and Un+1UnU_{n+1}\subset U_{n}. Then B1=KnB_{1}=\bigcup K_{n}, B2=UnB_{2}=\bigcap U_{n} are Borel sets that satisfy B1AB2B_{1}\subset A\subset B_{2} and ν(B1)=ν(B2)\nu(B_{1})=\nu(B_{2}) so Aν(d)A\in\mathcal{L}_{\nu}(\mathbb{R}^{d}).

 

Lastly, we formally prove Theorem 29.

Proof [Proof of Theorem 29] Consider the function w:B×dB×dw\colon B\times\mathbb{R}^{d}\to B\times\mathbb{R}^{d} given by w(𝐡,𝐱)=(𝐡,𝐱𝐡)w({\mathbf{h}},{\mathbf{x}})=({\mathbf{h}},{\mathbf{x}}-{\mathbf{h}}). Then ww is continuous, invertible, and w1(𝐡,𝐱)=(𝐱,𝐱+𝐡)w^{-1}({\mathbf{h}},{\mathbf{x}})=({\mathbf{x}},{\mathbf{x}}+{\mathbf{h}}).

Now let A𝒰(d)A\in{\mathscr{U}}(\mathbb{R}^{d}). Then Lemma 30 implies that B×dB\times\mathbb{R}^{d} is universally measurable in B×AB\times A. Lemma 31 then implies that w1(B×A)=B×ABw^{-1}(B\times A)=B\times A\oplus B is universally measurable as well. Finally, Lemma 30 implies that AB𝒰(d)A\oplus B\in{\mathscr{U}}(\mathbb{R}^{d}) as well.  

Appendix B Alternative Characterizations of the WW_{\infty} metric

We start with proving Lemma 3 using a measurable selection theorem.

Theorem 35

Let X,YX,Y be Borel sets and assume that DX×YD\subset X\times Y is also Borel. Let DxD_{x} denote

Dx={y:(x,y)D}D_{x}=\{y\colon(x,y)\in D\}

and

ProjX(D):={x:(x,y)D}\operatorname{Proj}_{X}(D)\colon=\{x\colon(x,y)\in D\}

Let f:D¯f\colon D\to\overline{\mathbb{R}} be a Borel function mapping DD to ¯\overline{\mathbb{R}} and define

f(x)=infyDxf(x,y)f^{*}(x)=\inf_{y\in D_{x}}f(x,y)

Assume that f(x)>f^{*}(x)>-\infty for all xx. Then for any δ>0\delta>0, there is a universally measurable φ:ProjX(D)Y\varphi\colon\operatorname{Proj}_{X}(D)\to Y for which

f(x,φ(x))f(x)+δf(x,\varphi(x))\leq f(x)+\delta

This statement is a consequence of Proposition 7.50 from (Bertsekas and Shreve, 1996).

We use the following results about universally measurable functions, see Lemma 7.27 of (Bertsekas and Shreve, 1996).

Lemma 36

Let g:dg:\mathbb{R}^{d}\to\mathbb{R} be a universally measurable function and let {\mathbb{Q}} be a Borel measure. Then there is a Borel measurable function φ\varphi for which φ=g\varphi=g {\mathbb{Q}}-a.e.

This result can be extended to d\mathbb{R}^{d}-valued functions:

Lemma 37

Let g:ddg:\mathbb{R}^{d}\to\mathbb{R}^{d} be a universally measurable function and let {\mathbb{Q}} be a Borel measure. Then there is a Borel measurable function φ\varphi for which φ=g\varphi=g {\mathbb{Q}}-a.e.

Proof  Let 𝐞i\mathbf{e}_{i} denote the iith basis vector. Then gi:=𝐞igg_{i}:=\mathbf{e}_{i}\cdot g is a universally measurable function from d\mathbb{R}^{d} to \mathbb{R}, so by Lemma 36, there is a Borel function φi\varphi_{i} for which φi=gi\varphi_{i}=g_{i} {\mathbb{Q}}-a.e. Then if we define φ=(φ1,φ2,,φd)\varphi=(\varphi_{1},\varphi_{2},\ldots,\varphi_{d}), this function is equal to gg {\mathbb{Q}}-a.e.  

Finally, we prove Lemma 3. Due to Lemmas 36 and 37, this lemma heavily relies on the fact that the domain of our functions is d\mathbb{R}^{d} rather than an arbitrary metric space. See 3 Recall that this paper defines the left-left hand side of (13) as the integral of Sϵ(f)S_{\epsilon}(f) with respect to the completion of {\mathbb{Q}}.

Proof 

To start, let {\mathbb{Q}}^{\prime} be a Borel measure satisfying W(,)ϵW_{\infty}({\mathbb{Q}}^{\prime},{\mathbb{Q}})\leq\epsilon.

Let γ\gamma be a coupling with marginals {\mathbb{Q}} and {\mathbb{Q}}^{\prime} supported on Δϵ\Delta_{\epsilon}. Then

f𝑑=f(𝐱)𝑑γ(𝐱,𝐱)=f(𝐱)𝟏𝐱𝐱ϵ𝑑γ(𝐱,𝐱)\displaystyle\int fd{\mathbb{Q}}^{\prime}=\int f({\mathbf{x}}^{\prime})d\gamma({\mathbf{x}},{\mathbf{x}}^{\prime})=\int f({\mathbf{x}}^{\prime}){\mathbf{1}}_{\|{\mathbf{x}}^{\prime}-{\mathbf{x}}\|\leq\epsilon}d\gamma({\mathbf{x}},{\mathbf{x}}^{\prime})
Sϵ(f)(𝐱)𝟏𝐱𝐱ϵ𝑑γ(𝐱,𝐱)=Sϵ(f)(𝐱)𝑑γ(𝐱,𝐱)=Sϵ(f)𝑑\displaystyle\leq\int S_{\epsilon}(f)({\mathbf{x}}){\mathbf{1}}_{\|{\mathbf{x}}^{\prime}-{\mathbf{x}}\|\leq\epsilon}d\gamma({\mathbf{x}},{\mathbf{x}}^{\prime})=\int S_{\epsilon}(f)({\mathbf{x}})d\gamma({\mathbf{x}},{\mathbf{x}}^{\prime})=\int S_{\epsilon}(f)d{\mathbb{Q}}

Therefore, we can conclude that

supϵ()f𝑑Sϵ(f)𝑑.\sup_{{\mathbb{Q}}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{Q}})}\int fd{\mathbb{Q}}^{\prime}\leq\int S_{\epsilon}(f)d{\mathbb{Q}}.

We will show the opposite inequality by applying the measurable selection theorem. Theorem 35 implies for each δ>0\delta>0, one can find a universally measurable function φ:dBϵ(𝐱)¯\varphi\colon\mathbb{R}^{d}\to\overline{B_{\epsilon}({\mathbf{x}})} for which f(φ(𝐱))+δSϵ(f)(𝐱)f(\varphi({\mathbf{x}}))+\delta\geq S_{\epsilon}(f)({\mathbf{x}}). By Lemma 37, one can find a Borel measurable function TT for which T=φT=\varphi {\mathbb{Q}}-a.e.

Let =T1{\mathbb{Q}}^{\prime}={\mathbb{Q}}\circ T^{-1}. Because TT is Borel measurable, {\mathbb{Q}}^{\prime} and fTf\circ T are Borel. We will now argue that f𝑑+δSϵ(f)𝑑\int fd{\mathbb{Q}}^{\prime}+\delta\geq\int S_{\epsilon}(f)d{\mathbb{Q}}. Recall that φ\varphi is always measurable with respect to the completion of {\mathbb{Q}}, and by convention g𝑑\int gd{\mathbb{Q}} means integration with respect to the completion of {\mathbb{Q}}. Then if we define M=(d)M={\mathbb{Q}}(\mathbb{R}^{d}),

f𝑑=f𝑑T1=f(T(𝐱))𝑑=f(φ(𝐱))𝑑Sϵ(f)δd=Sϵ(f)𝑑δM\int fd{\mathbb{Q}}^{\prime}=\int fd{\mathbb{Q}}\circ T^{-1}=\int f(T({\mathbf{x}}))d{\mathbb{Q}}=\int f(\varphi({\mathbf{x}}))d{\mathbb{Q}}\geq\int S_{\epsilon}(f)-\delta d{\mathbb{Q}}=\int S_{\epsilon}(f)d{\mathbb{Q}}-\delta M

Because δ>0\delta>0 was arbitrary and ϵ(){\mathbb{Q}}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{Q}}),

Sϵ(f)𝑑supϵ()f𝑑\int S_{\epsilon}(f)d{\mathbb{Q}}\leq\sup_{{\mathbb{Q}}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{Q}})}\int fd{\mathbb{Q}}^{\prime}

It remains to show that W(,)ϵW_{\infty}({\mathbb{Q}},{\mathbb{Q}}^{\prime})\leq\epsilon. Define a function G:dd×dG\colon\mathbb{R}^{d}\to\mathbb{R}^{d}\times\mathbb{R}^{d}, G(𝐱)=(𝐱,T(𝐱))G({\mathbf{x}})=({\mathbf{x}},T({\mathbf{x}})) and a coupling γ\gamma by γ=G\gamma=G\sharp{\mathbb{Q}}. Then γ(Δϵ)=G()(Δϵ)=(G1(Δϵ))=1\gamma(\Delta_{\epsilon})=G\sharp({\mathbb{Q}})(\Delta_{\epsilon})={\mathbb{Q}}(G^{-1}(\Delta_{\epsilon}))=1, so supp(γ)Δϵ\operatorname{supp}(\gamma)\subseteq\Delta_{\epsilon}.  

Next we prove Lemma 4. We begin by presenting Strassen’s theorem, see Corollary 1.28 of (Villani, 2003) for more details

Theorem 38 (Strassen’s Theorem)

Let ,{\mathbb{P}},{\mathbb{Q}} be positive finite measures with the same mass and let ϵ0\epsilon\geq 0. Let Π(,)\Pi({\mathbb{P}},{\mathbb{Q}}) denote the set couplings of {\mathbb{P}} and {\mathbb{Q}}. Then

infπΠ(,)π({𝐱𝐲>ϵ)=supA closed(A)(Aϵ)\inf_{\pi\in\Pi({\mathbb{P}},{\mathbb{Q}})}\pi(\{\|{\mathbf{x}}-{\mathbf{y}}\|>\epsilon)=\sup_{A\text{ closed}}{\mathbb{Q}}(A)-{\mathbb{P}}(A^{\epsilon}) (54)

Strassen’s theorem is usually written with AϵA^{\epsilon} in (54) replaced by Aϵ]={𝐱:dist(𝐱,A)ϵ}A^{\epsilon]}=\{{\mathbf{x}}\colon\operatorname{dist}({\mathbf{x}},A)\leq\epsilon\}– however, for closed sets Aϵ]=AϵA^{\epsilon]}=A^{\epsilon}. Strassen’s theorem together with Urysohn’s lemma then immediately proves Lemma 4.

Lemma 39 (Urysohn’s Lemma)

Let AA and BB be two closed and disjoint subsets of d\mathbb{R}^{d}. Then there exists a function f:d[0,1]f\colon\mathbb{R}^{d}\to[0,1] for which f=0f=0 on AA and f=1f=1 on BB.

See for instance result 4.15 of (Folland, 1999).

See 4

Proof  First, notice that Lemma 3 implies that if ϵ(){\mathbb{Q}}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}), then Sϵ(h)𝑑h𝑑\int S_{\epsilon}(h)d{\mathbb{P}}\geq\int hd{\mathbb{Q}} for all hCb(d)h\in C_{b}(\mathbb{R}^{d}), proving the inequality \geq in the statement of the lemma.

We will now argue the other inequality: specifically, we will show that

supA closed(A)(Aϵ)suphCb(d)h𝑑Sϵ(h)𝑑\sup_{A\text{ closed}}{\mathbb{Q}}(A)-{\mathbb{P}}(A^{\epsilon})\leq\sup_{h\in C_{b}(\mathbb{R}^{d})}\int hd{\mathbb{Q}}-\int S_{\epsilon}(h)d{\mathbb{P}} (55)

Strassen’s theorem will then imply that W(,)ϵW_{\infty}({\mathbb{P}},{\mathbb{Q}})\leq\epsilon. Let δ\delta be arbitrary and let AA be a closed set that satisfies supA closed(A)(Aϵ)(A)(Aϵ)+δ\sup_{A\text{ closed}}{\mathbb{Q}}(A)-{\mathbb{P}}(A^{\epsilon})\leq{\mathbb{Q}}(A)-{\mathbb{P}}(A^{\epsilon})+\delta. Now because AA is closed, An=AB1/n(𝟎)A_{n}=A\oplus B_{1/n}({\mathbf{0}}) is a series of open sets decreasing to AA and Anϵ=AϵB1/n(𝟎)A_{n}^{\epsilon}=A^{\epsilon}\oplus B_{1/n}({\mathbf{0}}) is a sequence of open sets decreasing to AϵA^{\epsilon}. Thus pick nn sufficiently large so that (Anϵ(Aϵ)δ{\mathbb{P}}(A^{\epsilon}_{n}-{\mathbb{P}}(A^{\epsilon})\leq\delta. By Urysohn’s lemma, one can choose a function hh which is 1 on AA, 0 on AnCA_{n}^{C}, and between 0 and 1 on AnACA_{n}-A^{C}. Then Sϵ(h)S_{\epsilon}(h) is 11 on AϵA^{\epsilon}, 0 on (Anϵ)C(A^{\epsilon}_{n})^{C} and between 0 and 1 on AnϵAϵA^{\epsilon}_{n}-A^{\epsilon}. Then h𝑑(A)0\int hd{\mathbb{Q}}-{\mathbb{Q}}(A)\geq 0 and thus

(h𝑑Sϵ(h)𝑑)((A)(Aϵ))(Aϵ)(Anϵ)δ.\left(\int hd{\mathbb{Q}}-\int S_{\epsilon}(h)d{\mathbb{P}}\right)-\left({\mathbb{Q}}(A)-{\mathbb{P}}(A^{\epsilon})\right)\geq{\mathbb{P}}(A^{\epsilon})-{\mathbb{P}}(A_{n}^{\epsilon})\geq-\delta.

Because δ\delta was arbitrary, (55) follows.  

Appendix C Minimizers of Cϕ(η,)C_{\phi}(\eta,\cdot): Proof of Lemma 25

See 25

Proof  To start, we will show that αϕ(η)\alpha_{\phi}(\eta) as defined in (8) is a minimizer of Cϕ(η,)C_{\phi}(\eta,\cdot). Let SS be the set of minimizers of Cϕ(η,)C_{\phi}^{*}(\eta,\cdot), which is non-empty due to the lower semi-continuity of ϕ\phi. Let a=infS=αϕ(η)a=\inf S=\alpha_{\phi}(\eta) and let siSs_{i}\in S be a sequence converging to aa. Then because ϕ\phi is lower semi-continuous,

Cϕ(η)=lim infiηϕ(si)+(1η)ϕ(si)ηϕ(a)+(1η)ϕ(a)C_{\phi}^{*}(\eta)=\liminf_{i\to\infty}\eta\phi(s_{i})+(1-\eta)\phi(-s_{i})\geq\eta\phi(a)+(1-\eta)\phi(-a)

Then aa is in fact a minimizer of Cϕ(η,)C_{\phi}^{*}(\eta,\cdot), so it is the smallest minimizer of Cϕ(η,)C_{\phi}^{*}(\eta,\cdot).

We will now show that the function αϕ\alpha_{\phi} is non-decreasing.

One can write

Cϕ(η2,α)\displaystyle C_{\phi}(\eta_{2},\alpha) =η2ϕ(α)+(1η2)ϕ(α)\displaystyle=\eta_{2}\phi(\alpha)+(1-\eta_{2})\phi(-\alpha)
=η1ϕ(α)+(1η1)ϕ(α)+(η2η1)(ϕ(α)ϕ(α))\displaystyle=\eta_{1}\phi(\alpha)+(1-\eta_{1})\phi(-\alpha)+(\eta_{2}-\eta_{1})(\phi(\alpha)-\phi(-\alpha))
=Cϕ(η1,α)+(η2η1)(ϕ(α)ϕ(α))\displaystyle=C_{\phi}(\eta_{1},\alpha)+(\eta_{2}-\eta_{1})(\phi(\alpha)-\phi(-\alpha)) (56)

Notice that the function αϕ(α)ϕ(α)\alpha\mapsto\phi(\alpha)-\phi(-\alpha) is non-increasing. Then because αϕ(η1)\alpha_{\phi}(\eta_{1}) is the smallest minimizer of Cϕ(η1,α)C_{\phi}(\eta_{1},\alpha), if α<αϕ(η1)\alpha<\alpha_{\phi}(\eta_{1}), then Cϕ(η1,α)>Cϕ(η1,αϕ(η1))C_{\phi}(\eta_{1},\alpha)>C_{\phi}(\eta_{1},\alpha_{\phi}(\eta_{1})). Furthermore, ϕ(α)ϕ(α)ϕ(αϕ(η1))ϕ(αϕ(η1))\phi(\alpha)-\phi(-\alpha)\geq\phi(\alpha_{\phi}(\eta_{1}))-\phi(-\alpha_{\phi}(\eta_{1})). Therefore, (56) implies that Cϕ(η2,α)>Cϕ(η2,αϕ(η1))C_{\phi}(\eta_{2},\alpha)>C_{\phi}(\eta_{2},\alpha_{\phi}(\eta_{1})), and thus α\alpha cannot be a minimizer of Cϕ(η2,)C_{\phi}(\eta_{2},\cdot). Therefore, αϕ(η2)αϕ(η1)\alpha_{\phi}(\eta_{2})\geq\alpha_{\phi}(\eta_{1}).

 

Appendix D Continuity properties of R¯ϕ\bar{R}_{\phi}– Proof of Lemma 12

Recall the function G(η,α)G(\eta,\alpha) defined by (47). With this notation, one can write the CϕC_{\phi}^{*} transform as h1Cϕ=supη[0,1]G(η,h1)h_{1}^{C_{\phi}^{*}}=\sup_{\eta\in[0,1]}G(\eta,h_{1}).

Lemma 40

Let c>0c>0 and consider αc\alpha\geq c. Let a(α)=αCϕa(\alpha)=\alpha^{C_{\phi}^{*}}, where the CϕC_{\phi}^{*} transform is as in Lemma 22. Then there is a constant k<1k<1 for which

a(α)=supη[0,k]Cϕ(η)ηα1ηa(\alpha)=\sup_{\eta\in[0,k]}\frac{C_{\phi}^{*}(\eta)-\eta\alpha}{1-\eta} (57)

The constants kk depends only on cc.

Proof  Recall that the function G(η,α)G(\eta,\alpha) is decreasing in α\alpha for fixed η\eta and continuous on [1,0)[1,0). Let k=sup{η:G(η,c)>0}k=\sup\{\eta\colon G(\eta,c)>0\}. As cc is strictly positive, one can conclude that limη1G(η,c)=\lim_{\eta\to 1}G(\eta,c)=-\infty and as a result k<1k<1. Because GG is decreasing in α\alpha, one can conclude that G(η,α)0G(\eta,\alpha)\leq 0 for all η>k\eta>k and αc\alpha\geq c. However, supη[0,1]G(η,α)0\sup_{\eta\in[0,1]}G(\eta,\alpha)\geq 0 because G(0,α)=0G(0,\alpha)=0 for all α\alpha. Thus (57) holds.

 

Lemma 41

Let {fα}\{f_{\alpha}\} be a set of LL-Lipschitz functions. Then supαfα\sup_{\alpha}f_{\alpha} is also LL-Lipschitz.

This statement is proved in Box 1.8 of (Santambrogio, 2015).

Lemma 42

Let {\mathbb{Q}} be any finite measure and assume that gg is a non-negative function in L1()L^{1}({\mathbb{Q}}). Let δ>0\delta>0. Then there is a lower semi-continuous function g~\tilde{g} for which |gg~|<δ\int|g-\tilde{g}|<\delta and g0g\geq 0.

See Proposition 7.14 of Folland.

Lemma 43

Let gg be a lower semi-continuous function bounded from below. Then there is a sequence of Lipschitz functions that approaches gg from below.

This statement appears in Box 1.5 of (Santambrogio, 2015).

Corollary 44

Let hh be an L1()L^{1}({\mathbb{Q}}) function with h0h\geq 0. Then for any δ\delta, there exists a Lipschitz h~\tilde{h} for which |hh~|𝑑<δ\int|h-\tilde{h}|d{\mathbb{Q}}<\delta.

Proof  By Lemma 42, one can pick a lower semi-continuous g~\tilde{g} for which g~0\tilde{g}\geq 0 and |hg~|𝑑<δ/2\int|h-\tilde{g}|d{\mathbb{Q}}<\delta/2. Next, by Lemma 43, one can pick a Lipschitz h~\tilde{h} for which |g~h~|𝑑δ/2\int|\tilde{g}-\tilde{h}|d{\mathbb{Q}}\leq\delta/2. Thus |hh~|𝑑<δ\int|h-\tilde{h}|d{\mathbb{Q}}<\delta.  
See 12

Proof  Let =0+1{\mathbb{P}}^{\prime}={\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime} and η=d1/d\eta^{\prime}=d{\mathbb{P}}_{1}^{\prime}/d{\mathbb{P}}^{\prime}. Then for any (h0,h1)SϕE(h_{0},h_{1})\in S_{\phi}\cap E,

h1𝑑1+h0𝑑0=ηh1+(1η)h0dCϕ(η)𝑑=R¯ϕ(0,1).\int h_{1}d{\mathbb{P}}_{1}^{\prime}+\int h_{0}d{\mathbb{P}}_{0}^{\prime}=\int\eta^{\prime}h_{1}+(1-\eta^{\prime})h_{0}d{\mathbb{P}}^{\prime}\geq\int C_{\phi}^{*}(\eta^{\prime})d{\mathbb{P}}^{\prime}=\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}).

We will now focus on showing the other inequality. Define a function ff by

f(𝐱)={αϕ(η(𝐱))𝐱supp0𝐱suppf({\mathbf{x}})=\begin{cases}\alpha_{\phi}(\eta^{\prime}({\mathbf{x}}))&{\mathbf{x}}\in\operatorname{supp}{\mathbb{P}}^{\prime}\\ 0&{\mathbf{x}}\not\in\operatorname{supp}{\mathbb{P}}^{\prime}\end{cases}

Let h1=ϕfh_{1}=\phi\circ f, h0=ϕfh_{0}=\phi\circ-f. Then h1h_{1}, h0h_{0} satisfy the inequality ηh1+(1η)h0Cϕ(η)\eta h_{1}+(1-\eta)h_{0}\geq C_{\phi}^{*}(\eta) for all η\eta while on supp\operatorname{supp}{\mathbb{P}}^{\prime}, η(𝐱)h1(𝐱)+(1η(𝐱))h0(𝐱)=Cϕ(η)\eta^{\prime}({\mathbf{x}})h_{1}({\mathbf{x}})+(1-\eta^{\prime}({\mathbf{x}}))h_{0}({\mathbf{x}})=C_{\phi}^{*}(\eta^{\prime}) and therefore

h1𝑑1+h0𝑑0=ηh1+(1η)h0d=Cϕ(η)𝑑.\int h_{1}d{\mathbb{P}}_{1}^{\prime}+\int h_{0}d{\mathbb{P}}_{0}^{\prime}=\int\eta^{\prime}h_{1}+(1-\eta^{\prime})h_{0}d{\mathbb{P}}^{\prime}=\int C_{\phi}^{*}(\eta^{\prime})d{\mathbb{P}}^{\prime}.

However, (h0,h1)E(h_{0},h_{1})\not\in E. We will now approximate h0,h1h_{0},h_{1} by bounded continuous functions contained in SϕS_{\phi}. Let δ>0\delta>0 be arbitrary. Pick a constant c>0c>0 for which c𝑑<δ\int cd{\mathbb{P}}^{\prime}<\delta and set h~1=max(h1,c)\tilde{h}_{1}=\max(h_{1},c). The pair (h0,h~1)(h_{0},\tilde{h}_{1}) are feasible pair for the set SϕS_{\phi}, and thus

Cϕ(η)ηh~1(1η)h00C_{\phi}^{*}(\eta)-\eta\tilde{h}_{1}-(1-\eta)h_{0}\leq 0 (58)

Furthermore,

h~1𝑑1+h0𝑑0<R¯ϕ(0,1)+δ.\int\tilde{h}_{1}d{\mathbb{P}}_{1}^{\prime}+\int h_{0}d{\mathbb{P}}_{0}^{\prime}<\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})+\delta. (59)

Let kk be the constant described by Lemma 40 corresponding to cc. Now by Corollary 44, there is a Lipschitz function gg for which |h1g|𝑑<min((1k)/k,1)δ\int|h_{1}-g|d{\mathbb{P}}^{\prime}<\min((1-k)/k,1)\delta. Let h^1=max(g,c)\hat{h}_{1}=\max(g,c). Then Lemma 41 implies that h^1\hat{h}_{1} has the same Lipschitz constant as gg, and the fact that h~1c\tilde{h}_{1}\geq c implies that

|h~1h^1|𝑑|h~1g|𝑑<min(1kk,1)δ\int|\tilde{h}_{1}-\hat{h}_{1}|d{\mathbb{P}}^{\prime}\leq\int|\tilde{h}_{1}-g|d{\mathbb{P}}^{\prime}<\min\left(\frac{1-k}{k},1\right)\delta (60)

Now let h^0=h^1Cϕ\hat{h}_{0}=\hat{h}_{1}^{C_{\phi}^{*}}. By Lemma 40, the supremum in the Cϕ{}^{C_{\phi}^{*}} transform for computing h^0\hat{h}_{0} can be taken over [0,k][0,k]. Therefore, if LL is the Lipschitz constant of h^1\hat{h}_{1}, Lemma 41 implies that the Lipschitz constant of h^0\hat{h}_{0} is at most kL/(1k)kL/(1-k). Furthermore, h^0,h^1\hat{h}_{0},\hat{h}_{1} are bounded on KϵK^{\epsilon} because Lipschitz functions are bounded over compact sets. Thus (h^0,h^1)(\hat{h}_{0},\hat{h}_{1}) is in SϕES_{\phi}\cap E. Next, we will show that h^0\int\hat{h}_{0} is close to h0\int h_{0}.

h^0h0d0=sup[0,k]Cϕ(η)ηh^11ηh0d0=sup[0,k]Cϕ(η)ηh^1(1η)h01ηd0\displaystyle\int\hat{h}_{0}-h_{0}d{\mathbb{P}}_{0}^{\prime}=\int\sup_{[0,k]}\frac{C_{\phi}^{*}(\eta)-\eta\hat{h}_{1}}{1-\eta}-h_{0}d{\mathbb{P}}_{0}^{\prime}=\int\sup_{[0,k]}\frac{C_{\phi}^{*}(\eta)-\eta\hat{h}_{1}-(1-\eta)h_{0}}{1-\eta}d{\mathbb{P}}_{0}^{\prime}
=sup[0,k](Cϕ(η)ηh~1(1η)h01η+η1η(h~1h^1))d0\displaystyle=\int\sup_{[0,k]}\left(\frac{C_{\phi}^{*}(\eta)-\eta\tilde{h}_{1}-(1-\eta)h_{0}}{1-\eta}+\frac{\eta}{1-\eta}(\tilde{h}_{1}-\hat{h}_{1})\right)d{\mathbb{P}}_{0}^{\prime}
sup[0,k]Cϕ(η)ηh~1(1η)h01η+sup[0,k]η1η(h~1h^1)d0sup[0,k]η1η(h~1h^1)d0\displaystyle\leq\int\sup_{[0,k]}\frac{C_{\phi}^{*}(\eta)-\eta\tilde{h}_{1}-(1-\eta)h_{0}}{1-\eta}+\sup_{[0,k]}\frac{\eta}{1-\eta}(\tilde{h}_{1}-\hat{h}_{1})d{\mathbb{P}}_{0}^{\prime}\leq\int\sup_{[0,k]}\frac{\eta}{1-\eta}(\tilde{h}_{1}-\hat{h}_{1})d{\mathbb{P}}_{0}^{\prime} (Equation 58)
=k1kh~1h^1d0δ\displaystyle=\frac{k}{1-k}\int\tilde{h}_{1}-\hat{h}_{1}d{\mathbb{P}}_{0}^{\prime}\leq\delta (Equation 60)

Therefore, by (59), (60), and the computation above,

h^1𝑑1+h^0𝑑0R¯ϕ(0,1)+3δ\int\hat{h}_{1}d{\mathbb{P}}_{1}^{\prime}+\int\hat{h}_{0}d{\mathbb{P}}_{0}^{\prime}\leq\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})+3\delta

AS δ>0\delta>0 is arbitrary, this inequality implies (27). Because KϵK^{\epsilon} is compact, the upper semi-continuity and concavity of R¯ϕ\bar{R}_{\phi} then follows from (27) together with the Reisz representation theorem.  

Appendix E Duality for Distributions with arbitrary support–Proof of Lemma 14

We begin with the simple observation that weak duality holds for measures supported on d\mathbb{R}^{d}. This argument is essentially swapping the order of an infimum and a supremum as presented in Section 4.1.

Lemma 45 (Weak Duality)

Let ϕ\phi be a non-increasing and lower semi-continuous loss function. Let SϕS_{\phi} be the set of pairs of functions defined in (25) for K=dK=\mathbb{R}^{d}.

Then

inf(h0,h1)SϕΘ(h0,h1)sup0ϵ(0)1ϵ(1)R¯ϕ(0,1)\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta(h_{0},h_{1})\geq\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})

Proof  By Lemma 3,

inf(h0,h1)SϕSϵ(h0)𝑑0+Sϵ(h1)𝑑1=inf(h0,h1)Sϕsup0ϵ(0)1ϵ(1)h0𝑑0+h1𝑑1.\inf_{(h_{0},h_{1})\in S_{\phi}}\int S_{\epsilon}(h_{0})d{\mathbb{P}}_{0}+\int S_{\epsilon}(h_{1})d{\mathbb{P}}_{1}=\inf_{(h_{0},h_{1})\in S_{\phi}}\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\int h_{0}d{\mathbb{P}}_{0}^{\prime}+\int h_{1}d{\mathbb{P}}_{1}^{\prime}.

Thus by swapping the inf\inf and the sup\sup,

inf(h0,h1)SϕSϵ(h0)𝑑0+Sϵ(h1)𝑑1sup0ϵ(0)1ϵ(1)inf(h0,h1)Sϕh0𝑑0+h1𝑑1\displaystyle\inf_{(h_{0},h_{1})\in S_{\phi}}\int S_{\epsilon}(h_{0})d{\mathbb{P}}_{0}+\int S_{\epsilon}(h_{1})d{\mathbb{P}}_{1}\geq\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\inf_{(h_{0},h_{1})\in S_{\phi}}\int h_{0}d{\mathbb{P}}_{0}^{\prime}+\int h_{1}d{\mathbb{P}}_{1}^{\prime}
=sup0ϵ(0)1ϵ(1)inf(h0,h1)Sϕd1d(0+1)h1+(1d1d(0+1))h0d(0+1)sup0ϵ(0)1ϵ(1)R¯ϕ(0,1)\displaystyle=\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\inf_{(h_{0},h_{1})\in S_{\phi}}\int\frac{d{\mathbb{P}}_{1}^{\prime}}{d({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime})}h_{1}+\left(1-\frac{d{\mathbb{P}}_{1}^{\prime}}{d({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime})}\right)h_{0}d({\mathbb{P}}_{0}^{\prime}+{\mathbb{P}}_{1}^{\prime})\geq\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})

 

The main strategy in this section is approximating measures with unbounded support by measures with bounded support. To this end, we define the restriction of a measure {\mathbb{P}} to a set KK by |K(A)=(KA){\mathbb{P}}|_{K}(A)={\mathbb{P}}(K\cap A).

The Portmaneau theorem then allows us to draw some conclusions about weakly convergent sequences of measures.

Theorem 46 (Portmanteau Theorem)

The following are equivalent:

  1. 1)

    The sequence n+(d){\mathbb{Q}}^{n}\in\mathcal{M}_{+}(\mathbb{R}^{d}) converges weakly to {\mathbb{Q}}

  2. 2)

    For all closed sets CC, lim supnn(C)(C)\limsup_{n\to\infty}{\mathbb{Q}}^{n}(C)\leq{\mathbb{Q}}(C) and limnn(d)=(d)\lim_{n\to\infty}{\mathbb{Q}}^{n}(\mathbb{R}^{d})={\mathbb{Q}}(\mathbb{R}^{d})

  3. 3)

    For all open sets UU, lim infnn(U)(U)\liminf_{n\to\infty}{\mathbb{Q}}^{n}(U)\geq{\mathbb{Q}}(U) and limnn(d)=(d)\lim_{n\to\infty}{\mathbb{Q}}^{n}(\mathbb{R}^{d})={\mathbb{Q}}(\mathbb{R}^{d})

See Theorem 8.2.3 of (Bogachev, 2007). This result allows us to draw conclusions about restrictions of weakly convergent sequences.

Lemma 47

Let n,+(d){\mathbb{Q}}^{n},{\mathbb{Q}}\in\mathcal{M}_{+}(\mathbb{R}^{d}) and assume that n{\mathbb{Q}}^{n} converges weakly to {\mathbb{Q}}. Let KK be a compact set with (K)=0{\mathbb{Q}}(\partial K)=0. Then n|K{\mathbb{Q}}^{n}|_{K} converges weakly to |K{\mathbb{Q}}|_{K}.

Proof  We will verify 2) of Theorem 46 for the measures n|K{\mathbb{Q}}^{n}|_{K}, {\mathbb{Q}}.

First, because (K)=(intK){\mathbb{Q}}(K)={\mathbb{Q}}(\operatorname{int}K), Theorem 46 implies that

lim supnn(K)(K)=(intK)lim infnn(intK)lim infnn(K).\limsup_{n\to\infty}{\mathbb{Q}}^{n}(K)\leq{\mathbb{Q}}(K)={\mathbb{Q}}(\operatorname{int}K)\leq\liminf_{n\to\infty}{\mathbb{Q}}^{n}(\operatorname{int}K)\leq\liminf_{n\to\infty}{\mathbb{Q}}^{n}(K).

Therefore, limnn|K(d)=limnn(K)=(K)\lim_{n\to\infty}{\mathbb{Q}}^{n}|_{K}(\mathbb{R}^{d})=\lim_{n\to\infty}{\mathbb{Q}}^{n}(K)={\mathbb{Q}}(K). Next, for any closed set CC, the set CKC\cap K is also closed so the fact that n{\mathbb{Q}}^{n} weakly converges to {\mathbb{Q}} implies that

lim supnn|K(C)=lim supnn(KC)(KC)=|K(C).\limsup_{n\to\infty}{\mathbb{Q}}^{n}|_{K}(C)=\limsup_{n\to\infty}{\mathbb{Q}}^{n}(K\cap C)\leq{\mathbb{Q}}(K\cap C)={\mathbb{Q}}|_{K}(C).

 
Next, Prokhorov’s theorem allows us to identify weakly convergent subsequences.

Theorem 48

Let n{\mathbb{Q}}^{n} be a sequence of measures for which supnn(d)<\sup_{n}{\mathbb{Q}}^{n}(\mathbb{R}^{d})<\infty and for all δ\delta, there exists a compact KK for which n(KC)<δ{\mathbb{Q}}^{n}(K^{C})<\delta for all nn. Then n{\mathbb{Q}}^{n} has a weakly convergent subsequence.

See Theorem 8.6.2 of (Bogachev, 2007). These results imply that R¯ϕ\bar{R}_{\phi} is upper semi-continuous on +(d)×+(d)\mathcal{M}_{+}(\mathbb{R}^{d})\times\mathcal{M}_{+}(\mathbb{R}^{d}).

Lemma 49

The functional R¯ϕ\bar{R}_{\phi} is upper semi-continuous with respect to the weak topology on probability measures (in duality with C0(d)C_{0}(\mathbb{R}^{d})).

Notice that Lemma 12 implies that R¯ϕ\bar{R}_{\phi} is upper semi-continuous on the space +(Kϵ)×+(Kϵ)\mathcal{M}_{+}(K^{\epsilon})\times\mathcal{M}_{+}(K^{\epsilon}) for a compact set KK. However, on d\mathbb{R}^{d}, weak convergence of measures is defined with respect to the dual of C0(d)C_{0}(\mathbb{R}^{d}), the set of continuous functions vanishing at \infty. This set is strictly smaller than Cb(d)C_{b}(\mathbb{R}^{d}), and thus the relation (27) would not immediately imply the the upper semi-continuity of RϕϵR_{\phi}^{\epsilon}.

Proof  Let 0n,1n{\mathbb{Q}}_{0}^{n},{\mathbb{Q}}_{1}^{n} be sequences of measures converging to 0,1{\mathbb{Q}}_{0},{\mathbb{Q}}_{1} respectively. Set =0+1{\mathbb{Q}}={\mathbb{Q}}_{0}+{\mathbb{Q}}_{1}.

Define a function F(R)=(BR(𝟎)¯C)F(R)={\mathbb{Q}}(\overline{B_{R}({\mathbf{0}})}^{C}). Then because this function is non-increasing, it has finitely many points of discontinuity.

Let δ>0\delta>0 be arbitrary and choose RR large enough so that F(R)<δ/Cϕ(1/2)F(R)<\delta/C_{\phi}^{*}(1/2) and FF is continuous at RR. Then notice that (BR(𝟎))=0{\mathbb{P}}(\partial B_{R}({\mathbf{0}}))=0 and thus one can apply Lemma 47 with the set BR(𝟎)¯\overline{B_{R}({\mathbf{0}})}.

Now let ν0,ν1\nu_{0},\nu_{1} be arbitrary measures. Consider νiR\nu_{i}^{R} defined by νiR=νi|BR(𝟎)¯\nu_{i}^{R}=\nu_{i}|_{\overline{B_{R}({\mathbf{0}})}}. Set ν=ν0+ν1\nu=\nu_{0}+\nu_{1}, η=dν1/dν\eta=d\nu_{1}/d\nu, νR=ν0R+ν1R\nu^{R}=\nu_{0}^{R}+\nu_{1}^{R}, ηR=dν1R/dνR\eta^{R}=d\nu_{1}^{R}/d\nu^{R}. Then on BR(𝟎)¯\overline{B_{R}({\mathbf{0}})}, ηR=η\eta^{R}=\eta a.e. Thus

|R¯ϕ(ν0R,ν1R)R¯ϕ(ν0,ν1)|=|Cϕ(η)𝟏BR(𝟎)¯𝑑νCϕ(η)𝑑ν|Cϕ(12)ν(BR(𝟎)C¯)|\bar{R}_{\phi}(\nu_{0}^{R},\nu_{1}^{R})-\bar{R}_{\phi}(\nu_{0},\nu_{1})|=\left|\int C_{\phi}^{*}(\eta){\mathbf{1}}_{\overline{B_{R}({\mathbf{0}})}}d\nu-\int C_{\phi}^{*}(\eta)d\nu\right|\leq C_{\phi}^{*}\left(\frac{1}{2}\right)\nu(\overline{B_{R}({\mathbf{0}})^{C}}) (61)

If we define i,R,i,Rn{\mathbb{Q}}_{i,R},{\mathbb{Q}}_{i,R}^{n} via i,R=i|BR(𝟎)¯{\mathbb{Q}}_{i,R}={\mathbb{Q}}_{i}|_{\overline{B_{R}({\mathbf{0}})}}, i,Rn=in=in|BR(𝟎)¯{\mathbb{Q}}_{i,R}^{n}={\mathbb{Q}}_{i}^{n}={\mathbb{Q}}_{i}^{n}|_{\overline{B_{R}({\mathbf{0}})}}, Lemma 47 implies that i,Rn{\mathbb{Q}}_{i,R}^{n} converges weakly to i,R{\mathbb{Q}}_{i,R} and limnn(BR(𝟎)¯C)=(BR(𝟎)C)<δ\lim_{n\to\infty}{\mathbb{Q}}^{n}(\overline{B_{R}({\mathbf{0}})}^{C})={\mathbb{Q}}(B_{R}({\mathbf{0}})^{C})<\delta. Therefore, for sufficiently large nn, n(BR(𝟎)¯C)<2δ/Cϕ(1/2){\mathbb{Q}}^{n}(\overline{B_{R}({\mathbf{0}})}^{C})<2\delta/C_{\phi}^{*}(1/2). By Lemma 12 and (61),

lim supnR¯ϕ(0n,1n)lim supnR¯ϕ(0,Rn,1,Rn)+2δR¯ϕ(0,R,1,R)+2δR¯ϕ(0,1)+3δ\limsup_{n\to\infty}\bar{R}_{\phi}({\mathbb{Q}}_{0}^{n},{\mathbb{Q}}_{1}^{n})\leq\limsup_{n\to\infty}\bar{R}_{\phi}({\mathbb{Q}}_{0,R}^{n},{\mathbb{Q}}_{1,R}^{n})+2\delta\leq\bar{R}_{\phi}({\mathbb{Q}}_{0,R},{\mathbb{Q}}_{1,R})+2\delta\leq\bar{R}_{\phi}({\mathbb{Q}}_{0},{\mathbb{Q}}_{1})+3\delta

Because δ\delta was arbitrary, the result follows.

 

Next we consider an approximation of 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} by compactly supported measures.

Lemma 50

Let 0,1{\mathbb{P}}_{0},{\mathbb{P}}_{1} be finite measures. Define in=i|Bn(𝟎)¯{\mathbb{P}}_{i}^{n}={\mathbb{P}}_{i}|_{\overline{B_{n}({\mathbf{0}})}} for nn\in\mathbb{N}. Then 0n,1n{\mathbb{P}}_{0}^{n},{\mathbb{P}}_{1}^{n} converge weakly to 0{\mathbb{P}}_{0}, 1{\mathbb{P}}_{1} respectively. Furthermore, there are measures 0ϵ(0),1ϵ(1){\mathbb{P}}_{0}^{*}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0}),{\mathbb{P}}_{1}^{*}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}) for which

lim supnsup1ϵ(1n)0ϵ(0)nR¯ϕ(0,1)R¯ϕ(0,1)\limsup_{n\to\infty}\sup_{\begin{subarray}{c}{\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}^{n})\\ {\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})^{n}\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})\leq\bar{R}_{\phi}({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}) (62)

Proof  Set =0+1{\mathbb{P}}={\mathbb{P}}_{0}+{\mathbb{P}}_{1}, n=0n+1n{\mathbb{P}}^{n}={\mathbb{P}}_{0}^{n}+{\mathbb{P}}_{1}^{n}. Notice that 2) of Theorem 46 implies that in{\mathbb{P}}_{i}^{n} converges weakly to i{\mathbb{P}}_{i}. Let 0,n,1,n{\mathbb{P}}_{0}^{*,n},{\mathbb{P}}_{1}^{*,n} be maximizers of R¯ϕ\bar{R}_{\phi} over ϵ(0n)×ϵ(1n){{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0}^{n})\times{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}^{n}). Next, by Strassen’s theorem (Theorem 38), in(Br(𝟎)¯)in,(Br+ϵ(𝟎)¯){\mathbb{P}}_{i}^{n}(\overline{B_{r}({\mathbf{0}})})\leq{\mathbb{P}}_{i}^{n,*}(\overline{B_{r+\epsilon}({\mathbf{0}})}) and thus i(Br(𝟎)¯C)in(Br(𝟎)¯C)in,(Br+ϵ(𝟎)¯){\mathbb{P}}_{i}(\overline{B_{r}({\mathbf{0}})}^{C})\geq{\mathbb{P}}_{i}^{n}(\overline{B_{r}({\mathbf{0}})}^{C})\geq{\mathbb{P}}_{i}^{n,*}(\overline{B_{r+\epsilon}({\mathbf{0}})}). Therefore, one can apply Prokhorov’s theorem (Thereom 48) to conclude that 0n,{\mathbb{P}}_{0}^{n,*}, 1n,{\mathbb{P}}_{1}^{n,*} have subsequences 0nk,{\mathbb{P}}_{0}^{n_{k},*}, 1nk,{\mathbb{P}}_{1}^{n_{k},*} that converge to measures 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} respectively. The upper semi-continuity of RϕR_{\phi} (Lemma 49) then implies that 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} satisfy (62).

It remains to show that iϵ(i){\mathbb{P}}_{i}^{*}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{i}). We will apply Lemma 4. Because ink,ϵ(ink){\mathbb{P}}_{i}^{n_{k},*}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{i}^{n_{k}}) for all nkn_{k}, Lemma 4 implies that for every fCb(d)f\in C_{b}(\mathbb{R}^{d}), Sϵ(f)𝑑inkf𝑑i,nk\int S_{\epsilon}(f)d{\mathbb{P}}_{i}^{n_{k}}\geq\int fd{\mathbb{P}}_{i}^{*,n_{k}}. Because ink{\mathbb{P}}_{i}^{n_{k}} converges weakly to i{\mathbb{P}}_{i} and i,nk{\mathbb{P}}_{i}^{*,n_{k}} converges weakly to i{\mathbb{P}}_{i}^{*}, one can take the limit kk\to\infty to conclude Sϵ(f)𝑑if𝑑i\int S_{\epsilon}(f)d{\mathbb{P}}_{i}\geq\int fd{\mathbb{P}}_{i}^{*} for all fCb(d)f\in C_{b}(\mathbb{R}^{d}). Lemma 4 then implies iϵ(i){\mathbb{P}}_{i}^{*}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{i}).

 

See 14

Proof  Let 0n{\mathbb{P}}_{0}^{n}, 1n{\mathbb{P}}_{1}^{n}, 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} be the the measures described in Lemma 50. Notice that because 0n{\mathbb{P}}_{0}^{n}, 1n{\mathbb{P}}_{1}^{n} are compactly supported, Lemma 13 applies. Define

Θn(h0,h1)=Sϵ(h1)𝑑1n+Sϵ(h0)𝑑0n.\Theta^{n}(h_{0},h_{1})=\int S_{\epsilon}(h_{1})d{\mathbb{P}}_{1}^{n}+\int S_{\epsilon}(h_{0})d{\mathbb{P}}_{0}^{n}.

Thus Lemmas 13 and Lemma 50 imply that

lim supninf(h0,h1)SϕΘn(h0,h1)=lim supnsup0ϵ(0n)1ϵ(1n)R¯ϕ(0,1)R¯ϕ(0,1)sup0ϵ(0)1ϵ(1)R¯ϕ(0,1).\limsup_{n\to\infty}\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta^{n}(h_{0},h_{1})=\limsup_{n\to\infty}\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0}^{n})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}^{n})\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime})\leq\bar{R}_{\phi}({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*})\leq\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}). (63)

We will show

inf(h0,h1)SϕΘ(h0,h1)lim supninf(h0,h1)SϕΘn(h0,h1).\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta(h_{0},h_{1})\leq\limsup_{n\to\infty}\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta_{n}(h_{0},h_{1}). (64)

Equations 63 and 64 imply that

inf(h0,h1)SϕΘ(h0,h1)R¯ϕ(0,1)sup0ϵ(0)1ϵ(1)R¯ϕ(0,1).\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta(h_{0},h_{1})\leq\bar{R}_{\phi}({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*})\leq\sup_{\begin{subarray}{c}{\mathbb{P}}_{0}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\\ {\mathbb{P}}_{1}^{\prime}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1})\end{subarray}}\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}). (65)

This relation together with weak duality (Lemma 45) imply that the inequalities in (65) are actually equalities. Therefore strong duality holds and 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} maximizes the dual.

Next, we prove the inequality in (64). Let δ>0\delta>0 be arbitrary and choose an nn\in\mathbb{N} for which n>2ϵn>2\epsilon and

1(Bn2ϵ(𝟎)¯C)+0(Bn2ϵ(𝟎)¯C)δ{\mathbb{P}}_{1}(\overline{B_{n-2\epsilon}({\mathbf{0}})}^{C})+{\mathbb{P}}_{0}(\overline{B_{n-2\epsilon}({\mathbf{0}})}^{C})\leq\delta (66)

Let (h0n,h1n)Sϕ(h_{0}^{n},h_{1}^{n})\in S_{\phi} be functions for which

Θn(h0n,h1n)inf(h0,h1)SϕΘn(h0,h1)+δ\Theta^{n}(h_{0}^{n},h_{1}^{n})\leq\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta^{n}(h_{0},h_{1})+\delta (67)

Define

h~0n={h0n𝐱Bnϵ(𝟎)¯Cϕ(12)𝐱Bnϵ(𝟎)¯h~1n={h1n𝐱Bnϵ(𝟎)¯Cϕ(12)𝐱Bnϵ(𝟎)¯\tilde{h}_{0}^{n}=\begin{cases}h_{0}^{n}&{\mathbf{x}}\in\overline{B_{n-\epsilon}({\mathbf{0}})}\\ C_{\phi}^{*}\left(\frac{1}{2}\right)&{\mathbf{x}}\not\in\overline{B_{n-\epsilon}({\mathbf{0}})}\end{cases}\quad\tilde{h}_{1}^{n}=\begin{cases}h_{1}^{n}&{\mathbf{x}}\in\overline{B_{n-\epsilon}({\mathbf{0}})}\\ C_{\phi}^{*}\left(\frac{1}{2}\right)&{\mathbf{x}}\not\in\overline{B_{n-\epsilon}({\mathbf{0}})}\end{cases}

Because ηh0n+(1η)h1nCϕ(η)\eta h_{0}^{n}+(1-\eta)h_{1}^{n}\geq C_{\phi}^{*}(\eta) η[0,1]\forall\eta\in[0,1] on Bnϵ(𝟎)B_{n-\epsilon}({\mathbf{0}}) and (Cϕ(1/2),Cϕ(1/2))Sϕ(C_{\phi}^{*}(1/2),C_{\phi}^{*}(1/2))\in S_{\phi}, one can conclude that (h~0n,h~1n)Sϕ(\tilde{h}_{0}^{n},\tilde{h}_{1}^{n})\in S_{\phi}.

Now because n>2ϵn>2\epsilon, the regions Bnϵ(𝟎)¯,Bn2ϵ(𝟎)¯\overline{B_{n-\epsilon}({\mathbf{0}})},\overline{B_{n-2\epsilon}({\mathbf{0}})} are non-empty. One can bound Sϵ(h~i)S_{\epsilon}(\tilde{h}_{i}) in terms of Sϵ(hi)S_{\epsilon}(h_{i}) and Cϕ(1/2)C_{\phi}^{*}(1/2):

Sϵ(h~i)(𝐱)=Sϵ(hi)(𝐱)\displaystyle S_{\epsilon}(\tilde{h}_{i})({\mathbf{x}})=S_{\epsilon}(h_{i})({\mathbf{x}}) for 𝐱Bn2ϵ(𝟎)¯\displaystyle\text{for }{\mathbf{x}}\in\overline{B_{n-2\epsilon}({\mathbf{0}})}
Sϵ(h~i)(𝐱)max(Sϵ(hi)(𝐱),Cϕ(1/2))Sϵ(hi)+Cϕ(1/2)\displaystyle S_{\epsilon}(\tilde{h}_{i})({\mathbf{x}})\leq\max(S_{\epsilon}(h_{i})({\mathbf{x}}),C_{\phi}^{*}(1/2))\leq S_{\epsilon}(h_{i})+C_{\phi}^{*}(1/2) for 𝐱Bn(𝟎)¯\displaystyle\text{for }{\mathbf{x}}\in\overline{B_{n}({\mathbf{0}})}
Sϵ(h~i)=Cϕ(1/2)\displaystyle S_{\epsilon}(\tilde{h}_{i})=C_{\phi}^{*}(1/2) for 𝐱Bn(𝟎)¯C\displaystyle\text{for }{\mathbf{x}}\in\overline{B_{n}({\mathbf{0}})}^{C}

Now for each ii, these bounds imply that

Sϵ(h~in)𝑑iBn2ϵ(𝟎)¯Sϵ(hin)𝑑i+Bn(𝟎)¯Bn2ϵ(𝟎)¯Sϵ(hin)+Cϕ(12)di+Bn(𝟎)¯CCϕ(12)𝑑i\displaystyle\int S_{\epsilon}(\tilde{h}_{i}^{n})d{\mathbb{P}}_{i}\leq\int_{\overline{B_{n-2\epsilon}({\mathbf{0}})}}S_{\epsilon}(h_{i}^{n})d{\mathbb{P}}_{i}+\int_{\overline{B_{n}({\mathbf{0}})}-\overline{B_{n-2\epsilon}({\mathbf{0}})}}S_{\epsilon}(h_{i}^{n})+C_{\phi}^{*}\left(\frac{1}{2}\right)d{\mathbb{P}}_{i}+\int_{\overline{B_{n}({\mathbf{0}})}^{C}}C_{\phi}^{*}\left(\frac{1}{2}\right)d{\mathbb{P}}_{i}
=Bn(𝟎)¯Sϵ(hin)𝑑i+Bn2ϵ(𝟎)¯CCϕ(12)𝑑i\displaystyle=\int_{\overline{B_{n}({\mathbf{0}})}}S_{\epsilon}(h_{i}^{n})d{\mathbb{P}}_{i}+\int_{\overline{B_{n-2\epsilon}({\mathbf{0}})}^{C}}C_{\phi}^{*}\left(\frac{1}{2}\right)d{\mathbb{P}}_{i}

Then, applying this bound for each ii,

Θ(h~0n,h~1n)=Sϵ(h~1n)𝑑1+Sϵ(h~0n)𝑑0\displaystyle\Theta(\tilde{h}_{0}^{n},\tilde{h}_{1}^{n})=\int S_{\epsilon}(\tilde{h}_{1}^{n})d{\mathbb{P}}_{1}+\int S_{\epsilon}(\tilde{h}_{0}^{n})d{\mathbb{P}}_{0}
(Bn(𝟎)¯Sϵ(h1n)𝑑1+Bn(𝟎)¯Sϵ(h0n)𝑑0)+(Bn2ϵ(𝟎)¯CCϕ(12)𝑑1+Bn2ϵ(𝟎)¯CCϕ(12)𝑑0)\displaystyle\leq\left(\int_{\overline{B_{n}({\mathbf{0}})}}S_{\epsilon}(h_{1}^{n})d{\mathbb{P}}_{1}+\int_{\overline{B_{n}({\mathbf{0}})}}S_{\epsilon}(h_{0}^{n})d{\mathbb{P}}_{0}\right)+\left(\int_{\overline{B_{n-2\epsilon}({\mathbf{0}})}^{C}}C_{\phi}^{*}\left(\frac{1}{2}\right)d{\mathbb{P}}_{1}+\int_{\overline{B_{n-2\epsilon}({\mathbf{0}})}^{C}}C_{\phi}^{*}\left(\frac{1}{2}\right)d{\mathbb{P}}_{0}\right)
=Θn(h0n,h1n)+Cϕ(12)(0(Bn2ϵ(𝟎)¯C)+1(Bn2ϵ(𝟎)¯C))(inf(h0,h1)SϕΘn(h0,h1)+δ)+δCϕ(12)\displaystyle=\Theta^{n}(h_{0}^{n},h_{1}^{n})+C_{\phi}^{*}\left(\frac{1}{2}\right)\left({\mathbb{P}}_{0}(\overline{B_{n-2\epsilon}({\mathbf{0}})}^{C})+{\mathbb{P}}_{1}(\overline{B_{n-2\epsilon}({\mathbf{0}})}^{C})\right)\leq\left(\inf_{(h_{0},h_{1})\in S_{\phi}}\Theta^{n}(h_{0},h_{1})+\delta\right)+\delta C_{\phi}^{*}\left(\frac{1}{2}\right)

The last inequality follows from Equations 66 and 67. Because δ\delta arbitrary, (64) holds.

 

Appendix F Complimentary Slackness

See 15

Notice that the forward direction of this lemma is actually a consequence of the approximate complimentary slackness result in Lemma 16, but we provide a separate self-contained proof below.

Proof

First assume that (0,1)({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}) maximizes R¯ϕ\bar{R}_{\phi} over ϵ(0)×ϵ(1){{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{0})\times{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{1}) and (h0,h1)(h_{0}^{*},h_{1}^{*}) minimizes Θ\Theta over SϕS_{\phi}. Because iϵ(i){\mathbb{P}}^{*}_{i}\in{{\mathcal{B}}^{\infty}_{\epsilon}}({\mathbb{P}}_{i}) and (h0,h1)Sϕ(h_{0}^{*},h_{1}^{*})\in S_{\phi}, by Lemma 3

Θ(h0,h1)=Sϵ(h1)𝑑1+Sϵ(h0)𝑑0h1𝑑1+h0𝑑0\displaystyle\Theta(h_{0}^{*},h_{1}^{*})=\int S_{\epsilon}(h_{1}^{*})d{\mathbb{P}}_{1}+\int S_{\epsilon}(h_{0}^{*})d{\mathbb{P}}_{0}\geq\int h_{1}^{*}d{\mathbb{P}}_{1}^{*}+\int h_{0}^{*}d{\mathbb{P}}_{0}^{*} (68)
=ηh1+(1η)h0dCϕ(η)𝑑=R¯ϕ(0,1)\displaystyle=\int\eta^{*}h_{1}^{*}+\left(1-\eta^{*}\right)h_{0}^{*}d{\mathbb{P}}^{*}\geq\int C_{\phi}^{*}(\eta^{*})d{\mathbb{P}}^{*}=\bar{R}_{\phi}({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*}) (69)

By Lemma 14, both the first expression of (68) and the last expression of (69) are equal. Thus all the inequalities above must be equalities which implies (31). Next, because (69) implies that

Sϵ(h1)𝑑1+Sϵ(h0)𝑑0=h1𝑑1+h0𝑑0\int S_{\epsilon}(h_{1}^{*})d{\mathbb{P}}_{1}+\int S_{\epsilon}(h_{0}^{*})d{\mathbb{P}}_{0}=\int h_{1}^{*}d{\mathbb{P}}_{1}^{*}+\int h_{0}^{*}d{\mathbb{P}}_{0}^{*}

and Lemma 3 implies that Sϵ(h0)𝑑0h0𝑑0\int S_{\epsilon}(h_{0}^{*})d{\mathbb{P}}_{0}\geq\int h_{0}^{*}d{\mathbb{P}}_{0}^{*} and Sϵ(h1)𝑑1h1𝑑1\int S_{\epsilon}(h_{1}^{*})d{\mathbb{P}}_{1}\geq\int h_{1}^{*}d{\mathbb{P}}_{1}^{*} we can conclude (30).

We will now show the opposite implication. Assume that h0,h1,0,1h_{0}^{*},h_{1}^{*},{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} satisfy (30) and (31). Then

Θ(h0,h1)=Sϵ(h1)𝑑1+Sϵ(h0)𝑑0\displaystyle\Theta(h_{0}^{*},h_{1}^{*})=\int S_{\epsilon}(h_{1}^{*})d{\mathbb{P}}_{1}+\int S_{\epsilon}(h_{0}^{*})d{\mathbb{P}}_{0}
=h1𝑑1+h0𝑑0\displaystyle=\int h_{1}^{*}d{\mathbb{P}}_{1}^{*}+\int h_{0}^{*}d{\mathbb{P}}_{0}^{*} (Equation 30)
=ηh1+(1η)h0d=Cϕ(η)𝑑\displaystyle=\int\eta^{*}h_{1}^{*}+(1-\eta^{*})h_{0}^{*}d{\mathbb{P}}^{*}=\int C_{\phi}^{*}(\eta^{*})d{\mathbb{P}}^{*} (Equation 31)
=R¯ϕ(0,1)\displaystyle=\bar{R}_{\phi}({\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*})

However, Lemma 14 implies that Θ(h0,h1)R¯ϕ(0,1)\Theta(h_{0},h_{1})\geq\bar{R}_{\phi}({\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}) for any h0,h1,0,1h_{0},h_{1},{\mathbb{P}}_{0}^{\prime},{\mathbb{P}}_{1}^{\prime}. Therefore, h0,h1h_{0}^{*},h_{1}^{*} must be optimal for Θ\Theta and 0,1{\mathbb{P}}_{0}^{*},{\mathbb{P}}_{1}^{*} must be optimal for R¯ϕ\bar{R}_{\phi}.

 
Notably, a similar strategy shows that if (h0n,h1n)Sϕ(h_{0}^{n},h_{1}^{n})\in S_{\phi} is a sequence that satisfies 1) and 2) of Lemma 16, then (h0n,h1n)(h_{0}^{n},h_{1}^{n}) must be a minimizing sequence for Θ\Theta.

Appendix G Technical Lemmas from Section 6

G.1 Proof of Lemma 17

See 17

Proof  First, one can verify that -\infty minimizes Cψ(0,α)C_{\psi}(0,\alpha) and \infty minimizes Cψ(1,α)C_{\psi}(1,\alpha), and that Cψ(0)=Cψ(1)=0C_{\psi}^{*}(0)=C_{\psi}^{*}(1)=0. To find minimizers of Cψ(η,α)C_{\psi}(\eta,\alpha) for η(0,1)\eta\in(0,1), we solve αCψ(η,α)=ηeα+(1η)eα=0\partial_{\alpha}C_{\psi}(\eta,\alpha)=-\eta e^{-\alpha}+(1-\eta)e^{\alpha}=0, resulting in αψ(η)=1/2log(η/1η)\alpha_{\psi}(\eta)=1/2\log(\eta/1-\eta). This formula allows for computation of Cψ(η)C_{\psi}^{*}(\eta) via Cψ(η)=Cψ(η,αψ(η))C_{\psi}^{*}(\eta)=C_{\psi}(\eta,\alpha_{\psi}(\eta)).

Next, by definition

ηψ(αψ(η))+(1η)(ψ(αψ(η)))=Cψ(η) and sψ(αψ(η))+(1s)(ψ(αψ(η)))Cψ(s)\eta\psi(\alpha_{\psi}(\eta))+(1-\eta)(-\psi(\alpha_{\psi}(\eta)))=C_{\psi}^{*}(\eta)\quad\text{ and }s\psi(\alpha_{\psi}(\eta))+(1-s)(-\psi(\alpha_{\psi}(\eta)))\geq C_{\psi}^{*}(s)

for all s[0,1]s\in[0,1]. Therefore, ψ(αψ(η))ψ(αψ(η))\psi(\alpha_{\psi}(\eta))-\psi(-\alpha_{\psi}(\eta)) is a supergradient of Cψ(η)C_{\psi}^{*}(\eta) at η\eta.

The function CψC_{\psi}^{*} is differentiable on (0,1)(0,1), and thus the superdifferential is unique on this set. To show that Cψ(0)\partial C_{\psi}^{*}(0), Cψ(1)\partial C_{\psi}^{*}(1) are singletons, it suffices to observe that

limη0ddηCψ(η)=+,limη1ddηCψ(η)=.\lim_{\eta\to 0}\frac{d}{d\eta}C_{\psi}^{*}(\eta)=+\infty,\lim_{\eta\to 1}\frac{d}{d\eta}C_{\psi}^{*}(\eta)=-\infty.

 

G.2 Proof of Lemma 18

See 18

Proof  Recall that on the extended real number line, every subsequence has a convergent subsequence. We will show that limnan=ψ(αψ(η0))\lim_{n\to\infty}a_{n}=\psi(\alpha_{\psi}(\eta_{0})) and limnbn=ψ(αψ(η0))\lim_{n\to\infty}b_{n}=\psi(-\alpha_{\psi}(\eta_{0})) by proving that every convergent subsequence of {an}\{a_{n}\} converges to ψ(αψ(η0))\psi(\alpha_{\psi}(\eta_{0})) and every convergent subsequence of bnb_{n} converges to ψ(αψ(η0))\psi(\alpha_{\psi}(\eta_{0})).

Let anka_{n_{k}}, bnkb_{n_{k}} be a convergent subsequences of {an}\{a_{n}\}, {bn}\{b_{n}\} respectively. (Again, this convergence is in ¯\overline{\mathbb{R}}.) Set a=limkanka=\lim_{k\to\infty}a_{n_{k}}, b=limkbnkb=\lim_{k\to\infty}b_{n_{k}}.

Then (39) (40) imply that

ηa+(1η)bCψ(η) for all η[0,1]\eta a+(1-\eta)b\geq C_{\psi}^{*}(\eta)\text{ for all }\eta\in[0,1]
η0a+(1η0)b=Cψ(η0)\eta_{0}a+(1-\eta_{0})b=C_{\psi}^{*}(\eta_{0}) (70)

These equations imply that abCψ(η0)a-b\in\partial C_{\psi}^{*}(\eta_{0}) and thus

ab=ψ(αψ(η0))ψ(αψ(η0))a-b=\psi(\alpha_{\psi}(\eta_{0}))-\psi(-\alpha_{\psi}(\eta_{0})) (71)

while (70) is equivalent to

η0a+(1η0)b=η0ψ(αψ(η0))+(1η0)ψ(αψ(η0))\eta_{0}a+(1-\eta_{0})b=\eta_{0}\psi(\alpha_{\psi}(\eta_{0}))+(1-\eta_{0})\psi(-\alpha_{\psi}(\eta_{0})) (72)

The equations (71) and (72) comprise a system of equations in two variables with a unique solution for aa and bb.  

G.3 Proof of Lemma 20

Lastly, we prove Lemma 20. See 20

Proof  We start by showing (43).

lim infnSϵ(hn)(𝐱)=lim infnsup𝐡ϵhn(𝐱+𝐡)=supNinfnNsup𝐡ϵhn(𝐱+𝐡)\displaystyle\liminf_{n\to\infty}S_{\epsilon}(h_{n})({\mathbf{x}})=\liminf_{n\to\infty}\sup_{\|{\mathbf{h}}\|\leq\epsilon}h_{n}({\mathbf{x}}+{\mathbf{h}})=\sup_{N}\inf_{n\geq N}\sup_{\|{\mathbf{h}}\|\leq\epsilon}h_{n}({\mathbf{x}}+{\mathbf{h}})
sup𝐡ϵsupNinfnNhn(𝐱+𝐡)=sup𝐡ϵlim infnhn(𝐱+𝐡)=Sϵ(lim infnhn)(𝐱)\displaystyle\geq\sup_{\|{\mathbf{h}}\|\leq\epsilon}\sup_{N}\inf_{n\geq N}h_{n}({\mathbf{x}}+{\mathbf{h}})=\sup_{\|{\mathbf{h}}\|\leq\epsilon}\liminf_{n\to\infty}h_{n}({\mathbf{x}}+{\mathbf{h}})=S_{\epsilon}(\liminf_{n\to\infty}h_{n})({\mathbf{x}})

Equation 44 can then be proved by the same argument:

lim supnSϵ(hn)(𝐱)=lim supnsup𝐡ϵhn(𝐱+𝐡)=infNsupnNsup𝐡ϵhn(𝐱+𝐡)\displaystyle\limsup_{n\to\infty}S_{\epsilon}(h_{n})({\mathbf{x}})=\limsup_{n\to\infty}\sup_{\|{\mathbf{h}}\|\leq\epsilon}h_{n}({\mathbf{x}}+{\mathbf{h}})=\inf_{N}\sup_{n\geq N}\sup_{\|{\mathbf{h}}\|\leq\epsilon}h_{n}({\mathbf{x}}+{\mathbf{h}})
sup𝐡ϵinfNsupnNhn(𝐱+𝐡)=sup𝐡ϵlim supnhn(𝐱+𝐡)=Sϵ(lim supnhn)(𝐱)\displaystyle\geq\sup_{\|{\mathbf{h}}\|\leq\epsilon}\inf_{N}\sup_{n\geq N}h_{n}({\mathbf{x}}+{\mathbf{h}})=\sup_{\|{\mathbf{h}}\|\leq\epsilon}\limsup_{n\to\infty}h_{n}({\mathbf{x}}+{\mathbf{h}})=S_{\epsilon}(\limsup_{n\to\infty}h_{n})({\mathbf{x}})

 

References

  • Awasthi et al. (2021a) P. Awasthi, N. Frank, A. Mao, M. Mohri, and Y. Zhong. Calibration and consistency of adversarial surrogate losses. NeurIps, 2021a.
  • Awasthi et al. (2021b) P. Awasthi, N. S. Frank, and M. Mohri. On the existence of the adversarial bayes classifier (extended version). arxiv, 2021b.
  • Bao et al. (2021) H. Bao, C. Scott, and M. Sugiyama. Calibrated surrogate losses for adversarially robust classification. arxiv, 2021.
  • Barbu and Precupanu (2012) V. Barbu and T. Precupanu. Convexity and Optimization in Banach Spaces. Springer Monographs in Mathematics, 4th edition, 2012.
  • Bartlett et al. (2006) P. L. Bartlett, M. I. Jordan, and J. D. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473), 2006.
  • Ben-David et al. (2003) S. Ben-David, N. Eiron, and P. M. Long. On the difficulty of approximately maximizing agreements. Journal of Computer System Sciences, 2003.
  • Bertsekas and Shreve (1996) D. P. Bertsekas and S. E. Shreve. Stochastic Optimal Control: The Discrete-Time Case. Athena Scientific, 1996.
  • Bhagoji et al. (2019) A. N. Bhagoji, D. Cullina, and P. Mittal. Lower bounds on adversarial robustness from optimal transport. In Advances in Neural Information Processing Systems, pages 7498–7510, 2019.
  • Biggio et al. (2013) B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013.
  • Bogachev (2007) V. I. Bogachev. Measure Theory, volume II. Springer, 2007.
  • Bungert et al. (2021) L. Bungert, N. G. Trillos, and R. Murray. The geometry of adversarial training in binary classification. arxiv, 2021.
  • Demontis et al. (2018) A. Demontis, M. Melis, M. Pintor, M. Jagielski, B. Biggio, A. Oprea, C. Nita-Rotaru, and F. Roli. Why do adversarial attacks transfer? explaining transferability of evasion and poisoning attacks. CoRR, 2018.
  • Domingo-Enrich et al. (2021) C. Domingo-Enrich, S. Jelassi, A. Mensch, G. Rotskoff, and J. Bruna. A mean-field analysis of two-player zero-sum games, 2021.
  • Folland (1999) G. B. Folland. Real analysis: modern techniques and their applications, volume 40. John Wiley & Sons, 1999.
  • Frank and Niles-Weed (2023) N. S. Frank and J. Niles-Weed. The adversarial consistency of surrogate risks for binary classification. arxiv, 2023.
  • Gao et al. (2022) R. Gao, X. Chen, and A. J. Kleywegt. Wasserstein distributionally robust optimization and variation regularization, 2022.
  • Goodfellow et al. (2014) I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. ICLR, 2014.
  • Gu et al. (2017) T. Gu, B. Dolan-Gavitt, and S. Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain. CoRR, 2017.
  • Jylhä (2014) H. Jylhä. The ll^{\infty} optimal transport: Infinite cyclical monotonicity and the existence of optimal transport maps. Calculus of Variations and Partial Differential Equations, 2014.
  • Kannan et al. (2018) H. Kannan, A. Kurakin, and I. J. Goodfellow. Adversarial logit pairing. CoRR, 2018.
  • Kurakin et al. (2017) A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. ICLR, 2017.
  • Li and Telgarsky (2023) J. D. Li and M. Telgarsky. On achieving optimal adversarial test error, 2023.
  • Lin (2004) Y. Lin. A note on margin-based loss functions in classification. Statistics & Probability Letters, 68(1):73–82, 2004.
  • Madry et al. (2019) A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. ICLR, 2019.
  • Meunier et al. (2022) L. Meunier, R. Ettedgui, R. Pinot, Y. Chevaleyre, and J. Atif. Towards consistency in adversarial classification. arXiv, 2022.
  • Mingyuan Zhang (2020) S. A. Mingyuan Zhang. Consistency vs. h-consistency: The interplay between surrogate loss functions and the scoring function class. NeurIps, 2020.
  • Mokhtari et al. (2019) A. Mokhtari, A. Ozdaglar, and S. Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach, 2019.
  • Papernot et al. (2016) N. Papernot, P. D. McDaniel, I. J. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical black-box attacks against deep learning systems using adversarial examples. CoRR, abs/1602.02697, 2016. URL http://arxiv.org/abs/1602.02697.
  • Philip M. Long (2013) R. A. S. Philip M. Long. Consistency versus realizable h-consistency for multiclass classification. ICML, 2013.
  • Pydi and Jog (2019) M. S. Pydi and V. Jog. Adversarial risk via optimal transport and optimal couplings. arXiv preprint arXiv:1912.02794, 2019.
  • Pydi and Jog (2021) M. S. Pydi and V. Jog. The many faces of adversarial risk. Neural Information Processing Systems, 2021.
  • Rozsa et al. (2016) A. Rozsa, M. Günther, and T. E. Boult. Are accuracy and robustness correlated? CoRR, 2016.
  • Santambrogio (2015) F. Santambrogio. Optimal Transport for Applied Mathematicians. Birkhäuser, 1st edition, 2015.
  • Shafahi et al. (2019) A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. P. Dickerson, C. Studer, L. S. Davis, G. Taylor, and T. Goldstein. Adversarial training for free! CoRR, 2019.
  • Steinwart (2007) I. Steinwart. How to compare different loss functions and their risks. Constructive Approximation, 2007.
  • Szegedy et al. (2013) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Tewari and Bartlett (2007) A. Tewari and P. L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8(36), 2007.
  • Tramèr et al. (2017) F. Tramèr, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. The space of transferable adversarial examples. arXiv, 2017.
  • Trillos and Trillos (2023) C. G. Trillos and N. G. Trillos. On adversarial robustness and the use of wasserstein ascent-descent dynamics to enforce it, 2023.
  • Trillos and Murray (2020) N. G. Trillos and R. Murray. Adversarial classification: Necessary conditions and geometric flows. arxiv, 2020.
  • Trillos et al. (2022) N. G. Trillos, M. Jacobs, and J. Kim. The multimarginal optimal transport formulation of adversarial multiclass classification. arXiv, 2022.
  • Villani (2003) C. Villani. Topics in Optimal Transportation. American Mathematical Society, 2nd edition, 2003.
  • Wang and Chizat (2023) G. Wang and L. Chizat. An exponentially converging particle method for the mixed nash equilibrium of continuous games, 2023.
  • Wang et al. (2021) Y. Wang, X. Ma, J. Bailey, J. Yi, B. Zhou, and Q. Gu. On the convergence and robustness of adversarial training. ICML, 2021.
  • Wong et al. (2019) E. Wong, F. Schmidt, and Z. Kolter. Wasserstein adversarial examples via projected Sinkhorn iterations. In Proceedings of the 36th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2019.
  • Wong et al. (2020) E. Wong, L. Rice, and J. Z. Kolter. Fast is better than free: Revisiting adversarial training. CoRR, abs/2001.03994, 2020.
  • Wu et al. (2020) K. Wu, A. H. Wang, and Y. Yu. Stronger and faster wasserstein adversarial attacks, 2020.
  • Xie et al. (2018) C. Xie, Y. Wu, L. van der Maaten, A. L. Yuille, and K. He. Feature denoising for improving adversarial robustness. CoRR, 2018.
  • Yang et al. (2020) Y.-Y. Yang, C. Rashtchian, Y. Wang, and K. Chaudhuri. Robustness for non-parametric classification: A generic attack and defense. Proceedings of Machine Learning Research, 2020.
  • Zhang (2004) T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 2004.