This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adversarially Robust Learning with Tolerance

Hassan Ashtiani McMaster University, [email protected]. Hassan Ashtiani is also a faculty affiliate at Toronto’s Vector Institute and was supported by an NSERC Discovery Grant.    Vinayak Pathak Layer 6 AI, [email protected].    Ruth Urner York University, [email protected]. Ruth Urner is also a faculty affiliate at Toronto’s Vector Institute and was supported by an NSERC Discovery Grant.
Abstract

We initiate the study of tolerant adversarial PAC-learning with respect to metric perturbation sets. In adversarial PAC-learning, an adversary is allowed to replace a test point xx with an arbitrary point in a closed ball of radius rr centered at xx. In the tolerant version, the error of the learner is compared with the best achievable error with respect to a slightly larger perturbation radius (1+γ)r(1+\gamma)r. This simple tweak helps us bridge the gap between theory and practice and obtain the first PAC-type guarantees for algorithmic techniques that are popular in practice.

Our first result concerns the widely-used “perturb-and-smooth” approach for adversarial learning. For perturbation sets with doubling dimension dd, we show that a variant of these approaches PAC-learns any hypothesis class \mathcal{H} with VC-dimension vv in the γ\gamma-tolerant adversarial setting with O(v(1+1/γ)O(d)ε)O\left(\frac{v(1+1/\gamma)^{O(d)}}{\varepsilon}\right) samples. This is in contrast to the traditional (non-tolerant) setting in which, as we show, the perturb-and-smooth approach can provably fail.

Our second result shows that one can PAC-learn the same class using O~(d.vlog(1+1/γ)ε2)\widetilde{O}\left(\frac{d.v\log(1+1/\gamma)}{\varepsilon^{2}}\right) samples even in the agnostic setting. This result is based on a novel compression-based algorithm, and achieves a linear dependence on the doubling dimension as well as the VC-dimension. This is in contrast to the non-tolerant setting where there is no known sample complexity upper bound that depends polynomially on the VC-dimension.

1 Introduction

Several empirical studies (Szegedy et al., 2014; Goodfellow et al., 2018) have demonstrated that models trained to have a low accuracy on a data set often have the undesirable property that a small perturbation to an input instance can change the label outputted by the model. For most domains this does not align with human perception and thus indicates that the learned models are not representing the ground truth despite obtaining good accuracy on test sets.

The theory of PAC-learning characterizes the conditions under which learning is possible. For binary classification, the following conditions are sufficient: a) unseen data should arrive from the same distribution as training data, and b) the class of models should have a low capacity (as measured, for example, by its VC-dimension). If these conditions are met, an Empirical Risk Minimizer (ERM) that simply optimizes model parameters to maximize accuracy on the training set learns successfully.

Recent work has studied test-time adversarial perturbations under the PAC-learning framework. If an adversary is allowed to perturb data during test time then the conditions above do not hold, and we cannot hope for the model to learn to be robust just by running ERM. Thus, the goal here is to bias the learning process towards finding models where label-changing perturbations are rare. This is achieved by defining a loss function that combines both classification error and the probability of seeing label-changing perturbations, and learning models that minimize this loss on unseen data. It has been shown that even though (robust) ERM can fail in this setting, PAC-learning is still possible as long as we know during training the kinds of perturbations we want to guard against at test time (Montasser et al., 2019). This result holds for all perturbation sets. However, the learning algorithm is significantly more complex than robust ERM and requires a large number of samples (with the best known sample complexity bounds potentially being exponential in the VC-dimension).

We study a tolerant version of the adversarially robust learning framework and restrict the perturbations to balls in a general metric space with finite doubling dimension. We show this slight shift in the learning objective yields significantly improved sample complexity bounds through a simpler learning paradigm than what was previously known. In fact, we show that a version of the common “perturb-and-smooth” paradigm successfully PAC-learns any class of bounded VC-dimension in this setting.

Learning in general metric spaces. What kinds of perturbations should a learning algorithm guard against? Any transformation of the input that we believe should not change its label could be a viable perturbation for the adversary to use. The early works in this area considered perturbations contained within a small p\ell_{p}-ball of the input. More recent work has considered other transformations such as a small rotation, or translation of an input image (Engstrom et al., 2019; Fawzi and Frossard, 2015; Kanbak et al., 2018; Xiao et al., 2018), or even adding small amounts of fog or snow (Kang et al., 2019). It has also been argued that small perturbations in some feature space should be allowed as opposed to the input space (Inkawhich et al., 2019; Sabour et al., 2016; Xu et al., 2020; Song et al., 2018; Hosseini and Poovendran, 2018). This motivates the study of more general perturbations.

We consider a setting where the input comes from a domain that is equipped with a distance metric and allows perturbations to be within a small metric ball around the input. Earlier work on general perturbation sets (for example, (Montasser et al., 2019)) considered arbitrary perturbations. In this setting one does not quantify the magnitude of a perturbation and thus cannot talk about small versus large perturbations. Modeling perturbations using a metric space enables us to do that while also keeping the setup general enough to be able to encode a large variety of perturbation sets by choosing appropriate distance functions.

Learning with tolerance. In practice, we often believe that small perturbations of the input should not change its label but we do not know precisely what small means. However, in the PAC-learning framework for adversarially robust classification, we are required to define a precise perturbation set and learn a model that has error arbitrarily close to the smallest error that can be achieved with respect to that perturbation set. In other words, we aim to be arbitrarily close to a target that was picked somewhat arbitrarily to begin with. Due to the uncertainty about the correct perturbation size, it is more meaningful to allow for a wider range of error values. To achieve this, we introduce the concept of tolerance. In the tolerant setting, in addition to specifying a perturbation size rr, we introduce a tolerance parameter γ\gamma that encodes our uncertainty about the size of allowed perturbations. Then, for any given ε>0\varepsilon>0, we aim to learn a model whose error with respect to perturbations of size rr is at most ε\varepsilon more than the smallest error achievable with respect to perturbations of size r(1+γ)r(1+\gamma).

2 Our results

In this paper we formalize and initiate the study of the problem of adversarially robust learning in the tolerant setting for general metric spaces and provide two algorithms for the task. Both of our algorithms rely on: 1) modifying the training data by randomly sampling points from the perturbation sets around each data point, and 2) smoothing the output of the model by taking a majority over the labels returned by the model for nearby points.

Our first algorithm starts by modifying the training set by randomly perturbing each training point using a certain distribution (see Section 5 for details). It then trains a (non-robust) PAC-learner (such as ERM) on the perturbed training set to find a hypothesis hh. Finally, it outputs a smooth version of hh. The smoothing step replaces h(x)h(x) at each point xx with the a majority label outputted by hh on the points around xx. We show that for metric spaces of a fixed doubling dimension, this algorithm successfully learns in the tolerant setting assuming tolerant realizability.

Theorem 2.1 (Informal version of Theorem 5.2).

Let (X,dist)(X,\mathrm{dist}) be a metric space with doubling dimension dd and \mathcal{H} a hypothesis class. Assuming tolerant realizability, \mathcal{H} can be learned tolerantly in the adversarially robust setting using O((1+1/γ)O(d)VC()ε)O\left(\frac{(1+1/\gamma)^{O(d)}\mathrm{VC}(\mathcal{H})}{\varepsilon}\right) samples, where γ\gamma encodes the amount of allowed tolerance, and ε\varepsilon is the desired accuracy.

An interesting feature of the above result is the linear dependence of the sample complexity with respect to VC()\mathrm{VC}(\mathcal{H}). This is in contrast to the best known upper bound for non-tolerant adversarial setting  (Montasser et al., 2019) which depends on the dual VC-dimension of the hypothesis class and in general is exponential in VC()\mathrm{VC}(\mathcal{H}). Moreover, this is the first PAC-type guarantee for the general perturb-and-smooth paradigm, indicating that the tolerant adversarial learning is the “right” learning model for studying these approaches. While the above method enjoys simplicity and can be computationally efficient, one downside is that its sample complexity grows exponentially with the doubling dimension. For instance, such algorithm cannot be used on high-dimensional data in the Euclidean space. Another limitation is that the guarantee holds only in the (robustly) realizable setting.

In the second main part of our submission (Section 6) we show that, surprisingly, these limitations can be overcome by incorporating ideas from the tolerant framework and perturb-and-smooth algorithms into a novel compression scheme for robust learning. The resulting algorithm improves the dependence on the doubling dimension, and works in the general agnostic setting.

Theorem 2.2 (Informal version of Corollary 6.5).

Let (X,dist)(X,\mathrm{dist}) be a metric space with doubling dimension dd and \mathcal{H} a hypothesis class. Then \mathcal{H} can be learned tolerantly in the adversarially robust setting using O~(O(d)VC()log(1+1/γ)ε2)\widetilde{O}\left(\frac{O(d)\mathrm{VC}(\mathcal{H})\log(1+1/\gamma)}{\varepsilon^{2}}\right) samples, where O~\widetilde{O} hides logarithmic factors, γ\gamma encodes the amount of allowed tolerance, and ε\varepsilon is the desired accuracy.

This algorithm exploits the connection between sample compression and adversarially robust learning Montasser et al. (2019). However, unlike Montasser et al. (2019), our new compression scheme sidesteps the dependence on the dual VC-dimension (refer to the discussion at the end of Section 6 for more details). As a result, we get an exponential improvement over the best known (nontolerant) sample complexity in terms of dependence on VC-dimension.

3 Related work

PAC-learning for adversarially robust classification has been studied extensively in recent years (Cullina et al., 2018; Awasthi et al., 2019; Montasser et al., 2019; Feige et al., 2015; Attias et al., 2019; Montasser et al., 2020a; Ashtiani et al., 2020). These works provide learning algorithms that guarantee low generalization error in the presence of adversarial perturbations in various settings. The most general result is due to Montasser et al. (2019), and is proved for general hypothesis classes and perturbation sets. All of the above results assume that the learner knows the kinds of perturbations allowed for the adversary. Some more recent papers have considered scenarios where the learner does not even need to know that. Goldwasser et al. (2020) allow the adversary to perturb test data in unrestricted ways and are still able to provide learning guarantees. The catch is that it only works in the transductive setting and only if the learner is allowed to abstain from making a prediction on some test points. Montasser et al. (2021a) consider the case where the learner needs to infer the set of allowed perturbations by observing the actions of the adversary.

Tolerance was introduced by Ashtiani et al. (2020) in the context of certification. They provide examples where certification is not possible unless we allow some tolerance. Montasser et al. (2021b) study transductive adversarial learning and provide a “tolerant” guarantee. Note that unlike our work, the main focus of that paper is on the transductive setting. Moreover, they do not specifically study tolerance with respect to metric perturbation sets. Without a metric, it is not meaningful to expand perturbation sets by a factor (1+γ)(1+\gamma) (as we do in the our definition of tolerance). Instead, they expand their perturbation sets by applying two perturbations in succession, which is akin to setting γ=1\gamma=1. In contrast, our results hold in the more common inductive setting, and capture a more realistic setting where γ\gamma can be any (small) real number larger than zero.

Subsequent to our work, Bhattacharjee et al. (2022) study adversarially robust learning with tolerance for “regular” VC-classes and show that a simple modification of robust ERM achieves a sample complexity polynomial in both VC-dimension and doubling dimension. In a similar vein, Raman et al. (2022) identify a more general property of hypothesis classes for which robust ERM is sufficient for adversarially robust learning with tolerance.

Like many recent adversarially robust learning algorithms (Feige et al., 2015; Attias et al., 2019), our first algorithm relies on calls to a non-robust PAC-learner. Montasser et al. (2020b) formalize the question of reducing adversarially robust learning to non-robust learning and study finite perturbation sets of size kk. They show a reduction that makes O(log2k)O(\log^{2}{k}) calls to the non-robust learner and also prove a lower bound of Ω(logk)\Omega(\log{k}). It will be interesting to see if our algorithms can be used to obtain better bounds for the tolerant setting. Our first algorithm makes one call to the non-robust PAC-learner at training time, but needs to perform potentially expensive smoothing for making actual predictions (see Theorem 5.2).

A related line of work studies smallest achievable robust loss for various distributions and hypothesis classes. For example,  Bubeck and Sellke (2021) show that hypothesis classes with low robust loss must be overparametrized. Yang et al. (2020b) explore real-world datasets and provide evidence that they are separable and therefore there must exist locally Lipschitz hypotheses with low robust loss. Note that the existence of such hypotheses does not immediately imply that PAC-learning is possible.

The techniques of randomly perturbing the training data and smoothing the output classifier has been extensively used in practice and has shown good empirical success. Augmenting the training data with some randomly perturbed samples was used for handwriting recognition as early as by  Yaeger et al. (1996). More recently, “stability training” was introduced by Zheng et al. (2016) for state of the art image classifiers where training data is augmented with Gaussian perturbations. Empirical evidence was provided that the technique improved the accuracy against naturally occurring perturbations. Augmentations with non-Gaussian perturbations of a large variety were considered by Hendrycks et al. (2019).

Smoothing the output classifier using random samples around the test point is a popular technique for producing certifiably robust classifiers. A certification, in this context, is a guarantee that given a test point xx, all points within a certain radius of xx receive the same label as xx. Several papers have provided theoretical analyses to show that smoothing produces certifiably robust classifiers (Cao and Gong, 2017; Cohen et al., 2019; Lecuyer et al., 2019; Li et al., 2019; Liu et al., 2018; Salman et al., 2019; Levine and Feizi, 2020), whereas others have identified cases where smoothing does not work Yang et al. (2020a); Blum et al. (2020).

However, to the best of our knowledge, a PAC-type guarantee has not been shown for any algorithm that employs training data perturbations or output classifier smoothing, and our paper provides the first such analysis.

4 Notations and setup

We denote by XX the input domain and by Y={0,1}Y=\{0,1\} the binary label space. We assume that XX is equipped with a metric dist\mathrm{dist}. A hypothesis h:XYh:X\to Y is a function that assigns a label to each point in the domain. A hypothesis class {\mathcal{H}} is a set of such hypotheses. For a sample S=((x1,y1),,(xn,yn))(X×Y)nS=((x_{1},y_{1}),\ldots,(x_{n},y_{n}))\in(X\times Y)^{n}, we use the notation SX={x1,x2,,xn}S^{X}=\{x_{1},x_{2},\ldots,x_{n}\} to denote the collection of domain points xix_{i} occurring in SS. The binary (also called 0-1) loss of hh on data point (x,y)X×Y(x,y)\in X\times Y is defined by

0/1(h,x,y)=𝟙[h(x)y],\ell^{0/1}(h,x,y)=\mathds{1}\left[{h(x)\neq y}\right],

where 𝟙[.]\mathds{1}\left[{.}\right] is the indicator function. Let PP by a probability distribution over X×YX\times Y. Then the expected binary loss of hh with respect to PP is defined by

P0/1(h)=𝔼(x,y)P[0/1(h,x,y)]{\mathcal{L}^{0/1}_{P}}(h)=\mathbb{E}_{(x,y)\sim P}[\ell^{0/1}(h,x,y)]

Similarly, the empirical binary loss of hh on sample S=((x1,y1),,(xn,yn))(X×Y)nS=((x_{1},y_{1}),\ldots,(x_{n},y_{n}))\in(X\times Y)^{n} is defined as S0/1(h)=1ni=1n0/1(h,xi,yi){\mathcal{L}^{0/1}_{S}}(h)=\frac{1}{n}\sum_{i=1}^{n}\ell^{0/1}(h,x_{i},y_{i}). We also define the approximation error of \mathcal{H} with respect to PP as P0/1()=infhP0/1(h){\mathcal{L}^{0/1}_{P}}(\mathcal{H})=\inf_{h\in\mathcal{H}}{\mathcal{L}^{0/1}_{P}}(h).

A learner 𝒜{\mathcal{A}} is a function that takes in a finite sequence of labeled instances S=((x1,y1),,(xn,yn))S=((x_{1},y_{1}),\ldots,(x_{n},y_{n})) and outputs a hypothesis h=𝒜(S)h={\mathcal{A}}(S). The following definition abstracts the notion of PAC-learning Vapnik and Chervonenkis (1971); Valiant (1984).

Definition 4.1 (PAC-learner).

Let 𝒫{\mathcal{P}} be a set of distributions over X×YX\times Y and {\mathcal{H}} a hypothesis class. We say 𝒜{\mathcal{A}} PAC-learns (,𝒫)({\mathcal{H}},{\mathcal{P}}) with m𝒜:(0,1)2m_{\mathcal{A}}:(0,1)^{2}\to\mathbb{N} samples if the following holds: for every distribution P𝒫P\in{\mathcal{P}} over X×YX\times Y, and every ε,δ(0,1)\varepsilon,\delta\in(0,1), if SS is an i.i.d. sample of size at least m𝒜(ε,δ)m_{\mathcal{A}}(\varepsilon,\delta) from PP, then with probability at least 1δ1-\delta (over the randomness of SS) we have

P(𝒜(S))P()+ε.{\mathcal{L}_{P}}({\mathcal{A}}(S))\leq{\mathcal{L}_{P}}(\mathcal{H})+\varepsilon.

𝒜{\mathcal{A}} is called an agnostic learner if 𝒫{\mathcal{P}} is the set of all distributions on X×YX\times Y, and a realizable learner if 𝒫={P:P()=0}{\mathcal{P}}=\{P:{\mathcal{L}_{P}}(\mathcal{H})=0\}.

The smallest function m:(0,1)2m:(0,1)^{2}\to\mathbb{N} for which there exists a learner 𝒜{\mathcal{A}} that satisfies the above definition with m𝒜=mm_{{\mathcal{A}}}=m is referred to as the (realizable or agnostic) sample complexity of learning \mathcal{H}.

The existence of sample-efficient PAC-learners for VC classes is a standard result Vapnik and Chervonenkis (1971). We state the results formally in Appendix A.

4.1 Tolerant adversarial PAC-learning

Let 𝒰:X2X\mathcal{U}:X\to 2^{X} be a function that maps each point in the domain to the set of its “admissible” perturbations. We call this function the perturbation type. The adversarial loss of hh with respect to 𝒰\mathcal{U} on (x,y)X×Y(x,y)\in X\times Y is defined by

𝒰(h,x,y)=maxz𝒰(x){0/1(h,z,y)}\ell^{\mathcal{U}}(h,x,y)=\max_{z\in\mathcal{U}(x)}\{\ell^{0/1}(h,z,y)\}

The expected adversarial loss with respect to PP is defined by P𝒰(h)=𝔼(x,y)P𝒰(h,x,y){\mathcal{L}^{\mathcal{U}}_{P}}(h)=\mathbb{E}_{(x,y)\sim P}\ell^{\mathcal{U}}(h,x,y). The empirical adversarial loss of hh on sample S=((x1,y1),,(xn,yn))(X×Y)nS=((x_{1},y_{1}),\ldots,(x_{n},y_{n}))\in(X\times Y)^{n} is defined by S𝒰(h)=1ni=1n𝒰(h,xi,yi){\mathcal{L}^{\mathcal{U}}_{S}}(h)=\frac{1}{n}\sum_{i=1}^{n}\ell^{\mathcal{U}}(h,x_{i},y_{i}). Finally, the adversarial approximation error of \mathcal{H} with respect to 𝒰\mathcal{U} and PP is defined by P𝒰()=infhP𝒰(h){\mathcal{L}^{\mathcal{U}}_{P}}(\mathcal{H})=\inf_{h\in\mathcal{H}}{\mathcal{L}^{\mathcal{U}}_{P}}(h).

The following definition generalizes the setting of PAC adversarial learning to what we call the tolerant setting, where we consider two perturbation types 𝒰\mathcal{U} and 𝒱\mathcal{V}. We say 𝒰\mathcal{U} is contained in 𝒱\mathcal{V} and and write it as 𝒰𝒱\mathcal{U}\prec\mathcal{V} if 𝒰(x)𝒱(x)\mathcal{U}(x)\subset\mathcal{V}(x) for all xXx\in X.

Definition 4.2 (Tolerant Adversarial PAC-learner).

Let 𝒫{\mathcal{P}} be a set of distributions over X×YX\times Y, \mathcal{H} a hypothesis class, and 𝒰𝒱\mathcal{U}\prec\mathcal{V} two perturbation types. We say 𝒜{\mathcal{A}} tolerantly PAC-learns (,𝒫,𝒰,𝒱)(\mathcal{H},{\mathcal{P}},\mathcal{U},\mathcal{V}) with m𝒜:(0,1)2m_{\mathcal{A}}:(0,1)^{2}\to\mathbb{N} samples if the following holds: for every distribution P𝒫P\in{\mathcal{P}} and every ε,δ(0,1)\varepsilon,\delta\in(0,1), if SS is an i.i.d. sample of size at least m𝒜(ε,δ)m_{\mathcal{A}}(\varepsilon,\delta) from PP, then with probability at least 1δ1-\delta (over the randomness of SS) we have

P𝒰(𝒜(S))P𝒱()+ε.{\mathcal{L}^{\mathcal{U}}_{P}}({\mathcal{A}}(S))\leq{\mathcal{L}^{\mathcal{V}}_{P}}(\mathcal{H})+\varepsilon.

We say 𝒜{\mathcal{A}} is a tolerant PAC-learner in the agnostic setting if 𝒫{\mathcal{P}} is the set of all distributions over X×YX\times Y, and in the tolerantly realizable setting if 𝒫={P:P𝒱()=0}{\mathcal{P}}=\{P:{\mathcal{L}^{\mathcal{V}}_{P}}(\mathcal{H})=0\}.

In the above context, we refer to 𝒰\mathcal{U} as the actual perturbation type and to 𝒱\mathcal{V} as the reference perturbation type. The case where 𝒰(x)=𝒱(x)\mathcal{U}(x)=\mathcal{V}(x) for all xXx\in X corresponds to the usual adversarial learning scenario (with no tolerance).

4.2 Tolerant adversarial PAC-learning in metric spaces

If XX is equipped with a metric dist(.,.)\mathrm{dist}(.,.), then 𝒰(x)\mathcal{U}(x) can be naturally defined by a ball of radius rr around xx, i.e., 𝒰(x)=r(x)={zXdist(x,z)r}\mathcal{U}(x)={\mathcal{B}}_{r}(x)=\{z\in X~{}\mid~{}\mathrm{dist}(x,z)\leq r\}. To simplify the notation, we sometimes use r(h,x,y)\ell^{r}(h,x,y) instead of r(h,x,y)\ell^{{\mathcal{B}}_{r}}(h,x,y) to denote the adversarial loss with respect to r{\mathcal{B}}_{r}.

In the tolerant setting, we consider the perturbation sets 𝒰(x)=r(x)\mathcal{U}(x)={\mathcal{B}}_{r}(x) and 𝒱(x)=(1+γ)r(x)\mathcal{V}(x)={\mathcal{B}}_{(1+\gamma)r}(x), where γ>0\gamma>0 is called the tolerance parameter. Note that 𝒰𝒱\mathcal{U}\prec\mathcal{V}. We now define PAC-learning with respect to the metric space.

Definition 4.3 (Tolerant Adversarial Learning in metric spaces).

Let (X,dist)(X,\mathrm{dist}) be a metric space, \mathcal{H} a hypothesis class, and 𝒫{\mathcal{P}} a set of distributions of X×YX\times Y. We say (,𝒫,dist)(\mathcal{H},{\mathcal{P}},\mathrm{dist}) is tolerantly PAC-learnable with m:(0,1)3m:(0,1)^{3}\to\mathbb{N} samples when for every r,γ>0r,\gamma>0 there exist a PAC-learner 𝒜r,γ{\mathcal{A}}_{r,\gamma} for (,𝒫,Br,Br(1+γ))(\mathcal{H},{\mathcal{P}},B_{r},B_{r(1+\gamma)}) that uses m(ε,δ,γ)m(\varepsilon,\delta,\gamma) samples.

Remark 4.4.

In this definition the learner receives γ\gamma and rr as input but its sample complexity does not depend on rr (but can depend on γ\gamma). Also, as in Definition 4.2, the tolerantly realizable setting corresponds to 𝒫={P:Pr(1+γ)()=0}{\mathcal{P}}=\{P:{\mathcal{L}^{r(1+\gamma)}_{P}}({\mathcal{H}})=0\} while in the agnostic setting 𝒫{\mathcal{P}} is the set of all distributions over X×YX\times Y.

The doubling dimension and the doubling measure of the metric space will play important roles in our analysis. We refer the reader to Appendix B for their definitions.

We will use the following lemma in our analysis, whose proof can be found in Appendix B:

Lemma 4.5.

For any family \mathcal{M} of complete, doubling metric spaces, there exist constants c1,c2>0c_{1},c_{2}>0 such that for any metric space (X,dist)(X,\mathrm{dist})\in{\mathcal{M}} with doubling dimension dd, there exists a measure μ\mu such that if a ball r{\mathcal{B}}_{r} of radius r>0r>0 is completely contained inside a ball αr{\mathcal{B}}_{\alpha r} of radius αr\alpha r (with potentially a different center) for any α>1\alpha>1, then 0<μ(αr)(c1α)c2dμ(r)0<\mu({\mathcal{B}}_{\alpha r})\leq(c_{1}\alpha)^{c_{2}d}\mu({\mathcal{B}}_{r}). Furthermore, if we have a constant α0>1\alpha_{0}>1 such that we know that αα0\alpha\geq\alpha_{0} then the bound can be simplified to 0<μ(αr)αζdμ(r)0<\mu({\mathcal{B}}_{\alpha r})\leq\alpha^{\zeta d}\mu({\mathcal{B}}_{r}), where ζ\zeta depends on {\mathcal{M}} and α0\alpha_{0}.

Later, we will set α=1+1/γ\alpha=1+1/\gamma where γ\gamma is the tolerance parameter. Since we are mostly interested in small values of γ\gamma, suppose we decide on some loose upper bound Γγ\Gamma\gg\gamma. This corresponds to saying that there exists some α0>1\alpha_{0}>1 such that αα0\alpha\geq\alpha_{0}.

It is worth noting that in the special case of Euclidean metric spaces, we can set both c1c_{1} and c2c_{2} to be 1. In the rest of the paper, we will assume we have a loose upper bound Γγ\Gamma\gg\gamma and use the simpler bound from Lemma B.5 extensively.

Given a metric space (X,d)(X,d) and a measure μ\mu defined over it, for any subset ZXZ\subseteq X for which μ(Z)\mu(Z) is non-zero and finite, μ\mu induces a probability measure PZμP_{Z}^{\mu} over ZZ as follows. For any set ZZZ^{\prime}\subseteq Z in the σ\sigma-algebra over ZZ, we define PZμ(Z)=μ(Z)/μ(Z)P_{Z}^{\mu}(Z^{\prime})=\mu(Z^{\prime})/\mu(Z). With a slight abuse of notation, we write zZz\sim Z to mean zPZμz\sim P_{Z}^{\mu} whenever we know μ\mu from the context.

Our learners rely on being able to sample from PZμP_{Z}^{\mu}. Thus we define the following oracle, which can be implemented efficiently for p\ell_{p} spaces.

Definition 4.6 (Sampling Oracle).

Given a metric space (X,dist)(X,\mathrm{dist}) equipped with a doubling measure μ\mu, a sampling oracle is an algorithm that when queried with a ZXZ\subseteq X such that μ(Z)\mu(Z) is finite, returns a sample drawn from PZμP_{Z}^{\mu}. We will use the notation zZz\sim Z for queries to this oracle.

5 The perturb-and-smooth approach for tolerant adversarial learning

In this section we focus on tolerant adversarial PAC-learning in metric spaces (Definition 4.3), and show that VC classes are tolerantly PAC-learnable in the tolerantly realizable setting. Interestingly, we prove this result using an approach that resembles the “perturb-and-smooth” paradigm which is used in practice (for example by Cohen et al. (2019)). The overall idea is to “perturb” each training point xx, train a classifier on the “perturbed” points, and “smooth out” the final hypothesis using a certain majority rule.

We employ three perturbation types: 𝒰\mathcal{U} and 𝒱\mathcal{V} play the role of the actual and the reference perturbation type respectively. Additionally, we consider a perturbation type 𝒲:X2X\mathcal{W}:X\to 2^{X}, which is used for smoothing. We assume 𝒰𝒱\mathcal{U}\prec\mathcal{V} and 𝒲𝒱\mathcal{W}\prec\mathcal{V}. For this section, we will use metric balls for the three types. Specifically, if 𝒰\mathcal{U} consists of balls of radius rr for some r>0r>0, then 𝒲\mathcal{W} will consists of balls of radius γr\gamma r and 𝒱\mathcal{V} will consist of balls of radius (1+γ)r(1+\gamma)r.

Definition 5.1 (Smoothed classifier).

For a hypothesis h:X{0,1}h:X\to\{0,1\}, and perturbation type 𝒲:X2X\mathcal{W}:X\to 2^{X}, we let h¯𝒲\bar{h}_{\mathcal{W}} denote the classifier resulting from replacing the label h(x)h(x) with the average label over 𝒲(x)\mathcal{W}(x), that is

h¯𝒲(x)=𝟙[𝔼x𝒲(x)h(x)1/2]\bar{h}_{\mathcal{W}}(x)=\mathds{1}\left[{\mathbb{E}_{x^{\prime}\sim\mathcal{W}(x)}h(x^{\prime})\geq 1/2}\right]

For metric perturbation types, where 𝒲\mathcal{W} is a ball of some radius rr, we also use the notation h¯r\bar{h}_{r} and when the type 𝒲\mathcal{W} is clear from context, we may omit the subscript altogether and simply write h¯\bar{h} for the smoothed classifier.

The tolerant perturb-and-smooth algorithm

We propose the following learning algorithm, TPaS, for tolerant learning in metric spaces. Let the perturbation radius be r>0r>0 for the actual type 𝒰=r\mathcal{U}={\mathcal{B}}_{r}, and let S=((x1,y1),,(xm,ym))S=((x_{1},y_{1}),\ldots,(x_{m},y_{m})) be the training sample. For each xiSXx_{i}\in S^{X}, the learner samples a point xir(1+γ)(xi)x^{\prime}_{i}\sim{\mathcal{B}}_{r\cdot(1+\gamma)}(x_{i}) (using the sampling oracle) from the expanded reference perturbation set 𝒱(xi)=(1+γ)r(xi)\mathcal{V}(x_{i})={\mathcal{B}}_{(1+\gamma)r}(x_{i}). Let S=((x1,y1),,(xm,ym))S^{\prime}=((x^{\prime}_{1},y_{1}),\ldots,(x^{\prime}_{m},y_{m})). TPaS then invokes a (standard, non-robust) PAC-learner 𝒜{\mathcal{A}}_{\mathcal{H}} for the hypothesis class {\mathcal{H}} on the perturbed data SS^{\prime}. We let h^=𝒜(S)\hat{h}={\mathcal{A}}_{{\mathcal{H}}}(S^{\prime}) denote the output of this PAC-learner. Finally, TPaS outputs the 𝒲\mathcal{W}-smoothed version of h¯γr\bar{h}_{\gamma r} for 𝒲=γr\mathcal{W}={\mathcal{B}}_{\gamma r}. That is, h¯γr(x)\bar{h}_{\gamma r}(x) is simply the majority label in a ball of radius γr\gamma r around xx with respect to the distribution defined by μ\mu, see also Definition 5.1. We will prove below that this h¯γr\bar{h}_{\gamma r} has a small 𝒰\mathcal{U}-adversarial loss. Algorithm 1 below summarizes our learning procedure.

Algorithm 1 Tolerant Perturb and Smooth (TPaS)
Input: Radius rr, tolerance parameter γ\gamma, data S=((x1,y1),,(xm,ym))S=((x_{1},y_{1}),\ldots,(x_{m},y_{m})), accesss to sampling oracle 𝒪{\mathcal{O}} for μ\mu and PAC-learner 𝒜{\mathcal{A}}_{{\mathcal{H}}}.
Initialize S=S^{\prime}=\emptyset
for i=1i=1 to mm do
   Sample xi(1+γ)r(xi)x^{\prime}_{i}\sim{\mathcal{B}}_{(1+\gamma)r}(x_{i})
   Add (xi,yi)(x^{\prime}_{i},y_{i}) to SS^{\prime}
end for
Set h^=𝒜(S)\hat{h}={\mathcal{A}}_{{\mathcal{H}}}(S^{\prime})
Output: h¯γr\bar{h}_{\gamma r} defined by
    h¯γr(x)=𝟙[𝔼xγr(x)h^(x)1/2]\bar{h}_{\gamma r}(x)=\mathds{1}\left[{\mathbb{E}_{x^{\prime}\sim{\mathcal{B}}_{\gamma r}(x)}\hat{h}(x^{\prime})\geq 1/2}\right]

The following is the main result of this section.

Theorem 5.2.

Let (X,dist)(X,\mathrm{dist}) be an any metric space with doubling dimension dd and doubling measure μ\mu. Let 𝒪\mathcal{O} be a sampling oracle for μ\mu. Let \mathcal{H} be a hypothesis class and 𝒫{\mathcal{P}} a set of distributions over X×YX\times Y. Assume 𝒜{\mathcal{A}}_{\mathcal{H}} PAC-learns \mathcal{H} with m(ε,δ)m_{\mathcal{H}}(\varepsilon,\delta) samples in the realizable setting. Then there exists a learner 𝒜{\mathcal{A}}, namely TPaS, that

  • Tolerantly PAC-learns (,𝒫,dist)(\mathcal{H},{\mathcal{P}},\mathrm{dist}) in the tolerantly realizable setting with sample complexity bounded by m(ε,δ,γ)=O(m(ε,δ)(1+1/γ)ζd)=O(VC()+log1/δε(1+1/γ)ζd)m(\varepsilon,\delta,\gamma)=O\left(m_{\mathcal{H}}(\varepsilon,\delta)\cdot(1+1/\gamma)^{\zeta d}\right)=O\left(\frac{\mathrm{VC}({\mathcal{H}})+\log{1/\delta}}{\varepsilon}\cdot(1+1/\gamma)^{\zeta d}\right), where γ\gamma is the tolerance parameter and dd is the doubling dimension.

  • Makes only one query to 𝒜{\mathcal{A}}_{\mathcal{H}}

  • Makes m(ε,δ,γ)m(\varepsilon,\delta,\gamma) queries to sampling oracle 𝒪\mathcal{O}

The proof of this theorem uses the following key technical lemma (its proof can be found in Appendix C):

Lemma 5.3.

Let r>0r>0 be a perturbation radius, γ>0\gamma>0 a tolerance parameter, and g:XYg:X\to Y a classifier. For xXx\in X and yY={0,1}y\in Y=\{0,1\}, we define

Σg,y(x)=𝔼zr(1+γ)(x)𝟙[g(z)y]andσg,y(x)=𝔼zrγ(x)𝟙[g(z)y].\Sigma_{g,y}(x)=\mathbb{E}_{z\sim{\mathcal{B}}_{r(1+\gamma)}(x)}\mathds{1}\left[{g(z)\neq y}\right]\quad\text{and}\quad\sigma_{g,y}(x)=\mathbb{E}_{z\sim{\mathcal{B}}_{r\gamma}(x)}\mathds{1}\left[{g(z)\neq y}\right].

Then Σg,y(x)13(1+γγ)ζd\Sigma_{g,y}(x)\leq\frac{1}{3}\cdot\left(\frac{1+\gamma}{\gamma}\right)^{-\zeta d} implies that σg,y(z)1/3\sigma_{g,y}(z)\leq 1/3 for all zr(x)z\in{\mathcal{B}}_{r}(x).

Proof of Theorem 5.2.

Consider some ϵ0>0\epsilon_{0}>0 and 0<δ<10<\delta<1 to be given (we will pick a suitable value of ϵ0\epsilon_{0} later), and assume the PAC-learner 𝒜{\mathcal{A}}_{\mathcal{H}} was invoked on the perturbed sample SS^{\prime} of size at least mA(ϵ0,δ)m_{A}(\epsilon_{0},\delta). According to definition 4.1, this implies that with probability 1δ1-\delta, the output h^=𝒜(S)\hat{h}={\mathcal{A}}_{\mathcal{H}}(S) has (binary) loss at most ϵ0\epsilon_{0} with respect to the data-generating distribution. Note that the relevant distribution here is the two-stage process of the original data generating distribution PP and the perturbation sampling according to 𝒱=(1+γ)r\mathcal{V}={\mathcal{B}}_{(1+\gamma)r}. Since PP is 𝒱\mathcal{V}-robustly realizable, the two-stage process yields a realizable distribution with respect to the standard 0/10/1-loss. Thus, we have

𝔼(x,y)P𝔼zr(1+γ)(x)𝟙[h^(z)y]ϵ0.\mathbb{E}_{(x,y)\sim P}\mathbb{E}_{z\sim{\mathcal{B}}_{r(1+\gamma)}(x)}\mathds{1}\left[{\hat{h}(z)\neq y}\right]\leq\epsilon_{0}.

With Lemma 5.3, this becomes 𝔼(x,y)PΣh^,y(x)ϵ0\mathbb{E}_{(x,y)\sim P}\Sigma_{\hat{h},y}(x)\leq\epsilon_{0}. For λ>0\lambda>0, Markov’s inequality then yields :

𝔼(x,y)P𝟙[Σh^,y(x)λ]>1ϵ0/λ\displaystyle\mathbb{E}_{(x,y)\sim P}\mathds{1}\left[{\Sigma_{\hat{h},y}(x)\leq\lambda}\right]>1-\epsilon_{0}/\lambda (1)

Thus setting λ=13(1+γγ)ζd\lambda=\frac{1}{3}\cdot\left(\frac{1+\gamma}{\gamma}\right)^{-\zeta d} and plugging in the result of the Lemma 5.3 to equation (1), we get

𝔼(x,y)P𝟙[zr(x),σh^,y(z)1/3]>1ϵ0/λ.\mathbb{E}_{(x,y)\sim P}\mathds{1}\left[{\forall z\in{\mathcal{B}}_{r}(x),\sigma_{\hat{h},y}(z)\leq 1/3}\right]>1-\epsilon_{0}/\lambda.

Since σh^,y(z)1/3\sigma_{\hat{h},y}(z)\leq 1/3 implies that 𝟙[𝔼zγr(z)h^(z)1/2]=y,\mathds{1}\left[{\mathbb{E}_{z^{\prime}\sim{\mathcal{B}}_{\gamma r(z)}}\hat{h}(z^{\prime})\geq 1/2}\right]=y, using the definition of the smoothed classifier h¯γr\bar{h}_{\gamma r} we get

𝔼(x,y)P𝟙[zr(x),h¯γr(z)y]ϵ0/λ,\displaystyle\mathbb{E}_{(x,y)\sim P}\mathds{1}\left[{\exists z\in{\mathcal{B}}_{r}(x),\bar{h}_{\gamma r}(z)\neq y}\right]\leq\epsilon_{0}/\lambda,

which implies Pr(h¯γr)ϵ0/λ{\mathcal{L}^{r}_{P}}(\bar{h}_{\gamma r})\leq\epsilon_{0}/\lambda. Thus, for the robust learning problem, if we are given a desired accuracy ε\varepsilon and we want Pr(h¯γr)ε{\mathcal{L}^{r}_{P}}(\bar{h}_{\gamma r})\leq\varepsilon, we can pick ϵ0=λε\epsilon_{0}=\lambda\varepsilon. Putting it all together, we get sample complexity mO(VC()+log1/δϵ0)m\leq O(\frac{\mathrm{VC}({\mathcal{H}})+\log{1/\delta}}{\epsilon_{0}}) where ϵ0=λε\epsilon_{0}=\lambda\varepsilon, and λ=13(1+γγ)ζd\lambda=\frac{1}{3}\cdot\left(\frac{1+\gamma}{\gamma}\right)^{-\zeta d}. Therefore, mO(VC()+log1/δε(1+1/γ)ζd)m\leq O\left(\frac{\mathrm{VC}({\mathcal{H}})+\log{1/\delta}}{\varepsilon}\cdot(1+1/\gamma)^{\zeta d}\right). ∎

Computational complexity of the learner. Assuming we have access to 𝒪\mathcal{O} and an efficient algorithm for non-robust PAC-learning in the realizable setting, we can compute h^\hat{h} efficiently. Therefore, the learning can be done efficiently in this case. However, at the prediction time, we need to compute h¯(x)\bar{h}(x) on new test points which requires us to compute an expectation. We can instead estimate the expectations using random samples from the sampling oracle. For a single test point xx, if the number of samples we draw is Ω(log1/δ)\Omega(\log{1/\delta}) then with probability at least 1δ1-\delta we get the same result as that of the optimal h¯(x)\bar{h}(x). Using more samples we can boost this probability to guarantee a similar output to that of h¯\bar{h} on a larger set of test points.

The traditional non-tolerant framework does not justify the use of perturb-and-smooth-type approaches. The introduction of the tolerance in the adversarial learning framework is crucial for being able to prove guarantees for perturb-and-smooth-type algorithms. To see why, consider a simple case where the domain is the real line, the perturbation set is an open Euclidean ball of radius 1, and the hypothesis class is the set of all thresholds. Assume that the underlying distribution is supported only on two points: (x=1,y=1)=(x=1,y=0)=0.5\mathbb{P}(x=-1,y=1)=\mathbb{P}(x=1,y=0)=0.5. This distribution is robustly realizable, but the threshold should be set exactly to x=0x=0 to get a small error. However, the perturb-and-smooth method will fail because the only way the PAC-learner 𝒜{\mathcal{A}}_{\mathcal{H}} sets the threshold to x=0x=0 is if it receives a (perturbed) sample exactly at x=0x=0, whose probability is 0.

6 Improved tolerant learning guarantees through sample compression

The perturb-and-smooth approach discussed in the previous section offers a general method for tolerant robust learning. However, one shortcoming of this approach is the exponential dependence of its sample complexity with respect to the doubling dimension of the metric space. Furthermore, the tolerant robust guarantee relied on the data generating distribution being tolerantly realizable. In this section, we propose another approach that addresses both of these issues. The idea is to adopt the perturb-and-smooth approach within a sample compression argument. We introduce the notion of a (𝒰,𝒱)(\mathcal{U},\mathcal{V})-tolerant sample compression scheme and present a learning bound based on such a compression scheme, starting with the realizable case. We then show that this implies learnability in the agnostic case as well. Remarkably, this tolerant compression based analysis will yield bounds on the sample complexity that avoid the exponential dependence on the doubling dimension.

For a compact representation, we will use the general notation 𝒰,𝒱,\mathcal{U},\mathcal{V}, and 𝒲\mathcal{W} for the three perturbation types (actual, reference and smoothing type) in this section and will assume that they satisfy the Property 1 below for some parameter β>0\beta>0. Lemma 5.3 implies that, in the metric setting, for any radius rr and tolerance parameter γ\gamma the perturbation types 𝒰=r\mathcal{U}={\mathcal{B}}_{r}, 𝒱=(1+γ)r\mathcal{V}={\mathcal{B}}_{(1+\gamma)r}, and 𝒲=γr\mathcal{W}={\mathcal{B}}_{\gamma r} have this property for β=13(1+γγ)ζd\beta=\frac{1}{3}\left(\frac{1+\gamma}{\gamma}\right)^{-\zeta d}.

Property 1.

For a fixed 0<β<1/20<\beta<1/2, we assume that the perturbation types 𝒱,𝒰\mathcal{V},\mathcal{U} and 𝒲\mathcal{W} are so that for any classifier hh and any xXx\in X, any y{0,1}y\in\{0,1\} if

𝔼z𝒱(x)[h(z)=y]1β\mathbb{E}_{z\sim\mathcal{V}(x)}[h(z)=y]\geq 1-\beta

then 𝒲\mathcal{W}-smoothed class classifier h¯𝒲\bar{h}_{\mathcal{W}} satisfies h¯𝒲(z)=y\bar{h}_{\mathcal{W}}(z)=y for all z𝒰(x)z\in\mathcal{U}(x).

A compression scheme of size kk is a pair of functions (κ,ρ)(\kappa,\rho), where the compression function κ:i=1(X×Y)ii=1k(X×Y)i\kappa:\bigcup_{i=1}^{\infty}(X\times Y)^{i}\to\bigcup_{i=1}^{k}(X\times Y)^{i} maps samples S=((x1,y1),(x2,y2),,(xm,ym))S=((x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{m},y_{m})) of arbitrary size to sub-samples of SS of size at most kk, and ρ:i=1k(X×Y)iYX\rho:\bigcup_{i=1}^{k}(X\times Y)^{i}\to Y^{X} is a decompression function that maps samples to classifiers. The pair (κ,ρ)(\kappa,\rho) is a sample compression scheme for loss \ell and class {\mathcal{H}}, if for any samples SS realizable by {\mathcal{H}}, we recover the correct labels for all (x,y)S(x,y)\in S, that is, S(H)=0{\mathcal{L}_{S}}(H)=0 implies that S(κρ(S))=0{\mathcal{L}_{S}}(\kappa\circ\rho(S))=0.

For tolerant learning, we introduce the following generalization of compression schemes:

Definition 6.1 (Tolerant sample compression scheme).

A sample compression scheme (κ,ρ)(\kappa,\rho) is a 𝒰,𝒱\mathcal{U},\mathcal{V}-tolerant sample compression scheme for class {\mathcal{H}}, if for any samples SS that are 𝒱\ell^{\mathcal{V}} realizable by {\mathcal{H}}, that is S𝒱()=0{\mathcal{L}^{\mathcal{V}}_{S}}({\mathcal{H}})=0, we have S𝒰(κρ(S))=0{\mathcal{L}^{\mathcal{U}}_{S}}(\kappa\circ\rho(S))=0.

The next lemma establishes that the existence of a sufficiently small tolerant compression scheme for the class {\mathcal{H}} yields bounds on the sample complexity of tolerantly learning {\mathcal{H}}. The proof of the lemma is based on a modification of a standard compression based generalization bound. Appendix Section D provides more details.

Lemma 6.2.

Let {\mathcal{H}} be a hypothesis class and 𝒰\mathcal{U} and 𝒱\mathcal{V} be perturbation types with 𝒰\mathcal{U} included in 𝒱\mathcal{V}. If the class {\mathcal{H}} admits a (𝒰,𝒱)(\mathcal{U},\mathcal{V})-tolerant compression scheme of size bounded by kln(m)k\ln(m) for sample of size mm, then the class is (𝒰,𝒱)(\mathcal{U},\mathcal{V})-tolerantly learnable in the realizable case with sample complexity bounded by m(ε,δ)=O~(k+ln(1/δ)ε)m(\varepsilon,\delta)=\tilde{O}\left(\frac{k+\ln(1/\delta)}{\varepsilon}\right).

We next establish a bound on the tolerant compression size for general VC-classes, which will then immediately yield the improved sample complexity bounds for tolerant learning in the realizable case. The proof is sketched here; its full version has been moved to the Appendix.

Lemma 6.3.

Let YX{\mathcal{H}}\subseteq Y^{X} be some hypothesis class with finite VC-dimension VC()<\mathrm{VC}({\mathcal{H}})<\infty, and let 𝒰,𝒱,𝒲\mathcal{U},\mathcal{V},\mathcal{W} satisfy the conditions in Property 1 for some β>0\beta>0. Then there exists a (𝒰,𝒱)(\mathcal{U},\mathcal{V})-tolerant sample compression scheme for {\mathcal{H}} of size O~(VC()ln(mβ))\tilde{O}\left(\mathrm{VC}({\mathcal{H}})\ln(\frac{m}{\beta})\right).

Proof Sketch.

We will employ a boosting-based approach to establish the claimed compression sizes. Let S=((x1,y1),(x2,y2),,(xm,ym))S=((x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{m},y_{m})) be a data-set that is 𝒱\ell^{\mathcal{V}}-realizable with respect to {\mathcal{H}}. We let S𝒱S_{\mathcal{V}} denote an “inflated data-set” that contains all domain points in the 𝒱\mathcal{V}-perturbation sets of the xiSXx_{i}\in S^{X}, that is S𝒱X:=i=1m𝒱(xi)S_{\mathcal{V}}^{X}:=\bigcup_{i=1}^{m}\mathcal{V}(x_{i}). Every point zS𝒱Xz\in S_{\mathcal{V}}^{X} is assigned the label y=yiy=y_{i} of the minimally-indexed (xi,yi)S(x_{i},y_{i})\in S with z𝒱(xi)z\in\mathcal{V}(x_{i}), and we set S𝒱S_{\mathcal{V}} to be the resulting collection of labeled data-points.

We then use the boost-by-majority method to encode a classifier gg that (roughly speaking) has error bounded by β/m\beta/m over (a suitable measure over) S𝒱S_{\mathcal{V}}. This boosting method outputs a TT-majority vote g(x)=𝟙[Σi=1Thi(x)]1/2g(x)=\mathds{1}\left[{\Sigma_{i=1}^{T}h_{i}(x)}\right]\geq 1/2 over weak learners hih_{i}, which in our case will be hypotheses from {\mathcal{H}}. We prove that this error can be achieved with T=18ln(2mβ)T=18\ln(\frac{2m}{\beta}) rounds of boosting. We prove that each weak learner that is used in the boosting procedure can be encoded with n=O~(VC())n=\tilde{O}(\mathrm{VC}({\mathcal{H}})) many sample points from SS. The resulting compression size is thus nT=O~(VC()ln(mβ))n\cdot T=\tilde{O}\left(\mathrm{VC}({\mathcal{H}})\ln(\frac{m}{\beta})\right).

Finally, the error bound β/m\beta/m of gg over S𝒱S_{\mathcal{V}} implies that the error in each perturbation set 𝒱(xi)\mathcal{V}(x_{i}) of a sample point (xi,yi)S(x_{i},y_{i})\in S is at most β\beta. Property 1 then implies S𝒰(g¯𝒲)=0{\mathcal{L}^{\mathcal{U}}_{S}}(\bar{g}_{\mathcal{W}})=0 for the 𝒲\mathcal{W}-smoothed classifier g¯𝒲\bar{g}_{\mathcal{W}}, establishing the (𝒰,𝒱)(\mathcal{U},\mathcal{V})-tolerant correctness of the compression scheme. ∎

This yields the following result

Theorem 6.4.

Let {\mathcal{H}} be a hypothesis class of finite VC-dimension and 𝒱,𝒰,𝒲\mathcal{V},\mathcal{U},\mathcal{W} be three perturbation types (actual, reference and smoothing) satisfying Property 1 for some β>0\beta>0. Then the sample complexity (omitting log-factors) of (𝒰,𝒱)(\mathcal{U},\mathcal{V})-tolerantly learning {\mathcal{H}} is bounded by

m(ε,δ)=O~(VC()ln(1/β)+ln(1/δ)ε)m(\varepsilon,\delta)=\tilde{O}\left(\frac{\mathrm{VC}({\mathcal{H}})\ln({1}/{\beta})+\ln({1}/{\delta})}{\varepsilon}\right)

in the realizable case, and in the agnostic case by

m(ε,δ)=O~(VC()ln(1/β)+ln(1/δ)ε2)m(\varepsilon,\delta)=\tilde{O}\left(\frac{\mathrm{VC}({\mathcal{H}})\ln({1}/{\beta})+\ln({1}/{\delta})}{\varepsilon^{2}}\right)
Proof.

The bound for the realizable case follows immediately from Lemma 6.3 and the subsequent discussion (in the Appendix). For the agnostic case, we employ a reduction from agnostic robust learnabilty to realizable robust learnability (Montasser et al., 2019; Moran and Yehudayoff, 2016). The reduction is analogous to the one presented in Appendix C of Montasser et al. (2019) for usual (non-tolerant) robust learnablity with some minor modifications. Namely, for a sample SS, we choose the largest subsample SS^{\prime} that is 𝒱\ell^{\mathcal{V}}-realizable (this will result in competitiveness with a 𝒱\ell^{\mathcal{V}}-optimal classifier), and we will use the boosting procedure described there for the 𝒰\ell^{\mathcal{U}} loss. For the sample sizes employed for the weak learners in that procedure, we can use the sample complexity for ε=δ=1/3\varepsilon=\delta=1/3 of an optimal (𝒰,𝒱)(\mathcal{U},\mathcal{V})-tolerant learner in the realizable case (note that each learning problem during the boosting procedure is a realizable (𝒰,𝒱)(\mathcal{U},\mathcal{V})-tolerant learning task). These modifications result in the stated sample complexity for agnostic tolerant learnability. ∎

In particular, for the doubling measure scenario (as considered in the previous section), we obtain

Corollary 6.5.

For metric tolerant learning with tolerance parameter γ\gamma in doubling dimension dd the sample complexity of adversarially robust learning with tolerance in the realizable case is bounded by m(ε,δ)=O~(VC()ζdln(1+1/γ)+ln(1/δ)ε)m(\varepsilon,\delta)=\tilde{O}\left(\frac{\mathrm{VC}({\mathcal{H}})\zeta d\ln(1+1/{\gamma})+\ln({1}/{\delta})}{\varepsilon}\right) and in the agnostic case by m(ε,δ)=O~(VC()ζdln(1+1/γ)+ln(1/δ)ε2)m(\varepsilon,\delta)=\tilde{O}\left(\frac{\mathrm{VC}({\mathcal{H}})\zeta d\ln(1+1/{\gamma})+\ln({1}/{\delta})}{\varepsilon^{2}}\right).

Discussion of linear dependence on VC()\mathrm{VC}({\mathcal{H}})

Earlier, general compression based sample complexity bounds for robust learning with arbitrary perturbation sets exhibit a dependence on the dual VC-dimension of the hypothesis class and therefore potentially an exponential dependence on VC()\mathrm{VC}({\mathcal{H}}) (Montasser et al., 2019). In our setting, we show that it is possible to avoid the dependence on dual-VC by exploiting both the metric structure of the domain set and the tolerant framework. In the full proof of Lemma 6.3, we show that if we can encode a classifier with small error (exponentially small with respect to the doubling dimension for the metric case) on the perturbed distribution w.r.t. larger perturbation sets, then we can use smoothing to get a classifier that correctly classifies every point in the inner inflated sets. And, as for TPaS, the tolerant perspective is crucial to exploit a smoothing step in the compression approach (through the guarantee from Property 1 or Lemma 5.3).

More specifically, we define a tolerant compression scheme (Definition 12) that naturally extends the classic definition of compression to the tolerant framework. The compression scheme we establish in the proof of Lemma 6.3 then borrows ideas from our perturb-and-smooth algorithm. Within the compression argument, we define the perturbed distribution over the sample that we want to compress with respect to the larger perturbation sets. We then use boosting to build a classifier with very small error with respect to this distribution. The nice property of boosting is that its error decreases exponentially with the number of iterations. As a result, we also get linear dependence on the doubling dimension. This classifier can be encoded using O~(TVC())\tilde{O}\left(T\mathrm{VC}({\mathcal{H}})\right) samples (TT rounds of boosting, and each weak classifier can be encoded using O~(VC()\tilde{O}(\mathrm{VC}({\mathcal{H}}) samples, since we can here use simple ε\varepsilon-approximations rather than invoking VC-theory in the dual space). Our decoder receives the description of these weak classifiers, combines them, and performs a final smoothing step. The smoothing step translates the exponentially small error with respect to the perturbed distribution to zero error with respect to the (inner) inflated set, thereby satisfying the requirement of a tolerant compression scheme.

References

  • Ashtiani et al. (2020) Hassan Ashtiani, Vinayak Pathak, and Ruth Urner. Black-box certification and learning under adversarial perturbations. In International Conference on Machine Learning, pages 388–398. PMLR, 2020.
  • Attias et al. (2019) Idan Attias, Aryeh Kontorovich, and Yishay Mansour. Improved generalization bounds for robust learning. In Algorithmic Learning Theory, ALT, pages 162–183, 2019.
  • Awasthi et al. (2019) Pranjal Awasthi, Abhratanu Dutta, and Aravindan Vijayaraghavan. On robustness to adversarial examples and polynomial optimization. In Advances in Neural Information Processing Systems, NeurIPS, pages 13760–13770, 2019.
  • Bhattacharjee et al. (2022) Robi Bhattacharjee, Max Hopkins, Akash Kumar, Hantao Yu, and Kamalika Chaudhuri. Robust empirical risk minimization with tolerance. arXiv preprint arXiv:2210.00635, 2022.
  • Blum et al. (2020) Avrim Blum, Travis Dick, Naren Manoj, and Hongyang Zhang. Random smoothing might be unable to certify l_infinity robustness for high-dimensional images. Journal of machine learning research, 21(211), 2020.
  • Blumer et al. (1989) Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM (JACM), 36(4):929–965, 1989.
  • Bubeck and Sellke (2021) Sébastien Bubeck and Mark Sellke. A universal law of robustness via isoperimetry. Advances in Neural Information Processing Systems, 34:28811–28822, 2021.
  • Cao and Gong (2017) Xiaoyu Cao and Neil Zhenqiang Gong. Mitigating evasion attacks to deep neural networks via region-based classification. In Proceedings of the 33rd Annual Computer Security Applications Conference, pages 278–287, 2017.
  • Cohen et al. (2019) Jeremy M. Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. In Proceedings of the 36th International Conference on Machine Learning, ICML, pages 1310–1320, 2019.
  • Cullina et al. (2018) Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. Pac-learning in the presence of adversaries. In Advances in Neural Information Processing Systems, NeurIPS, pages 230–241, 2018.
  • Engstrom et al. (2019) Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt, and Aleksander Madry. Exploring the landscape of spatial robustness. In International Conference on Machine Learning, pages 1802–1811. PMLR, 2019.
  • Fawzi and Frossard (2015) Alhussein Fawzi and Pascal Frossard. Manitest: Are classifiers really invariant? In British Machine Vision Conference (BMVC), number CONF, 2015.
  • Feige et al. (2015) Uriel Feige, Yishay Mansour, and Robert Schapire. Learning and inference in the presence of corrupted inputs. In Conference on Learning Theory, COLT, pages 637–657, 2015.
  • Goldwasser et al. (2020) Shafi Goldwasser, Adam Tauman Kalai, Yael Tauman Kalai, and Omar Montasser. Beyond perturbations: Learning guarantees with arbitrary adversarial test examples. arXiv preprint arXiv:2007.05145, 2020.
  • Goodfellow et al. (2018) Ian J. Goodfellow, Patrick D. McDaniel, and Nicolas Papernot. Making machine learning robust against adversarial inputs. Commun. ACM, 61(7):56–66, 2018.
  • Hanneke (2016) Steve Hanneke. The optimal sample complexity of pac learning. The Journal of Machine Learning Research, 17(1):1319–1333, 2016.
  • Haussler (1992) David Haussler. Decision theoretic generalizations of the pac model for neural net and other learning applications. Information and computation, 100(1):78–150, 1992.
  • Haussler and Welzl (1987) David Haussler and Emo Welzl. epsilon-nets and simplex range queries. Discret. Comput. Geom., 2:127–151, 1987.
  • Hendrycks et al. (2019) Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. In International Conference on Learning Representations, 2019.
  • Hosseini and Poovendran (2018) Hossein Hosseini and Radha Poovendran. Semantic adversarial examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 1614–1619, 2018.
  • Inkawhich et al. (2019) Nathan Inkawhich, Wei Wen, Hai Helen Li, and Yiran Chen. Feature space perturbations yield more transferable adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7066–7074, 2019.
  • Kanbak et al. (2018) Can Kanbak, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. Geometric robustness of deep networks: Analysis and improvement. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4441–4449. IEEE, 2018.
  • Kang et al. (2019) Daniel Kang, Yi Sun, Dan Hendrycks, Tom Brown, and Jacob Steinhardt. Testing robustness against unforeseen adversaries. arXiv preprint arXiv:1908.08016, 2019.
  • Lecuyer et al. (2019) Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on Security and Privacy (SP), pages 656–672. IEEE, 2019.
  • Levine and Feizi (2020) Alexander Levine and Soheil Feizi. Robustness certificates for sparse adversarial attacks by randomized ablation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 4585–4593, 2020.
  • Li et al. (2019) Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Certified adversarial robustness with additive noise. Advances in Neural Information Processing Systems, 32:9464–9474, 2019.
  • Liu et al. (2018) Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pages 369–385, 2018.
  • Luukkainen and Saksman (1998) Jouni Luukkainen and Eero Saksman. Every complete doubling metric space carries a doubling measure. Proceedings of the American Mathematical Society, 126(2):531–534, 1998.
  • Montasser et al. (2019) Omar Montasser, Steve Hanneke, and Nathan Srebro. VC classes are adversarially robustly learnable, but only improperly. In Conference on Learning Theory, COLT, pages 2512–2530, 2019.
  • Montasser et al. (2020a) Omar Montasser, Surbhi Goel, Ilias Diakonikolas, and Nathan Srebro. Efficiently learning adversarially robust halfspaces with noise. arXiv preprint arXiv:2005.07652, 2020a.
  • Montasser et al. (2020b) Omar Montasser, Steve Hanneke, and Nati Srebro. Reducing adversarially robust learning to non-robust pac learning. In NeurIPS, 2020b.
  • Montasser et al. (2021a) Omar Montasser, Steve Hanneke, and Nathan Srebro. Adversarially robust learning with unknown perturbation sets. arXiv preprint arXiv:2102.02145, 2021a.
  • Montasser et al. (2021b) Omar Montasser, Steve Hanneke, and Nathan Srebro. Transductive robust learning guarantees. arXiv preprint arXiv:2110.10602, 2021b.
  • Moran and Yehudayoff (2016) Shay Moran and Amir Yehudayoff. Sample compression schemes for vc classes. Journal of the ACM (JACM), 63(3):1–10, 2016.
  • Raman et al. (2022) VInod Raman, Unique Subedi, and Ambuj Tewari. Probabilistically robust pac learning. arXiv preprint arXiv:2211.05656, 2022.
  • Sabour et al. (2016) Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J Fleet. Adversarial manipulation of deep representations. In ICLR (Poster), 2016.
  • Salman et al. (2019) Hadi Salman, Jerry Li, Ilya P. Razenshteyn, Pengchuan Zhang, Huan Zhang, Sébastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. In Advances in Neural Information Processing Systems 32, NeurIPS, pages 11289–11300, 2019.
  • Schapire and Freund (2013) Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. Kybernetes, 2013.
  • Shalev-Shwartz and Ben-David (2014) Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014.
  • Simon (2015) Hans U Simon. An almost optimal pac algorithm. In Conference on Learning Theory, pages 1552–1563. PMLR, 2015.
  • Song et al. (2018) Yang Song, Rui Shu, Nate Kushman, and Stefano Ermon. Constructing unrestricted adversarial examples with generative models. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 8322–8333, 2018.
  • Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR, 2014.
  • Valiant (1984) Leslie G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134–1142, 1984.
  • Vapnik and Chervonenkis (1971) V. N. Vapnik and A. Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264–280, 1971.
  • Xiao et al. (2018) Chaowei Xiao, Jun-Yan Zhu, Bo Li, Warren He, Mingyan Liu, and Dawn Song. Spatially transformed adversarial examples. In International Conference on Learning Representations, 2018.
  • Xu et al. (2020) Qiuling Xu, Guanhong Tao, Siyuan Cheng, and Xiangyu Zhang. Towards feature space adversarial attack. arXiv preprint arXiv:2004.12385, 2020.
  • Yaeger et al. (1996) Larry Yaeger, Richard Lyon, and Brandyn Webb. Effective training of a neural network character classifier for word recognition. Advances in neural information processing systems, 9:807–816, 1996.
  • Yang et al. (2020a) Greg Yang, Tony Duan, J Edward Hu, Hadi Salman, Ilya Razenshteyn, and Jerry Li. Randomized smoothing of all shapes and sizes. In Proceedings of the 37th International Conference on Machine Learning, pages 693–10705, 2020a.
  • Yang et al. (2020b) Yao-Yuan Yang, Cyrus Rashtchian, Hongyang Zhang, Ruslan Salakhutdinov, and Kamalika Chaudhuri. A closer look at accuracy vs. robustness. In Proceedings of the 34th International Conference on Neural Information Processing Systems, pages 8588–8601, 2020b.
  • Zheng et al. (2016) Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. Improving the robustness of deep neural networks via stability training. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 4480–4488, 2016.

Appendix A Standard results from VC theory

Let XX be a domain. For hypothesis hh and BXB\subseteq X let h(B)=(h(b))bBh(B)=\left(h(b)\right)_{b\in B}.

Definition A.1 (VC-dimension).

We say \mathcal{H} shatters BXB\subseteq X if |{h(B):h}|=2|B||\{h(B):h\in\mathcal{H}\}|=2^{|B|}. The VC-dimension of {\mathcal{H}}, denoted by VC()\mathrm{VC}({\mathcal{H}}), is defined to be the supremum of the size of the sets that are shattered by {\mathcal{H}}.

Theorem A.2 (Existence of Realizable PAC-learners Hanneke (2016); Simon (2015); Blumer et al. (1989)).

Let \mathcal{H} be a hypothesis class with bounded VC-dimension. Then \mathcal{H} is PAC-learnable in the realizable setting using O(VC()+log(1/δ)ε)O\left(\frac{\mathrm{VC}({\mathcal{H}})+\log(1/\delta)}{\varepsilon}\right) samples.

Theorem A.3 (Existence of Agnostic PAC-learners Haussler (1992)).

Let \mathcal{H} be a hypothesis class with bounded VC-dimension. Then \mathcal{H} is PAC-learnable in the agnostic setting using O(VC()+log(1/δ)ε2)O\left(\frac{\mathrm{VC}({\mathcal{H}})+\log(1/\delta)}{\varepsilon^{2}}\right) samples.

Appendix B Metric spaces

Definition B.1.

A metric space (X,dist)(X,\mathrm{dist}) is called a doubling metric if there exists a constant MM such that every ball of radius rr in it can be covered by at most MM balls of radius r/2r/2. The quantity log2M\log_{2}{M} is called the doubling dimension.

Definition B.2.

For a metric space (X,dist)(X,\mathrm{dist}), a measure μ\mu defined on XX is called a doubling measure if there exists a constant CC, such that for all xXx\in X and r+r\in\mathbb{R}^{+}, we have that 0<μ(2r(x))Cμ(r(x))<0<\mu({\mathcal{B}}_{2r}(x))\leq C\cdot\mu({\mathcal{B}}_{r}(x))<\infty. In this case, μ\mu is called CC-doubling.

It can be shown (Luukkainen and Saksman, 1998) that every complete metric with doubling dimension dd has a CC-doubling measure μ\mu for some C2cdC\leq 2^{cd} where cc is a universal constant. For example, Euclidean spaces with an p\ell_{p} distance metric are complete and the Lesbesgue measure is a doubling measure.

The following lemmas follow straightforwardly from the definitions doubling metric and measures.

Lemma B.3.

Let (X,dist)(X,\mathrm{dist}) be a doubling metric equipped with a CC-doubling measure μ\mu. Then for all xXx\in X, r>0r>0, and α>1\alpha>1, we have that μ(αr(x))Clog2αμ(r(x))\mu({\mathcal{B}}_{\alpha r}(x))\leq C^{\lceil\log_{2}\alpha\rceil}\cdot\mu({\mathcal{B}}_{r}(x))

Proof.

Since μ\mu is a measure, if B,BXB,B^{\prime}\subseteq X such that BBB\subseteq B^{\prime}, then μ(B)μ(B)\mu(B)\leq\mu(B^{\prime}). Let R=2log2αR=2^{\lceil\log_{2}\alpha\rceil}. It’s clear that RαR\geq\alpha. Therefore αr(x)R(x){\mathcal{B}}_{\alpha r}(x)\subseteq{\mathcal{B}}_{R}(x). Expanding r(x){\mathcal{B}}_{r}(x) by a factor of two log2α\lceil\log_{2}\alpha\rceil times, we get R(x){\mathcal{B}}_{R}(x), which means μ(R(x))Clog2αμ(r(x))\mu({\mathcal{B}}_{R}(x))\leq C^{\lceil\log_{2}\alpha\rceil}\cdot\mu({\mathcal{B}}_{r}(x)). But since αr(x)R(x){\mathcal{B}}_{\alpha r}(x)\subseteq{\mathcal{B}}_{R}(x), we get the desired result. ∎

Lemma B.4.

Let (X,dist)(X,\mathrm{dist}) be a doubling metric equipped with a CC-doubling measure μ\mu. Let x,xXx,x^{\prime}\in X, r>0r>0, and α>1\alpha>1 be such that r(x)αr(x){\mathcal{B}}_{r}(x^{\prime})\subseteq{\mathcal{B}}_{\alpha r}(x). Then μ(αr(x))Clog2(2α)μ(r(x))\mu({\mathcal{B}}_{\alpha r}(x))\leq C^{\lceil\log_{2}(2\alpha)\rceil}\cdot\mu({\mathcal{B}}_{r}(x^{\prime})).

Proof.

By Lemma B.3, all we need to show is that Bαr(x)2αr(x)B_{\alpha r}(x)\subseteq{\mathcal{B}}_{2\alpha r}(x^{\prime}). Indeed, let yαr(x)y\in{\mathcal{B}}_{\alpha r}(x) be any point. Then, from triangle inequality, we have that

d(x,y)\displaystyle d(x^{\prime},y) d(x,x)+d(x,y)\displaystyle\leq d(x,x^{\prime})+d(x,y)
d(x,x)+αr\displaystyle\leq d(x,x^{\prime})+\alpha r

Moreover, since xαr(x)x^{\prime}\in{\mathcal{B}}_{\alpha r}(x), we have that d(x,x)αrd(x,x^{\prime})\leq\alpha r. Substituting into the equation above, we get d(x,y)2αrd(x^{\prime},y)\leq 2\alpha r, which means y2αr(x)y\in{\mathcal{B}}_{2\alpha r}(x^{\prime}). ∎

Finally, we also get:

Lemma B.5.

For any family \mathcal{M} of complete, doubling metric spaces, there exist constants c1,c2>0c_{1},c_{2}>0 such that for any metric space (X,dist)(X,\mathrm{dist})\in{\mathcal{M}} with doubling dimension dd, there exists a measure μ\mu such that if a ball r{\mathcal{B}}_{r} of radius r>0r>0 is completely contained inside a ball αr{\mathcal{B}}_{\alpha r} of radius αr\alpha r (with potentially a different center) for any α>1\alpha>1, then 0<μ(αr)(c1α)c2dμ(r)0<\mu({\mathcal{B}}_{\alpha r})\leq(c_{1}\alpha)^{c_{2}d}\mu({\mathcal{B}}_{r}).

Proof.

We prove this when {\mathcal{M}} is the set of all complete, doubling metric spaces employing Lemmas B.3 and B.4, that can be found in Appendix, part B. We have that Clog2(2α)(2α)2log2CC^{\lceil\log_{2}(2\alpha)\rceil}\leq(2\alpha)^{2\log_{2}C}. Since log2Ccd\log_{2}C\leq cd, we get (2α)2log2C(2α)cd(2\alpha)^{2\log_{2}C}\leq(2\alpha)^{cd}. Thus c1=2c_{1}=2 and c2=cc_{2}=c. ∎

Corollary B.6.

Suppose we have a constant α0>1\alpha_{0}>1 such that we know that αα0\alpha\geq\alpha_{0}. Then the bound in Lemma B.5 can be further simplified to 0<μ(αr)αζdμ(r)0<\mu({\mathcal{B}}_{\alpha r})\leq\alpha^{\zeta d}\mu({\mathcal{B}}_{r}), where ζ\zeta depends on {\mathcal{M}} and α0\alpha_{0}. Furthermore, if c1=1c_{1}=1 then we can set α0=1\alpha_{0}=1.

Proof.

(c1α)c2d=αc2d(1+logαc1)αc2d(1+logα0c1)=αζd(c_{1}\alpha)^{c_{2}d}=\alpha^{c_{2}d(1+\log_{\alpha}c_{1})}\leq\alpha^{c_{2}d(1+\log_{\alpha_{0}}c_{1})}=\alpha^{\zeta d} for ζ=c2(1+logα0c1)\zeta=c_{2}(1+\log_{\alpha_{0}}c_{1}). If c1=1c_{1}=1, then ζ=c2\zeta=c_{2} for all α\alpha. ∎

Appendix C Proof of Lemma 5.3

Let Xerr={zr(1+γ)(x)g(z)y}X_{\text{err}}=\{z\in{\mathcal{B}}_{r(1+\gamma)}(x)\mid\ g(z)\neq y\}. Then, we have that Σg,y(x)=𝔼zr(1+γ)(x)𝟙[g(z)y]=μ(Xerr)μ(r(1+γ)(x))\Sigma_{g,y}(x)=\mathbb{E}_{z\sim{\mathcal{B}}_{r(1+\gamma)}(x)}\mathds{1}\left[{g(z)\neq y}\right]=\frac{\mu(X_{\text{err}})}{\mu({\mathcal{B}}_{r(1+\gamma)}(x))}. Further, for all zr(x)z\in{\mathcal{B}}_{r}(x), we have 𝔼zrγ(z)𝟙[g(z)y]=μ(Xerrrγ(z))μ(rγ(z))\mathbb{E}_{z^{\prime}\sim{\mathcal{B}}_{r\gamma}(z)}\mathds{1}\left[{g(z^{\prime})\neq y}\right]=\frac{\mu(X_{\text{err}}\cap{\mathcal{B}}_{r\gamma}(z))}{\mu({\mathcal{B}}_{r\gamma}(z))}.

Let zr(x)z\in{\mathcal{B}}_{r}(x). Since this implies that rγ(z)r(1+γ)(x){\mathcal{B}}_{r\gamma}(z)\subseteq{\mathcal{B}}_{r(1+\gamma)}(x), the worst case happens when Xerrrγ(z)X_{\text{err}}\subseteq{\mathcal{B}}_{r\gamma}(z). Therefore,

σg,y(x)\displaystyle\sigma_{g,y}(x) =𝔼zrγ(z)𝟙[g(z)y]\displaystyle=\mathbb{E}_{z^{\prime}\sim{\mathcal{B}}_{r\gamma}(z)}\mathds{1}\left[{g(z^{\prime})\neq y}\right] (3)
=μ(Xerrrγ(z))μ(rγ(z))\displaystyle=\frac{\mu(X_{\text{err}}\cap{\mathcal{B}}_{r\gamma}(z))}{\mu({\mathcal{B}}_{r\gamma}(z))}
μ(Xerr)μ(rγ(z))\displaystyle\leq\frac{\mu(X_{\text{err}})}{\mu({\mathcal{B}}_{r\gamma}(z))}
Σg,y(x)μ(r(1+γ)(x))μ(rγ(z))\displaystyle\leq\frac{\Sigma_{g,y}(x)\cdot\mu({\mathcal{B}}_{r(1+\gamma)}(x))}{\mu({\mathcal{B}}_{r\gamma}(z))}
Σg,y(x)(1+γγ)ζd,\displaystyle\leq\Sigma_{g,y}(x)\cdot\left(\frac{1+\gamma}{\gamma}\right)^{\zeta d},

where the last inequality is implied by Lemma B.5. Thus, Σ(x)13(1+γγ)ζd\Sigma(x)\leq\frac{1}{3}\cdot\left(\frac{1+\gamma}{\gamma}\right)^{-\zeta d} implies that σ(z)1/3\sigma(z)\leq 1/3 as claimed.

Appendix D Compression based bounds

D.1 Proof of Lemma 6.2

To prove the generalization bound for tolerant learning, we employ the following lemma that establishes generalization for compression schemes for adversarial losses:

Lemma D.1 (Lemma 11, (Montasser et al., 2019)).

For any kk\in\mathbb{N} and fixed function ρ:i=1k(X×Y)iYX\rho:\bigcup_{i=1}^{k}(X\times Y)^{i}\to Y^{X}, for any distribution PP over X×YX\times Y and any mm\in\mathbb{N}, with probability at least (1δ)(1-\delta) over an i.i.d. sample S=((x1,y1),(x2,y2),,(xm,ym))S=((x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{m},y_{m})): if there exist indices i1,i2,,iki_{1},i_{2},\ldots,i_{k} such that

S𝒰(ρ((xi1,yi1),(xi2,yi2),,(xik,yik)))=0{\mathcal{L}^{\mathcal{U}}_{S}}(\rho((x_{i_{1}},y_{i_{1}}),(x_{i_{2}},y_{i_{2}}),\ldots,(x_{i_{k}},y_{i_{k}})))=0

then the robust loss of the decompression with respect to PP is bounded by

P𝒰(ρ((xi1,yi1),(xi2,yi2),,(xik,yik)))\displaystyle{\mathcal{L}^{\mathcal{U}}_{P}}(\rho((x_{i_{1}},y_{i_{1}}),(x_{i_{2}},y_{i_{2}}),\ldots,(x_{i_{k}},y_{i_{k}})))
1mk(kln(m)+ln(1/δ))\displaystyle\qquad\qquad~{}\leq~{}\frac{1}{m-k}(k\ln(m)+\ln(1/\delta))

The above lemma implies that if (κ,ρ)(\kappa,\rho) is a compression scheme that compresses data-sets of size mm to at most kln(m)k\ln(m) data points, for class {\mathcal{H}} and robust loss 𝒰\ell^{\mathcal{U}}, then the sample complexity (omitting logarithmic factors) of robustly learning {\mathcal{H}} in the realizable case is bounded by

m(ε,δ)=O~(k+ln(1/δ)ε)m(\varepsilon,\delta)=\tilde{O}\left(\frac{k+\ln(1/\delta)}{\varepsilon}\right)

For the tolerant setting, since every sample that is realizable with respect to 𝒱\ell^{\mathcal{V}} is also realizable with respect to 𝒰\ell^{\mathcal{U}}, if a (𝒰,𝒱)(\mathcal{U},\mathcal{V})-tolerant compression scheme compresses to at most kln(m)k\ln(m) data-points and decompresses all 𝒱\ell^{\mathcal{V}}-realizable samples SS to functions that have 𝒰\ell^{\mathcal{U}}-loss 0 on SS, then the lemma implies the above bound for the (𝒰,𝒱)(\mathcal{U},\mathcal{V})-tolerant sample complexity of learning {\mathcal{H}}.

D.2 Proof of Lemma 6.3

The proof of this Lemma employs the notions of a sample being ε\varepsilon-net or and ε\varepsilon-approximation for a hypothesis class {\mathcal{H}}. A labeled data set S=((x1,y1),(x2,y2),,(xm,ym))S=((x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{m},y_{m})) is an ε\varepsilon-net for class {\mathcal{H}} with respect to distribution PP over X×YX\times Y if for every hypothesis hh\in{\mathcal{H}} with P0/1(h)ε{\mathcal{L}^{0/1}_{P}}(h)\geq\varepsilon, there exists an index jj and (xj,yj)S(x_{j},y_{j})\in S with h(xj)yjh(x_{j})\neq y_{j}. SS is an ε\varepsilon-approximation for class {\mathcal{H}} with respect to distribution PP over X×YX\times Y if for every hypothesis hh\in{\mathcal{H}} we have |S0/1(h)P0/1(h)|ε|{\mathcal{L}^{0/1}_{S}}(h)-{\mathcal{L}^{0/1}_{P}}(h)|\leq\varepsilon. Standard VC-theory tells us that, for classes with bounded VC-dimension, sufficiently large samples from PP are ε\varepsilon-nets or ε\varepsilon-approximations with high probability (Haussler and Welzl, 1987).

Proof.

We will employ a boosting-based approach to establish the claimed compression sizes. Let S=((x1,y1),(x2,y2),,(xm,ym))S=((x_{1},y_{1}),(x_{2},y_{2}),\ldots,(x_{m},y_{m})) be a data-set that is 𝒱\ell^{\mathcal{V}}-realizable with respect to {\mathcal{H}}. We let S𝒱S_{\mathcal{V}} denote an “inflated data-set” that contains all domain points in the perturbation sets of the xiSXx_{i}\in S^{X}, that is

S𝒱X:=i=1m𝒱(xi)S_{\mathcal{V}}^{X}:=\bigcup_{i=1}^{m}\mathcal{V}(x_{i})

Every point zS𝒱Xz\in S_{\mathcal{V}}^{X} is assigned the label y=yiy=y_{i} of the minimally-indexed (xi,yi)S(x_{i},y_{i})\in S with z𝒱(xi)z\in\mathcal{V}(x_{i}), and we set S𝒱S_{\mathcal{V}} to be the resulting collection of labeled data-points. (Note that since the sample SS is assumed to be 𝒱\ell^{\mathcal{V}}-realizable, assigning it the label of some other corresponding data point in case z𝒱(xi)𝒱(xj)z\in\mathcal{V}(x_{i})\cap\mathcal{V}(x_{j}) for xixjx_{i}\neq x_{j}, would not induce any inconsistencies). Now let DD be the probability measure over S𝒱XS_{\mathcal{V}}^{X} defined by first sampling an index jj uniformly from [j]={1,2,,j}[j]=\{1,2,\ldots,j\} and then sampling a domain point z𝒱(xj)z\sim\mathcal{V}(x_{j}) from the 𝒱\mathcal{V}-perturbation set around the jj-th sample point in SS. Note that this implies that if D(B)(β/m)D(B)\leq(\beta/m) for some subset BS𝒱XB\subseteq S_{\mathcal{V}}^{X}, then

z𝒱(x)[zB]β\mathbb{P}_{z\sim\mathcal{V}(x)}[z\in B]\leq\beta (4)

for all xSXx\in S^{X}.

We will now show that, by means of a compression scheme, we can encode a hypothesis gg with binary loss

D0/1(g)β/m.{\mathcal{L}^{0/1}_{D}}(g)\leq\beta/m. (5)

Property 1 together with Equation 4 then implies that the resulting 𝒲\mathcal{W}-smoothed function g¯\bar{g} has 𝒰\mathcal{U}-robust loss 0 on the sample SS, S𝒰(g¯)=0{\mathcal{L}^{\mathcal{U}}_{S}}(\bar{g})=0. Since the smoothing is a deterministic operation once gg is fixed, this implies the existence of a (𝒰,𝒱)(\mathcal{U},\mathcal{V})-tolerant compression scheme.

Standard VC-theory tells us that, for a class GG of bounded VC-dimension, for any distribution over X×YX\times Y, and any ε,δ>0\varepsilon,\delta>0, with probability at least (1δ)(1-\delta) an i.i.d. sample of size Θ(VC(G)+ln(1/δ)ε2)\Theta\left(\frac{\mathrm{VC}(G)+\ln(1/\delta)}{\varepsilon^{2}}\right) is an ε\varepsilon-approximation for the class GG (Haussler and Welzl, 1987). This implies in particular, that there exists a finite subset S𝒱fS𝒱S_{\mathcal{V}}^{f}\subset S_{\mathcal{V}} of size at most 4m2CVC(G)β2\frac{4m^{2}C\cdot\mathrm{VC}(G)}{\beta^{2}} (for some constant CC) with the property that any classifier gGg\in G with empirical (binary) loss at most β/2m\beta/2m on S𝒱fS_{\mathcal{V}}^{f} has loss D0/1(g)β/m{\mathcal{L}^{0/1}_{D}}(g)\leq\beta/m with respect to the distribution DD. We will choose such a set S𝒱fS_{\mathcal{V}}^{f} for the class GG of TT-majority votes over {\mathcal{H}} for T=18ln(2mβ)T=18\ln(\frac{2m}{\beta}). That is

G={gYX\displaystyle G=\{g\in Y^{X} h1,h2,,hTH:\displaystyle\mid\exists h_{1},h_{2},\ldots,h_{T}\in H:
g(x)=𝟙[Σi=1Thi(x)1/2]}\displaystyle g(x)=\mathds{1}\left[{\Sigma_{i=1}^{T}h_{i}(x)\geq 1/2}\right]\}

The VC-dimension of GG is bounded by (Shalev-Shwartz and Ben-David, 2014)

VC(G)=𝒪(TVC()log(TVC()))=𝒪(18ln(2mβ)VC()log(18ln(2mβ)VC())).\displaystyle\mathrm{VC}(G)={\mathcal{O}}(T\cdot\mathrm{VC}({\mathcal{H}})\log(T\mathrm{VC}({\mathcal{H}})))={\mathcal{O}}(18\ln(\frac{2m}{\beta})\mathrm{VC}({\mathcal{H}})\log(18\ln(\frac{2m}{\beta})\mathrm{VC}({\mathcal{H}}))).

We will now show how to obtain the classifier gg by means of a boosting approach on the finite data-set S𝒱fS_{\mathcal{V}}^{f}. More specifically, we will use the boost-by-majority method. This method outputs a TT-majority vote g(x)=𝟙[Σi=1Thi(x)]1/2g(x)=\mathds{1}\left[{\Sigma_{i=1}^{T}h_{i}(x)}\right]\geq 1/2 over weak learners hih_{i}, which in our case will be hypotheses from {\mathcal{H}}. After TT iterations with γ\gamma-weak learners, the empirical loss over the sample S𝒱fS_{\mathcal{V}}^{f} is bounded by e2γ2T\mathrm{e}^{-2\gamma^{2}T} (see Section 13.1 in (Schapire and Freund, 2013)). Thus, with γ=1/6\gamma=1/6, and T=18ln(2mβ)T=18\ln(\frac{2m}{\beta}), we obtain

S𝒱f0/1(g)β2m{\mathcal{L}^{0/1}_{S_{\mathcal{V}}^{f}}}(g)\leq\frac{\beta}{2m}

which, by the choice of S𝒱fS_{\mathcal{V}}^{f} implies

D0/1(g)β/m{\mathcal{L}^{0/1}_{D}}(g)\leq\beta/m

which is what we needed to show according to Equation 5.

It remains to argue that the weak learners to be employed in the boosting procedure can be encoded by a small number of sample points from the original sample SS. For this part, we will employ a technique introduced earlier for robust compression (Montasser et al., 2019). Recall that the set SS is 𝒱\mathcal{V}-robustly realizable, which implies that the set S𝒱fS_{\mathcal{V}}^{f} is (binary loss-) realizable by {\mathcal{H}}. By standard VC-theory, for every distribution DiD_{i} over S𝒱fS_{\mathcal{V}}^{f}, there exists an ε\varepsilon-net of size 𝒪(VC()/ε){\mathcal{O}}(\mathrm{VC}({\mathcal{H}})/\varepsilon) (Haussler and Welzl, 1987). Thus, for every distribution DiD_{i} over S𝒱fS_{\mathcal{V}}^{f} (that may occur during the boosting procedure), there exists a subsample SiS_{i} of S𝒱fS_{\mathcal{V}}^{f}, of size at most n=𝒪(3VC())n={\mathcal{O}}(3\mathrm{VC}({\mathcal{H}})) with the property that every hypothesis from {\mathcal{H}} that is consistent with SiS_{i} has binary loss at most 1/31/3 with respect to DiD_{i} (thus can serve as a weak learner for margin γ=1/6\gamma=1/6 in the above procedure). Now for every labeled point (x,y)Si(x,y)\in S_{i}, there is a sample point (xj,yj)S(x_{j},y_{j})\in S in the original sample SS such that x𝒱(xj)x\in\mathcal{V}(x_{j}) and y=yjy=y_{j}. Let SiS^{\prime}_{i} be the collection of these corresponding original sample points. Note that any hypothesis hHh\in H that is 𝒱\mathcal{V}-robustly consistent with SiS^{\prime}_{i} is consistent with SiS_{i}. Therefore we can use the nn original data-points in SiS^{\prime}_{i} to encode the weak learner hih_{i} (for the decoding any 𝒱\mathcal{V}-robust ERM hypothesis can be chosen to obtain hih_{i}).

To summarize, we will compress the sample SS to the sequence S1,S2,STS^{\prime}_{1},S^{\prime}_{2},\ldots S^{\prime}_{T} of nT=𝒪(VC()ln(mβ))n\cdot T={\mathcal{O}}(\mathrm{VC}({\mathcal{H}})\ln(\frac{m}{\beta})) sample points from SS. To decode, we obtain the function gg as a majority vote over the weak learner hih_{i} and proceed to obtain the 𝒲\mathcal{W}-smoothed function g¯\bar{g}. This function g¯\bar{g} satisfies S𝒰(g¯)=0{\mathcal{L}^{\mathcal{U}}_{S}}(\bar{g})=0 and by this we have established the existence of a 𝒰,𝒱\mathcal{U},\mathcal{V}-tolerant compression scheme of size 𝒪(VC()ln(mβ)){\mathcal{O}}(\mathrm{VC}({\mathcal{H}})\ln(\frac{m}{\beta})) as claimed. ∎