This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Losses for Strategic Classification

Tosca Lechner,1 Ruth Urner, 2
Abstract

Strategic classification, i.e. classification under possible strategic manipulations of features, has received a lot of attention from both the machine learning and the game theory community. Most works focus on analysing properties of the optimal decision rule under such manipulations. In our work we take a learning theoretic perspective, focusing on the sample complexity needed to learn a good decision rule which is robust to strategic manipulation. We perform this analysis by introducing a novel loss function, the strategic manipulation loss, which takes into account both the accuracy of the final decision rule and its vulnerability to manipulation. We analyse the sample complexity for a known graph of possible manipulations in terms of the complexity of the function class and the manipulation graph. Additionally, we initialize the study of learning under unknown manipulation capabilities of the involved agents. Using techniques from transfer learning theory, we define a similarity measure for manipulation graphs and show that learning outcomes are robust with respect to small changes in the manipulation graph. Lastly, we analyse the (sample complexity of) learning of the manipulation capability of agents with respect to this similarity measure, providing novel guarantees for strategic classification with respect to an unknown manipulation graph.

1 Introduction

In many scenarios where a decision rule is learned from data, the publication of this decision rule has an effect on the distribution of the underlying population that may harm the quality of the rule. For example, applicants for a loan may change details in their bank account to receive a better score, people may join a gym or sports club without ever intending to participate, in order to get a better health insurance policy, or students may employ different strategies such as registering to volunteer, or joining rare clubs (without attending either) to appear better on college applications.

Effects and incentives resulting from strategic behavior in classification scenarios have received substantial attention from both machine learning and game-theoretic perspectives in recent years (Hardt et al. 2016; Milli et al. 2019; Haghtalab et al. 2020; Tsirtsis and Rodriguez 2020; Zhang and Conitzer 2021). Most works study this as a two-player game between an institution that publishes a decision rule and a population of best responding agents to be classified. Given the classifier, these agents may change their feature representations in order to obtain a more favorable classification outcome. To prevent the induced additional classification error the institution will publish a modified predictor, not transparently reflecting the underlying intent and potentially causing additional harm to sub-populations that may be less equipped to perform the required changes to their representations (Hu, Immorlica, and Vaughan 2019).

In this work, we propose a learning theoretic take on this scenario. In machine learning, it is common to model desiderata for a learning outcome in form of a loss function. The goal of the learning process is then to identify a predictor that minimizes this loss in expectation over a data-generating distribution. Thus, we here define a novel loss function for learning under strategic manipulations. The aim of this loss is to induce a combination of two (potentially competing) requirements: achieving low classification error taking into account that individuals being classified may manipulate their features, and discouraging such feature manipulations overall. Prior work has shown that these may be conflicting requirements (Zhang and Conitzer 2021). Our proposed loss function thus aims to induce a balanced combination of these requirements rather than strictly enforcing one, and then only observing the effect on the other (as is implied by frameworks that aim to minimize classification error under best-responding agents (Hardt et al. 2016; Milli et al. 2019) or enforcing incentive-compatibility (Zhang and Conitzer 2021)).

To define our strategic manipulation loss we employ an abstraction of the plausible feature manipulations in form of a manipulation graph (Zhang and Conitzer 2021). An edge 𝐱𝐱\mathbf{x}\to\mathbf{x}^{\prime} in this graph indicates that an individual with feature vector 𝐱\mathbf{x} may change their features to present as 𝐱\mathbf{x}^{\prime} if this leads to a positive classification, for example since the utility of this change in classification exceeds the cost of the change between these vectors. We define our strategic loss in dependence of this graph and carefully motivate the proposed loss in terms of requirements and effects from previous literature. We then analyze the sample complexity of learning with this loss function. We identify sufficient conditions for proper learnability that take into account the interplay between a hypothesis class and an underlying manipulation graph. Moreover, we show that every class that has finite VC-dimension is learnable with respect to this loss by drawing a connection to results in the context of learning under adversarial perturbations (Montasser, Hanneke, and Srebro 2019). This effect may be surprising, since it presents a contrast to learning VC-classes with the sole requirement of minimizing classification error under strategic feature manipulations, which has been shown can lead to some VC-classes not being learnable (Zhang and Conitzer 2021). Thus, our analysis shows that balancing classification error with disincentivizing feature manipulations can reduce the complexity of the learning problem.

Moreover, we show that the quality of learning outcomes under our loss function is robust to inaccuracies in the manipulation graph. Such a robustness property is important, since an assumed graph might not exactly reflect agents’ responses. In fact, it has recently been argued that the model of best-responding agents is not backed up by empirical observations on agent distributions after strategic responses (Jagadeesan, Mendler-Dünner, and Hardt 2021). Moreover, different sub-populations may have differences in their manipulation graphs (different capabilities to manipulate their features) or a manipulation graph may be inferred from data and therefore exhibit statistical errors. We introduce a novel distance measure between manipulation graphs by drawing connections to learning bounds in transfer learning (Ben-David et al. 2010; Mansour, Mohri, and Rostamizadeh 2009) and show that the strategic loss of a learned predictor when employing a different manipulation graph can be bounded in terms of this distance measure. Finally, we present some initial results on how manipulation graphs may be learned from data.

1.1 Related work

That learning outcomes might be compromised by agents responding to published classification rules with strategic manipulations of their feature vectors was first pointed out over a decade ago (Dalvi et al. 2004; Brückner and Scheffer 2011) and has received substantial interest from the research community in recent years initiated by a study by Hardt et al. that differentiated the field from the more general context of learning under adversarial perturbations (Hardt et al. 2016). That study considered strategic responses being induced by separable cost functions for utility maximizing agents and studied the resulting decision boundaries for certain classes of classifiers. Recent years have seen a lot of interest in better understanding the interplay of various incentives in settings where a decision rule is published and thereby has an effect on how the entities that are to be classified might present themselves to the decision-maker. In particular, various externalities to this scenario have been analyzed. A general cost to society formalized in form of a “social burden” incurred by the costs of enforced feature manipulation, has been shown to occur when institutions anticipate strategic responses (Milli et al. 2019; Jagadeesan, Mendler-Dünner, and Hardt 2021). Further, it has been demonstrated how such a burden may be suffered to differing degrees by various subgroups of a population that may differ in their capabilities to adapt their features in ways that are favorable to them (Milli et al. 2019; Hu, Immorlica, and Vaughan 2019), raising concerns over fairness in such scenarios.

Recent studies have extended the original game-theoretic model of a classifier publishing intuition and best responding subjects. For example, a recent work studied how strategic modification can also be a positive effect and how that should be taken into consideration by the institution (Haghtalab et al. 2020). Such a perspective has been connected to underlying causal relations between features and classification outcome and resulting strategic recommendations (Miller, Milli, and Hardt 2020; Tsirtsis and Rodriguez 2020). Further, a very recent study has explored how the model of a best responding agent may be relaxed to better reflect empirically observed phenomena (Jagadeesan, Mendler-Dünner, and Hardt 2021).

Much of previous work considers the scenario of classification with strategic agents on a population level. A few recent studies have also analyzed how phenomena observed on samples reflect the underlying population events (Haghtalab et al. 2020). Notably, very recent studies provided first analyses of learning with strategically responding agents in a PAC framework (Zhang and Conitzer 2021; Sundaram et al. 2021). The former work studied the sample complexity of learning VC-classes in this setup and analyzed effects on sample complexity of enforcing incentive compatibility for the learned classification rules. Our work can be viewed as an extension of this analysis. We propose to combine aspects of incentive compatibility and minimizing negative externalities such as social burden in form of a novel loss function that may serve as a learning objective when strategic responses are to be expected.

Our sample complexity analysis is then hinging on techniques developed in the context of learning under adversarial perturbations, a learning scenario which has received considerable research attention in recent years (Feige, Mansour, and Schapire 2015; Cullina, Bhagoji, and Mittal 2018; Montasser, Hanneke, and Srebro 2019, 2021). While the learning problems are not identical, we present how strategic behaviour can be modeled as a form of “one-sided adversarial perturbation” and inheritance of resulting learning guarantees.

1.2 Overview on contributions

In Section 2 we review our notation and then introduce our new notion of strategic loss and motivate it. Our main contributions can be summarized as follows:

Strategic manipulation loss

We propose a novel loss function for learning in the presence of strategic feature manipulations. We carefully motivate this loss by relating it to concepts of social burden and incentive compatibility (and their potential trade-offs with accuracy) in prior literature.

Sample complexity analysis

We analyze (PAC type) learnability of VC-classes with the strategic loss. We provide sufficient conditions (and examples of when they are satisfied) for learnability with a proper learner. By drawing connections and adapting results from learning under adversarial perturbations to our setup, we also show that, while proper learnability can not always be guaranteed, every VC-class is learnable under the strategic loss with an improper learner.

Robustness to inaccurate manipulation information

We investigate the impact of using an approximate manipulation graph to yield a surrogate strategic loss function for cases where the true manipulation graph is not known or not accessible. For this, we introduce a novel similarity measure on graphs and show that if graphs are similar with respect to our notion then they yield reasonable surrogate strategic losses for each other (Theorem 6).

Learning the manipulation graph

We explore the question of whether it is possible to learn a manipulation graph that yields a good surrogate strategic loss. We identify a sufficient condition for a class of graphs 𝒢{\mathcal{G}} being learnable with respect to our previously defined similarity measure for graphs (Theorem 7), which in turn guaranteed the learning of a reasonable surrogate loss.

All proofs can be found in the appendix of the full version.

2 Setup

2.1 Basic Learning Theoretic Notions for Classification

We employ a standard setup of statistical learning theory for classification. We let 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} denote the domain and 𝒴\mathcal{Y} (mostly 𝒴={0,1}\mathcal{Y}=\{0,1\}) a (binary) label space. We model the data generating process as a distribution PP over 𝒳×𝒴\mathcal{X}\times\mathcal{Y} and let P𝒳P_{\mathcal{X}} denote the marginal of PP over 𝒳\mathcal{X}. We use the notation (𝐱,y)P(\mathbf{x},y)\sim P to indicate that (𝐱,y)(\mathbf{x},y) is a sample from distribution PP and SPnS\sim P^{n} to indicate that set SS is a sequence (for example a training or test data set) of nn i.i.d. samples from PP. Further, we use notation ηP(𝐱)=(𝐱,y)P[y=1𝐱]\eta_{P}(\mathbf{x})=\mathbb{P}_{(\mathbf{x},y)\sim P}[y=1\mid\mathbf{x}] to denote the regression or conditional labeling function of PP. We say that the distribution has deterministic labels if ηP(𝐱){0,1}\eta_{P}(\mathbf{x})\in\{0,1\} for all 𝐱𝒳\mathbf{x}\in\mathcal{X}.

A classifier or hypothesis is a function h:𝒳𝒴h:\mathcal{X}\to\mathcal{Y}. A classifier hh can naturally be viewed a subset of 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, namely h={(𝐱,y)𝒳×𝒴𝐱𝒳,y=h(𝐱)}h=\{(\mathbf{x},y)\in\mathcal{X}\times\mathcal{Y}~{}\mid~{}\mathbf{x}\in\mathcal{X},~{}y=h(\mathbf{x})\}. We let {\mathcal{F}} denote the set of all Borel measurable functions from 𝒳\mathcal{X} to 𝒴\mathcal{Y} (or all functions in case of a countable domain). A hypothesis class is a subset of {\mathcal{F}}, often denoted by {\mathcal{H}}\subseteq{\mathcal{F}}. For a loss function :×𝒳×𝒴\ell:{\mathcal{F}}\times\mathcal{X}\times\mathcal{Y}\to\mathbb{R} we denote the expected loss for a distribution PP as P{\mathcal{L}_{P}} and the empirical loss for a sample SS as S{\mathcal{L}_{S}}. We use standard definitions like PAC learnability, sample complexity, and approximation error. For further elaborations on these definitions, we refer the reader to the appendix for an extended definitions section or a textbook (Shalev-Shwartz and Ben-David 2014).

2.2 Strategic Classification

Learning objectives in prior work

The possibilities for strategic manipulations of a feature vector are often modeled in terms of a cost function cost:𝒳×𝒳0+\mathrm{cost}:\mathcal{X}\times\mathcal{X}\to\mathbb{R}_{0}^{+}, so that cost(𝐱,𝐱)\mathrm{cost}(\mathbf{x},\mathbf{x}^{\prime}) indicates how expensive it is for an individual with feature vector 𝐱\mathbf{x} to present as 𝐱\mathbf{x}^{\prime}. A natural minimal assumption a cost function should satisfy is cost(𝐱,𝐱)=0\mathrm{cost}(\mathbf{x},\mathbf{x})=0 for all feature vectors 𝐱\mathbf{x}. It is then typically assumed that instances best-respond to a published classifier, in that the individual 𝐱\mathbf{x} would choose to pay the cost of presenting as 𝐱\mathbf{x}^{\prime} as long as the cost doesn’t exceed the utility that would be gained from the difference in classification outcome. Assuming the benefit of individual 𝐱\mathbf{x} receiving classification 11 over classification 0 is γ\gamma, the manipulation would happen if cost(𝐱,𝐱)γ\mathrm{cost}(\mathbf{x},\mathbf{x}^{\prime})\leq\gamma and h(𝐱)=0h(\mathbf{x})=0 while h(𝐱)=1h(\mathbf{x}^{\prime})=1 for a given classifier. That is, we can define the best response of an individual with feature vector 𝐱\mathbf{x} facing classifier hh as

br(𝐱,h)=argmax𝐱𝒳[γh(𝐱)cost(𝐱,𝐱)],\mathrm{br}(\mathbf{x},h)=\mathrm{argmax}_{\mathbf{x}^{\prime}\in\mathcal{X}}[\gamma\cdot h(\mathbf{x}^{\prime})-\mathrm{cost}(\mathbf{x},\mathbf{x}^{\prime})],

with ties broken arbitrarily, and assuming that, if the original feature vector 𝐱\mathbf{x} is among those maximizing the above, then the individual would choose to maintain the original features. An often assumed learning goal is then performative optimality (Perdomo et al. 2020; Jagadeesan, Mendler-Dünner, and Hardt 2021), which stipulates that a learner should aim to maximize accuracy on the distribution it induces via the agent responses. That is, this objective can be phrased as minimizing

𝔼(𝐱,y)P𝟙[h(br(𝐱,h))y]\mathbb{E}_{(\mathbf{x},y)\sim P}\mathds{1}\left[{h(\mathrm{br}(\mathbf{x},h))\neq y}\right]

An alternative view on this setup, if the agent responses are deterministic, is to view the above as minimizing the binary loss of the effective hypothesis h^:𝒳{0,1}\hat{h}:\mathcal{X}\to\{0,1\} that is induced by hh and the agents’ best responses br(,)\mathrm{br}(\cdot,\cdot) (Zhang and Conitzer 2021), defined as

h^(𝐱)=h(br(𝐱,h)).\hat{h}(\mathbf{x})=h(\mathrm{br}(\mathbf{x},h)). (1)

The goal of performative optimality has been combined with the notion of social burden that is induced by a classifier (Milli et al. 2019; Jagadeesan, Mendler-Dünner, and Hardt 2021). This notion reflects that it is undesirable for a (truly) positive instance to be forced to manipulate its features to obtain a (rightfully) positive classification. This is modeled by considering the burden on a positive individual to be the cost that is incurred by reaching for a positive classification and the social burden incurred by a classifier to be the expectation with respect to the data-generating process over these costs:

brdP(h)=𝔼(𝐱,y)P[min𝐱𝒳{cost(𝐱,𝐱)h(𝐱)=1}y=1]\mathrm{brd}_{P}(h)=\mathbb{E}_{(\mathbf{x},y)\sim P}\left[\min_{\mathbf{x}^{\prime}\in\mathcal{X}}\{\mathrm{cost}(\mathbf{x},\mathbf{x}^{\prime})\mid h(\mathbf{x}^{\prime})=1\}~{}\mid~{}y=1\right]

It has been shown that optimizing for performative optimality (under the assumption of deterministic best-responses) also incurs maximal social burden (Jagadeesan, Mendler-Dünner, and Hardt 2021).

A new loss function for learning under strategic manipulations

Arguably, to seek performative optimality (or minimize the binary loss over the effective hypothesis class) the cost function as well as the value γ\gamma (or function γ:𝒳\gamma:\mathcal{X}\to\mathbb{R}) of positive classification needs to be known (or at least approximately known). To take best responses into account, a learner needs to know what these best responses may look like. In that case, we may ignore the details of the cost function and value γ\gamma, and simply represent the collection of plausible manipulations as a directed graph structure =(𝒳,E)\mathcal{M}=(\mathcal{X},E) over the feature space 𝒳\mathcal{X} (Zhang and Conitzer 2021). The edge-set EE consists of all pairs (𝐱,𝐱)(\mathbf{x},\mathbf{x}^{\prime}) with cost(𝐱,𝐱)γ\mathrm{cost}(\mathbf{x},\mathbf{x}^{\prime})\leq\gamma, and we will also use the notation 𝐱𝐱\mathbf{x}\to\mathbf{x}^{\prime} for (𝐱,𝐱)E(\mathbf{x},\mathbf{x}^{\prime})\in E, and write =(𝒳,E)=(𝒳,)\mathcal{M}=(\mathcal{X},E)=(\mathcal{X},\to). We note that this formalism is valid for both countable (discrete) and uncountable domains.

Given the information in the so obtained manipulation graph =(𝒳,)\mathcal{M}=(\mathcal{X},\to), we now design a loss function for classification in the presence of strategic manipulation that reflects both classification errors and the goal of disincentivizing manipulated features as much as possible. Our proposed loss function below models that it is undesirable for a classifier to assign h(𝐱)=0h(\mathbf{x})=0 and h(𝐱)=1h(\mathbf{x}^{\prime})=1 if feature vector 𝐱\mathbf{x} can present as 𝐱\mathbf{x}^{\prime}. This is independent of a true label yy (e.g. if (𝐱,y)(\mathbf{x},y) is sampled from the data generating process). If the label y=0y=0 is not positive, the point gets misclassified when 𝐱\mathbf{x} presents as 𝐱\mathbf{x}^{\prime}. On the other hand, if the true label is 11, then either a true positive instance is forced to manipulate their features to obtain a rightly positive outcome (and this contributes to the social burden), or, if the choice is to not manipulate the features, the instance will be misclassified (prior work has also considered models where true positive instance are “honest” and will not manipulate their features (Dong et al. 2018)). Here, we propose to incorporate both misclassification and contributions to social burden into a single loss function that a learner may aim to minimize.

Definition 1.

We define the strategic loss :×𝒳×𝒴{0,1}{\ell^{\to}}:{\mathcal{F}}\times\mathcal{X}\times\mathcal{Y}\to\{0,1\} with respect to manipulation graph (𝒳,)(\mathcal{X},\to) as follows:

(h,𝐱,y)={1if h(𝐱)y1if h(𝐱)=0 and 𝐱 with 𝐱𝐱 and h(𝐱)=10else{\ell^{\to}}(h,\mathbf{x},y)=\left\{\begin{array}[]{ll}1&\text{if }h(\mathbf{x})\neq y\\ 1&\text{if }h(\mathbf{x})=0\text{ and }\\ &\exists\mathbf{x}^{\prime}\text{ with }\mathbf{x}\to\mathbf{x}^{\prime}\text{ and }h(\mathbf{x}^{\prime})=1\\ 0&\text{else}\\ \end{array}\right.

Note that the first two cases are not mutually exclusive. The above loss function discretizes the social burden by assigning a loss of 11 whenever a positive individual is required to manipulate features. As for the standard classification loss, the above point-wise definition of a loss function allows to define the true strategic loss P(h){\mathcal{L}^{\to}_{P}}(h) and empirical strategic loss S(h){\mathcal{L}^{\to}_{S}}(h) of a classifier with respect to a distribution PP or a data sample SS.

2.3 Comparison with Alternative Formalisms for Strategic Classification

To motivate our proposed loss, we here discuss several scenarios where, we’d argue, minimizing the strategic loss leads to a more desirable learning outcome than learning with a binary loss, while taking strategic manipulations into account. As discussed above, a common approach to modeling classification in a setting where strategic manipulations may occur is to assume that all agents will best-respond to a published classifier. That is, if h(𝐱)=0h(\mathbf{x})=0, h(𝐱)=1h(\mathbf{x}^{\prime})=1 and 𝐱𝐱\mathbf{x}\to\mathbf{x}^{\prime}, then the agent with initial feature vector 𝐱\mathbf{x} will effectively receive classification 11. A natural modeling is then to consider the effective hypothesis h^\hat{h} induced hh (see Equation 1) and aim to minimize the classification error with the effective class ^={ff=h^ for some h}\hat{{\mathcal{H}}}=\{f\mid f=\hat{h}\text{ for some }h\in{\mathcal{H}}\} (Zhang and Conitzer 2021). However it has been shown, that the VC-dimension of ^\hat{{\mathcal{H}}} may be arbitrarily larger than the VC-dimension of {\mathcal{H}}, and may even become infinite (Zhang and Conitzer 2021). When learning this effective class ^\hat{{\mathcal{H}}} with respect to the binary loss (which corresponds to aiming for performative optimality), this will imply that the class is not learnable. By contrast, we will show below that any class of finite VC-dimension remains learnable with respect to the strategic loss (Theorem 5).

It has also been shown that the negative effects in terms of sample complexity of considering the effective hypothesis class can be avoided by considering only incentive-compatible hypotheses in {\mathcal{H}}, that is outputting only such hypotheses that will not induce any feature manipulations in response to the published classifier (Zhang and Conitzer 2021). While this avoids the growths in terms of VC-dimension, it may prohibitively increase the approximation error of the resulting (pruned) class as we show in the example below. We would argue that this illustrates how low sample complexity, in itself, is not a sufficient criterion for learning success.

Example 1.

Consider 𝒳=\mathcal{X}=\mathbb{N} and a manipulation graph that includes edges nn+1n\to n+1 and nn1n\to n-1 for all nn\in\mathbb{N}. This is a reasonable structure, considering that the cost of moving the (one-dimensional) feature by 11 is worth a positive classification outcome. However, the only two hypotheses that are incentive compatible in this case are the two constant functions h0:𝒳{0}h_{0}:\mathcal{X}\to\{0\} and h1:𝒳{1}h_{1}:\mathcal{X}\to\{1\}. Thus, requiring incentive compatiblity forces the learner to assign all points in the space with the same label. This class, in fact, has low sample complexity. However, arguably, restricting the learning to such a degree (and suffering the resulting classification error, which will be close to 0.50.5 for distributions with balanced classes), is, in most cases not a reasonable price to pay for dis-incentivising feature manipulations.

The following example illustrates how our loss function can be viewed as incorporating the notion of social burden directly into the loss.

Example 2.

Let’s again consider a domain 𝒳=\mathcal{X}=\mathbb{N} and a manipulation graph \mathcal{M} with edges nn+1n\to n+1 for all nn\in\mathbb{N}. We consider distributions that have support {(1,0),(2,0),(3,1),(4,1)}\{(1,0),(2,0),(3,1),(4,1)\}, thus only these four points have positive probability mass and a hypothesis class of thresholds ={haa}{\mathcal{H}}=\{h_{a}\mid a\in\mathbb{R}\}, with ha(𝐱)=𝟙[𝐱a]h_{a}(\mathbf{x})=\mathds{1}\left[{\mathbf{x}\geq a}\right]. The true labeling on these distributions is η(𝐱)=h2.5(𝐱)\eta(\mathbf{x})=h_{2.5}(\mathbf{x}). On all distributions, where all four points have positive mass the performatively optimal hypothesis (or effective hypothesis of minimal binary loss) however is h3.5h_{3.5}. The social burden incurred then is brdP(h3.5)=P((3,1))cost(3,4)\mathrm{brd}_{P}(h_{3.5})=P((3,1))\cdot\mathrm{cost}(3,4). It is important to note that the performative optimality of h3.5h_{3.5} is independent of the distribution PP over the points. A learner that minimizes the strategic loss, on the other hand, will take the distribution PP into account and output h2.5h_{2.5} if P((2,0))<P((3,1))P((2,0))<P((3,1)), while outputting h3.5h_{3.5} if P((2,0))>P((3,1))P((2,0))>P((3,1)). If the difference in mass of these points (or the margin areas in a more general setting) is significant, then minimizing the strategic loss will opt for allowing a small amount of manipulation in turn for outputting a correct classification rule in case P((2,0))P((3,1))P((2,0))\ll P((3,1)); and it will opt for changing the classification rule, accept a small amount of social burden in exchange for preventing a large amount of manipulations and resulting classification errors, in case P((2,0))P((3,1))P((2,0))\gg P((3,1)). We would argue that this reflects a desirable learning outcome.

3 Learnability with the Strategic Loss

3.1 Warm up: Loss Classes and Learnability

It is well known that a class {\mathcal{H}} is learnable (with respect to the set of all distributions) if the loss class induced by a 0/10/1-valued loss function \ell has finite VC-dimension. In the case of the binary classification loss, this is in fact a characterization for learnability (and the VC-dimension of the loss class is identical to the VC-dimension of the hypothesis class {\mathcal{H}}). In general, bounded VC-dimension of the loss class is a sufficient condition for learnability (the VC-dimension provides an upper bound on the sample complexity), but it is not a necessary condition (it doesn’t, in general, yield a lower bound on the sample complexity of learning a class {\mathcal{H}} with respect to some loss \ell). We start by reviewing these notions for the classification loss and then take a closer look at the loss class induced by the strategic loss.

Let \ell be a loss function and hh be a classifier. We define the loss set h𝒳×𝒴h_{\ell}\subseteq\mathcal{X}\times\mathcal{Y} as the set of all labeled instances (𝐱,y)(\mathbf{x},y) on which hh suffers loss 11. The loss class {\mathcal{H}}_{\ell} is the collection of all loss sets (in the literature, the loss class is often described as the function class of indicator functions over these sets). In the case of binary classification loss 0/1{\ell^{0/1}}, the loss set of a classifier hh is exactly the complement of hh in 𝒳×𝒴\mathcal{X}\times\mathcal{Y}. That is, in this case the loss set of hh is also a binary function over the domain 𝒳\mathcal{X} (namely the function 𝐱|h(𝐱)1|\mathbf{x}\mapsto|h(\mathbf{x})-1|). For the strategic loss on the other hand, the loss set of a classifier hh is not a function, since it can contain both (𝐱,0)(\mathbf{x},0) and (𝐱,1)(\mathbf{x},1) for some points 𝐱𝒳\mathbf{x}\in\mathcal{X}, namely if h(𝐱)=0h(\mathbf{x})=0 and there exists an 𝐱\mathbf{x}^{\prime} with 𝐱𝐱\mathbf{x}\to\mathbf{x}^{\prime} and h(𝐱)=1h(\mathbf{x}^{\prime})=1. For a class {\mathcal{H}} we let 0/1{\mathcal{H}}_{\ell^{0/1}} denote the loss class with respect to the binary loss and {\mathcal{H}}_{{\ell^{\to}}} the loss class with respect to the strategic loss.

Definition 2.

Let 𝒵\mathcal{Z} be some set and 𝒰2𝒵\mathcal{U}\subseteq 2^{\mathcal{Z}} be a collection of subsets of 𝒵\mathcal{Z}. We say that a set S𝒵S\subseteq\mathcal{Z} is shattered by 𝒰\mathcal{U} if

{USU𝒰}=2S,\{U\cap S\mid U\in\mathcal{U}\}=2^{S},

that is, every subset of SS can be obtained by intersecting SS with some set UU from the collection 𝒰\mathcal{U}. The VC-dimension of 𝒰\mathcal{U} is the largest size of a set that is shattered by 𝒰\mathcal{U} (or \infty if 𝒰\mathcal{U} can shatter arbitrarily large sets).

It is easy to verify that for the binary loss, the VC-dimension of {\mathcal{H}} as a collection of subsets of 𝒳×𝒴\mathcal{X}\times\mathcal{Y} is identical with the VC-dimension of 0/1{\mathcal{H}}_{\ell^{0/1}} (and this also coincides with the VC-dimension of {\mathcal{H}} as a binary function class (Shalev-Shwartz and Ben-David 2014); VC-dimension is often defined for binary functions rather than for collection of subsets, however, this is limiting for cases where the loss class is not a class of functions).

We now show that the VC-dimension of a class {\mathcal{H}} and its loss class with respect to the strategic loss can have an arbitrarily large difference. Similar results have been shown for the binary loss class of the effective class ^\hat{{\mathcal{H}}} induced by a manipulation graph (Zhang and Conitzer 2021). However the binary loss class of ^\hat{{\mathcal{H}}} is different from the strategic loss class of {\mathcal{H}} and, as we will see, the implications for learnability are also different.

Observation 1.

For any d{}d\in\mathbb{N}\cup\{\infty\} there exists a class {\mathcal{H}} and a manipulation graph =(𝒳,)\mathcal{M}=(\mathcal{X},\to) with VC()=1\mathrm{VC}({\mathcal{H}})=1 and VC()d\mathrm{VC}({\mathcal{H}}_{\ell^{\to}})\geq d.

On the other hand, we prove that the VC-dimension of the strategic loss class {\mathcal{H}}_{{\ell^{\to}}} is always at least as large as the VC-dimension of the original class.

Observation 2.

For any hypothesis class {\mathcal{H}} and any manipulation graph =(𝒳,)\mathcal{M}=(\mathcal{X},\to), we have VC()VC()\mathrm{VC}({\mathcal{H}})\leq\mathrm{VC}({\mathcal{H}}_{\ell^{\to}}).

Standard VC-theory tells us that, for the binary classification loss, any learner that acts according to the ERM (Empirical Risk Minimization) principle is a successful learner for classes of bounded VC-dimension dd. For a brief recap of the underpinnings of this result, we refer the reader to the supplementary material or for further details to (Shalev-Shwartz and Ben-David 2014). In the case of general loss classes with values in {0,1}\{0,1\}, the VC-dimension does not characterize learnability. In particular, we next show that the VC-dimension of the strategic loss class does not imply a lower bound on the sample complexity.

Theorem 3.

For every d{}d\in\mathbb{N}\cup\{\infty\}, there exists a hypothesis class {\mathcal{H}} with VC()=d\mathrm{VC}({\mathcal{H}}_{\ell^{\to}})=d that is learnable with sample complexity O(log(1/δ)/ϵ)O(\log(1/\delta)/\epsilon) in the realizable case.

3.2 Sufficient Conditions for Strategic Loss Learnability

In the previous section, we have seen that the loss class having a finite VC-dimension is a sufficient (but not necessary) condition for learnability with respect to the strategic loss. We have also seen that the VC-dimension of {\mathcal{H}}_{{\ell^{\to}}} can be arbitrarily larger than the VC-dimension of {\mathcal{H}}. To start exploring what determines learnability under the strategic loss, we provide a sufficient condition for a class to be properly learnable with respect to the strategic loss.

Note that for a hypothesis hh, the strategic loss set hh_{\ell^{\to}} can be decomposed into the loss set of hh with respect to the binary loss and the component that comes from the strategic manipulations. Formally, we can define the strategic component loss.

Definition 3.

We let the strategic component loss with respect to manipulation graph \to be defined as

,(h,𝐱)=𝟙[h(𝐱)=0𝐱:𝐱𝐱,h(𝐱)=1]{\ell^{\to,\perp}}(h,\mathbf{x})=\mathds{1}\left[{h(\mathbf{x})=0\land\exists\mathbf{x}^{\prime}:\mathbf{x}\to\mathbf{x}^{\prime},\ h(\mathbf{x}^{\prime})=1}\right]

We note that (h,𝐱,y)0/1(h,𝐱,y)+,(h,𝐱){\ell^{\to}}(h,\mathbf{x},y)\leq{\ell^{0/1}}(h,\mathbf{x},y)+{\ell^{\to,\perp}}(h,\mathbf{x}). We will denote the true strategic component loss with respect to marginal distribution P𝒳P_{\mathcal{X}} as P𝒳,{\mathcal{L}^{\to,\perp}_{P_{\mathcal{X}}}}.

For the loss sets, we then get

h0/1={(𝐱,y)𝒳×𝒴h(𝐱)y},h_{{\ell^{0/1}}}=\{(\mathbf{x},y)\in\mathcal{X}\times\mathcal{Y}~{}\mid~{}h(\mathbf{x})\neq y\},

and

h,={(𝐱,y)𝒳×𝒴\displaystyle h_{{\ell^{\to,\perp}}}=\{(\mathbf{x},y)\in\mathcal{X}\times\mathcal{Y}~{}\mid~{} h(𝐱)=0\displaystyle h(\mathbf{x})=0
𝐱:𝐱𝐱,h(𝐱)=1}.\displaystyle\land\exists\mathbf{x}^{\prime}:\mathbf{x}\to\mathbf{x}^{\prime},\ h(\mathbf{x}^{\prime})=1\}.

This implies

h=h0/1h,h_{{\ell^{\to}}}=h_{{\ell^{0/1}}}\cup h_{{\ell^{\to,\perp}}}

for all classifiers hh\in{\mathcal{F}}, and thereby

={h0/1h,h}.{\mathcal{H}}_{{\ell^{\to}}}=\{h_{{\ell^{0/1}}}\cup h_{{\ell^{\to,\perp}}}~{}\mid~{}h\in{\mathcal{H}}\}.

By standard counting arguments on the VC-dimension of such unions (see, for example Chapter 6 of (Shalev-Shwartz and Ben-David 2014) and exercises in that chapter), it can be shown this decomposition implies that VC()dlogd\mathrm{VC}({\mathcal{H}}_{{\ell^{\to}}})\leq d\log d for d=VC(0/1)+VC(,)=VC()+VC(,)d=\mathrm{VC}({\mathcal{H}}_{{\ell^{0/1}}})+\mathrm{VC}({\mathcal{H}}_{{\ell^{\to,\perp}}})=\mathrm{VC}({\mathcal{H}})+\mathrm{VC}({\mathcal{H}}_{{\ell^{\to,\perp}}}). Thus, if both the class {\mathcal{H}} itself and the class of strategic components have finite VC-dimension, then {\mathcal{H}} is properly learnable by any learner that is an ERM for the strategic loss:

Theorem 4.

Let {\mathcal{H}} be a hypothesis class with finite VC()+VC(,)=d<\mathrm{VC}({\mathcal{H}})+\mathrm{VC}({\mathcal{H}}_{{\ell^{\to,\perp}}})=d<\infty. Then {\mathcal{H}} is properly PAC learnable with respect to the strategic loss (both in the realizable and the agnostic case).

Whether the class of strategic components has finite VC-dimension intrinsically depends on the interplay between the hypothesis class {\mathcal{H}} and the graph structure of the manipulation graph. In Observation 1, we have seen that the graph structure can yield the strategic component sets to have much larger complexity than the original class. In the appendix, Section B, we provide a few natural examples, where the VC-dimension of the strategic components is finite.

Theorem 4 provides a strong sufficient condition under which any empirical risk minimizer for the strategic loss is a successful agnostic learner for a class of finite VC-dimension. We believe, in many natural situations the conditions in that theorem will hold, and analyzing in more detail which graph structure, combinations of graphs structures and hypothesis classes or classes of cost function lead to the strategic component sets having finite VC-dimension is an intriguing direction for further research.

We close this section with two results, both stated in Theorem 5, on the learnability under the strategic loss in the general case where the VC-dimension of the strategic component sets may be infinite. First, there are classes and manipulation graphs for which no proper learner is (PAC-) successful, even in the realizable case. Second, for any class of finite VC-dimension and any manipulation graph, there exists an improper PAC learner. These results follow by drawing a connection from learning under the strategic loss to learning under an adversarial loss (Montasser, Hanneke, and Srebro 2019). In the general adversarial loss setup, every domain instance 𝐱\mathbf{x} is assigned a set of potential perturbation 𝒰(𝐱)\mathcal{U}(\mathbf{x}), and the adversarial loss of a hypothesis hh is then defined as

𝒰(h,𝐱,y)=𝟙[𝐱𝒰(𝐱):h(𝐱)y].\ell^{\mathcal{U}}(h,\mathbf{x},y)=\mathds{1}\left[{\exists\mathbf{x}^{\prime}\in\mathcal{U}(\mathbf{x}):h(\mathbf{x}^{\prime})\neq y}\right].

The strategic loss can be viewed as a one-sided version of the adversarial loss, where the perturbation sets differ conditional on the label of a point, and where 𝒰(𝐱,1)={𝐱}\mathcal{U}(\mathbf{x},1)=\{\mathbf{x}\}, while 𝒰(𝐱,0)={𝐱𝒳𝐱𝐱}\mathcal{U}(\mathbf{x},0)=\{\mathbf{x}^{\prime}\in\mathcal{X}\mid\mathbf{x}\to\mathbf{x}^{\prime}\}. The following results on learnability with the strategic loss then follow by slight modifications of the corresponding proofs for learning under adversarial loss.

Theorem 5.

(Adaptation of Theorem 1 and Theorem 4 in (Montasser, Hanneke, and Srebro 2019))
There exists a hypothesis class {\mathcal{H}} with VC()=1\mathrm{VC}({\mathcal{H}})=1 that is not learnable with respect to the strategic loss by any proper learner 𝒜{\mathcal{A}} for {\mathcal{H}} even in the realizable case. On the other hand, every class {\mathcal{H}} of finite VC-dimension is learnable (by some improper learner).

4 Strategic loss with respect to an approximate manipulation graph

In many situations one might not have direct access to the true manipulation graph =(V,E)\mathcal{M}=(V,E), but only to some approximate graph =(V,E)\mathcal{M}^{\prime}=(V,E^{\prime}). In this section, we will investigate how this change of manipulation graph impacts the corresponding loss function. We define a criterion for measuring the similarity of graphs with respect to hypothesis class {\mathcal{H}} and show that similar graphs will yield similar strategic losses. That is, we show an upper bound on the true strategic loss of a hypothesis hh (i.e., strategic loss with respect to the true manipulation graph) in terms of the graph similarity and the surrogate strategic loss of hh (i.e., the strategic loss with respect to the approximate graph). We will use 𝐱𝐱\mathbf{x}\leadsto\mathbf{x}^{\prime} to denote (𝐱,𝐱)E(\mathbf{x},\mathbf{x}^{\prime})\in E^{\prime}. As the set of vertices VV is always equal to 𝒳\mathcal{X} in our setting, the graphs \mathcal{M} and \mathcal{M}^{\prime} are uniquely defined by \to and \leadsto respectively. We will therefore use \to and \mathcal{M}, as well as \leadsto and \mathcal{M}^{\prime} interchangeably.

We now define the distance between graphs with respect to a hypothesis class {\mathcal{H}} by the impact a change of manipulation graph has on the strategic component loss of elements of {\mathcal{H}}. This definition and its later use are inspired by works on domain adaptation (Ben-David et al. 2010; Mansour, Mohri, and Rostamizadeh 2009).

Definition 4.

For two manipulation graphs, given by \to and \leadsto, we let their \mathcal{H}-P𝒳P_{\mathcal{X}}-distance be defined as

d,P𝒳(,)=suph𝔼𝐱P𝒳[|,(h,𝐱),(h,𝐱)|]d_{\mathcal{H},P_{\mathcal{X}}}(\to,\leadsto)=\sup_{h\in\mathcal{H}}\mathbb{E}_{\mathbf{x}\sim P_{\mathcal{X}}}[|{\ell^{\to,\perp}}(h,\mathbf{x})-{\ell^{\leadsto,\perp}}(h,\mathbf{x})|]

We will now bound the strategic manipulation loss P(h){\mathcal{L}^{\to}_{P}}(h) with respect to the true graph \to in terms of the strategic manipulation loss P(h){\mathcal{L}^{\leadsto}_{P}}(h) with respect to the approximate graph \leadsto and the {\mathcal{H}}-P𝒳P_{\mathcal{X}}-distance between \to and \leadsto.

Theorem 6.

Let \mathcal{H} be any hypothesis class and ,\to,\leadsto two manipulation graphs. Then for any distribution PP over 𝒳×𝒴\mathcal{X}\times\mathcal{Y} and any hh\in\mathcal{H} we have

P(h)\displaystyle{\mathcal{L}^{\to}_{P}}(h) P0/1(h)+P,(h)+d,P𝒳(,)\displaystyle\leq{\mathcal{L}^{0/1}_{P}}(h)+{\mathcal{L}^{\leadsto,\perp}_{P}}(h)+{d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto)
2P(h)+d,P𝒳(,).\displaystyle\leq 2{\mathcal{L}^{\leadsto}_{P}}(h)+{d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto).

Furthermore, by rearranging the result, we get

12P(h)d,P𝒳(,)P(h).\frac{1}{2}{\mathcal{L}^{\leadsto}_{P}}(h)-{d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto)\leq{\mathcal{L}^{\to}_{P}}(h).

We note that the expression d,P𝒳(,){d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto) is independent of the labeling and can therefore be estimated using data without any label information. Furthermore, we have seen that small d,P𝒳(,){d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto) tightens the upper as well as the lower bound on P(h){\mathcal{L}^{\to}_{P}}(h). Therefore, d,P𝒳(,){d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto) is a suitable distance measure for approximating the structure of the manipulation graph. In the following section, we will explore learning \leadsto with low d,P𝒳(,){d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto) from finite samples.

5 Learning a manipulation graph

In the last section, we have assumed to be given an approximate manipulation graph which we can use to learn a classifier with low strategic loss. We now want to go one step further and pose the goal of learning a manipulation graph \leadsto from a predefined class of graphs 𝒢{\mathcal{G}} such that ,{\ell^{\leadsto,\perp}} serves as a good strategic surrogate loss for ,{\ell^{\to,\perp}}. From Theorem 6 we already know that ,{\ell^{\leadsto,\perp}} is a good surrogate loss for ,{\ell^{\to,\perp}} if d,P𝒳(,){d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto) is small. This section will thus focus on learning an approximate manipulation graph 𝒢\leadsto\in{\mathcal{G}} with small d,P𝒳(,){d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto).

In order to further specify our learning problem, we will now describe what the input of such a learning procedure will look like. For a manipulation graph \to, let B:𝒳2𝒳B_{\to}:\mathcal{X}\rightarrow 2^{\mathcal{X}} be the function that maps an instance 𝐱\mathbf{x} to its set of children, i.e., B(𝐱)={𝐱𝒳:𝐱𝐱}B_{\to}(\mathbf{x})=\{\mathbf{x}^{\prime}\in\mathcal{X}:\mathbf{x}\to\mathbf{x}^{\prime}\}. We note that a manipulation graph \to is uniquely defined by BB_{\to}. Thus we will sometimes use BB_{\to} and \to interchangeably. The input to our learning procedure will be of the form of samples S={(𝐱1,B(𝐱1)),,(𝐱n,B(𝐱n))}S=\{(\mathbf{x}_{1},B_{\to}(\mathbf{x}_{1})),\dots,(\mathbf{x}_{n},B_{\to}(\mathbf{x}_{n}))\} from the true manipulation graph \to.

As a next step in formulating our learning problem, we will need to define a loss function. As stated above, our goal is to learn 𝒢\leadsto\in{\mathcal{G}} with small d,P𝒳(,){d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto). As the definition of d,P𝒳(,){d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto) contains a supremum over all hh\in{\mathcal{H}}, we cannot use it as a loss directly (as a loss needs to be defined point-wise). However, we can formulate a loss that is closely related and will serve to guarantee low d,P𝒳(,){d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto). Let the graph loss for a manipulation graph \leadsto, a domain point xx, a manipulation set B𝒳B\subset\mathcal{X} and a hypothesis hh as

gr(h,,𝐱,B)={1 if h(𝐱)=0𝐱B:h(𝐱)=1𝐱′′:𝐱𝐱′′ implies h(𝐱′′)=01 if h(𝐱)=0𝐱B:h(𝐱)=0𝐱′′:𝐱𝐱′′ and h(𝐱)=10 otherwise.{\ell^{\mathrm{gr}}}(h,\leadsto,\mathbf{x},B)=\begin{cases}1&\text{ if }h(\mathbf{x})=0\land\exists\mathbf{x}^{\prime}\in B:h(\mathbf{x}^{\prime})=1\\ &\land\forall\mathbf{x}^{\prime\prime}:\mathbf{x}\leadsto\mathbf{x}^{\prime\prime}\text{ implies }h(\mathbf{x}^{\prime\prime})=0\\ 1&\text{ if }h(\mathbf{x})=0\land\forall\mathbf{x}^{\prime}\in B:h(\mathbf{x}^{\prime})=0\\ &\land\exists\mathbf{x}^{\prime\prime}:\mathbf{x}\leadsto\mathbf{x}^{\prime\prime}\text{ and }h(\mathbf{x})=1\\ 0&\text{ otherwise.}\end{cases}

This loss is indeed closely related to the {\mathcal{H}}-P𝒳P_{\mathcal{X}}-distance as gr(h,,𝐱,B(𝐱))=|,(h,𝐱),(h,𝐱)|{\ell^{\mathrm{gr}}}(h,\leadsto,\mathbf{x},B_{\to}(\mathbf{x}))=|{\ell^{\to,\perp}}(h,\mathbf{x})-{\ell^{\leadsto,\perp}}(h,\mathbf{x})|.

The true graph loss with respect to some marginal P𝒳P_{\mathcal{X}} and true manipulation graph \to is then defined by

(P𝒳,)gr(h,)=𝔼𝐱P𝒳[gr(h,,𝐱,B(𝐱))].{\mathcal{L}^{\mathrm{gr}}_{(P_{\mathcal{X}},\to)}}(h,\leadsto)=\mathbb{E}_{\mathbf{x}\sim P_{\mathcal{X}}}[{\ell^{\mathrm{gr}}}(h,\leadsto,\mathbf{x},B_{\to}(\mathbf{x}))].

Furthermore for a sample S={(𝐱1,B1),(𝐱n,Bn)}S=\{(\mathbf{x}_{1},B_{1}),\dots(\mathbf{x}_{n},B_{n})\} we define the empirical graph loss as

Sgr(h,)=(𝐱i,Bi)Sgr(h,,𝐱i,Bi).{\mathcal{L}^{\mathrm{gr}}_{S}}(h,\leadsto)=\sum_{(\mathbf{x}_{i},B_{i})\in S}{\ell^{\mathrm{gr}}}(h,\leadsto,\mathbf{x}_{i},B_{i}).

Similar to previous sections, we now want to define a loss class for ×𝒢{\mathcal{H}}\times{\mathcal{G}}. We define g(h,)g(h,\leadsto) to be the set of all pairs (𝐱,B)𝒳×2𝒳(\mathbf{x},B)\in\mathcal{X}\times 2^{\mathcal{X}} on which gr(h,,𝐱,B)=1{\ell^{\mathrm{gr}}}(h,\leadsto,\mathbf{x},B)=1. Then the graph loss class of ×𝒢{\mathcal{H}}\times{\mathcal{G}} is defined as

(×𝒢)gr={g(h,):h and 𝒢}.({\mathcal{H}}\times{\mathcal{G}})_{{\ell^{\mathrm{gr}}}}=\{g(h,\leadsto):h\in{\mathcal{H}}\text{ and }\leadsto\in{\mathcal{G}}\}.

We will now show that if the VC-dimension of the loss class (×𝒢)gr({\mathcal{H}}\times{\mathcal{G}})_{{\ell^{\mathrm{gr}}}} is finite, we can indeed learn 𝒢\mathcal{G} with respect to gr{\ell^{\mathrm{gr}}}. For some examples and more discussion on the VC-dimension with respect to the loss class (×𝒢)gr({\mathcal{H}}\times{\mathcal{G}})_{{\ell^{\mathrm{gr}}}}, we refer the reader to the appendix.

Lemma 1.

Let VC((×𝒢)gr)=d\mathrm{VC}(({\mathcal{H}}\times{\mathcal{G}})_{{\ell^{\mathrm{gr}}}})=d. Then there is ngraph:(0,1)2n_{\mathrm{graph}}:(0,1)^{2}\mapsto\mathbb{N}, such that for any marginal distribution P𝒳P_{\mathcal{X}} and any manipulation graph \to for a sample S={(𝐱1,B(𝐱1)),,(𝐱n,B(𝐱n))}S=\{(\mathbf{x}_{1},B_{\to}(\mathbf{x}_{1})),\dots,(\mathbf{x}_{n},B_{\to}(\mathbf{x}_{n}))\} of size nn(ϵ,δ)n\geq n(\epsilon,\delta), we have with probability at least 1δ1-\delta over the sample generation S𝒳=(𝐱1,,𝐱n)P𝒳nS_{\mathcal{X}}=(\mathbf{x}_{1},\dots,\mathbf{x}_{n})\sim P_{\mathcal{X}}^{n} for any hh\in{\mathcal{H}} and any 𝒢\leadsto\in{\mathcal{G}}

|(P𝒳,)gr(h,)Sgr(h,)|<ϵ.|{\mathcal{L}^{\mathrm{gr}}_{(P_{\mathcal{X}},\to)}}(h,\leadsto)-{\mathcal{L}^{\mathrm{gr}}_{S}}(h,\leadsto)|<\epsilon.

Furthermore, ngraph(ϵ,δ)O(d+log1δϵ2)n_{\mathrm{graph}}(\epsilon,\delta)\in O(\frac{d+\log\frac{1}{\delta}}{\epsilon^{2}}).

We note that the above lemma is agnostic in the sense that it did not require 𝒢\to\in{\mathcal{G}}. We will now introduce an empirical version of the {\mathcal{H}}-𝒫𝒳{\mathcal{P}}_{\mathcal{X}}-distance. This will allow us to state the main theorem of this section and show that it is indeed possible to learn 𝒢\leadsto\in{\mathcal{G}} with low d,P𝒳(,){d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto) if VC((×𝒢)gr)\mathrm{VC}(({\mathcal{H}}\times{\mathcal{G}})_{{\ell^{\mathrm{gr}}}}) is finite.

Definition 5.

Given a sample S𝒳={(𝐱1,,𝐱n)}S_{\mathcal{X}}=\{(\mathbf{x}_{1},\dots,\mathbf{x}_{n})\} of domain elements 𝐱i\mathbf{x}_{i} and two manipulation graphs \to and \leadsto we can define the empirical {\mathcal{H}}-S𝒳S_{\mathcal{X}}-distance as

d,S𝒳(,)=suph𝐱iS𝒳gr(h,,𝐱i,B(𝐱i)).d_{{\mathcal{H}},S_{\mathcal{X}}}(\to,\leadsto)=\sup_{h\in{\mathcal{H}}}\sum_{\mathbf{x}_{i}\in S_{\mathcal{X}}}{\ell^{\mathrm{gr}}}(h,\leadsto,\mathbf{x}_{i},B_{\to}(\mathbf{x}_{i})).
Theorem 7.

Let VC((×𝒢)gr)=d\mathrm{VC}(({\mathcal{H}}\times{\mathcal{G}})_{{\ell^{\mathrm{gr}}}})=d. Then there is ndist:(0,1)2n_{\mathrm{dist}}:(0,1)^{2}\mapsto\mathbb{N}, such that for every marginal distribution P𝒳P_{\mathcal{X}} and every manipulation graph \to for a sample S={(𝐱1,B(𝐱1)),,(𝐱n,B(𝐱n))}S=\{(\mathbf{x}_{1},B_{\to}(\mathbf{x}_{1})),\dots,(\mathbf{x}_{n},B_{\to}(\mathbf{x}_{n}))\} of size nn(ϵ,δ)n\geq n(\epsilon,\delta), we have with probability at least 1δ1-\delta over the sample generation S𝒳=(𝐱1,,𝐱n)P𝒳nS_{\mathcal{X}}=(\mathbf{x}_{1},\dots,\mathbf{x}_{n})\sim P_{\mathcal{X}}^{n} for any 𝒢\leadsto\in{\mathcal{G}}

d,P𝒳(,)<d,S𝒳(,)+ϵ.d_{{\mathcal{H}},P_{\mathcal{X}}}(\to,\leadsto)<d_{{\mathcal{H}},S_{\mathcal{X}}}(\to,\leadsto)+\epsilon.

Furthermore, ndist(ϵ,δ)O(d+log1δϵ2)n_{\mathrm{dist}}(\epsilon,\delta)\in O(\frac{d+\log\frac{1}{\delta}}{\epsilon^{2}}).

Combining Theorem 7 and Theorem 6 we can thus conclude that it is indeed possible to learn 𝒢\leadsto\in{\mathcal{G}} such that using {\ell^{\leadsto}} as a surrogate loss function guarantees a good approximation on the true strategic loss {\ell^{\to}}.

6 Conclusion

In this paper, we introduced a new strategic loss, which incentivizes correct classification under strategic manipulations. We also incorporate the idea of social burden into our notion of loss. We differentiated this loss from previous formal frameworks designed to mitigate strategic manipulation. In particular, we showed that optimizing for our strategic loss can yield satisfactory classification rules, even if there is no incentive-compatible hypothesis in the class that performs well on the classification task at hand. In addition, the loss formulation yields desirable effects in terms of sample complexity. Our work opens various avenues for further investigations and we hope it will inspire follow-up studies on the connections between a hypothesis class and the underlying manipulation graphs, effects of these connections, as well as learnability of the manipulation graph.

7 Acknowledgements

Ruth Urner is also a faculty affiliate member of Toronto’s Vector Institute. Her research is supported by an NSERC Discovery grant. Tosca Lechner is also a graduate student of the Vector Institute and was supported by a Vector Research Grant.

References

  • Ben-David et al. (2010) Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; and Vaughan, J. W. 2010. A theory of learning from different domains. Mach. Learn., 79(1-2): 151–175.
  • Blumer et al. (1989) Blumer, A.; Ehrenfeucht, A.; Haussler, D.; and Warmuth, M. K. 1989. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM (JACM), 36(4): 929–965.
  • Brückner and Scheffer (2011) Brückner, M.; and Scheffer, T. 2011. Stackelberg games for adversarial prediction problems. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining SIGKDD, 547–555.
  • Cullina, Bhagoji, and Mittal (2018) Cullina, D.; Bhagoji, A. N.; and Mittal, P. 2018. PAC-learning in the presence of adversaries. In Advances in Neural Information Processing Systems, 230–241.
  • Dalvi et al. (2004) Dalvi, N. N.; Domingos, P. M.; Mausam; Sanghai, S. K.; and Verma, D. 2004. Adversarial classification. In Proceedings of the Tenth ACM International Conference on Knowledge Discovery and Data Mining SIGKDD, 99–108.
  • Dong et al. (2018) Dong, J.; Roth, A.; Schutzman, Z.; Waggoner, B.; and Wu, Z. S. 2018. Strategic Classification from Revealed Preferences. In Proceedings of the 2018 ACM Conference on Economics and Computation, EC, 55–70.
  • Feige, Mansour, and Schapire (2015) Feige, U.; Mansour, Y.; and Schapire, R. 2015. Learning and inference in the presence of corrupted inputs. In Conference on Learning Theory, 637–657.
  • Haghtalab et al. (2020) Haghtalab, N.; Immorlica, N.; Lucier, B.; and Wang, J. Z. 2020. Maximizing Welfare with Incentive-Aware Evaluation Mechanisms. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI, 160–166.
  • Hardt et al. (2016) Hardt, M.; Megiddo, N.; Papadimitriou, C. H.; and Wootters, M. 2016. Strategic Classification. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, ITCS, 111–122.
  • Haussler and Welzl (1987) Haussler, D.; and Welzl, E. 1987. epsilon-Nets and Simplex Range Queries. Discret. Comput. Geom., 2: 127–151.
  • Hu, Immorlica, and Vaughan (2019) Hu, L.; Immorlica, N.; and Vaughan, J. W. 2019. The Disparate Effects of Strategic Manipulation. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT*, 259–268.
  • Jagadeesan, Mendler-Dünner, and Hardt (2021) Jagadeesan, M.; Mendler-Dünner, C.; and Hardt, M. 2021. Alternative Microfoundations for Strategic Classification. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 4687–4697.
  • Mansour, Mohri, and Rostamizadeh (2009) Mansour, Y.; Mohri, M.; and Rostamizadeh, A. 2009. Domain Adaptation: Learning Bounds and Algorithms. In The 22nd Conference on Learning Theory, COLT.
  • Miller, Milli, and Hardt (2020) Miller, J.; Milli, S.; and Hardt, M. 2020. Strategic Classification is Causal Modeling in Disguise. In Proceedings of the 37th International Conference on Machine Learning, ICML, 6917–6926.
  • Milli et al. (2019) Milli, S.; Miller, J.; Dragan, A. D.; and Hardt, M. 2019. The Social Cost of Strategic Classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT*, 230–239.
  • Montasser, Hanneke, and Srebro (2019) Montasser, O.; Hanneke, S.; and Srebro, N. 2019. VC Classes are Adversarially Robustly Learnable, but Only Improperly. In Conference on Learning Theory, COLT, 2512–2530.
  • Montasser, Hanneke, and Srebro (2021) Montasser, O.; Hanneke, S.; and Srebro, N. 2021. Adversarially Robust Learning with Unknown Perturbation Sets. In Conference on Learning Theory, COLT 2021, 3452–3482.
  • Moran and Yehudayoff (2016) Moran, S.; and Yehudayoff, A. 2016. Sample compression schemes for VC classes. Journal of the ACM (JACM), 63(3): 1–10.
  • Perdomo et al. (2020) Perdomo, J. C.; Zrnic, T.; Mendler-Dünner, C.; and Hardt, M. 2020. Performative Prediction. In Proceedings of the 37th International Conference on Machine Learning, ICML, 7599–7609.
  • Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S.; and Ben-David, S. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
  • Sundaram et al. (2021) Sundaram, R.; Vullikanti, A.; Xu, H.; and Yao, F. 2021. PAC-Learning for Strategic Classification. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 9978–9988.
  • Tsirtsis and Rodriguez (2020) Tsirtsis, S.; and Rodriguez, M. G. 2020. Decisions, Counterfactual Explanations and Strategic Behavior. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems NeurIPS.
  • Valiant (1984) Valiant, L. G. 1984. A Theory of the Learnable. Commun. ACM, 27(11): 1134–1142.
  • Vapnik and Chervonenkis (1971) Vapnik, V. N.; and Chervonenkis, A. Y. 1971. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability & Its Applications, 16(2): 264–280.
  • Zhang and Conitzer (2021) Zhang, H.; and Conitzer, V. 2021. Incentive-Aware PAC Learning. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI, 5797–5804.

Appendix A Statistical Learning Theory basics

In this section, we detail the standard setup of statistical learning theory for classification which we employ in our work.

A.1 Learning theoretic setup and definitions

We let 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} denote the domain and 𝒴\mathcal{Y} (mostly 𝒴={0,1}\mathcal{Y}=\{0,1\}) a (binary) label space. We model the data generating process as a distribution PP over 𝒳×𝒴\mathcal{X}\times\mathcal{Y} and let P𝒳P_{\mathcal{X}} denote the marginal of PP over 𝒳\mathcal{X}. We use the notation (𝐱,y)P(\mathbf{x},y)\sim P to indicate that (𝐱,y)(\mathbf{x},y) is a sample from distribution PP and SPnS\sim P^{n} to indicate that set SS is a sequence (for example a training or test data set) of nn i.i.d. samples from PP. Further, we use notation ηP(𝐱)=(𝐱,y)P[y=1𝐱]\eta_{P}(\mathbf{x})=\mathbb{P}_{(\mathbf{x},y)\sim P}[y=1\mid\mathbf{x}] to denote the regression or conditional labeling function of PP. We say that the distribution has deterministic labels if ηP(𝐱){0,1}\eta_{P}(\mathbf{x})\in\{0,1\} for all 𝐱𝒳\mathbf{x}\in\mathcal{X}.

A classifier or hypothesis is a function h:𝒳𝒴h:\mathcal{X}\to\mathcal{Y}. A classifier hh can naturally be viewed a subset of 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, namely h={(𝐱,y)X×Y𝐱X,y=h(𝐱)}h=\{(\mathbf{x},y)\in X\times Y~{}\mid~{}\mathbf{x}\in X,~{}y=h(\mathbf{x})\}. We let {\mathcal{F}} denote the set of all Borel measurable functions from 𝒳\mathcal{X} to 𝒴\mathcal{Y} (or all functions in case of a countable domain). A hypothesis class is a subset of {\mathcal{F}}, often denoted by {\mathcal{H}}\subseteq{\mathcal{F}}.

The quality of prediction of a hypothesis on an input/output pair (x,y)(x,y) is measured by a loss function :(×𝒳×𝒴)\ell:({\mathcal{F}}\times\mathcal{X}\times\mathcal{Y})\to\mathbb{R}. For classification problems, the quality of prediction is typically measured with the binary or classification loss

0/1(h,x,y)=𝟙[h(x)y],{\ell^{0/1}}(h,x,y)=\mathds{1}\left[{h(x)\neq y}\right],

where 𝟙[α]\mathds{1}\left[{\alpha}\right] denotes the indicator function for predicate α\alpha.

We denote the expected loss (or true loss) of a hypothesis hh with respect to the distribution PP and loss function \ell by

P(h)=𝔼(x,y)P[(h,x,y)].{\mathcal{L}_{P}}(h)=\mathbb{E}_{(x,y)\sim P}[\ell(h,x,y)].

In particular, we will denote the true binary loss of a classifier hh by P0/1(h){\mathcal{L}^{0/1}_{P}}(h). The quality of an output classifier is typically compared with the best possible (binary) loss achievable on distribution PP with hypotheses from a fixed class {\mathcal{H}}. This quantity is referred to as the approximation error of {\mathcal{H}} with respect to PP and denoted by

optP0/1()=infhP0/1(h).\mathrm{opt}^{0/1}_{P}({\mathcal{H}})=\inf_{h\in{\mathcal{H}}}{\mathcal{L}^{0/1}_{P}}(h).

The empirical loss of a hypothesis hh with respect to loss function \ell and a sample S=((x1,y1),,(xn,yn))S=((x_{1},y_{1}),\ldots,(x_{n},y_{n})) is defined as

S(h)=1ni=1n(h,xi,yi).{\mathcal{L}_{S}}(h)=\frac{1}{n}\sum_{i=1}^{n}\ell(h,x_{i},y_{i}).

A learner 𝒜{\mathcal{A}} is a function that takes in a finite sequence of labeled instances S=((x1,y1),,(xn,yn))S=((x_{1},y_{1}),\ldots,(x_{n},y_{n})) and outputs a hypothesis h=𝒜(S)h={\mathcal{A}}(S). The following is a standard notion of (PAC-)learnability of a hypothesis class from finite samples (Vapnik and Chervonenkis 1971; Valiant 1984; Blumer et al. 1989; Shalev-Shwartz and Ben-David 2014).

Definition 6 (PAC learnability).

We say that a learner 𝒜{\mathcal{A}} is PAC learns (or simply learns) hypothesis class {\mathcal{H}} with respect to a set of distributions 𝒫{\mathcal{P}} and loss function \ell if, for every ϵ,δ>0\epsilon,\delta>0 there is a sample-size n(ϵ,δ)n(\epsilon,\delta) such that, for all nn(ϵ,δ)n\geq n(\epsilon,\delta), we have

SPn[P(𝒜(S))optP()+ϵ]1δ.\mathbb{P}_{S\sim P^{n}}\left[{\mathcal{L}_{P}}({\mathcal{A}}(S))\leq\mathrm{opt}{P}({\mathcal{H}})+\epsilon\right]\geq 1-\delta.

We say that 𝒜{\mathcal{A}} is agnostically learns {\mathcal{H}}, if the above holds with respect to the class of all data-generating distributions (subject to some mild, standard measurability conditions). We say that 𝒜{\mathcal{A}} learns {\mathcal{H}} in the realizable case if the above holds with respect to the class 𝒫{\mathcal{P}} of distributions for which there exists a classifier hh^{*}\in{\mathcal{H}} with P(h)=0{\mathcal{L}_{P}}(h^{*})=0.

The smallest function n:[0,1]2n:[0,1]^{2}\to\mathbb{N} for which there exists a learner 𝒜{\mathcal{A}} that learns class {\mathcal{H}} in the above sense is called the sample complexity of {\mathcal{H}}.

Definition 7 (Proper versus improper learning).

If a learner 𝒜{\mathcal{A}} always outputs a function 𝒜(S){\mathcal{A}}(S)\in{\mathcal{H}} from the hypothesis class {\mathcal{H}}, we call 𝒜{\mathcal{A}} a proper learner for {\mathcal{H}} (and otherwise we call 𝒜{\mathcal{A}} and improper learner for {\mathcal{H}}). If Definition 6 for learnability holds with a proper learner for {\mathcal{H}}, we call the class {\mathcal{H}} proper (PAC) learnable.

It is well known that a binary hypothesis class {\mathcal{H}} is learnable (with a proper learner; both agnostically and in the realizable case) in the above sense if and only if {\mathcal{H}} has finite VC-dimension VC()\mathrm{VC}({\mathcal{H}}) (see Definition 2 in the main part of the paper, and more details on the background of this in Section on loss classes); and that the sample complexity is θ~(VC()log(1/δ)ϵ)\tilde{\theta}\left(\frac{\mathrm{VC}({\mathcal{H}})\log(1/\delta)}{\epsilon}\right) in the realizable case and θ(VC()log(1/δ)ϵ2){\theta}\left(\frac{\mathrm{VC}({\mathcal{H}})\log(1/\delta)}{\epsilon^{2}}\right) in the agnostic case (Shalev-Shwartz and Ben-David 2014).

A.2 The role of VC-dimension of loss classes for learnability

Standard VC-theory tells us that, for the binary classification loss, the sample complexity of learning a hypothesis class is determined by the VC-dimension of the class (and thus the VC-dimension of the loss class since these are identical for the binary loss). Any learner that acts according to the ERM (Empirical Risk Minimization) principle is a successful learner with respect to the binary loss for classes of bounded VC-dimension dd. We here briefly recap the underpinnings of this result. For the realizable case, a sample SPnS\sim P^{n} of size at least O~(dlog(1/δ)/ϵ)\tilde{O}(d\log(1/\delta)/\epsilon) is an ϵ\epsilon-net for the loss class (with probability at least 1δ1-\delta) (Haussler and Welzl 1987), thus, for every hypothesis that has loss at least ϵ\epsilon, there would be a sample point indicating that. Choosing a hypothesis of zero empirical loss, then guarantees true loss bounded by ϵ\epsilon. In the agnostic case, a sample of size O(dlog(1/δ)/ϵ2){O}(d\log(1/\delta)/\epsilon^{2}) is an ϵ\epsilon-approximation for the loss class, that is, we have |P(h)S(h)|ϵ|{\mathcal{L}_{P}}(h)-{\mathcal{L}_{S}}(h)|\leq\epsilon for every hh\in{\mathcal{H}} with high probability, which in turn implies that any ERM learner is a successful agnostic PAC learner. For the binary loss there are complementing lower bounds for the sample complexity of learning VC-classes. However, unlike for the binary loss, the VC-dimension of the strategic loss class does not imply a lower bound on the sample complexity (as we see in Theorem 3).

Appendix B Example for bounded VC dimension of strategic component

To illustrate the conditions in Theorem 4, we here provide a few natural examples, where the VC-dimension of the strategic components can be bounded:

Example 8.
  1. 1.

    =(𝒳,)\mathcal{M}=(\mathcal{X},\to) being complete, implies VC(,)=VC()\mathrm{VC}({\mathcal{H}}_{{\ell^{\to,\perp}}})=\mathrm{VC}({\mathcal{H}}).

  2. 2.

    If \mathcal{M} corresponds to a partial order over 𝒳\mathcal{X} (that is, in particular \mathcal{M} is acyclic) and the class {\mathcal{H}} is a subset of complements of initial segments in \mathcal{M} (that is if, for some hh\in{\mathcal{H}} and 𝐱𝒳\mathbf{x}\in\mathcal{X} we have h(𝐱)=0h(\mathbf{x})=0, then we also have h(𝐱)=0h(\mathbf{x}^{\prime})=0 for all 𝐱\mathbf{x}^{\prime} that precede 𝐱\mathbf{x} in the partial order) , then the VC-dimension of ,{\mathcal{H}}_{{\ell^{\to,\perp}}} is bounded by the size of a largest antichain in \mathcal{M}.

  3. 3.

    If 𝒳=d\mathcal{X}=\mathbb{R}^{d}, {\mathcal{H}} consists of linear classifiers and the plausible manipulations 𝐱𝐱\mathbf{x}\to\mathbf{x}^{\prime} are determined by 𝐱𝐱pr\|\mathbf{x}-\mathbf{x}^{\prime}\|_{p}\leq r two points being close in some standard norm, then the VC-dimension of ,{\mathcal{H}}_{{\ell^{\to,\perp}}} is bounded by 2d+22d+2.

  4. 4.

    If 𝒳=d\mathcal{X}=\mathbb{R}^{d}, {\mathcal{H}} consists of linear classifiers and we have plausible manipulations 𝐱𝐱\mathbf{x}\to\mathbf{x}^{\prime} if and only if 𝐱\mathbf{x} and 𝐱\mathbf{x}^{\prime} differ on only one coordinate. Then the VC-dimension of the resulting class of strategic components is d+1d+1 (since for every hh, for every point 𝐱\mathbf{x} such that h(x)=0h(x)=0, there is a 𝐱\mathbf{x}^{\prime} such that 𝐱𝐱\mathbf{x}\to\mathbf{x}^{\prime} is an edge and h(𝐱)=1h(\mathbf{x}^{\prime})=1. Thus, for every half-space hh, sets of zeros of hh that are connected to a 11 is exactly {𝐱:h(𝐱)=0}\{\mathbf{x}:h(\mathbf{x})=0\}, and therefore VC(,)=VC()\mathrm{VC}({\mathcal{H}}_{{\ell^{\to,\perp}}})=\mathrm{VC}({\mathcal{H}}).

Appendix C The VC dimension of the graph loss class

In Section 5 we analyse the learnability of an approximate manipulation graph from a predefined set of manipulation graphs 𝒢{\mathcal{G}} in {\mathcal{H}}-P𝒳P_{\mathcal{X}}-distance in terms of the VC dimension of the graph loss class (×𝒢)gr({\mathcal{H}}\times{\mathcal{G}})_{{\ell^{\mathrm{gr}}}}. We now want to give some examples for when this VC dimension is indeed finite.

Example 9.
  1. 1.

    If for every hh\in{\mathcal{H}} and every 𝒢\to\in{\mathcal{G}}, we have VC(({h}×𝒢)glo)d1VC((\{h\}\times{\mathcal{G}})_{glo})\leq d_{1} and VC((×{})gr)d2VC(({\mathcal{H}}\times\{\to\})_{{\ell^{\mathrm{gr}}}})\leq d_{2} then VC((×𝒢)gr)d1d2VC(({\mathcal{H}}\times{\mathcal{G}})_{{\ell^{\mathrm{gr}}}})\leq d_{1}d_{2}.

  2. 2.

    Let lin={h:d{0,1}:wd:h(x)=1 if and only if xTw0}{\mathcal{H}}_{lin}=\{h:\mathbb{R}^{d}\rightarrow\{0,1\}:\exists w\in\mathbb{R}^{d}:h(x)=1\text{ if and only if }x^{T}w\geq 0\} be the class of linear separators and 𝒢lin={:wd:f(x)=Bxtw(x)}\mathcal{G}_{lin}=\{\to:\exists w\in\mathbb{R}^{d}:f_{\to}(x)=B_{x^{t}w}(x)\} be the graph class consistent of graphs such that all their neighborhood sets are balls whose radius may depend linearly on the feature vector. Then the VC-dimension of lin×𝒢lin{\mathcal{H}}_{lin}\times\mathcal{G}_{lin} is finite.

Appendix D Proofs

Observation 1.

For any d{}d\in\mathbb{N}\cup\{\infty\} there exists a class {\mathcal{H}} and a manipulation graph =(𝒳,)\mathcal{M}=(\mathcal{X},\to) with VC()=1\mathrm{VC}({\mathcal{H}})=1 and VC()d\mathrm{VC}({\mathcal{H}}_{\ell^{\to}})\geq d.

Proof of Observation 1:   Let 𝒳\mathcal{X} be some infinite domain. We treat the case dd\in\mathbb{N} first. We consider a set a set 𝒰={𝐱1,𝐱2,,𝐱d}\mathcal{U}=\{\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{d}\} of dd domain points and a disjoint set 𝒱={𝐳1,𝐳2,,𝐳2d}\mathcal{V}=\{\mathbf{z}_{1},\mathbf{z}_{2},\ldots,\mathbf{z}_{2^{d}}\} of 2d2^{d} domain points. We associate each subset of 𝒰\mathcal{U} with exactly one point in 𝒱\mathcal{V}. We design a manipulation graph \mathcal{M} by adding an edge 𝐱i𝐳j\mathbf{x}_{i}\to\mathbf{z}_{j} if and only if 𝐱i\mathbf{x}_{i} is a member of the subset associated with 𝐳j\mathbf{z}_{j}, and include no other edges. Now we consider the class {\mathcal{H}} of singletons over 𝒱\mathcal{V}, that is ={h1,h2,,h2d}{\mathcal{H}}=\{h_{1},h_{2},\ldots,h_{2^{d}}\} consists of 2d2^{d} functions and we have hj(𝐱)=𝟙[𝐱=𝐳j]h_{j}(\mathbf{x})=\mathds{1}\left[{\mathbf{x}=\mathbf{z}_{j}}\right]. That is hjh_{j} assigns label 0 to every point in the domain except for 𝐳j\mathbf{z}_{j}. Then we have VC()=1\mathrm{VC}({\mathcal{H}})=1. However the loss class {\mathcal{H}}_{{\ell^{\to}}} shatters the set {(𝐱1,0),(𝐱2,0),(𝐱d,0)}\{(\mathbf{x}_{1},0),(\mathbf{x}_{2},0)\ldots,(\mathbf{x}_{d},0)\} (and also the set {(𝐱1,1),(𝐱2,1),(𝐱d,1)}\{(\mathbf{x}_{1},1),(\mathbf{x}_{2},1)\ldots,(\mathbf{x}_{d},1)\}). Thus, VC()d\mathrm{VC}({\mathcal{H}}_{{\ell^{\to}}})\geq d. For the case d=d=\infty, we will use the same construction with the set 𝒰=\mathcal{U}=\mathbb{N} and 𝒱\mathcal{V} being an uncountable set, again indexed by the power-set of 𝒰\mathcal{U}.

Observation 2.

For any hypothesis class {\mathcal{H}} and any manipulation graph =(𝒳,)\mathcal{M}=(\mathcal{X},\to), we have VC()VC()\mathrm{VC}({\mathcal{H}})\leq\mathrm{VC}({\mathcal{H}}_{\ell^{\to}}).

Proof of Observation 2:   Let {(𝐱1,y1),(𝐱2,y2),(𝐱d,yd)}\{(\mathbf{x}_{1},y_{1}),(\mathbf{x}_{2},y_{2})\ldots,(\mathbf{x}_{d},y_{d})\} be a set of points shattered by the hypothesis class {\mathcal{H}}. Since all elements of {\mathcal{H}} are functions, we can assume yi=1y_{i}=1 for all ii. The same set of domain points {(𝐱1,1),(𝐱2,1),(𝐱d,1)}\{(\mathbf{x}_{1},1),(\mathbf{x}_{2},1)\ldots,(\mathbf{x}_{d},1)\} with all labels set to 11 is then shattered by the loss class {\mathcal{H}}_{\ell^{\to}}. To see this, note that for a function hh\in{\mathcal{H}} and point 𝐱\mathbf{x}, if h(𝐱)=1h(\mathbf{x})=1, then (𝐱,1)h(\mathbf{x},1)\notin h_{\ell^{\to}} is not in the loss set. Thus, if the class {\mathcal{H}} can shatter those points, the loss class will shatter them as well.

Theorem 3.

For every d{}d\in\mathbb{N}\cup\{\infty\}, there exists a hypothesis class {\mathcal{H}} with VC()=d\mathrm{VC}({\mathcal{H}}_{\ell^{\to}})=d that is learnable with sample complexity O(log(1/δ)/ϵ)O(\log(1/\delta)/\epsilon) in the realizable case.

Proof of Theorem 3:   We consider the class of singletons from the proof of Observation 1. Note that, if the class is realizable with respect to the strategic loss, then there exists at most one 𝐳j\mathbf{z}_{j} such that this point (𝐳j,1)(\mathbf{z}_{j},1) with label 11 has a positive weight under this distribution. Further, the distribution can not assign any weight to points (𝐱i,1)(\mathbf{x}_{i},1), and it can only assign a positive weight to points (𝐱i,0)(\mathbf{x}_{i},0) if 𝐱i\mathbf{x}_{i} is not a member of the subset corresponding to 𝐳j\mathbf{z}_{j}. Further, if the point (𝐳j,1)(\mathbf{z}_{j},1) is a member of the training sample, the learner can output hypothesis hjh_{j} and this hypothesis then has true strategic loss 0. If the training sample does not contain any point with label 11, the learner can output a hypothesis h0h_{0} that assigns label 0 to all points. In that case, we get P(h0)=P(𝐳j,1){\mathcal{L}^{\to}_{P}}(h_{0})=P(\mathbf{z}_{j},1). If the probability P(𝐳j,1)P(\mathbf{z}_{j},1) was larger than ϵ\epsilon, then, by the standard argument, a sample of the stated size being an ϵ\epsilon-net for singletons would have contained the point (with high probability at least 1δ1-\delta).

Theorem 5.

(Adaptation of Theorem 1 and Theorem 4 in (Montasser, Hanneke, and Srebro 2019))
There exists a hypothesis class {\mathcal{H}} with VC()=1\mathrm{VC}({\mathcal{H}})=1 that is not learnable with respect to the strategic loss by any proper learner 𝒜{\mathcal{A}} for {\mathcal{H}} even in the realizable case. On the other hand, every class {\mathcal{H}} of finite VC-dimension is learnable (by some improper learner).

Proof of Theorem 5:   The construction to prove the first claim can be taken exactly as is in the proof of Theorem 1 by Montasser et al.(Montasser, Hanneke, and Srebro 2019) by setting the label +1+1 there to 0 and the label 1-1 there to 11. The positive result on general learnability with an improper learner is derived by Montasser et al.(Montasser, Hanneke, and Srebro 2019) by means of adapting a general compression scheme for VC-classes developed by Moran and Yehudayoff (Moran and Yehudayoff 2016). We can use the same adaptation to show learnability with respect to the strategic loss with the following modification: instead of just compressing a data sample SS, the compression scheme for the adversarial loss first creates an inflated sample SUS_{U} which adds the perturbation sets (and then discretizing these using Sauer’s lemma to obtain a new finite data set). For the strategic loss, we would add the label conditioned perturbation sets (as described above), that is we inflate the sample by adding points (𝐱,y)(\mathbf{x}^{\prime},y), for (𝐱i,y)(\mathbf{x}_{i},y) being the data point of smallest index with 𝐱i𝐱\mathbf{x}_{i}\to\mathbf{x}^{\prime}. The discretization step and remaining compression and decompression technique can be used identically as in the adversarial loss case.

Theorem 6.

Let \mathcal{H} be any hypothesis class and ,\to,\leadsto two manipulation graphs. Then for any distribution PP over 𝒳×𝒴\mathcal{X}\times\mathcal{Y} and any hh\in\mathcal{H} we have

P(h)\displaystyle{\mathcal{L}^{\to}_{P}}(h) P0/1(h)+P,(h)+d,P𝒳(,)\displaystyle\leq{\mathcal{L}^{0/1}_{P}}(h)+{\mathcal{L}^{\leadsto,\perp}_{P}}(h)+{d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto)
2P(h)+d,P𝒳(,).\displaystyle\leq 2{\mathcal{L}^{\leadsto}_{P}}(h)+{d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto).

Furthermore, by rearranging the result, we get

12P(h)d,P𝒳(,)P(h).\frac{1}{2}{\mathcal{L}^{\leadsto}_{P}}(h)-{d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto)\leq{\mathcal{L}^{\to}_{P}}(h).

Proof of Theorem 6:  

P(h)\displaystyle{\mathcal{L}^{\to}_{P}}(h) =P0/1(h)+P,(h)\displaystyle={\mathcal{L}^{0/1}_{P}}(h)+{\mathcal{L}^{\to,\perp}_{P}}(h)
𝔼(x,y)P[h(x)=0,y=1,x:xx:h(x)=1]\displaystyle-\mathbb{E}_{(x,y)\sim P}[h(x)=0,y=1,\exists x^{\prime}:x\to x^{\prime}:h(x^{\prime})=1]
P0/1(h)+P,(h)+|P,(h)P,(h)|\displaystyle\leq{\mathcal{L}^{0/1}_{P}}(h)+{\mathcal{L}^{\leadsto,\perp}_{P}}(h)+|{\mathcal{L}^{\to,\perp}_{P}}(h)-{\mathcal{L}^{\leadsto,\perp}_{P}}(h)|
P0/1(h)+P,(h)+d,P𝒳(,)\displaystyle\leq{\mathcal{L}^{0/1}_{P}}(h)+{\mathcal{L}^{\leadsto,\perp}_{P}}(h)+{d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto)
2P(h)+d,P𝒳(,)\displaystyle\leq 2{\mathcal{L}^{\leadsto}_{P}}(h)+{d_{\mathcal{H},P_{\mathcal{X}}}}(\to,\leadsto)

By exchanging \to and \leadsto and rearranging we get the second inequality.

Lemma 1.

Let VC((×𝒢)gr)=d\mathrm{VC}(({\mathcal{H}}\times{\mathcal{G}})_{{\ell^{\mathrm{gr}}}})=d. Then there is ngraph:(0,1)2n_{\mathrm{graph}}:(0,1)^{2}\mapsto\mathbb{N}, such that for any marginal distribution P𝒳P_{\mathcal{X}} and any manipulation graph \to for a sample S={(x1,B(x1)),,(xn,B(xn))}S=\{(x_{1},B_{\to}(x_{1})),\dots,(x_{n},B_{\to}(x_{n}))\} of size nn(ϵ,δ)n\geq n(\epsilon,\delta), we have with probability at least 1δ1-\delta over the sample generation S𝒳=(x1,,xn)PXnS_{\mathcal{X}}=(x_{1},\dots,x_{n})\sim P_{X}^{n} for any hh\in{\mathcal{H}} and any 𝒢\leadsto\in{\mathcal{G}}

|(P𝒳,)gr(h,)Sgr(h,)|<ϵ.|{\mathcal{L}^{\mathrm{gr}}_{(P_{\mathcal{X}},\to)}}(h,\leadsto)-{\mathcal{L}^{\mathrm{gr}}_{S}}(h,\leadsto)|<\epsilon.

Furthermore, ngraph(ϵ,δ)O(d+log1δϵ2)n_{\mathrm{graph}}(\epsilon,\delta)\in O(\frac{d+\log\frac{1}{\delta}}{\epsilon^{2}}).

Proof of Lemma 1:   The result follows directly from the finite VC-dimension of VC((×𝒢)gr)\mathrm{VC}(({\mathcal{H}}\times{\mathcal{G}})_{{\ell^{\mathrm{gr}}}}) and the fact that a finite VC-dimension of a class yields uniform convergence of the class (Theorem 6.7 and Theorem 6.8 from (Shalev-Shwartz and Ben-David 2014)).

Theorem 7.

Let VC((×𝒢)gr)=d\mathrm{VC}(({\mathcal{H}}\times{\mathcal{G}})_{{\ell^{\mathrm{gr}}}})=d. Then there is ndist:(0,1)2n_{\mathrm{dist}}:(0,1)^{2}\mapsto\mathbb{N}, such that for any marginal distribution P𝒳P_{\mathcal{X}} and any manipulation graph \to for a sample S={(x1,B(x1)),,(xn,B(xn))}S=\{(x_{1},B_{\to}(x_{1})),\dots,(x_{n},B_{\to}(x_{n}))\} of size nn(ϵ,δ)n\geq n(\epsilon,\delta), we have with probability at least 1δ1-\delta over the sample generation S𝒳=(x1,,xn)PXnS_{\mathcal{X}}=(x_{1},\dots,x_{n})\sim P_{X}^{n} for any 𝒢\leadsto\in{\mathcal{G}}

d,PX(,)<d,SX(,)+ϵ.d_{{\mathcal{H}},P_{X}}(\to,\leadsto)<d_{{\mathcal{H}},S_{X}}(\to,\leadsto)+\epsilon.

Furthermore, ndist(ϵ,δ)O(d+log1δϵ2)n_{\mathrm{dist}}(\epsilon,\delta)\in O(\frac{d+\log\frac{1}{\delta}}{\epsilon^{2}}).

Proof of Theorem 7:   Let 𝒢\leadsto\in{\mathcal{G}} and S𝒳P𝒳nS_{\mathcal{X}}\sim P_{\mathcal{X}}^{n}. Then using Lemma 1 we get with probability 1δ1-\delta

|d,PX(,)d,SX(,)|\displaystyle|d_{{\mathcal{H}},P_{X}}(\to,\leadsto)-d_{{\mathcal{H}},S_{X}}(\to,\leadsto)|
=|suph𝔼xP𝒳[|,(h,x),(h,x)|]\displaystyle=|\sup_{h\in\mathcal{H}}\mathbb{E}_{x\sim P_{\mathcal{X}}}[|{\ell^{\to,\perp}}(h,x)-{\ell^{\leadsto,\perp}}(h,x)|]
suphxiS𝒳gr(h,,xi,B(xi))|\displaystyle-\sup_{h\in{\mathcal{H}}}\sum_{x_{i}\in S_{\mathcal{X}}}{\ell^{\mathrm{gr}}}(h,\leadsto,x_{i},B_{\to}(x_{i}))|
suph|𝔼xP𝒳[|,(h,x),(h,x)|]\displaystyle\leq\sup_{h\in{\mathcal{H}}}|\mathbb{E}_{x\sim P_{\mathcal{X}}}[|{\ell^{\to,\perp}}(h,x)-{\ell^{\leadsto,\perp}}(h,x)|]
xiS𝒳gr(h,,xi,B(xi))|\displaystyle-\sum_{x_{i}\in S_{\mathcal{X}}}{\ell^{\mathrm{gr}}}(h,\leadsto,x_{i},B_{\to}(x_{i}))|
suph|(P𝒳,)gr(h,)Sgr(h,)|<ϵ.\displaystyle\leq\sup_{h\in{\mathcal{H}}}|{\mathcal{L}^{\mathrm{gr}}_{(P_{\mathcal{X}},\to)}}(h,\leadsto)-{\mathcal{L}^{\mathrm{gr}}_{S}}(h,\leadsto)|<\epsilon.