Strategic Representation
Abstract
Humans have come to rely on machines for reducing excessive information to manageable representations. But this reliance can be abused—strategic machines might craft representations that manipulate their users. How can a user make good choices based on strategic representations? We formalize this as a learning problem, and pursue algorithms for decision-making that are robust to manipulation. In our main setting of interest, the system represents attributes of an item to the user, who then decides whether or not to consume. We model this interaction through the lens of strategic classification (Hardt et al. 2016), reversed: the user, who learns, plays first; and the system, which responds, plays second. The system must respond with representations that reveal ‘nothing but the truth’ but need not reveal the entire truth. Thus, the user faces the problem of learning set functions under strategic subset selection, which presents distinct algorithmic and statistical challenges. Our main result is a learning algorithm that minimizes error despite strategic representations, and our theoretical analysis sheds light on the trade-off between learning effort and susceptibility to manipulation.
1 Introduction
There is a growing recognition that learning systems deployed in social settings are likely to induce strategic interactions between the system and its users. One promising line of research in this domain is strategic classification Hardt et al. (2016), which studies learning in settings where users can strategically modify their features in response to a learned classification rule. The primary narrative in strategic classification is one of self-interested users that act to ‘game’ the system, and of systems that should defend against this form of gaming behavior. In this paper, we initiate the study of reversed scenarios, in which it is the system that strategically games its users. We argue that this is quite routine: In settings where users make choices about items (e.g., click, buy, watch), decisions are often made on the basis of only partial information. But the choice of what information to present to users often lies in the hands of the system, which can utilize its representation power to promote its own goals. Here we formalize the problem of strategic representation, and study how learning can now aid users in choosing well despite strategic system behavior.
As a concrete example, consider a user browsing for a hotel in an online booking platform. Hotels on the platform are represented by a small set of images; as there are many images available for each hotel, the system must choose which subset of these to display. Clearly, the choice of representation can have a significant effect on users’ decision-making (whether to click the hotel, and subsequently whether to book it), and in turn, on the system’s profit. The system may well attempt to capitalize on the control it has over what information is presented to the user—at the expense of the user, who may be swayed to make sub-optimal decisions. Given that users rely on system-crafted representations for making decisions, choosing well requires users to account for the strategic nature of the representations they face.
Our goal in this paper is to study when and how learning can aid users in making decisions that are strategically robust. We model users as deciding based on a choice strategy mapping represented items to binary choices, which can be learned from the user’s previous experiences (e.g., hotels stayed in, and whether they were worthwhile). Since in reality full descriptions of items are often too large for humans to process effectively (e.g., hundreds or thousands of attributes), the role of the system is to provide users with compact representations (e.g., a small subset of attributes). We therefore model the system as responding to with a representation mapping , which determines for each item its representation , on which users rely for making choices (i.e., is a function of ). We focus on items that are discrete and composed of a subset of ground elements; accordingly, we consider representation mappings that are lossy but truthful, meaning they reveal a cardinality-constrained subset of the item’s full set of attributes: and for some exogenously-set .
Given that the system has representation power—how should users choose ? The key idea underlying our formulation is that the system and its users have misaligned goals. A user wishes to choose items that are ‘worthwhile’ to her, as determined by her valuation function —in our case, a set function, which can reflect complex tastes (e.g., account for complementarity among the attributes, such as ‘balcony’ and ‘ocean view’ in the hotel example). Meanwhile, the system aims to maximize user engagement by choosing a feasible representation that will incite the user to choose the item. Importantly, while values are based on the true , choices rely on representations —which the system controls. This causes friction: a representation may be optimal for the system, but may not align with user interests.
The challenge for users (and our goal in learning) is therefore to make good choices on the basis of strategic representations. Note that this is not a simple missing value problem, since which attributes will be missing depends on how users choose, i.e., on their learned choice function .111Consider the following subtlety: since users must commit to a choice strategy , the choice of how to handle missing features (a part of the strategy) determines which features will be missing. Nor is this a problem that can be addressed by mapping representations to a categorical representation (with classes ‘0’,‘1’, and ‘unknown’); to see this, note that given a subset representation , it is impossible to know which of the attributes that do not appear in are withheld—and which are truly missing.
Subset representations.
We focus on subset representations since they provide a natural means to ensure that representations reveal ‘nothing but the truth’ (but not necessarily ‘the whole truth’). This is our primary modeling concern, which stems from realistic restrictions on the system (like consumer expectations, legal obligations, or business practices). Subset representations cleanly capture what we view as a fundamental tension between users and an informationally-advantageous system—the ability to withhold information; examples include retail items described by a handful of attributes; videos presented by several key frames; and movie recommendations described by a select set of reviews, to name a few.
Overview of our results.
We begin by showing that users who choose naïvely can easily be manipulated by a strategic system (Sec. 3). We then proceed to study users who learn (Sec. 4). Our main result is an efficient algorithm for learning (Thm. 4.7), which holds under a certain realizability assumption on . The algorithm minimizes the empirical error over a hypothesis class of set-function classifiers, whose complexity is controlled by parameter , thus allowing users to trade off expressivity and statsitical efficiency. The algorithm builds upon several structural properties which we establish for set-function classifiers. Our key result here is that ‘folding’ the system’s strategic response into the hypothesis class results in an induced class having a simple form that makes it amenable to efficient optimization (Thm. 4.6).
Building on this, we continue to study the ‘balance of power’ (Sec. 5), as manifested in the interplay between (hypothesis complexity, which reflects the power of the user) and the range (which determines the power of the system). For fixed , we analyze how the choice of affects the user’s classification error, through its decomposition into estimation and approximation errors. For estimation, we give a generalization bound (Thm. 5.9), obtained by analyzing the VC dimension of the induced function class (as it relies on ). For approximation, we give several results (e.g., Thm. 5.4) that link the expressivity of the user’s value function to the complexity of the learned (again, as it relies on ). Together, our results characterize how much is gained versus how much effort is invested in learning as is varied. One conclusion is that even minimal learning can help significantly (whereas no learning can be catastrophic).
From the system’s perspective, and for fixed , we study how the range affects the system. Intuitively, we would expect that increasing the range should be beneficial to the system, as it provides more alternatives to choose from. However, perhaps surprisingly, we find that the system can increase its payoff by ‘tying its hands’ to a lower . This is because upper-bounds not only the system’s range but also the ‘effective’ of the user (who gets nothing from choosing ), and the lower the , the better it is for the system (Lemma 5.10). The choice of turns out to be immaterial against fully strategic users, but highly consequential against users that are not.
1.1 Relation to Strategic Classification
Our formalization of strategic representation shares a deep connection to strategic classification (Hardt et al., 2016). Strategic representation and strategic classification share an underlying structure (a leading learning player who must take into account a strategic responder), but there are important differences. The first is conceptual: in our setting, roles are reversed—it is users who learn (and not the system), and the system strategically responds (and not users). This allows us to pursue questions regarding the susceptibility of users to manipulation, with different emphases and goals, while maintaining the ‘language’ of strategic classification.
The second difference is substantive: we consider inputs that are sets (rather than continuous vectors), and manipulations that hide information (rather than alter it). Technically, switching to discrete inputs can be done by utilizing the cost function to enforce truthfulness constraints as a ‘hard’ penalty on modifications. But this change is not cosmetic: since the system behaves strategically by optimizing over subsets, learning must account for set-relations between different objects in input space; this transforms the problem to one of learning set functions.222This is a subtle point: Since sets can be expressed as binary membership vectors, it is technically possible to use conventional vector-based approaches to learn . Nonetheless, these approaches cannot account for strategic behavior; this is since implements a set operation, which is ill-defined for continuous vector inputs. From a modeling perspective, subsets make our work compatible with classic attribute-based consumer decision theory (Lancaster, 1966).
Overall, we view the close connection to strategic classification as a strength of our formalization, showing that the framework of strategic classification is useful far beyond what was previously considered; and also that a fairly mild variation can lead to a completely different learning problem, with distinct algorithmic and statistical challenges.
1.2 Related Work (see also Appx. B)
Strategic classification.
Strategic classification is a highly active area of research. Recent works in this field include statistical learning characterizations Zhang and Conitzer (2021); Sundaram et al. (2021); Ghalme et al. (2021), practical optimization methods Levanon and Rosenfeld (2021, 2022), relaxation of key assumptions Ghalme et al. (2021); Bechavod et al. (2022); Jagadeesan et al. (2021); Levanon and Rosenfeld (2022); Eilat et al. (2022), relations to causal aspects Miller et al. (2020); Chen et al. (2020a), and consideration of societal implications Milli et al. (2019); Hu et al. (2019); Chen et al. (2020b); Levanon and Rosenfeld (2021).
Two recent works are closely relate to ours: Zrnic et al. (2021) consider a dynamic setting of repeated learning that can change the order of play; in contrast, we switch the roles. Krishnaswamy et al. (2021) consider information withholding by users (rather than the system). They aim to learn a truth-eliciting mechanism, which incentivizes the second player to reveal all information (i.e., ‘the whole truth’). Their mechanism ensures that withholding never occurs; in contrast, our goal is to predict despite strategic withholding (i.e., ‘nothing but the truth’).
Bayesian persuasion. In Bayesian persuasion (Kamenica and Gentzkow, 2011), a more-informed player (i.e., the system) uses its information advantage coupled with commitment power to influence the choices of a decision maker (i.e., the user). Works closest to ours are by Dughmi et al. (2015), who upper bound the number of signaled attributes, and by Haghtalab et al. (2021), who study strategic selection of anecdotes. Both works consider the human player as a learner (as do we). However, in our work, the order of play is reversed—the decision-maker (user) moves first and the more-informed player (system) follows. Bayesian persuasion also assumes that the system knows the user’s valuation, and crucially relies on both parties knowing the distribution of items. In contrast, we model the user as having only sample access to , and the system as agnostic to it.
2 A Formalization of Strategic Representation
We begin by describing the setting from the perspective of the user, which is the learning entity in our setup. We then present the ‘types’ of users we study, and draw connections between our setting and others found in the literature.
2.1 Learning Setting
In our setup, a user is faced with a stream of items, and must choose which of these to consume. Items are discrete, with each item described by a subset of ground attributes, , where . We assume all feasible items have at most attributes, . The value of items for the user are encoded by a value function, . We say an item is worthwhile to the user if it has positive value, , and use to denote worthwhileness, i.e., if is worthwhile, and if it is not. Items are presented to the user as samples drawn i.i.d. from some unknown distribution over , and for each item, the user must choose whether to consume it (e.g., click, buy, watch) or not.
We assume that the user makes choices regarding items by committing at the onset to a choice function that governs her choice behavior. In principle, the user is free to chose from some predefined function class ; learning will consider finding a good , but the implications of the choice of itself will play a central role in our analysis. Ideally, the user would like to choose items if and only if they are worthwhile to her; practically, her goal is to find an for which this holds with large probability over . For this, the user can make use of her knowledge regarding items she has already consumed, and therefore also knows their value; we model this as providing the user access to a labeled sample set where and , which she can use for learning .
Strategic representations. The difficulty in learning is that user choices at test time must rely only on item representations, denoted , rather than on full item descriptions. Thus, learning considers choice functions that operate on representations, ; the challenge lies in that while choices must be made on the basis of representations , item values are derived from their full descriptions —which representations describe only partially.
The crucial aspect of our setup is that representations are not arbitrary; rather, representations are controlled by the system, which can choose them strategically to promote its own goals. We model the system as acting through a representation mapping, , which operates independently on any , and can be determined in response to the user’s choice of . This mimics a setting in which a fast-acting system can infer and quickly respond to a user’s (relatively fixed) choice patterns.
We assume the system’s goal is to choose a that maximizes expected user engagement:
(1) |
Nonetheless, representations cannot be arbitrary, and we require to satisfy two properties. First, chosen representations must be truthful, meaning that for all . Second, representations are subject to cardinality constraints, for some predetermined . We will henceforth use to mean representations of feasible cardinality. Both requirements stem from realistic considerations: A nontruthful system which intentionally distorts item information is unlikely to be commercially successful in the long run; intuitively, truthfulness gives users some hope of resilience to manipulation. For , we think of these as exogenous parameters of the environment, arising naturally due to physical restrictions (e.g., screen size) or cognitive considerations (e.g., information processing capacity); if , we say representations are lossy.
Under these constraints, the system can optimize Eq. (1) by choosing representations via the best-response mapping:
(2) |
Eq. (2) is a best-response since it maximizes Eq. (1) for any given : for every , chooses a feasible that triggers a positive choice event, —if such a exists. In this way, control how much leeway the system has in revealing only partial truths; as we will show, both parameters play a key role in determining outcomes for both system and user. From now on we overload notation and by refer to this best-response mapping.
Learning objective. Given function class and a labeled sample set , the user aims to find a choice function that correctly identifies worthwhile items given their representation, and in a way that is robust to strategic system manipulation. The user’s objective is therefore to maximize:
(3) |
where is the best-response mapping in Eq. (2).
Note that since is binary, the argmax of may not be unique; e.g., if some both have . Nonetheless, the particular choice of does not matter—from the user’s perspective, her choice of is invariant to the system’s choice of best-response (proof in Appx. C):
Observation 2.1.
Every best-response induces the same value in the user’s objective function (Eq. (3)).
2.2 User Types
Our main focus throughout the paper will be on users that learn by optimizing Eq. (3). But to understand the potential benefit of learning, we also analyze ‘simpler’ types of user behavior. Overall, we study three user types, varying in their sophistication and the amount of effort they invest in choosing . These include:
-
•
The naïve user: Acts under the (false) belief that representations are chosen in her own best interest. This user type truthfully reports her preferences to the system by setting as her choice function.333Note that while takes values in , takes values in . Nonetheless, truthfulness implies that , and so is well-defined as a choice function over .
-
•
The agnostic user: Makes no assumptions about the system. This user employs a simple strategy that relies on basic data statistics which provides minimal but robust guarantees regarding her payoff.
-
•
The strategic user: Knows that the system is strategic, and anticipates it to best-respond. This user is willing to invest effort (in terms of data and compute) in learning a choice function that maximizes her payoff by accounting for the system’s strategic behavior.
Our primary goal is to study the balance of power between users (that choose) and the system (which represents). In particular, we will be interested in exploring the tradeoff between a user’s effort and her susceptibility to manipulation.
2.3 Strategic Representation as a Game
Before proceeding, we give an equivalent characterization of strategic representation as a game. Our setting can be compactly described as a single-step Stackelberg game: the first player is User, which observes samples , and commits to a choice function ; the second player is System, which given , chooses a truthful (note how depends on ). The payoffs are:
(4) | |||
(5) |
Note that payoffs differ only in that User seeks correct choices, whereas System benefits from positive choices. This reveals a clear connection to strategic classification, in which System, who plays first, is interested in accurate predictions, and for this it can learn a classifier; and User, who plays second, can manipulate individual inputs (at some cost) to obtain positive predictions. Thus, strategic representation can be viewed as a variation on strategic classification, but with roles ‘reversed’. Nonetheless, and despite these structural similarities, strategic representation bears key differences: items are discrete (rather than continuous), manipulations are subject to ‘hard’ set constraints (rather than ‘soft’, continuous costs), and learning regards set functions (rather than vector functions). These differences lead to distinct questions and unique challenges in learning.
3 Warm-up: Naïve and Agnostic Users
The naïve user. The naïve user employs a ‘what you see is what you get’ policy: given a representation of an item, , this user estimates the item’s value based on alone, acting ‘as if’ were the item itself. Consequently, the naïve user sets , even though is truly a function of . The naïve user fails to account for the system’s strategic behavior (let alone the fact that of some actual ).
Despite its naivety, there are conditions under which this user’s approach makes sense. Our first result shows that the naïve policy is sensible in settings where the system is benevolent, and promotes user interests instead of its own.
Lemma 3.1.
If system plays the benevolent strategy:
then the naïve approach maximizes user payoff.
Proof in Appx. D. The above lemma is not meant to imply that naïve users assume the system is benevolent; rather, it justifies why users having this belief might act in this way. Real systems, however, are unlikely to be benevolent; our next example shows a strategic system can easily manipulate naïve users to receive arbitrarily low payoff.
Example 1.
Let with and . Fix , and let . Note are the feasible representations. The naïve user assigns according to . For , a strategic system plays . The expected payoff to the user is .
One reason a naïve user is susceptible to manipulation is because she does not make any use of the data she may have. We next describe a slightly sophisticated user that uses a simple strategy to ensure a better payoff.
The agnostic user. The agnostic user puts all faith in data; this user does not make assumptions on, nor is she susceptible to, the type of system she plays against. Her strategy is simple: collect data, compute summary statistics, and choose to either always accept or always reject (or flip a coin). In particular, given a sample set , the agnostic user first computes the fraction of positive examples, . Then, for some tolerance , sets for all , if , if , and flips a coin otherwise. In Example 1, an agnostic user would choose when is large, guaranteeing a payoff of at least as . Investing minimal effort, for an appropriate choice of , this user’s strategy turns out to be quite robust.
Theorem 3.2.
(Informal) Let be the true rate of positive examples, . Then as increases, the agnostic user’s payoff approaches at rate .
Formal statement and proof in Appx. A.1. In essence, the agnostic user guarantee herself the ‘majority’ rate with rudimentary usage of her data, and in a way that does not depend on how system responds. But this can be far from optimal; we now turn to the more elaborate strategic user who makes more clever use of the data at her disposal.
4 Strategic Users Who Learn
A strategic agent acknowledges that the system is strategic, and anticipates that representations are chosen to maximize her own engagement. Knowing this, the strategic user makes use of her previous experiences, in the form of a labeled data set , to learn a choice function from some function class that optimizes her payoff (given that the system is strategic). Cast as a learning problem, this is equivalent to minimizing the expected classification error on strategically-chosen representations:
(6) |
Since the distribution is unknown, we follow the conventional approach of empirical risk minimization (ERM) and optimize the empirical analog of Eq. (6):
(7) |
Importantly since every is a set, must include set functions , and any algorithm for optimizing Eq. (7) must take this into account. In Sections 4.1 and 4.2, we characterize the complexity of a user’s choice function and relate its complexity to that of , and in Section 4.3 give an algorithm that computes , the empirical minimizer, for a hypothesis class of a given complexity.
4.1 Complexity Classes of Set Functions
Ideally, a learning algorithm should permit flexibility in choosing the complexity of the class of functions it learns (e.g., the degree of a polynomial kernel, the number of layers in a neural network), as this provides means to trade-off running time with performance and to reduce overfitting. In this section we propose a hierarchy of set-function complexity classes that is appropriate for our problem.
Denote by all subsets of having size at most :
We start by defining -order functions over the representation space. These functions are completely determined by weights placed on subsets of size at most .
Definition 4.1.
We say the function is of order if there exists real weights on sets of cardinality at most , , such that
Not all functions can necessarily be expressed as a -order function (for some ); nonetheless, in the context of optimizing Eq. (7), we show that working with -order functions is sufficiently general, since any set function can be linked to a matching -order function (for some ) through how it operates on strategic inputs.
Lemma 4.2.
For any , there exists and a corresponding -order function such that:
Lem. 4.2 permits us to focus on -order functions. The proof is constructive (see Appx. E), and the construction itself turns out to be highly useful. In particular, the proof constructs having a particular form of binary basis weights, , which we assume from now on are fixed (). Hence, every function has a corresponding binary-weighted -order function , which motivates the following definition of functions and function classes.
Definition 4.3.
We say a -order function with basis weights is binary-weighted if:
for some fixed and .
A binary weighted -order determines a family of -size subsets, described by having weights as , such that for any with , if and only if contains a subset from the family (for with , always). This is made precise using the notion of lifted functions in Lem. A.4 in Appx. A.2. Next, denote:
The classes will serve as complexity classes for our learning algorithm; the user provides as input, and Alg outputs an that minimizes the empirical loss444Assuming the empirical error is zero.. As we will show, using as a complexity measure provides the user direct control over the tradeoff between estimation and approximation error, as well as over the running time.
Next, we show that the classes are strictly nested. This will be important for our analysis of approximation error, as it will let us reason about the connection between the learned and the target function (proof in Appx. E).
Lemma 4.4.
For all , and .
Note that includes all binary-weighted set functions, but since representations are of size at most , it suffices to consider only . Importantly, can be set lower than ; for example, is the class of threshold modular functions, and is the class of threshold pairwise functions. The functions we consider are parameterized by their weights, , and so any -order function has at most weights. In this sense, the choice of is highly meaningful. Now that we have defined our complexity classes, we turn to discussing how they can be optimized over.
4.2 Learning via Reduction to Induced Functions
The simple structure of functions in makes them good candidates for optimization. But the main difficulty in optimizing the empirical error in Eq. (7) is that the choice of does not only determine the error, but also determines the inputs on which errors are measured (indirectly through the dependence of on ). To cope with this challenge, our approach is to work with induced functions that already have the system’s strategic response encoded within, which will prove useful for learning. Additionally, as they operate directly on (and not ), they can easily be compared with , which will become important in Sec. 5.
Definition 4.5.
For a class , its induced class is:
The induced class includes for every a corresponding function that already has integrated in it. We use to denote the induced class of . For every , we denote its induced function by . Whereas functions operate on , induced functions operate directly on , with each accounting internally for how the system strategic responds to on each .
Our next theorem provides a key structural result: induced functions inherit the weights of their -order counterparts.
Theorem 4.6.
For any with weights :
its induced can be expressed using the same weights, , but with summation over subsets of , i.e.:
Thm. 4.6 is the main pillar on which our algorithm stands: it allows us to construct by querying the loss directly—i.e., without explicitly computing —by working with the induced ; this is since:
Thus, through their shared weights, induced functions serve as a bridge between what we optimize, and how.
4.3 Learning Algorithm
We now present our learning algorithm, Alg. The algorithm is exact: it takes as input a training set and a parameter , and returns an that minimizes the empirical loss (Eq. (7)). Correctness holds under the realizabilility condition , i.e., is the induced function of some .555Note that even for standard linear binary classification, finding an empirical minimizer of the in the agnostic (i.e., non-realizable) case is NP-hard (Shalev-Shwartz and Ben-David, 2014).
The algorithm constructs by sequentially computing its weights, . As per Def. 4.3, only for with must be learned; hence, weights are sparse, in the sense that only a small subset of them are assigned , while the rest are . Weights can be implemented as a hash table, where if is in the table, and if it is not. Our next result establishes the correctness of Alg. The proof leverages a property that characterizes the existence of an having zero empirical error (see Lem. E.1). The proof of Lem. E.1 uses Thm. 4.6, which enables the loss to be directly computed for the induced functions using the shared weight structure.
Theorem 4.7.
For any , if is realizable then Alg returns an that minimizes the empirical error.
Proof in Appx. E. Note that our algorithm is exact: it returns a true minimizer of the empirical 0/1 loss, assuming is realizable. Additionally, Alg can be used to identify if there exists an with zero empirical error; at Step 15, for each if there does not exist a or such that then from Lem. E.1 in Appx. D there does not exist an with zero empirical error.
Lemma 4.8.
Let be the size of elements in , be the number of samples, and be the user’s choice of complexity. Then Alg runs in time.
This runtime is made possible due to several key factors: (i) only -sized weights need to be learned, (ii) all weights are binary-valued, and (iii) loss queries are efficient in induced space. Nonetheless, when and are large, runtime may be significant, and so must be chosen with care. Fortunately, our results in Sec. 5.1.1 give encouraging evidence that learning with small —even , for which runtime is —is quite powerful (assuming is realizable).
In the analysis, the is made possible only since weights are sparse, and since Alg operates on a finite sample set of size . Alternatively, if is large, then this can be replaced with . This turns out to be necessary; in Appx. A.3 we show that, in the limit, is a lower bound.
5 Balance of Power
Our final section explores the question: what determines the balance of power between system and users? We begin with the perspective of the user, who has commitment power, but can only minimize the empirical error. For her, the choice of complexity class is key in balancing approximation error—how well (in principle) can functions approximate ; and estimation error—how close the empirical payoff of the learned is to its expected value. Our results give insight into how these types of error trade off as is varied (here we do not assume realizability).
For the system, the important factors are and , since these determine its flexibility in choosing representations. Since more feasible representation mean more flexibility, it would seem plausible that smaller and larger should help the system more. However, our results indicate differently: for system, smaller is better, and the choice of has limited effect on strategic users. The result for goes through a connection to the user’s choice of ; surprisingly, smaller turns out to be, in some sense, better for all.
5.1 User’s Perspective
We begin by studying the effects of on user payoff. Recall that users aim to minimize the expected error (Eq. (6)):
but instead minimize the empirical error (Eq. (7)). For reasoning about the expected error of the learned choice function , a common approach is to decompose it into two error types—approximation and estimation:
Approximation error describes the lowest error obtainable by functions in ; this measures the ‘expressivity’ of , and is independent of . For approximation error, we define a matching complexity structure for value functions , and give several results relating the choice of and the complexity of . Estimation error describes how far the learned is from the optimal , and depends on the data size, . Here we give a generalization bound based on VC analysis.
5.1.1 User approximation error
To analyze the approximation error, we must be able to relate choice functions (that operate on representations ) to the target value function (which operates on items ). To connect the two, we will again use induced functions, for which we now define a matching complexity structure.
Definition 5.1.
A function has an induced complexity of if exists a function s.t.:
and is minimal (i.e., there is no such ).
We show in Lem. 5.2 and Cor. 5.3 that the induced complexity of a function captures the minimum such that is an induced function of an .
Lemma 5.2.
Let . Then for every , the induced complexity of the corresponding is .
Corollary 5.3.
Let be the induced function class of , as defined in Def. 4.5. Then:
Proof of Cor. 5.3 is in Appx. F. We now turn to considering the effect of on approximation error. Since the ‘worthwhileness’ function operates on , we can consider its induced complexity, which we denote by (i.e., ). The following result shows that if , then is expressive enough to perfectly recover .
Theorem 5.4.
If then the approximation error is .
One conclusion from Thm. 5.4 is that if the user knows , then zero error is, in principle, obtainable; another is that there is no reason to choose . In practice, knowing can aid the user in tuning according to computational (Sec. 4.3) and statistical considerations (Sec. 5.1.2). Further conclusions relate and :
Corollary 5.5.
If and the distribution has full support on , then is the smallest that gives zero approximation error.
Corollary 5.6.
If , then the approximation error weakly increases with , i.e., for all . Furthermore, if the distribution has full support on then no can achieve zero approximation error.

Proofs in Appx. F. In general, Cor. 5.6 guarantees only weak improvement with . Next, we show that increasing can exhibit clear diminishing-returns behavior, with most of the gain obtained at very low .
Lemma 5.7.
Let be the uniform distribution over . Then there is a value function for which diminishes convexly with .
The proof is constructive (see Appx. F). We construct a whose approximation error is upper bounded by
The diminishing returns property of upper bound is illustrated in Fig. 1. Although Lem. 5.7 describes a special case, we conjecture that this phenomena applies more broadly.
Our next result shows that learning -order functions can be as powerful as learning subadditive functions; hence, learning with is highly expressive. Interestingly, the connection between general (-order) functions and subadditive functions is due to the strategic response mapping, .
Lemma 5.8.
Consider threshold-subadditive functions:
Then for every threshold-subadditive , there is an for which .
5.1.2 User estimation error
For estimation error, we give generalization bounds based on VC analysis. The challenge in analyzing functions in is that generalization applies to the strategic 0/1 loss, i.e., , and so standard bounds (which apply to the standard 0/1 loss) do not hold. To get around this, our approach relies on directly analyzing the VC dimension of the induced class, (a similar approach was taken in Sundaram et al. (2021) for SC). This allows us to employ tools from VC theory, which give the following bound.
Theorem 5.9.
For any and , given a sample set of size sampled from and labeled by some , we have
w.p. at least over , and for a fixed constant . In particular, Alg in Sec. 4.3, assuming is realizable, returns an for which:
w.p. at least over , and for a fixed constant .
The proof relies on Thm. 4.6; since and share weights, the induced can be analyzed as a class of -variate degree- multilinear polynomials. Since induced functions already incorporate , VC analysis for the 0/1 loss can be applied. Note that such polynomials have exactly degrees of freedom; hence the term in the bound.
5.2 System’s Perspective
The system’s expressive power derives from its flexibility in choosing representations for items . Since determine which representations are feasible, they directly control the system’s power to manipulate; and while the system itself may not have direct control over (i.e., if they are set by exogenous factors like screen size), their values certainly affect the system’s ability to optimize engagement. Our next result is therefore unintuitive: for system, a smaller is better (in the worst case), even though it reduces the set of feasible representations. This result is obtained indirectly, by considering the effect of on the user’s choice of .
Lemma 5.10.
There exists a distribution and a value function such that for all , system has higher payoff against the optimal than against .
The proof is in Appx. F; it uses the uniform distribution, and the value function from Thm. 5.7. Recalling that the choice of controls the induced complexity (Cor. 5.3), and that users should choose to be no greater than , we can conclude the following (in a worst-case sense):
Corollary 5.11.
For the system, lower is better.
Proof in Appx. F. For , it turns out that against strategic users—it is inconsequential. This is since payoff to the strategic user is derived entirely from , which is upper-bounded by , but can be set lower than . This invariance is derived immediately from how functions in are defined, namely that for all with (Def. 4.3). This, however, holds when the strategic user chooses to learn over for some . Consider, alternatively, a strategic user that decides to learn subadditive functions instead. In this case, Thm. 5.8 shows that determines the users ‘effective’ ; the smaller , the smaller the subset of subadditive functions that can be learned. Hence, for user, smaller means worse approximation error. This becomes even more pronounced when facing a naïve user; for her, a lower means that system now has a large set of representations to choose from; if even one of them has , the system can exploit this to increase its gains. In this sense, as decreases, payoff to the system (weakly) improves.
6 Discussion
Our analysis of the balance of power reveals a surprising conclusion: for both parties, in some sense, simple choice functions are better. For system, lower improves its payoff through how it relates to (Corollary 5.11). For users, lower is clearly better in terms of runtime (Lemma 4.8) and estimation error (Theorem 5.9), and for approximation error, lower has certain benefits—as it relates to (Corollary 5.5), and via diminishing returns (Theorem 5.7). Thus, and despite their conflicting interests—to some degree, the incentives of the system and its users align.
But the story is more complex. For users, there is no definitive notion of ‘better’; strategic users always face a trade-off, and must choose to balance approximation, estimation, and runtime. In principle, users are free to choose at will; but since there is no use for , a system controlling de facto controls as well. This places a concrete restriction on the freedom of users to choose, and inequitably: for small , users whose has complexity (i.e., having ‘simple tastes’) are less susceptible to manipulation than users with of complexity (e.g., fringe users with eclectic tastes) (Theorem 5.4, Corollaries. 5.5 and 5.6). In this sense, the choice of also has implications on fairness. We leave the further study of these aspects of strategic representation for future work.
From a purely utilitarian point of view, it is tempting to conclude that systems should always set to be low. But this misses the broader picture: although systems profit from engagement, users engage only if they believe it is worthwhile to them, and dissatisfied users may choose to leave the system entirely (possibly into the hands of another). Thus, the system should not blindly act to maximize engagement; in reality, it, too, faces a tradeoff.
References
- Abboud et al. [1999] Elias Abboud, Nader Agha, Nader H Bshouty, Nizar Radwan, and Fathi Saleh. Learning threshold functions with small weights using membership queries. In Proceedings of the twelfth annual conference on Computational learning theory, pages 318–322, 1999.
- Abraham et al. [2012] Ittai Abraham, Moshe Babaioff, Shaddin Dughmi, and Tim Roughgarden. Combinatorial auctions with restricted complements. In Proceedings of the 13th ACM Conference on Electronic Commerce, pages 3–16, 2012.
- Angluin [1988] Dana Angluin. Queries and concept learning. Machine learning, 1988.
- Balcan and Harvey [2011] Maria-Florina Balcan and Nicholas JA Harvey. Learning submodular functions. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 793–802, 2011.
- Bechavod et al. [2022] Yahav Bechavod, Chara Podimata, Zhiwei Steven Wu, and Juba Ziani. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
- Chen et al. [2020a] Yatong Chen, Jialu Wang, and Yang Liu. Linear classifiers that encourage constructive adaptation. arXiv preprint arXiv:2011.00355, 2020a.
- Chen et al. [2020b] Yatong Chen, Jialu Wang, and Yang Liu. Strategic recourse in linear classification. arXiv preprint arXiv:2011.00355, 2020b.
- Chevaleyre et al. [2008] Yann Chevaleyre, Ulle Endriss, Sylvia Estivie, and Nicolas Maudet. Multiagent resource allocation in k -additive domains: preference representation and complexity. Ann. Oper. Res., 163(1):49–62, 2008.
- Conitzer et al. [2005] Vincent Conitzer, Tuomas Sandholm, and Paolo Santi. Combinatorial auctions with k-wise dependent valuations. In AAAI, 2005.
- Dughmi et al. [2015] Shaddin Dughmi, Nicole Immorlica, Ryan O’Donnell, and Li-Yang Tan. Algorithmic signaling of features in auction design. In Algorithmic Game Theory - 8th International Symposium, SAGT, pages 150–162. Springer, 2015.
- Eilat et al. [2022] Itay Eilat, Ben Finkelshtein, Chaim Baskin, and Nir Rosenfeld. Strategic classification with graph neural networks. arXiv preprint arXiv:2205.15765, 2022.
- Feige et al. [2015] Uriel Feige, Michal Feldman, Nicole Immorlica, Rani Izsak, Brendan Lucier, and Vasilis Syrgkanis. A unifying hierarchy of valuations with complements and substitutes. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
- Feldman [2009] Vitaly Feldman. On the power of membership queries in agnostic learning. The Journal of Machine Learning Research, 10:163–182, 2009.
- Ghalme et al. [2021] Ganesh Ghalme, Vineet Nair, Itay Eilat, Inbal Talgam-Cohen, and Nir Rosenfeld. Strategic classification in the dark. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.
- Haghtalab et al. [2021] Nika Haghtalab, Nicole Immorlica, Brendan Lucier, Markus Mobius, and Divyarthi Mohan. Persuading with anecdotes. Technical report, National Bureau of Economic Research, 2021.
- Hardt et al. [2016] Moritz Hardt, Nimrod Megiddo, Christos H. Papadimitriou, and Mary Wootters. Strategic classification. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 2016.
- Hu et al. [2019] Lily Hu, Nicole Immorlica, and Jennifer Wortman Vaughan. The disparate effects of strategic manipulation. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), pages 259–268, 2019.
- Jagadeesan et al. [2021] Meena Jagadeesan, Celestine Mendler-Dünner, and Moritz Hardt. Alternative microfoundations for strategic classification. In International Conference on Machine Learning, 2021.
- Kamenica and Gentzkow [2011] Emir Kamenica and Matthew Gentzkow. Bayesian persuasion. American Economic Review, 101(6), 2011.
- Krishnaswamy et al. [2021] Anilesh K Krishnaswamy, Haoming Li, David Rein, Hanrui Zhang, and Vincent Conitzer. Classification with strategically withheld data. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
- Lancaster [1966] Kelvin J Lancaster. A new approach to consumer theory. Journal of political economy, 74(2):132–157, 1966.
- Levanon and Rosenfeld [2021] Sagi Levanon and Nir Rosenfeld. Strategic classification made practical. In Proceedings of the 38th International Conference on Machine Learning, ICML, 2021.
- Levanon and Rosenfeld [2022] Sagi Levanon and Nir Rosenfeld. Generalized strategic classification and the case of aligned incentives. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
- Miller et al. [2020] John Miller, Smitha Milli, and Moritz Hardt. Strategic classification is causal modeling in disguise. In International Conference on Machine Learning, pages 6917–6926. PMLR, 2020.
- Milli et al. [2019] Smitha Milli, John Miller, Anca D. Dragan, and Moritz Hardt. The social cost of strategic classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), pages 230–239, 2019.
- Rosenfeld et al. [2020] Nir Rosenfeld, Kojin Oshiba, and Yaron Singer. Predicting choice with set-dependent aggregation. In International Conference on Machine Learning, 2020.
- Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Sundaram et al. [2021] Ravi Sundaram, Anil Vullikanti, Haifeng Xu, and Fan Yao. Pac-learning for strategic classification. In International Conference on Machine Learning, pages 9978–9988, 2021.
- Zhang and Conitzer [2021] Hanrui Zhang and Vincent Conitzer. Incentive-aware pac learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 5797–5804, 2021.
- Zrnic et al. [2021] Tijana Zrnic, Eric Mazumdar, Shankar Sastry, and Michael Jordan. Who leads and who follows in strategic classification? Advances in Neural Information Processing Systems, 34, 2021.
Appendix A Additional Results
A.1 Agnostic User
Theorem A.1 stated below is the formal version of Theorem 3.2 in Section 3. Theorem A.1 shows that given a a large enough sample size , an agnostic user’s payoff would approach , where .
Theorem A.1.
Let and , then agnostic user’s expected payoff guarantee is given by
Before we prove the theorem, we state Hoeffding’s inequality, which is a well known result from probability theory.
Lemma A.2 (Hoeffding).
Let be the sum of i.i.d. random variables with and for all , then
We will use the following equivalent form of the above inequality. Let i.e. and . Then we have with probability at-least we have
(8) |
(9) |
Now we are ready to give the proof of Theorem A.1
Proof of Theorem A.1.
We begin with the following supporting lemma.
Lemma A.3.
Let , then
Proof.
The proof follows from following sequence of inequalities,
Let . We have which is an increasing function of , so we the maximum is achieved at and is given by . This completes the proof of the lemma. ∎
From Lemma A.3 we have that , hence there is a non-trivial range i.e. where user assigns for all with probability 1. Similarly, when user assigns for all with probability 1. We will consider three cases separately.
Case 1 (): From Hoeffding’s inequality (Eq. 9) we have that with probability at-least ,
Hence, with probability at-least an agnostic user will get a payoff of . Hence, the expected payoff in this case is at-least .
Case 2 (): Similar to Case 1 here we use the tail bound given by Hoeffding’s inequality (Eq. 8) to get with probability at least ,
Hence, . The agnostic user guarantees a payoff of with probability at least () in this case. Hence we have the payoff of in this case.
Case 3 (): Finally, in this case, the agnostic user chooses for all with probability and for all with probability . Hence, the expected payoff is given by irrespective of the true mean of positive samples. ∎
A.2 Lifted Functions
The relation between choice functions and their induced counterparts passes through an additional type of functions that operate on sets of size exactly . Denote , and note that all feasible representations can be partitioned as . We refer to functions that operate on single-sized sets as restricted functions. Our next result shows that choice functions in can be represented by restricted functions over that are ‘lifted’ to operate on the entire space. This will allow us to work only with sets of size exactly .
Lemma A.4.
For each there exists a corresponding such that , where:
Proof.
Let . Then there is a weight function on sets of size at most such that either or , and
Define such that for a , if and otherwise. It is easy to see from the choice of that . ∎
A.3 A Lower Bound on the Running Time
As stated in Lemma 4.8, the running time of our algorithm is . We argued in Section 4 that is made possible only since weights are sparse, and since the algorithm operates on a finite sample set of size . If is large, then this expression can be replaced with . We now show that, in the limit (or under full information), the dependence on is necessary. The conclusion from Lemma A.5 is that to find the loss minimizer, any algorithm must traverse at least all such ; since there exist such functions, this is a lower bound. This is unsurprising; is tightly related to the class of multilinear polynomials, whose degrees of freedom are exactly .
Lemma A.5.
Consider a subclass of composed of choice functions which have for exactly one with , and otherwise. Then, for every such , there exists a corresponding , such that is a unique minimizer (within this subclass) of the error w.r.t. .
Proof.
Let and be distinct size subsets, and let and . Further, let , be a weight function that assigns to and to all other subsets of size at most . Let and be two function in defined by the binary weighted functions and respectively. Observe that for the approximation error (see (Eq. (6))) of is zero. Hence, to prove the lemma it is sufficient to show that .
Suppose . Since , there exists an such that but . From Theorem 4.6 and the choice of and , this implies but , and hence, a contradiction. ∎
Appendix B Additional Related Work
Learning set functions.
Concept learning refers to learning a binary function over hypercubes [Angluin, 1988] through a query access model. Abboud et al. [1999] provide a lower bound on membership queries to exactly learn a threshold function over sets where each element has small integer valued weights. Our learning framework admits large weights and has only a sample access in contrast with the query access studied in this literature. Feldman [2009] show that the problem of learning set functions with sample access is computationally hard. However, we show (see Section 5) that the strategic setting is more nuanced; a more complex representations are disadvantageous for both user and the system. In other words, it is in the best interest of system to choose smaller (and much simpler) representations. A by-now classic work in learning theory studies the learnability from data of submodular (non-threshold) set functions [Balcan and Harvey, 2011]. Though we consider learning subadditive functions in this work, an extension to submodular valuations is a natural extension. Learning set functions is in general hard, even for certain subcalsses such as submodular functions. Rosenfeld et al. [2020] show that it’s possible to learn certain parameterized subclasses of submodular functions, when the goal is to use them for optimization. But this refers to learning over approximate proxy losses; whereas in our work, we show that learning is possible directly over the 0/1 loss.
Hierarchies of set functions.
Conitzer et al. [2005] (and independently, Chevaleyre et al. [2008]) suggest a notion of -wise dependent valuations, to which our Definition 4.1 is related. We also allow up to -wise dependencies, but our valuations need not be positive and we focus on their sign (an indication whether an item is acceptable or not). Our set function valuations are also over item attributes rather than multiple items. Despite the differences, the definitions have a shared motivation: Conitzer et al. [2005] believe that this type of valuation is likely to arise in many economic scenarios, especially since due to cognitive limitations, it might be difficult for a player to understand the inter-relationships between a large group of items. Hierarchies of valuations with limited dependencies/synergies have been further studied by Abraham et al. [2012], Feige et al. [2015] under the title ‘hypergraph valuations’. These works focus on monotone valuations that have only positive weights for every subset, and are thus mathematically different than ours.
Appendix C A Missing Proof from Section 2
See 2.1
Proof.
The proof follows from the definition of best response (Eq. 2). Let . Then since consists of only best response, we have either , or . Hence, if and only if for any . ∎
Appendix D A Missing Proof from Section 3 and an Additional Example
See 3.1
Proof.
Since a naïve user plays , for each the payoff of the user is maximized if in response the system plays a such that . Observe that, if there exists a and , such that then and consequently the user’s payoff is maximized for such an . Conversely, if there exists no and such that , then no truthful system can ensure more than zero utility for such an . Hence, a benevolent system maximizes the utility of a naïve user. ∎
We now present an additional example to show how a naïve user’s choice function can be manipulated by the strategic system and, as a consequence, the user may obtain arbitrarily small payoff against a strategic system.
Example 2.
Let with and . Further, let with as representations and a distribution supported over .
A unique truthful representation for this instance is . A strategic agent can manipulate a naïve agent into non-preferred choices by using a representation for . Note here that a naïve agent expected as a representation for since and . However, a strategic agent chose as under given we have . A naïve users payoff in this case is reduced to which can be arbitrarily small.
Appendix E Missing Proofs from Section 4
See 4.2
Proof.
Define as follows: if for all then , and otherwise
Define as follows: For , ; for
First, we argue that defined as above satisfies for all . Suppose . Then there exists such that and . From the choice of , we may assume without loss of generality that . Further, from the construction of , we have , and hence . Now suppose . Then for all we have . In particular, for all such that we have . This implies for all such that we have . This is because if there exists such that and then from the definition of there exists a such that , and (a contradiction). Additionally, from definition, for all such that we have . Hence, if then .
Now, we show that is a -order function. Let and for all such that and . Further, for all such that , if then let . For all , if then by construction of , we have , and since for all , we have . Hence, for all , if
Similarly, for all , if then by construction of , we have if and only if there exists a , and such that . In particular, if and then there exists a , and such that . Since , , and , we have if and then
Finally, if and then from the definition of there does not exists a , and such that . Since , we have if and then
∎
See 4.4
Proof.
Arbitrarily choose (recall is the ground set) such that , and let .666Here we wish to distinguish between for and and hence we use instead of . Also for all and , let . Let be defined as follows
From the definition of , we have . We show that . First, observe that for all if and only if . Suppose . Then there is a weight function on sets of size at most such that either or , and
Let be such that . This implies . Hence there exist a such that and . Let be such that but . Such a exists because . Further, as , we have . But since , we have from the choice of and
This gives a contradiction. Hence, . ∎
See 4.6
Proof.
Since , satisfies the following two properties (see Definition 4.3):
-
1.
either or ,
-
2.
for all having .
Further, from the definition of , we have . This implies
From the the above two properties of the weights function, we have
From the above two equations we conclude that
Finally, the two properties of ensure that
∎
See 4.7
Proof.
Throughout, for ease of notation, we use to denote . Let . Recall is equal to at the beginning of the algorithm. Also, for each , let . The following lemma characterizes the training set for which there exists an with zero empirical error.
Lemma E.1.
There exists an with zero empirical error if and only if for all there exists a and such that for all .
Proof.
Suppose there exists an with zero empirical error. Since , from Lemma A.4 we have there exists a such that . We state the following observation, and its proof follows from the definition of .
Observation E.2.
-
1.
For every such that there is a such that and ,
-
2.
For every such that , it must be that for every , we have .
Further, as the empirical error for is zero, we have the following observation.
Observation E.3.
-
1.
For every there is a and such that ,
-
2.
For every , it must be that for every , we have .
Proof.
For every , since the empirical error is zero, we have . From the definition of , this implies there is a and such that . Similarly, for every , since the empirical error is zero, we have . Again from the definition of , it must be that for every , we have . ∎
Hence, from Observations E.2 and E.3, we have for every there is a , such that . Similarly, for every it must be that for every , we have . Hence, for every there exists a and such that for all .
Conversely, suppose for every there exists a and such that for all . Then define as follows: a) for all such that for an , let , b) for all , such that for an and for an , let , c) for all , such that for any , let . From the supposition, we have that for every , there is a and such that . Now define . To show that the empirical error of is zero, it is sufficient to show that for every , and for every . Let . From the definition of , for every such that , . Hence, from the definition of , we have for every such that , . Now from the definition of best response, we have . Similarly, if from our supposition and the definition of , we have there exists a such that and . Hence, from the definition of , there exists a such that and . Finally, from the definition of best response , we have . ∎
Now, if then there exists an such that the induced function of is equal to , that is, . This implies there exists an which attains zero empirical error on the training set. Since empirical error is always non-negative, such an minimizes the empirical error in this case. Hence, from Lemma E.1, it follows that if is realizable then for all at Step 17 of Alg there is either a and , or and . Now, observe that at the beginning of Step 22, set satisfies the following:
(10) |
Further, at Step 22, for a , if . This implies
(11) |
Also, from Theorem 4.6, the induced function corresponding to the returned is given as
(12) |
To complete the proof of theorem, we show that for every . Suppose . Then from Equations and 10 and 11, for every and we have , and hence from Equation 12 for we have . Similarly, suppose . Then from Equation 11, there exists , such that . Hence, from Equation 12, and noting that and we have . ∎
See 4.8
Proof.
In the first two for loops, for each (or in ) the internal for loop runs for time. Since , this is a total of at most operations. Similarly, Step 22 places weights on at most subsets, and hence runs in time. Hence, Alg runs in time. ∎
Appendix F Missing Proofs from Section 5
See 5.2
Proof.
Let . From Lemma A.4, we know . Further, assume is such that . Now, from the definition of , we have for all , if and only if there exists a such that and . Since , if and only if there exists a such that and . This implies the induced complexity of is . ∎
See 5.3
Proof.
From Lemma 5.2, we know that functions in have induced complexity at most . We show that if has induced complexity at most then there is an such that . Let the induced complexity of be equal to . Then there exists a such that
(13) |
Let . First we show that for all . Since is a lift of , if for a then for all such that , we have .
(14) |
Hence, from Equations 13 and 14 for all , if and only if there exists such that and . From the definition of induced function, this implies for all .
To show , we construct a weight function on sets of size at most . For and , let . For , let
Now from Equation 14, if and only if there exists such that and . Hence, from the definition of , if and only if there exists such that and . In particular, since and , we have
∎
See 5.4
Proof.
Since the induced complexity of is , there is a function s.t.:
Let and , and define the weight function on sets of size at most as follows: a) if then let , b) if and there exists a such that then , and c) if and there does not exist a such that then . Now define using as follows:
We now show that for each , implying . Suppose for an . Then there exists a such that and . From Theorem 4.6, and the choice of and we have that there exists a , such that . From the construction of this implies there exists a , such that . But from the above definition of this implies . Similarly, we can argue, if then for any . Hence, for each implying and 0 approximation error for . ∎
See 5.5
Proof.
In the proof of Theorem 5.4, we show that for , we have zero approximation error. Hence, to prove the corollary it is sufficient to show that for a the approximation error is not zero. Suppose there is an such that and . Since the distribution has full support, this implies for all . Hence, the induced complexity of is at most giving a contradiction. ∎
See 5.6
Proof.
The approximation error weakly decreases because for all . Also, from the proof of Corollary 5.5, it is clear that no can achieve zero approximation error. ∎
See 5.7
Proof.
We construct a such that the approximation error for is as given below
It is easy to see that diminishes convexly with (see Fig. 1). We choose elements (the ground set), and let be the size subset consisting of these elements. For a , let and . We first show that there exists a with the following two properties:
-
1.
if then there exists a such that .
-
2.
For , let . Then , for every .
-
3.
For every , let . Then , where .
We construct such a iteratively. We begin by making the following observation.
Observation F.1.
For each , .
Proof.
Recall consists of size subsets of . For a we wish to choose size subsets of that contain a , . This equivalent to choosing a fixed size subset of and then choosing the remaining elements from the elements (not part of ) in . For every we can choose size subset of in ways, and for each such choice we can choose the remaining elements in ways. Since, this holds for any , we have . ∎
Constructing v: The idea is to iteratively add elements in to , that is, iteratively determine the such that . In the first round, we arbitrarily choose from and add it to , and the remaining are added to . At round , assume we have constructed a satisfying the above three properties for , that is,
-
1.
if then there exists a such that .
-
2.
, for every .
-
3.
For every , let . Then , where .
Hence, at round , we have elements in which are not yet in or . From these elements in , for every size subset we arbitrarily choose elements containing and add the remaining elements to . Now, observe that satisfies the first two properties for every after this procedure. We argue satisfies the third property for any , such that . The size sets in containing a , such that , can be partitioned into sets containing different size subsets of . In particular, we have the following combinatorial equality
In the above expression, corresponds to the number of size sets that contain only a a specific size subset of . Since our iterative procedure ensures from each such partition at least fraction of is added to , we have that satisfies the third property.
Optimal : From the construction of , it is clear that the optimal for the above constructed , for any satisfies the following: for every , if and only if there exists a , , and . Further as is the uniform distribution, for such an :
∎
See 5.8
Proof.
Let with a corresponding such that for all . Choose an , and . Define a weight function on sets of size at most as follows:
Let be the function defined by the binary weight as defined above. We argue that for every , if . For every , if and only if there is a and such that . Since is sub-additive, we have
(15) |
A simple recursive argument implies if and only if there is a such that and , and hence . Hence, from Theorem 4.6 this implies, if and only if . ∎
See 5.9
Proof.
We first argue that the VC dimension of is at most . Let , index the vectors in by (the ground set), such that . Then each can be represented by a binary vector , with the entry indexed by a being if and only if . Further, let be a binary weighted vector with and as in Def. 4.3. Then from the definition of , for each , there is a such that a) for all , and b) the entry of indexed by a with is . From this we observe that the VC dimension of is at most , since each is decided by the realization of binary weights on entries indexed by the sets. Now the first part of the theorem follows by noting that the first bound is the agnostic PAC generalization guarantee for an algorithm minimizing the empirical error in the standard classification setting with VC dimension at most . To prove the second part, we have , and hence the approximation error is zero, that is, (from Lemma E.1). Further, Alg minimizes the empirical error (Theorem 4.7), and returns an with zero empirical error. ∎
See 5.10
Proof.
See 5.11
Proof.
In Lemma 5.10, we showed there exists a user with such that for all and , the system has better utility against the optimal choice function in than in . Since the choice of the user can make is bounded by , a lower maximizes the worst-case payoff to the system. ∎