Strategic Representation

Vineet Nair Vineet Nair is now at Google Research, India. Technion Israel Institute of Technology, Haifa, Israel. Ganesh Ghalme This work was done when Ganesh was a post-doctoral fellow at Technion. Indian Institute of Technology Hyderabad, India. Inbal Talgam-Cohen Technion Israel Institute of Technology, Haifa, Israel. Nir Rosenfeld Technion Israel Institute of Technology, Haifa, Israel.

Abstract

Humans have come to rely on machines for reducing excessive information to manageable representations. But this reliance can be abused—strategic machines might craft representations that manipulate their users. How can a user make good choices based on strategic representations? We formalize this as a learning problem, and pursue algorithms for decision-making that are robust to manipulation. In our main setting of interest, the system represents attributes of an item to the user, who then decides whether or not to consume. We model this interaction through the lens of strategic classification (Hardt et al. 2016), reversed: the user, who learns, plays first; and the system, which responds, plays second. The system must respond with representations that reveal ‘nothing but the truth’ but need not reveal the entire truth. Thus, the user faces the problem of learning set functions under strategic subset selection, which presents distinct algorithmic and statistical challenges. Our main result is a learning algorithm that minimizes error despite strategic representations, and our theoretical analysis sheds light on the trade-off between learning effort and susceptibility to manipulation.

1 Introduction

There is a growing recognition that learning systems deployed in social settings are likely to induce strategic interactions between the system and its users. One promising line of research in this domain is strategic classification Hardt et al. (2016), which studies learning in settings where users can strategically modify their features in response to a learned classification rule. The primary narrative in strategic classification is one of self-interested users that act to ‘game’ the system, and of systems that should defend against this form of gaming behavior. In this paper, we initiate the study of reversed scenarios, in which it is the system that strategically games its users. We argue that this is quite routine: In settings where users make choices about items (e.g., click, buy, watch), decisions are often made on the basis of only partial information. But the choice of what information to present to users often lies in the hands of the system, which can utilize its representation power to promote its own goals. Here we formalize the problem of strategic representation, and study how learning can now aid users in choosing well despite strategic system behavior.

As a concrete example, consider a user browsing for a hotel in an online booking platform. Hotels on the platform are represented by a small set of images; as there are many images available for each hotel, the system must choose which subset of these to display. Clearly, the choice of representation can have a significant effect on users’ decision-making (whether to click the hotel, and subsequently whether to book it), and in turn, on the system’s profit. The system may well attempt to capitalize on the control it has over what information is presented to the user—at the expense of the user, who may be swayed to make sub-optimal decisions. Given that users rely on system-crafted representations for making decisions, choosing well requires users to account for the strategic nature of the representations they face.

Our goal in this paper is to study when and how learning can aid users in making decisions that are strategically robust. We model users as deciding based on a choice strategy $h$ mapping represented items to binary choices, which can be learned from the user’s previous experiences (e.g., hotels stayed in, and whether they were worthwhile). Since in reality full descriptions of items are often too large for humans to process effectively (e.g., hundreds or thousands of attributes), the role of the system is to provide users with compact representations (e.g., a small subset of attributes). We therefore model the system as responding to $h$ with a representation mapping ${\phi}$ , which determines for each item $x$ its representation $z={\phi}(x)$ , on which users rely for making choices (i.e., $h$ is a function of $z$ ). We focus on items $x$ that are discrete and composed of a subset of ground elements; accordingly, we consider representation mappings ${\phi}$ that are lossy but truthful, meaning they reveal a cardinality-constrained subset of the item’s full set of attributes: $z\subseteq x$ and $k_{1}\leq|z|\leq k_{2}$ for some exogenously-set $k_{1},k_{2}$ .

Given that the system has representation power—how should users choose $h$ ? The key idea underlying our formulation is that the system and its users have misaligned goals. A user wishes to choose items that are ‘worthwhile’ to her, as determined by her valuation function $v(x)$ —in our case, a set function, which can reflect complex tastes (e.g., account for complementarity among the attributes, such as ‘balcony’ and ‘ocean view’ in the hotel example). Meanwhile, the system aims to maximize user engagement by choosing a feasible representation $z={\phi}(x)$ that will incite the user to choose the item. Importantly, while values are based on the true $x$ , choices rely on representations $z$ —which the system controls. This causes friction: a representation may be optimal for the system, but may not align with user interests.

The challenge for users (and our goal in learning) is therefore to make good choices on the basis of strategic representations. Note that this is not a simple missing value problem, since which attributes will be missing depends on how users choose, i.e., on their learned choice function $h$ .¹¹1Consider the following subtlety: since users must commit to a choice strategy $h$ , the choice of how to handle missing features (a part of the strategy) determines which features will be missing. Nor is this a problem that can be addressed by mapping representations to a categorical representation (with classes ‘0’,‘1’, and ‘unknown’); to see this, note that given a subset representation $z$ , it is impossible to know which of the attributes that do not appear in $z$ are withheld—and which are truly missing.

Subset representations.

We focus on subset representations since they provide a natural means to ensure that representations reveal ‘nothing but the truth’ (but not necessarily ‘the whole truth’). This is our primary modeling concern, which stems from realistic restrictions on the system (like consumer expectations, legal obligations, or business practices). Subset representations cleanly capture what we view as a fundamental tension between users and an informationally-advantageous system—the ability to withhold information; examples include retail items described by a handful of attributes; videos presented by several key frames; and movie recommendations described by a select set of reviews, to name a few.

Overview of our results.

We begin by showing that users who choose naïvely can easily be manipulated by a strategic system (Sec. 3). We then proceed to study users who learn (Sec. 4). Our main result is an efficient algorithm for learning $h$ (Thm. 4.7), which holds under a certain realizability assumption on $v$ . The algorithm minimizes the empirical error over a hypothesis class of set-function classifiers, whose complexity is controlled by parameter $k$ , thus allowing users to trade off expressivity and statsitical efficiency. The algorithm builds upon several structural properties which we establish for set-function classifiers. Our key result here is that ‘folding’ the system’s strategic response into the hypothesis class results in an induced class having a simple form that makes it amenable to efficient optimization (Thm. 4.6).

Building on this, we continue to study the ‘balance of power’ (Sec. 5), as manifested in the interplay between $k$ (hypothesis complexity, which reflects the power of the user) and the range $[k_{1},k_{2}]$ (which determines the power of the system). For fixed $k_{1},k_{2}$ , we analyze how the choice of $k$ affects the user’s classification error, through its decomposition into estimation and approximation errors. For estimation, we give a generalization bound (Thm. 5.9), obtained by analyzing the VC dimension of the induced function class (as it relies on $k$ ). For approximation, we give several results (e.g., Thm. 5.4) that link the expressivity of the user’s value function $v$ to the complexity of the learned $h$ (again, as it relies on $k$ ). Together, our results characterize how much is gained versus how much effort is invested in learning as $k$ is varied. One conclusion is that even minimal learning can help significantly (whereas no learning can be catastrophic).

From the system’s perspective, and for fixed $k$ , we study how the range $[k_{1},k_{2}]$ affects the system. Intuitively, we would expect that increasing the range should be beneficial to the system, as it provides more alternatives to choose $z$ from. However, perhaps surprisingly, we find that the system can increase its payoff by ‘tying its hands’ to a lower $k_{2}$ . This is because $k_{2}$ upper-bounds not only the system’s range but also the ‘effective’ $k$ of the user (who gets nothing from choosing $k>k_{2}$ ), and the lower the $k$ , the better it is for the system (Lemma 5.10). The choice of $k_{1}$ turns out to be immaterial against fully strategic users, but highly consequential against users that are not.

1.1 Relation to Strategic Classification

Our formalization of strategic representation shares a deep connection to strategic classification (Hardt et al., 2016). Strategic representation and strategic classification share an underlying structure (a leading learning player who must take into account a strategic responder), but there are important differences. The first is conceptual: in our setting, roles are reversed—it is users who learn (and not the system), and the system strategically responds (and not users). This allows us to pursue questions regarding the susceptibility of users to manipulation, with different emphases and goals, while maintaining the ‘language’ of strategic classification.

The second difference is substantive: we consider inputs that are sets (rather than continuous vectors), and manipulations that hide information (rather than alter it). Technically, switching to discrete inputs can be done by utilizing the cost function to enforce truthfulness constraints as a ‘hard’ penalty on modifications. But this change is not cosmetic: since the system behaves strategically by optimizing over subsets, learning must account for set-relations between different objects in input space; this transforms the problem to one of learning set functions.²²2This is a subtle point: Since sets can be expressed as binary membership vectors, it is technically possible to use conventional vector-based approaches to learn $h$ . Nonetheless, these approaches cannot account for strategic behavior; this is since $\phi$ implements a set operation, which is ill-defined for continuous vector inputs. From a modeling perspective, subsets make our work compatible with classic attribute-based consumer decision theory (Lancaster, 1966).

Overall, we view the close connection to strategic classification as a strength of our formalization, showing that the framework of strategic classification is useful far beyond what was previously considered; and also that a fairly mild variation can lead to a completely different learning problem, with distinct algorithmic and statistical challenges.

1.2 Related Work (see also Appx. B)

Strategic classification.

Strategic classification is a highly active area of research. Recent works in this field include statistical learning characterizations Zhang and Conitzer (2021); Sundaram et al. (2021); Ghalme et al. (2021), practical optimization methods Levanon and Rosenfeld (2021, 2022), relaxation of key assumptions Ghalme et al. (2021); Bechavod et al. (2022); Jagadeesan et al. (2021); Levanon and Rosenfeld (2022); Eilat et al. (2022), relations to causal aspects Miller et al. (2020); Chen et al. (2020a), and consideration of societal implications Milli et al. (2019); Hu et al. (2019); Chen et al. (2020b); Levanon and Rosenfeld (2021).

Two recent works are closely relate to ours: Zrnic et al. (2021) consider a dynamic setting of repeated learning that can change the order of play; in contrast, we switch the roles. Krishnaswamy et al. (2021) consider information withholding by users (rather than the system). They aim to learn a truth-eliciting mechanism, which incentivizes the second player to reveal all information (i.e., ‘the whole truth’). Their mechanism ensures that withholding never occurs; in contrast, our goal is to predict despite strategic withholding (i.e., ‘nothing but the truth’).

Bayesian persuasion. In Bayesian persuasion (Kamenica and Gentzkow, 2011), a more-informed player (i.e., the system) uses its information advantage coupled with commitment power to influence the choices of a decision maker (i.e., the user). Works closest to ours are by Dughmi et al. (2015), who upper bound the number of signaled attributes, and by Haghtalab et al. (2021), who study strategic selection of anecdotes. Both works consider the human player as a learner (as do we). However, in our work, the order of play is reversed—the decision-maker (user) moves first and the more-informed player (system) follows. Bayesian persuasion also assumes that the system knows the user’s valuation, and crucially relies on both parties knowing the distribution $D$ of items. In contrast, we model the user as having only sample access to $D$ , and the system as agnostic to it.

2 A Formalization of Strategic Representation

We begin by describing the setting from the perspective of the user, which is the learning entity in our setup. We then present the ‘types’ of users we study, and draw connections between our setting and others found in the literature.

2.1 Learning Setting

In our setup, a user is faced with a stream of items, and must choose which of these to consume. Items are discrete, with each item $x\in{}\mathcal{X}\subseteq 2^{E}$ described by a subset of ground attributes, $E$ , where $|E|=q$ . We assume all feasible items have at most $n$ attributes, $|x|\leq n$ . The value of items for the user are encoded by a value function, $v:{}\mathcal{X}\rightarrow\mathbb{R}$ . We say an item $x$ is worthwhile to the user if it has positive value, $v(x)>0$ , and use $y=Y(x)=\operatorname{sign}(v(x))$ to denote worthwhileness, i.e., $y=1$ if $x$ is worthwhile, and $y=-1$ if it is not. Items are presented to the user as samples drawn i.i.d. from some unknown distribution $D$ over ${}\mathcal{X}$ , and for each item, the user must choose whether to consume it (e.g., click, buy, watch) or not.

We assume that the user makes choices regarding items by committing at the onset to a choice function $h$ that governs her choice behavior. In principle, the user is free to chose $h$ from some predefined function class $H$ ; learning will consider finding a good $h\in H$ , but the implications of the choice of $H$ itself will play a central role in our analysis. Ideally, the user would like to choose items if and only if they are worthwhile to her; practically, her goal is to find an $h$ for which this holds with large probability over $D$ . For this, the user can make use of her knowledge regarding items she has already consumed, and therefore also knows their value; we model this as providing the user access to a labeled sample set $S=\{(x_{i},y_{i})\}_{i=1}^{m}$ where $x_{i}\sim D$ and $y_{i}=\operatorname{sign}(v(x_{i}))$ , which she can use for learning $h$ .

Strategic representations. The difficulty in learning $h$ is that user choices at test time must rely only on item representations, denoted $z\in\mathcal{Z}$ , rather than on full item descriptions. Thus, learning considers choice functions that operate on representations, $h:\mathcal{Z}\rightarrow\{\pm 1\}$ ; the challenge lies in that while choices must be made on the basis of representations $z$ , item values are derived from their full descriptions $x$ —which representations describe only partially.

The crucial aspect of our setup is that representations are not arbitrary; rather, representations are controlled by the system, which can choose them strategically to promote its own goals. We model the system as acting through a representation mapping, ${\phi}:{}\mathcal{X}\rightarrow\mathcal{Z}$ , which operates independently on any $x$ , and can be determined in response to the user’s choice of $h$ . This mimics a setting in which a fast-acting system can infer and quickly respond to a user’s (relatively fixed) choice patterns.

We assume the system’s goal is to choose a ${\phi}_{h}$ that maximizes expected user engagement:

\mathbb{E}_{x\sim D}{\left[{\mathds{1}{\{{h({\phi}_{h}(x))=1}\}}}\right]}.

(1)

Nonetheless, representations cannot be arbitrary, and we require ${\phi}_{h}$ to satisfy two properties. First, chosen representations must be truthful, meaning that $z\subseteq x$ for all $x$ . Second, representations are subject to cardinality constraints, $k_{1}\leq|z|\leq k_{2}$ for some predetermined $k_{1},k_{2}\in\mathbb{N}$ . We will henceforth use $\mathcal{Z}$ to mean representations of feasible cardinality. Both requirements stem from realistic considerations: A nontruthful system which intentionally distorts item information is unlikely to be commercially successful in the long run; intuitively, truthfulness gives users some hope of resilience to manipulation. For $k_{1},k_{2}$ , we think of these as exogenous parameters of the environment, arising naturally due to physical restrictions (e.g., screen size) or cognitive considerations (e.g., information processing capacity); if $k_{2}<n$ , we say representations are lossy.

Under these constraints, the system can optimize Eq. (1) by choosing representations via the best-response mapping:

{\phi}_{h}(x)=\operatorname*{argmax}_{z\in\mathcal{Z}}h(z)\,\,\,\,\text{ s.t. }\,\,\,\,z\subseteq x,|z|\in[k_{1},k_{2}]

(2)

Eq. (2) is a best-response since it maximizes Eq. (1) for any given $h$ : for every $x$ , ${\phi}_{h}$ chooses a feasible $z\subseteq x$ that triggers a positive choice event, $h(z)=1$ —if such a $z$ exists. In this way, $k_{1},k_{2}$ control how much leeway the system has in revealing only partial truths; as we will show, both parameters play a key role in determining outcomes for both system and user. From now on we overload notation and by ${\phi}_{h}(x)$ refer to this best-response mapping.

Learning objective. Given function class $H$ and a labeled sample set $S$ , the user aims to find a choice function $h\in H$ that correctly identifies worthwhile items given their representation, and in a way that is robust to strategic system manipulation. The user’s objective is therefore to maximize:

\mathbb{E}_{x\sim D}{\left[{\mathds{1}{\{{h({\phi}_{h}(x))=y}\}}}\right]}

(3)

where ${\phi}_{h}$ is the best-response mapping in Eq. (2).

Note that since $h$ is binary, the argmax of ${\phi}_{h}$ may not be unique; e.g., if some $z_{1}\subseteq x,z_{2}\subseteq x$ both have $h(z_{1})=h(z_{2})=1$ . Nonetheless, the particular choice of $z$ does not matter—from the user’s perspective, her choice of $h$ is invariant to the system’s choice of best-response $z$ (proof in Appx. C):

Observation 2.1.

Every best-response $z\in\phi_{h}(x)$ induces the same value in the user’s objective function (Eq. (3)).

2.2 User Types

Our main focus throughout the paper will be on users that learn $h$ by optimizing Eq. (3). But to understand the potential benefit of learning, we also analyze ‘simpler’ types of user behavior. Overall, we study three user types, varying in their sophistication and the amount of effort they invest in choosing $h$ . These include:

•

The naïve user: Acts under the (false) belief that representations are chosen in her own best interest. This user type truthfully reports her preferences to the system by setting $h=v$ as her choice function.³³3Note that while $h$ takes values in ${}\mathcal{X}$ , $v$ takes values in ${}\mathcal{X}$ . Nonetheless, truthfulness implies that $\mathcal{Z}\subseteq{}\mathcal{X}$ , and so $v$ is well-defined as a choice function over $\mathcal{Z}$ .
•

The agnostic user: Makes no assumptions about the system. This user employs a simple strategy that relies on basic data statistics which provides minimal but robust guarantees regarding her payoff.
•

The strategic user: Knows that the system is strategic, and anticipates it to best-respond. This user is willing to invest effort (in terms of data and compute) in learning a choice function $h$ that maximizes her payoff by accounting for the system’s strategic behavior.

Our primary goal is to study the balance of power between users (that choose) and the system (which represents). In particular, we will be interested in exploring the tradeoff between a user’s effort and her susceptibility to manipulation.

2.3 Strategic Representation as a Game

Before proceeding, we give an equivalent characterization of strategic representation as a game. Our setting can be compactly described as a single-step Stackelberg game: the first player is User, which observes samples $S=\{(x_{i},y_{i})\}_{i=1}^{m}$ , and commits to a choice function $h:\mathcal{Z}\rightarrow\{\pm 1\}$ ; the second player is System, which given $h$ , chooses a truthful $\phi_{h}:{}\mathcal{X}\rightarrow\mathcal{Z}$ (note how $\phi_{h}$ depends on $h$ ). The payoffs are:

	$\displaystyle\text{{User:}}\qquad\,\,\mathbb{E}_{x\sim D}{\left[{\mathds{1}{\{{h(\phi_{h}(x))=y}\}}}\right]}$		(4)
	$\displaystyle\text{{System:}}\quad\,\,\mathbb{E}_{x\sim D}{\left[{\mathds{1}{\{{h(\phi_{h}(x))=1}\}}}\right]}$		(5)

Note that payoffs differ only in that User seeks correct choices, whereas System benefits from positive choices. This reveals a clear connection to strategic classification, in which System, who plays first, is interested in accurate predictions, and for this it can learn a classifier; and User, who plays second, can manipulate individual inputs (at some cost) to obtain positive predictions. Thus, strategic representation can be viewed as a variation on strategic classification, but with roles ‘reversed’. Nonetheless, and despite these structural similarities, strategic representation bears key differences: items are discrete (rather than continuous), manipulations are subject to ‘hard’ set constraints (rather than ‘soft’, continuous costs), and learning regards set functions (rather than vector functions). These differences lead to distinct questions and unique challenges in learning.

3 Warm-up: Naïve and Agnostic Users

The naïve user. The naïve user employs a ‘what you see is what you get’ policy: given a representation of an item, $z$ , this user estimates the item’s value based on $z$ alone, acting ‘as if’ $z$ were the item itself. Consequently, the naïve user sets $h(z)=\operatorname{sign}(v(z))$ , even though $v$ is truly a function of $x$ . The naïve user fails to account for the system’s strategic behavior (let alone the fact that $z\subseteq x$ of some actual $x$ ).

Despite its naivety, there are conditions under which this user’s approach makes sense. Our first result shows that the naïve policy is sensible in settings where the system is benevolent, and promotes user interests instead of its own.

Lemma 3.1.

If system plays the benevolent strategy:

\phi^{\mathrm{benev}}_{h}(x)=\operatorname*{argmax}_{z\subseteq x,|z|\in[k_{1},k2]}\{\mathbbm{1}\{h(z)=\operatorname{sign}(v(x))\},

then the naïve approach maximizes user payoff.

Proof in Appx. D. The above lemma is not meant to imply that naïve users assume the system is benevolent; rather, it justifies why users having this belief might act in this way. Real systems, however, are unlikely to be benevolent; our next example shows a strategic system can easily manipulate naïve users to receive arbitrarily low payoff.

Example 1.

Let $x_{1}=\{a_{1}\},x_{2}=\{a_{1},a_{2}\},x_{3}=\{a_{2}\}$ with $v(x_{1})=+1$ and $v(x_{2})=v(x_{3})=-1$ . Fix $k_{1}=k_{2}=1$ , and let $D=(\varepsilon/2,1-\varepsilon,\varepsilon/2)$ . Note $\mathcal{Z}=\{a_{1},a_{2}\}$ are the feasible representations. The naïve user assigns $h=(a_{1})=+1,h(a_{2})=-1$ according to $v$ . For $x_{2}$ , a strategic system plays $\phi(x_{2})=a_{1}$ . The expected payoff to the user is $\varepsilon$ .

One reason a naïve user is susceptible to manipulation is because she does not make any use of the data she may have. We next describe a slightly sophisticated user that uses a simple strategy to ensure a better payoff.

The agnostic user. The agnostic user puts all faith in data; this user does not make assumptions on, nor is she susceptible to, the type of system she plays against. Her strategy is simple: collect data, compute summary statistics, and choose to either always accept or always reject (or flip a coin). In particular, given a sample set $S=\{(x_{i},y_{i})\}_{i=1}^{m}$ , the agnostic user first computes the fraction of positive examples, ${\hat{\mu}}:=\frac{1}{m}\sum_{i=1}^{m}y_{i}$ . Then, for some tolerance $\tau$ , sets for all $z$ , $h(z)=1$ if ${\hat{\mu}}\geq 1/2+\tau$ , $h(z)=-1$ if ${\hat{\mu}}\leq 1/2-\tau$ , and flips a coin otherwise. In Example 1, an agnostic user would choose $h=(-1,-1)$ when $m$ is large, guaranteeing a payoff of at least $\frac{\sqrt{m}(1-\varepsilon/2)}{2+\sqrt{m}}\rightarrow(1-\varepsilon/2)$ as $m\rightarrow\infty$ . Investing minimal effort, for an appropriate choice of $\tau$ , this user’s strategy turns out to be quite robust.

Theorem 3.2.

(Informal) Let $\mu$ be the true rate of positive examples, $\mu=\mathbb{E}_{D}{\left[{Y}\right]}$ . Then as $m$ increases, the agnostic user’s payoff approaches $\max\{\mu,1-\mu\}$ at rate $1/\sqrt{m}$ .

Formal statement and proof in Appx. A.1. In essence, the agnostic user guarantee herself the ‘majority’ rate with rudimentary usage of her data, and in a way that does not depend on how system responds. But this can be far from optimal; we now turn to the more elaborate strategic user who makes more clever use of the data at her disposal.

4 Strategic Users Who Learn

A strategic agent acknowledges that the system is strategic, and anticipates that representations are chosen to maximize her own engagement. Knowing this, the strategic user makes use of her previous experiences, in the form of a labeled data set $S=\{(x_{i},y_{i})\}_{i=1}^{m}$ , to learn a choice function ${\hat{h}}$ from some function class $H$ that optimizes her payoff (given that the system is strategic). Cast as a learning problem, this is equivalent to minimizing the expected classification error on strategically-chosen representations:

h^{*}=\operatorname*{argmin}_{h\in H}\mathbb{E}_{D}{\left[{\mathds{1}{\{{h({\phi}_{h}(x))\neq y}\}}}\right]}.

(6)

Since the distribution $D$ is unknown, we follow the conventional approach of empirical risk minimization (ERM) and optimize the empirical analog of Eq. (6):

{\hat{h}}=\operatorname*{argmin}_{h\in H}\frac{1}{m}\sum_{i=1}^{m}h({\phi}_{h}(x_{i}))\neq y_{i}).

(7)

Importantly since every $z_{i}=\phi_{h}(x_{i})$ is a set, $H$ must include set functions $h:\mathcal{Z}\rightarrow\{\pm 1\}$ , and any algorithm for optimizing Eq. (7) must take this into account. In Sections 4.1 and 4.2, we characterize the complexity of a user’s choice function and relate its complexity to that of $v$ , and in Section 4.3 give an algorithm that computes ${\hat{h}}$ , the empirical minimizer, for a hypothesis class of a given complexity.

4.1 Complexity Classes of Set Functions

Ideally, a learning algorithm should permit flexibility in choosing the complexity of the class of functions it learns (e.g., the degree of a polynomial kernel, the number of layers in a neural network), as this provides means to trade-off running time with performance and to reduce overfitting. In this section we propose a hierarchy of set-function complexity classes that is appropriate for our problem.

Denote by $\Gamma_{k}(z)$ all subsets of $z$ having size at most $k$ :

\Gamma_{k}(z)=\{z^{\prime}\in 2^{E}\,:\,z^{\prime}\subseteq z,|z^{\prime}|\leq k\}.

We start by defining $k$ -order functions over the representation space. These functions are completely determined by weights placed on subsets of size at most $k$ .

Definition 4.1.

We say the function $h:\mathcal{Z}\rightarrow\{\pm 1\}$ is of order $k$ if there exists real weights on sets of cardinality at most $k$ , $\{w(z^{\prime})\,:\,z^{\prime}\in\Gamma_{k}(z)\}$ , such that

h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right).

Not all functions $h(z)$ can necessarily be expressed as a $k$ -order function (for some $k$ ); nonetheless, in the context of optimizing Eq. (7), we show that working with $k$ -order functions is sufficiently general, since any set function $h$ can be linked to a matching $k$ -order function $h^{\prime}$ (for some $k\leq k_{2}$ ) through how it operates on strategic inputs.

Lemma 4.2.

For any $h:\mathcal{Z}\rightarrow\{\pm 1\}$ , there exists $k\leq k_{2}$ and a corresponding $k$ -order function $h^{\prime}$ such that:

h({\phi}_{h}(x))=h^{\prime}({\phi}_{h^{\prime}}(x))\,.

Lem. 4.2 permits us to focus on $k$ -order functions. The proof is constructive (see Appx. E), and the construction itself turns out to be highly useful. In particular, the proof constructs $h^{\prime}$ having a particular form of binary basis weights, $w(z)\in\{a_{-},a_{+}\}$ , which we assume from now on are fixed ( $\forall k$ ). Hence, every function $h$ has a corresponding binary-weighted $k$ -order function $h^{\prime}$ , which motivates the following definition of functions and function classes.

Definition 4.3.

We say a $k$ -order function $h$ with basis weights $\bm{w}$ is binary-weighted if:

w(z)\begin{cases}\in\{a_{-},a_{+}\}&\forall z\text{ such that }|z|=k\\ =a_{-}&\forall z\text{ such that }|z|<k\end{cases}

for some fixed $a_{-}\in(-1,0)$ and $a_{+}>\sum_{i\in[k]}{n\choose i}$ .

A binary weighted $k$ -order $h$ determines a family of $k$ -size subsets, described by having weights as $a_{+}$ , such that for any $z\in\mathcal{Z}$ with $|z|\geq k$ , $h(z)=1$ if and only if $z$ contains a subset from the family (for $z$ with $|z|<k$ , $h(z)=-1$ always). This is made precise using the notion of lifted functions in Lem. A.4 in Appx. A.2. Next, denote:

H_{k}=\{h\,:\,h\text{ is a binary-weighted $k$-order function}\}.

The classes $\{H_{k}\}_{k}$ will serve as complexity classes for our learning algorithm; the user provides $k$ as input, and Alg outputs an ${\hat{h}}\in H_{k}$ that minimizes the empirical loss⁴⁴4Assuming the empirical error is zero.. As we will show, using $k$ as a complexity measure provides the user direct control over the tradeoff between estimation and approximation error, as well as over the running time.

Next, we show that the $\{H_{k}\}_{k}$ classes are strictly nested. This will be important for our analysis of approximation error, as it will let us reason about the connection between the learned ${\hat{h}}$ and the target function $v$ (proof in Appx. E).

Lemma 4.4.

For all $k$ , $H_{k-1}\subseteq H_{k}$ and $H_{k}\setminus H_{k-1}\neq\emptyset$ .

Note that $H_{n}$ includes all binary-weighted set functions, but since representations are of size at most $k_{2}$ , it suffices to consider only $k\leq k_{2}$ . Importantly, $k$ can be set lower than $k_{1}$ ; for example, $H_{1}$ is the class of threshold modular functions, and $H_{2}$ is the class of threshold pairwise functions. The functions we consider are parameterized by their weights, $\bm{w}$ , and so any $k$ -order function has at most $|\bm{w}|=\sum_{i=0}^{k}{q\choose i}$ weights. In this sense, the choice of $k$ is highly meaningful. Now that we have defined our complexity classes, we turn to discussing how they can be optimized over.

4.2 Learning via Reduction to Induced Functions

The simple structure of functions in $H_{k}$ makes them good candidates for optimization. But the main difficulty in optimizing the empirical error in Eq. (7) is that the choice of $h$ does not only determine the error, but also determines the inputs on which errors are measured (indirectly through the dependence of ${\phi}_{h}$ on $h$ ). To cope with this challenge, our approach is to work with induced functions that already have the system’s strategic response encoded within, which will prove useful for learning. Additionally, as they operate directly on $x$ (and not $z$ ), they can easily be compared with $v$ , which will become important in Sec. 5.

Definition 4.5.

For a class $H$ , its induced class is:

F_{H}\triangleq\{f:{}\mathcal{X}\rightarrow\{\pm 1\}\,:\,\exists h\in H\text{ s.t. }f(x)=h(\phi_{h}(x))\}

The induced class $F_{H}$ includes for every $h\in H$ a corresponding function that already has ${\phi}_{h}$ integrated in it. We use $F_{k}=F_{H_{k}}$ to denote the induced class of $H_{k}$ . For every $h$ , we denote its induced function by $f_{h}$ . Whereas $h$ functions operate on $z$ , induced functions operate directly on $x$ , with each $f_{h}$ accounting internally for how the system strategic responds to $h$ on each $x$ .

Our next theorem provides a key structural result: induced functions inherit the weights of their $k$ -order counterparts.

Theorem 4.6.

For any $h\in H_{k}$ with weights $\bm{w}$ :

h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right),

its induced $f_{h}\in F_{k}$ can be expressed using the same weights, $\bm{w}$ , but with summation over subsets of $x$ , i.e.:

f_{h}(x)=\operatorname{sign}\left(\sum\nolimits_{z\in\Gamma_{k}(x)}w(z)\right).

Thm. 4.6 is the main pillar on which our algorithm stands: it allows us to construct $h$ by querying the loss directly—i.e., without explicitly computing ${\phi}_{h}$ —by working with the induced $f_{h}$ ; this is since:

\mathds{1}{\{{h({\phi}_{h}(x_{i}))\neq y_{i}}\}}=\mathds{1}{\{{f_{h}(x_{i})\neq y_{i}}\}}

Thus, through their shared weights, induced functions serve as a bridge between what we optimize, and how.

4.3 Learning Algorithm

We now present our learning algorithm, Alg. The algorithm is exact: it takes as input a training set $S$ and a parameter $k$ , and returns an $h\in H_{k}$ that minimizes the empirical loss (Eq. (7)). Correctness holds under the realizabilility condition $Y\in F_{k}$ , i.e., $Y$ is the induced function of some $h\in H_{k}$ .⁵⁵5Note that even for standard linear binary classification, finding an empirical minimizer of the $0/1$ in the agnostic (i.e., non-realizable) case is NP-hard (Shalev-Shwartz and Ben-David, 2014).

The algorithm constructs $h$ by sequentially computing its weights, $\bm{w}=\{w(z)\}_{|z|\leq k}$ . As per Def. 4.3, only $w(z)$ for $z$ with $|z|=k$ must be learned; hence, weights are sparse, in the sense that only a small subset of them are assigned $a_{+}$ , while the rest are $a_{-}$ . Weights can be implemented as a hash table, where $w(z)=a_{+}$ if $z$ is in the table, and $w(z)=a_{-}$ if it is not. Our next result establishes the correctness of Alg. The proof leverages a property that characterizes the existence of an $h\in H_{k}$ having zero empirical error (see Lem. E.1). The proof of Lem. E.1 uses Thm. 4.6, which enables the loss to be directly computed for the induced functions using the shared weight structure.

31: Input:

S=\{(x_{i},y_{i})\}_{i\in[m]}

k\in[k_{2}]

2: Pre-compute:

S^{+}=\{x\in S:y=+1\}

S^{-}=\{x\in S:y=-1\}

Z_{k,S}=\{z\,:\,|z|=k,\exists x\in S~{}z\subseteq x\}

{\hat{p}}(x_{i})=\frac{1}{m}\sum_{j\in[m]}\mathds{1}{\{{x_{i}=x_{j}}\}}\quad\forall i\in[m]

47: Fix

a_{-}\in(-1,0)

and

a_{+}>\sum_{i\in[1,k]}{n\choose i}

8: Initialize:

59:

Z^{+}=\varnothing

Z^{-}=\varnothing

V=\varnothing

S_{z}=\varnothing\quad\forall z\in Z_{k,S}

10: Run:

11: for

x\in S^{-}

12: for

z

s.t.

z\subseteq x

and

z\in Z_{k,S}

13:

Z^{-}=Z^{-}\cup\{z\}

Z_{k,S}=Z_{k,S}\setminus\{z\}

14:

S_{z}=S_{z}\cup\{x\}

15: end for

616: end for

17: for

x\in S^{+}

18: for

z\subseteq x

such that

z\in Z_{k,S}

19:

Z^{+}=Z^{+}\cup\{z\}

20: end for

721: end for

22: Set

w(z)=\begin{cases}a_{+}&\text{if }z\in Z^{+}\hskip 65.04034pt{\leavevmode\color[rgb]{0.46875,0.46875,0.46875}{\triangleright\text{{ implemented}}}}\\ a_{-}&\text{o.w. (implicitly)}\hskip 39.0242pt{\leavevmode\color[rgb]{0.46875,0.46875,0.46875}{\text{{as hash table}}}}\end{cases}

23: Return

\hat{h}(z)=\operatorname{sign}(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime}))

Algorithm 1 Alg

Theorem 4.7.

For any $k\in[k_{2}]$ , if $Y$ is realizable then Alg returns an $\hat{h}$ that minimizes the empirical error.

Proof in Appx. E. Note that our algorithm is exact: it returns a true minimizer of the empirical 0/1 loss, assuming $Y$ is realizable. Additionally, Alg can be used to identify if there exists an $h\in H_{k}$ with zero empirical error; at Step 15, for each $x\in S^{+}$ if there does not exist a $z\in Z_{k,S}$ or $z\in Z^{+}$ such that $z\subseteq x$ then from Lem. E.1 in Appx. D there does not exist an $h$ with zero empirical error.

Lemma 4.8.

Let $n$ be the size of elements in ${}\mathcal{X}$ , $m$ be the number of samples, and $k\leq k_{2}$ be the user’s choice of complexity. Then Alg runs in $O(m{n\choose k})$ time.

This runtime is made possible due to several key factors: (i) only $k$ -sized weights need to be learned, (ii) all weights are binary-valued, and (iii) loss queries are efficient in induced space. Nonetheless, when $n$ and $k$ are large, runtime may be significant, and so $k$ must be chosen with care. Fortunately, our results in Sec. 5.1.1 give encouraging evidence that learning with small $k$ —even $k=1$ , for which runtime is $O((mn)^{2})$ —is quite powerful (assuming $Y$ is realizable).

In the analysis, the $m{n\choose k}$ is made possible only since weights are sparse, and since Alg operates on a finite sample set of size $m$ . Alternatively, if $m$ is large, then this can be replaced with ${q\choose k}$ . This turns out to be necessary; in Appx. A.3 we show that, in the limit, ${q\choose k}$ is a lower bound.

5 Balance of Power

Our final section explores the question: what determines the balance of power between system and users? We begin with the perspective of the user, who has commitment power, but can only minimize the empirical error. For her, the choice of complexity class $k$ is key in balancing approximation error—how well (in principle) can functions $h\in H_{k}$ approximate $v$ ; and estimation error—how close the empirical payoff of the learned ${\hat{h}}$ is to its expected value. Our results give insight into how these types of error trade off as $k$ is varied (here we do not assume realizability).

For the system, the important factors are $k_{1}$ and $k_{2}$ , since these determine its flexibility in choosing representations. Since more feasible representation mean more flexibility, it would seem plausible that smaller $k_{1}$ and larger $k_{2}$ should help the system more. However, our results indicate differently: for system, smaller $k_{2}$ is better, and the choice of $k_{1}$ has limited effect on strategic users. The result for $k_{2}$ goes through a connection to the user’s choice of $k$ ; surprisingly, smaller $k$ turns out to be, in some sense, better for all.

5.1 User’s Perspective

We begin by studying the effects of $k$ on user payoff. Recall that users aim to minimize the expected error (Eq. (6)):

{{\varepsilon}}(h)=\mathbb{E}_{D}{\left[{\mathds{1}{\{{h({\phi}_{h}(x))\neq\operatorname{sign}(v(x))}\}}}\right]},

but instead minimize the empirical error (Eq. (7)). For reasoning about the expected error of the learned choice function ${\hat{h}}\in H_{k}$ , a common approach is to decompose it into two error types—approximation and estimation:

{{\varepsilon}}({\hat{h}})=\underbrace{{{\varepsilon}}(h^{*})}_{\text{approx.}}+\underbrace{{{\varepsilon}}({\hat{h}})-{{\varepsilon}}(h^{*})}_{\text{estimation}},\qquad h^{*}=\operatorname*{argmin}_{h^{\prime}\in H_{k}}{{\varepsilon}}(h^{\prime})

Approximation error describes the lowest error obtainable by functions in $H_{k}$ ; this measures the ‘expressivity’ of $H_{k}$ , and is independent of ${\hat{h}}$ . For approximation error, we define a matching complexity structure for value functions $v$ , and give several results relating the choice of $k$ and the complexity of $v$ . Estimation error describes how far the learned ${\hat{h}}$ is from the optimal $h^{*}\in H_{k}$ , and depends on the data size, $m$ . Here we give a generalization bound based on VC analysis.

5.1.1 User approximation error

To analyze the approximation error, we must be able to relate choice functions $h$ (that operate on representations $z$ ) to the target value function $v$ (which operates on items $x$ ). To connect the two, we will again use induced functions, for which we now define a matching complexity structure.

Definition 5.1.

A function $f:{}\mathcal{X}\rightarrow\{\pm 1\}$ has an induced complexity of $\ell$ if exists a function $g:Z_{\ell}\rightarrow\{\pm 1\}$ s.t.:

f(x)=\begin{cases}1&\text{{if }}\,\,\exists z\subseteq x,|z|=\ell\,\,\text{{and}}\,\,g(z)=1\\ -1&\text{{o.w.}}\end{cases}

and $\ell$ is minimal (i.e., there is no such $g^{\prime}:Z_{\ell-1}\rightarrow\{\pm 1\}$ ).

We show in Lem. 5.2 and Cor. 5.3 that the induced complexity of a function $f$ captures the minimum $k\in[1,n]$ such that $f$ is an induced function of an $h\in H_{k}$ .

Lemma 5.2.

Let $k\leq k_{2}$ . Then for every $h\in H_{k}$ , the induced complexity of the corresponding $f_{h}$ is $\ell\leq k$ .

Corollary 5.3.

Let $F_{k}=F_{H_{k}}$ be the induced function class of $H_{k}$ , as defined in Def. 4.5. Then:

F_{k}=\{f:{}\mathcal{X}\rightarrow\{\pm 1\}\,:\,f\textnormal{ has induced complexity }\leq k\}.

Proof of Cor. 5.3 is in Appx. F. We now turn to considering the effect of $k$ on approximation error. Since the ‘worthwhileness’ function $Y(x)=\operatorname{sign}(v(x))$ operates on $x$ , we can consider its induced complexity, which we denote by $\ell^{*}$ (i.e., $Y\in F_{\ell^{*}}$ ). The following result shows that if $\ell^{*}\leq k$ , then $H_{k}$ is expressive enough to perfectly recover $Y$ .

Theorem 5.4.

If $\ell^{*}\leq k$ then the approximation error is $0$ .

One conclusion from Thm. 5.4 is that if the user knows $\ell^{*}$ , then zero error is, in principle, obtainable; another is that there is no reason to choose $k>\ell^{*}$ . In practice, knowing $\ell^{*}$ can aid the user in tuning $k$ according to computational (Sec. 4.3) and statistical considerations (Sec. 5.1.2). Further conclusions relate $\ell^{*}$ and $k_{2}$ :

Corollary 5.5.

If $\ell^{*}\leq k_{2}$ and the distribution $D$ has full support on ${}\mathcal{X}$ , then $k=\ell^{*}$ is the smallest $k$ that gives zero approximation error.

Corollary 5.6.

If $\ell^{*}>k2$ , then the approximation error weakly increases with $k$ , i.e., ${{\varepsilon}}(h^{*}_{k})\leq{{\varepsilon}}(h^{*}_{k-1})$ for all $k\leq k_{2}$ . Furthermore, if the distribution $D$ has full support on ${}\mathcal{X}$ then no $k$ can achieve zero approximation error.

Refer to caption — Figure 1: Upper bound on approximation error showing diminishing returns. Parameters: $q=400,n=30$ and $k_{2}=10$ .

Proofs in Appx. F. In general, Cor. 5.6 guarantees only weak improvement with $k$ . Next, we show that increasing $k$ can exhibit clear diminishing-returns behavior, with most of the gain obtained at very low $k$ .

Lemma 5.7.

Let $D$ be the uniform distribution over ${}\mathcal{X}$ . Then there is a value function $v$ for which ${{\varepsilon}}(h^{*}_{k})$ diminishes convexly with $k$ .

The proof is constructive (see Appx. F). We construct a $v$ whose approximation error $h^{*}_{k}\in H_{k}$ is upper bounded by

{{\varepsilon}}(h^{*}_{k})\leq\frac{1}{4{q\choose n}}\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}\,.

The diminishing returns property of upper bound is illustrated in Fig. 1. Although Lem. 5.7 describes a special case, we conjecture that this phenomena applies more broadly.

Our next result shows that learning $k_{1}$ -order functions can be as powerful as learning subadditive functions; hence, learning with $k=k_{1}$ is highly expressive. Interestingly, the connection between general ( $k$ -order) functions and subadditive functions is due to the strategic response mapping, ${\phi}$ .

Lemma 5.8.

Consider threshold-subadditive functions:

H_{\mathrm{SA}}=\{\operatorname{sign}(g(z))\,:\,g\textnormal{ is subadditive on subsets in }\mathcal{Z}\}

Then for every threshold-subadditive $h_{g}\in H_{\mathrm{SA}}$ , there is an $h\in H_{k_{1}}$ for which $h(\phi_{h}(x))=h_{g}(\phi_{h_{g}}(x))\,\,\forall x\in{}\mathcal{X}$ .

5.1.2 User estimation error

For estimation error, we give generalization bounds based on VC analysis. The challenge in analyzing functions in $H_{k}$ is that generalization applies to the strategic 0/1 loss, i.e., $\mathds{1}{\{{h({\phi}_{h}(x))\neq y}\}}$ , and so standard bounds (which apply to the standard 0/1 loss) do not hold. To get around this, our approach relies on directly analyzing the VC dimension of the induced class, $F_{k}$ (a similar approach was taken in Sundaram et al. (2021) for SC). This allows us to employ tools from VC theory, which give the following bound.

Theorem 5.9.

For any $k$ and $m$ , given a sample set $S$ of size $m$ sampled from $D$ and labeled by some $v$ , we have

{{\varepsilon}}({\hat{h}})-{{\varepsilon}}(h^{*})\leq\sqrt{\frac{C({q\choose k}\log({q\choose k}/\epsilon)+\log(1/\delta)}{m}}

w.p. at least $1-\delta$ over $S$ , and for a fixed constant $C$ . In particular, Alg in Sec. 4.3, assuming $Y$ is realizable, returns an ${\hat{h}}\in H_{k}$ for which:

{{\varepsilon}}({\hat{h}})\leq\sqrt{\frac{C({q\choose k}\log({q\choose k}/\epsilon)+\log(1/\delta)}{m}}

w.p. at least $1-\delta$ over $S$ , and for a fixed constant $C$ .

The proof relies on Thm. 4.6; since $h$ and $f_{h}$ share weights, the induced $F_{k}$ can be analyzed as a class of $q$ -variate degree- $k$ multilinear polynomials. Since induced functions already incorporate ${\phi}$ , VC analysis for the 0/1 loss can be applied. Note that such polynomials have exactly ${q\choose k}$ degrees of freedom; hence the term in the bound.

5.2 System’s Perspective

The system’s expressive power derives from its flexibility in choosing representations $z$ for items $x$ . Since $k_{1},k_{2}$ determine which representations are feasible, they directly control the system’s power to manipulate; and while the system itself may not have direct control over $k_{1},k_{2}$ (i.e., if they are set by exogenous factors like screen size), their values certainly affect the system’s ability to optimize engagement. Our next result is therefore unintuitive: for system, a smaller $k_{2}$ is better (in the worst case), even though it reduces the set of feasible representations. This result is obtained indirectly, by considering the effect of $k_{2}$ on the user’s choice of $k$ .

Lemma 5.10.

There exists a distribution $D$ and a value function $v$ such that for all $k<k^{\prime}\leq k_{2}$ , system has higher payoff against the optimal $h^{*}_{k}\in H_{k}$ than against $h^{*}_{k^{\prime}}\in H_{k^{\prime}}$ .

The proof is in Appx. F; it uses the uniform distribution, and the value function from Thm. 5.7. Recalling that the choice of $k$ controls the induced complexity $\ell$ (Cor. 5.3), and that users should choose $k$ to be no greater than $k_{2}$ , we can conclude the following (in a worst-case sense):

Corollary 5.11.

For the system, lower $k_{2}$ is better.

Proof in Appx. F. For $k_{1}$ , it turns out that against strategic users—it is inconsequential. This is since payoff to the strategic user is derived entirely from $k$ , which is upper-bounded by $k_{2}$ , but can be set lower than $k_{1}$ . This invariance is derived immediately from how functions in $H_{k}$ are defined, namely that $w(z)=a_{-}$ for all $z$ with $|z|<k$ (Def. 4.3). This, however, holds when the strategic user chooses to learn over $H_{k}$ for some $k$ . Consider, alternatively, a strategic user that decides to learn subadditive functions instead. In this case, Thm. 5.8 shows that $k_{1}$ determines the users ‘effective’ $k$ ; the smaller $k_{1}$ , the smaller the subset of subadditive functions that can be learned. Hence, for user, smaller $k_{1}$ means worse approximation error. This becomes even more pronounced when facing a naïve user; for her, a lower $k_{1}$ means that system now has a large set of representations to choose from; if even one of them has $v(z)=1$ , the system can exploit this to increase its gains. In this sense, as $k_{1}$ decreases, payoff to the system (weakly) improves.

6 Discussion

Our analysis of the balance of power reveals a surprising conclusion: for both parties, in some sense, simple choice functions are better. For system, lower $k$ improves its payoff through how it relates to $k_{2}$ (Corollary 5.11). For users, lower $k$ is clearly better in terms of runtime (Lemma 4.8) and estimation error (Theorem 5.9), and for approximation error, lower $k$ has certain benefits—as it relates to $\ell^{*}$ (Corollary 5.5), and via diminishing returns (Theorem 5.7). Thus, and despite their conflicting interests—to some degree, the incentives of the system and its users align.

But the story is more complex. For users, there is no definitive notion of ‘better’; strategic users always face a trade-off, and must choose $k$ to balance approximation, estimation, and runtime. In principle, users are free to choose $k$ at will; but since there is no use for $k>k_{2}$ , a system controlling $k_{2}$ de facto controls $k$ as well. This places a concrete restriction on the freedom of users to choose, and inequitably: for small $k_{2}$ , users whose $v$ has complexity $\leq k_{2}$ (i.e., having ‘simple tastes’) are less susceptible to manipulation than users with $v$ of complexity $>k_{2}$ (e.g., fringe users with eclectic tastes) (Theorem 5.4, Corollaries. 5.5 and 5.6). In this sense, the choice of $k_{2}$ also has implications on fairness. We leave the further study of these aspects of strategic representation for future work.

From a purely utilitarian point of view, it is tempting to conclude that systems should always set $k_{2}$ to be low. But this misses the broader picture: although systems profit from engagement, users engage only if they believe it is worthwhile to them, and dissatisfied users may choose to leave the system entirely (possibly into the hands of another). Thus, the system should not blindly act to maximize engagement; in reality, it, too, faces a tradeoff.

References

Abboud et al. [1999] Elias Abboud, Nader Agha, Nader H Bshouty, Nizar Radwan, and Fathi Saleh. Learning threshold functions with small weights using membership queries. In Proceedings of the twelfth annual conference on Computational learning theory, pages 318–322, 1999.
Abraham et al. [2012] Ittai Abraham, Moshe Babaioff, Shaddin Dughmi, and Tim Roughgarden. Combinatorial auctions with restricted complements. In Proceedings of the 13th ACM Conference on Electronic Commerce, pages 3–16, 2012.
Angluin [1988] Dana Angluin. Queries and concept learning. Machine learning, 1988.
Balcan and Harvey [2011] Maria-Florina Balcan and Nicholas JA Harvey. Learning submodular functions. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 793–802, 2011.
Bechavod et al. [2022] Yahav Bechavod, Chara Podimata, Zhiwei Steven Wu, and Juba Ziani. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
Chen et al. [2020a] Yatong Chen, Jialu Wang, and Yang Liu. Linear classifiers that encourage constructive adaptation. arXiv preprint arXiv:2011.00355, 2020a.
Chen et al. [2020b] Yatong Chen, Jialu Wang, and Yang Liu. Strategic recourse in linear classification. arXiv preprint arXiv:2011.00355, 2020b.
Chevaleyre et al. [2008] Yann Chevaleyre, Ulle Endriss, Sylvia Estivie, and Nicolas Maudet. Multiagent resource allocation in k -additive domains: preference representation and complexity. Ann. Oper. Res., 163(1):49–62, 2008.
Conitzer et al. [2005] Vincent Conitzer, Tuomas Sandholm, and Paolo Santi. Combinatorial auctions with k-wise dependent valuations. In AAAI, 2005.
Dughmi et al. [2015] Shaddin Dughmi, Nicole Immorlica, Ryan O’Donnell, and Li-Yang Tan. Algorithmic signaling of features in auction design. In Algorithmic Game Theory - 8th International Symposium, SAGT, pages 150–162. Springer, 2015.
Eilat et al. [2022] Itay Eilat, Ben Finkelshtein, Chaim Baskin, and Nir Rosenfeld. Strategic classification with graph neural networks. arXiv preprint arXiv:2205.15765, 2022.
Feige et al. [2015] Uriel Feige, Michal Feldman, Nicole Immorlica, Rani Izsak, Brendan Lucier, and Vasilis Syrgkanis. A unifying hierarchy of valuations with complements and substitutes. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
Feldman [2009] Vitaly Feldman. On the power of membership queries in agnostic learning. The Journal of Machine Learning Research, 10:163–182, 2009.
Ghalme et al. [2021] Ganesh Ghalme, Vineet Nair, Itay Eilat, Inbal Talgam-Cohen, and Nir Rosenfeld. Strategic classification in the dark. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.
Haghtalab et al. [2021] Nika Haghtalab, Nicole Immorlica, Brendan Lucier, Markus Mobius, and Divyarthi Mohan. Persuading with anecdotes. Technical report, National Bureau of Economic Research, 2021.
Hardt et al. [2016] Moritz Hardt, Nimrod Megiddo, Christos H. Papadimitriou, and Mary Wootters. Strategic classification. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 2016.
Hu et al. [2019] Lily Hu, Nicole Immorlica, and Jennifer Wortman Vaughan. The disparate effects of strategic manipulation. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), pages 259–268, 2019.
Jagadeesan et al. [2021] Meena Jagadeesan, Celestine Mendler-Dünner, and Moritz Hardt. Alternative microfoundations for strategic classification. In International Conference on Machine Learning, 2021.
Kamenica and Gentzkow [2011] Emir Kamenica and Matthew Gentzkow. Bayesian persuasion. American Economic Review, 101(6), 2011.
Krishnaswamy et al. [2021] Anilesh K Krishnaswamy, Haoming Li, David Rein, Hanrui Zhang, and Vincent Conitzer. Classification with strategically withheld data. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
Lancaster [1966] Kelvin J Lancaster. A new approach to consumer theory. Journal of political economy, 74(2):132–157, 1966.
Levanon and Rosenfeld [2021] Sagi Levanon and Nir Rosenfeld. Strategic classification made practical. In Proceedings of the 38th International Conference on Machine Learning, ICML, 2021.
Levanon and Rosenfeld [2022] Sagi Levanon and Nir Rosenfeld. Generalized strategic classification and the case of aligned incentives. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
Miller et al. [2020] John Miller, Smitha Milli, and Moritz Hardt. Strategic classification is causal modeling in disguise. In International Conference on Machine Learning, pages 6917–6926. PMLR, 2020.
Milli et al. [2019] Smitha Milli, John Miller, Anca D. Dragan, and Moritz Hardt. The social cost of strategic classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), pages 230–239, 2019.
Rosenfeld et al. [2020] Nir Rosenfeld, Kojin Oshiba, and Yaron Singer. Predicting choice with set-dependent aggregation. In International Conference on Machine Learning, 2020.
Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
Sundaram et al. [2021] Ravi Sundaram, Anil Vullikanti, Haifeng Xu, and Fan Yao. Pac-learning for strategic classification. In International Conference on Machine Learning, pages 9978–9988, 2021.
Zhang and Conitzer [2021] Hanrui Zhang and Vincent Conitzer. Incentive-aware pac learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 5797–5804, 2021.
Zrnic et al. [2021] Tijana Zrnic, Eric Mazumdar, Shankar Sastry, and Michael Jordan. Who leads and who follows in strategic classification? Advances in Neural Information Processing Systems, 34, 2021.

Appendix A Additional Results

A.1 Agnostic User

Theorem A.1 stated below is the formal version of Theorem 3.2 in Section 3. Theorem A.1 shows that given a a large enough sample size $m$ , an agnostic user’s payoff would approach $\max\{\mu,1-\mu\}$ , where $\mu=\mathbb{E}_{D}{\left[{y}\right]}$ .

Theorem A.1.

Let $\frac{2}{2+\sqrt{m}}\leq\delta<1/8$ and $\tau=\frac{\delta}{2(1-\delta)}+\sqrt{\frac{2\log(1/\delta)}{m}}$ , then agnostic user’s expected payoff guarantee is given by

\begin{cases}\geq(1-\delta)(1-\mu)&\text{if }\ \widehat{\mu}\leq 1/2-\tau\\ \geq(1-\delta)\mu&\text{if }\ \widehat{\mu}\geq 1/2+\tau\\ =1/2&\text{Otherwise}\end{cases}

Before we prove the theorem, we state Hoeffding’s inequality, which is a well known result from probability theory.

Lemma A.2 (Hoeffding).

Let $S_{m}=\sum_{i=1}^{m}X_{i}$ be the sum of $m$ i.i.d. random variables with $X_{i}\in[0,1]$ and $\mu=\mathbb{E}[X_{i}]$ for all $i\in[m]$ , then

\mathbb{P}(\frac{S_{m}}{m}-\mu\geq\varepsilon)\leq e^{-2m\varepsilon^{2}}\,\,\,\text{ and}\,\,\,\mathbb{P}(\frac{S_{m}}{m}-\mu\leq-\varepsilon)\leq e^{-2m\varepsilon^{2}}.

We will use the following equivalent form of the above inequality. Let $\delta:=e^{-2m\varepsilon^{2}}$ i.e. $\varepsilon=\sqrt{\frac{2\log(1/\delta)}{m}}$ and $\widehat{\mu}=\frac{S_{m}}{m}$ . Then we have with probability at-least $(1-\delta)$ we have

\mu\leq\widehat{\mu}+\sqrt{\frac{2\log(1/\delta)}{m}}~{}~{}~{}~{}\text{and}

(8)

\mu\geq\widehat{\mu}-\sqrt{\frac{2\log(1/\delta)}{m}}

(9)

Now we are ready to give the proof of Theorem A.1

Proof of Theorem A.1.

We begin with the following supporting lemma.

Lemma A.3.

Let $\frac{2}{2+\sqrt{m}}\leq\delta<1/8$ , then $\tau<1/2.$

Proof.

The proof follows from following sequence of inequalities,

\displaystyle\frac{2}{2+\sqrt{m}}<\delta\iff m>4(1/\delta-1)^{2}\implies m>4(1/\delta-1)\log(1/\delta)\iff\frac{\delta}{2(1-\delta)}>\frac{2\log(1/\delta)}{m}

Let $\gamma=\frac{\delta}{2(1-\delta)}$ . We have $\tau=\gamma+\sqrt{\gamma}$ which is an increasing function of $\delta$ , so we the maximum is achieved at $\delta=1/8$ and is given by $1/\sqrt{14}+1/14<1/2$ . This completes the proof of the lemma. ∎

From Lemma A.3 we have that $1/2+\tau<1$ , hence there is a non-trivial range i.e. $\widehat{\mu}\in[1/2+\tau,1]$ where user assigns $h(z)=+1$ for all $z$ with probability 1. Similarly, when $\widehat{\mu}\in[0,1/2-\tau]$ user assigns $h(z)=-1$ for all $z$ with probability 1. We will consider three cases separately.

Case 1 ( $\widehat{\mu}\in[1/2+\tau,1]$ ): From Hoeffding’s inequality (Eq. 9) we have that with probability at-least $(1-\delta)$ ,

	$\displaystyle\mu$	$\displaystyle\geq\widehat{\mu}-\sqrt{\frac{2\log(1/\delta)}{m}}$
	$\displaystyle\implies\mu$	$\displaystyle\geq 1/2+\frac{\delta}{2(1-\delta)}=\frac{1}{2(1-\delta)}$

Hence, with probability at-least $(1-\delta)$ an agnostic user will get a payoff of $\mu$ . Hence, the expected payoff in this case is at-least $(1-\delta)\mu\geq 1/2\geq(1-\delta)(1-\mu)$ .

Case 2 ( $\widehat{\mu}\in[0,1/2-\tau]$ ): Similar to Case 1 here we use the tail bound given by Hoeffding’s inequality (Eq. 8) to get with probability at least $(1-\delta)$ ,

	$\displaystyle\mu$	$\displaystyle\leq\widehat{\mu}+\sqrt{\frac{2\log(1/\delta)}{m}}$
	$\displaystyle\implies\mu$	$\displaystyle\leq 1/2-\frac{\delta}{2(1-\delta)}=\frac{1-2\delta}{2(1-\delta)}.$

Hence, $(1-\mu)\geq\frac{1}{2(1-\delta)}$ . The agnostic user guarantees a payoff of $(1-\mu)$ with probability at least ( $1-\delta$ ) in this case. Hence we have the payoff of $(1-\delta)(1-\mu)\geq 1/2\geq(1-\delta)\mu$ in this case.

Case 3 ( $\widehat{\mu}\in(1/2-\tau,1/2+\tau)$ ): Finally, in this case, the agnostic user chooses $h(z)=1$ for all $z\in\mathcal{Z}$ with probability $1/2$ and $h(z)=-1$ for all $z\in\mathcal{Z}$ with probability $1/2$ . Hence, the expected payoff is given by $\frac{1}{2}\mu+\frac{1}{2}(1-\mu)=1/2$ irrespective of the true mean $\mu$ of positive samples. ∎

A.2 Lifted Functions

The relation between choice functions and their induced counterparts passes through an additional type of functions that operate on sets of size exactly $\ell$ . Denote $\mathcal{Z}_{\ell}=\{z\in 2^{E}:|z|=\ell\}$ , and note that all feasible representations can be partitioned as $\mathcal{Z}=\mathcal{Z}_{k_{1}}\uplus\ldots\uplus\mathcal{Z}_{k_{2}}$ . We refer to functions that operate on single-sized sets as restricted functions. Our next result shows that choice functions in $H_{k}$ can be represented by restricted functions over $\mathcal{Z}_{k}$ that are ‘lifted’ to operate on the entire $\mathcal{Z}$ space. This will allow us to work only with sets of size exactly $k$ .

Lemma A.4.

For each $h\in H_{k}$ there exists a corresponding $g:Z_{k}\rightarrow\{\pm 1\}$ such that $h=\mathrm{lift}(g)$ , where:

\mathrm{lift}(g)(z)=\begin{cases}1&\text{{if }}k\leq|z|\text{{ and }}\\ &\exists z^{\prime}\subseteq z,|z^{\prime}|=k\,\text{{ s.t. }}\,g(z^{\prime})=1\\ -1&\text{{o.w.}}\end{cases}

Proof.

Let $h\in H_{k}$ . Then there is a weight function $w$ on sets of size at most $k$ such that either $w(z)\in(-1,0)$ or $w(z)>\sum_{i\in[k]}{n\choose i}$ , and

h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}:z^{\prime}\subseteq z,|z^{\prime}|\leq k}w(z^{\prime})\right)

Define $g:\mathcal{Z}_{k}\rightarrow\{-1,1\}$ such that for a $z\in\mathcal{Z}_{k}$ , $g(z)=1$ if $w(z)>0$ and $g(z)=-1$ otherwise. It is easy to see from the choice of $w(z)$ that $h=\mathrm{lift}(g)$ . ∎

A.3 A Lower Bound on the Running Time

As stated in Lemma 4.8, the running time of our algorithm is $m({n\choose k})$ . We argued in Section 4 that is made possible only since weights are sparse, and since the algorithm operates on a finite sample set of size $m$ . If $m$ is large, then this expression can be replaced with ${q\choose k}$ . We now show that, in the limit (or under full information), the dependence on ${q\choose k}$ is necessary. The conclusion from Lemma A.5 is that to find the loss minimizer, any algorithm must traverse at least all such $h$ ; since there exist ${q\choose k}$ such functions, this is a lower bound. This is unsurprising; $H_{k}$ is tightly related to the class of multilinear polynomials, whose degrees of freedom are exactly ${q\choose k}$ .

Lemma A.5.

Consider a subclass of $H_{k}$ composed of choice functions $h$ which have $w(z)=a_{+}$ for exactly one $z$ with $|z|=k$ , and $w(z)=a_{-}$ otherwise. Then, for every such $h$ , there exists a corresponding $v$ , such that $h$ is a unique minimizer (within this subclass) of the error w.r.t. $v$ .

Proof.

Let $z_{1}$ and $z_{2}$ be distinct $k$ size subsets, and let $a_{-}\in(0,1)$ and $a_{+}>\sum_{i\in[1,k]}{n\choose i}$ . Further, let $\bm{w}_{i}$ , $i\in[1,2]$ be a weight function that assigns $a_{+}$ to $z_{i}$ and $a_{-}$ to all other subsets of size at most $k$ . Let $h_{1}$ and $h_{2}$ be two function in $H_{k}$ defined by the binary weighted functions $\bm{w}_{1}$ and $\bm{w}_{2}$ respectively. Observe that for $v_{i}=f_{h_{i}}$ the approximation error (see (Eq. (6))) of $h_{i}$ is zero. Hence, to prove the lemma it is sufficient to show that $f_{h_{1}}\neq f_{h_{2}}$ .

Suppose $f_{h_{1}}=f_{h_{2}}$ . Since $z_{1}\neq z_{2}$ , there exists an $x\in{}\mathcal{X}$ such that $z_{1}\subseteq x$ but $z_{2}\subseteq x$ . From Theorem 4.6 and the choice of $a_{+}$ and $a_{-}$ , this implies $f_{h_{1}}(x)=1$ but $f_{h_{2}}(x)=-1$ , and hence, a contradiction. ∎

Appendix B Additional Related Work

Learning set functions.

Concept learning refers to learning a binary function over hypercubes [Angluin, 1988] through a query access model. Abboud et al. [1999] provide a lower bound on membership queries to exactly learn a threshold function over sets where each element has small integer valued weights. Our learning framework admits large weights and has only a sample access in contrast with the query access studied in this literature. Feldman [2009] show that the problem of learning set functions with sample access is computationally hard. However, we show (see Section 5) that the strategic setting is more nuanced; a more complex representations are disadvantageous for both user and the system. In other words, it is in the best interest of system to choose smaller (and much simpler) representations. A by-now classic work in learning theory studies the learnability from data of submodular (non-threshold) set functions [Balcan and Harvey, 2011]. Though we consider learning subadditive functions in this work, an extension to submodular valuations is a natural extension. Learning set functions is in general hard, even for certain subcalsses such as submodular functions. Rosenfeld et al. [2020] show that it’s possible to learn certain parameterized subclasses of submodular functions, when the goal is to use them for optimization. But this refers to learning over approximate proxy losses; whereas in our work, we show that learning is possible directly over the 0/1 loss.

Hierarchies of set functions.

Conitzer et al. [2005] (and independently, Chevaleyre et al. [2008]) suggest a notion of $k$ -wise dependent valuations, to which our Definition 4.1 is related. We also allow up to $k$ -wise dependencies, but our valuations need not be positive and we focus on their sign (an indication whether an item is acceptable or not). Our set function valuations are also over item attributes rather than multiple items. Despite the differences, the definitions have a shared motivation: Conitzer et al. [2005] believe that this type of valuation is likely to arise in many economic scenarios, especially since due to cognitive limitations, it might be difficult for a player to understand the inter-relationships between a large group of items. Hierarchies of valuations with limited dependencies/synergies have been further studied by Abraham et al. [2012], Feige et al. [2015] under the title ‘hypergraph valuations’. These works focus on monotone valuations that have only positive weights for every subset, and are thus mathematically different than ours.

Appendix C A Missing Proof from Section 2

See 2.1

Proof.

The proof follows from the definition of best response (Eq. 2). Let $z_{1},z_{2}\in{\phi}_{h}(x)$ . Then since ${\phi}_{h}$ consists of only best response, we have either $h(z_{1})=h(z_{2})=1$ , or $h(z_{1})=h(z_{2})=-1$ . Hence, $h(z_{1})=v(x)$ if and only if $h(z_{2})=v(x)$ for any $z_{1},z_{2}\in\phi_{h}(x)$ . ∎

Appendix D A Missing Proof from Section 3 and an Additional Example

See 3.1

Proof.

Since a naïve user plays $h(z)=\operatorname{sign}(v(z))$ , for each $x\in{}\mathcal{X}$ the payoff of the user is maximized if in response the system plays a $z\subseteq x$ such that $\operatorname{sign}(v(z))=\operatorname{sign}(v(x))$ . Observe that, if there exists a $z\in\mathcal{Z}$ and $z\subseteq x$ , such that $\operatorname{sign}(v(z))=\operatorname{sign}(v(x))$ then $z\in\phi^{\mathrm{benev}}_{h}(x)$ and consequently the user’s payoff is maximized for such an $x$ . Conversely, if there exists no $z\in\mathcal{Z}$ and $z\subseteq x$ such that $\operatorname{sign}(v(z))=\operatorname{sign}(v(x))$ , then no truthful system can ensure more than zero utility for such an $x$ . Hence, a benevolent system maximizes the utility of a naïve user. ∎

We now present an additional example to show how a naïve user’s choice function can be manipulated by the strategic system and, as a consequence, the user may obtain arbitrarily small payoff against a strategic system.

Example 2.

Let $x_{1}=\{a_{1},a_{2}\},x_{2}=\{a_{1},a_{3}\},x_{3}=\{a_{1},a_{4}\},x_{4}=\{a_{2},a_{3}\},x_{5}=\{a_{3},a_{4}\}\}$ with $\operatorname{sign}(v(x_{1}))=\operatorname{sign}(v(x_{5}))=\operatorname{sign}(v(a_{2}))=\operatorname{sign}(v(a_{4}))=+1$ and $\operatorname{sign}(v(x_{2}))=\operatorname{sign}(v(x_{3}))=\operatorname{sign}(v(x_{4}))=\operatorname{sign}(v(a_{1}))=\operatorname{sign}(v(a_{3}))=-1$ . Further, let $k_{1}=k_{2}=1$ with $z_{i}=a_{i}$ as representations and a distribution $D=(\frac{\varepsilon}{4},\frac{\varepsilon}{4},1-\varepsilon,\frac{\varepsilon}{4},\frac{\varepsilon}{4})$ supported over $(x_{1},x_{2},x_{3},x_{4},x_{5})$ .

A unique truthful representation for this instance is $h=(-1,+1,-1,+1)$ . A strategic agent can manipulate a naïve agent into non-preferred choices by using a representation $(a_{2},a_{1},a_{4},a_{2},a_{4})$ for $(x_{1},x_{2},x_{3},x_{4},x_{5})$ . Note here that a naïve agent expected $z_{1}$ as a representation for $x_{3}$ since $h(z_{1})=\operatorname{sign}(v(x_{3}))=-1$ and $h(z_{4})=+1\neq\operatorname{sign}(v(x_{3}))$ . However, a strategic agent chose $a_{4}$ as under given $h$ we have $h(a_{4})=1$ . A naïve users payoff in this case is reduced to $\varepsilon$ which can be arbitrarily small.

Appendix E Missing Proofs from Section 4

See 4.2

Proof.

Define $k$ as follows: if $h(z)=-1$ for all $z\in\mathcal{Z}$ then $k=k_{1}$ , and otherwise

k=\max_{k^{\prime}\in[k_{1},k_{2}]}\{\exists z\text{ such that }|z|=k^{\prime}\text{ and }h(z)=1,\text{ but for all }z^{\prime}\subset z\text{ and }z^{\prime}\in\mathcal{Z},h(z^{\prime})=-1\}.

Define $h^{\prime}$ as follows: For $|z|<k$ , $h^{\prime}(z)=-1$ ; for $|z|\geq k$

	$\displaystyle h^{\prime}(z)=1$	$\displaystyle~{}~{}~{}~{}~{}\text{if }\exists z^{\prime}:\|z^{\prime}\|=k,z^{\prime}\subseteq z\text{ and }h(z^{\prime})=1;$
	$\displaystyle h^{\prime}(z)=-1$	$\displaystyle~{}~{}~{}~{}~{}\text{otherwise}.$

First, we argue that $h^{\prime}$ defined as above satisfies $h^{\prime}(\phi_{h^{\prime}}(x))=h(\phi_{h}(x))$ for all $x\in{}\mathcal{X}$ . Suppose $h(\phi_{h}(x))=1$ . Then there exists $z\in\mathcal{Z}$ such that $z\subseteq x$ and $h(z)=1$ . From the choice of $k$ , we may assume without loss of generality that $|z|=k$ . Further, from the construction of $h^{\prime}$ , we have $h^{\prime}(z)=1$ , and hence $h^{\prime}(\phi_{h^{\prime}}(x))=h(\phi_{h}(x))=1$ . Now suppose $h(\phi_{h}(x))=-1$ . Then for all $z\subseteq x$ we have $h(z)=-1$ . In particular, for all $z\subseteq x$ such that $|z|=k$ we have $h(z)=-1$ . This implies for all $z\subseteq x$ such that $|z|\geq k$ we have $h^{\prime}(z)=-1$ . This is because if there exists $z\subseteq x$ such that $|z|\geq k$ and $h^{\prime}(z)=1$ then from the definition of $h^{\prime}$ there exists a $z^{\prime}\subseteq z\subseteq x$ such that $|z^{\prime}|=k$ , and $h(z^{\prime})=1$ (a contradiction). Additionally, from definition, for all $z\subseteq x$ such that $|z|<k$ we have $h^{\prime}(z)=-1$ . Hence, if $h(\phi_{h}(x))=-1$ then $h^{\prime}(\phi_{h^{\prime}}(x))=-1$ .

Now, we show that $h^{\prime}$ is a $k$ -order function. Let $a_{-}\in(-1,0)$ and $w(z)=a_{-}$ for all $z$ such that $|z|\leq k$ and $h^{\prime}(z)=-1$ . Further, for all $z$ such that $|z|=k$ , if $h^{\prime}(z)=1$ then let $w(z)=a_{+}>\sum_{i\in[k]}{n\choose i}$ . For all $z\in\mathcal{Z}$ , if $|z|<k$ then by construction of $h^{\prime}$ , we have $h^{\prime}(z)=-1$ , and since for all $z^{\prime}\in\Gamma_{k}(z)$ , $w(z^{\prime})=a_{-}<0$ we have $\sum_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})<0$ . Hence, for all $z\in\mathcal{Z}$ , if $|z|<k$

h^{\prime}(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right)=-1.

Similarly, for all $z\in\mathcal{Z}$ , if $|z|\geq k$ then by construction of $h^{\prime}$ , we have $h^{\prime}(z)=1$ if and only if there exists a $z^{\prime}\subseteq z$ , and $z^{\prime}=|k|$ such that $h^{\prime}(z)=h(z)=1$ . In particular, if $|z|\geq k$ and $h^{\prime}(z)=1$ then there exists a $z^{\prime}\subseteq z$ , and $|z^{\prime}|=k$ such that $w(z^{\prime})=a_{+}$ . Since $a_{+}>\sum_{i\in[k]}{n\choose i}$ , $a_{-}\in(-1,0)$ , and $k_{2}\leq n$ , we have if $|z|\geq k$ and $h^{\prime}(z)=1$ then

h^{\prime}(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right)=1.

Finally, if $|z|\geq k$ and $h^{\prime}(z)=-1$ then from the definition of $h^{\prime}$ there does not exists a $z^{\prime}\subseteq z$ , and $|z^{\prime}|=k$ such that $w(z^{\prime})=a_{+}$ . Since $a_{-}\in(-1,0)$ , we have if $|z|\geq k$ and $h^{\prime}(z)=-1$ then

h^{\prime}(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right)=-1.

∎

See 4.4

Proof.

Arbitrarily choose $u\subset E$ (recall $E$ is the ground set) such that $|u|=k$ , and let $w(u)=a_{k,+}>\sum_{i\in[k]}{n\choose i}$ .⁶⁶6Here we wish to distinguish between $a_{+}$ for $k$ and $k-1$ and hence we use $a_{k,+}$ instead of $a_{+}$ . Also for all $z\neq u$ and $|z|\leq k$ , let $w(z)=a_{-}\in(-1,0)$ . Let $h:\mathcal{Z}\rightarrow\{\pm 1\}$ be defined as follows

h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right)

From the definition of $H_{k}$ , we have $h\in H_{k}$ . We show that $h\not\in H_{k-1}$ . First, observe that for all $z\in\mathcal{Z}$ $h(z)=1$ if and only if $u\subseteq z$ . Suppose $h\in H_{k-1}$ . Then there is a weight function $w^{\prime}$ on sets of size at most $k-1$ such that either $w^{\prime}(z)=a_{-}\in(-1,0)$ or $w_{z}=a_{k-1,+}>\sum_{i\in[k-1]}{n\choose i}$ , and

h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k-1}(z)}w^{\prime}(z^{\prime})\right)

Let $z\in\mathcal{Z}$ be such that $u\subseteq z$ . This implies $h(z)=1$ . Hence there exist a $u^{\prime}\subseteq z$ such that $|u^{\prime}|=k-1$ and $w^{\prime}(u^{\prime})=a_{k-1,+}$ . Let $\tilde{z}\in\mathcal{Z}$ be such that $u^{\prime}\subseteq\tilde{z}$ but $u\not\subseteq\tilde{z}$ . Such a $\tilde{z}$ exists because $u\cap u^{\prime}\neq u$ . Further, as $u\not\subseteq\tilde{z}$ , we have $h(\tilde{z})=-1$ . But since $u^{\prime}\subseteq\tilde{z}$ , we have from the choice of $a_{k-1,+}$ and $a_{-}$

	$\displaystyle\sum\nolimits_{z^{\prime}\in\Gamma_{k-1}(\tilde{z})}w^{\prime}(z^{\prime})~{}~{}$	$\displaystyle>~{}~{}0$
	$\displaystyle\Rightarrow\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k-1}(\tilde{z})}w^{\prime}(z^{\prime})\right)~{}~{}$	$\displaystyle=~{}~{}h(\tilde{z})=~{}~{}1.$

This gives a contradiction. Hence, $h\not\in H_{k-1}$ . ∎

See 4.6

Proof.

Since $h\in H_{k}$ , $\bm{w}$ satisfies the following two properties (see Definition 4.3):

1.

either $w(z)=a_{-}\in(-1,0)$ or $w(z)=a_{+}>\sum_{i\in[k]}{n\choose i}$ ,
2.

$w(z)=a_{-}$ for all $z$ having $|z|<k$ .

Further, from the definition of $f_{h}$ , we have $f_{h}(x)=h(\phi_{h}(x))$ . This implies

f_{h}(x)=1\Longleftrightarrow\,\exists z\in\mathcal{Z},\,z\subseteq x\text{ such that }h(z)=1\,.

From the the above two properties of the weights function, we have

h(z)=1\Longleftrightarrow\,\exists z^{\prime}\subseteq z,\,|z^{\prime}|=k\text{ such that }w(z^{\prime})=a_{+}>0\,.

From the above two equations we conclude that

f_{h}(x)=1\Longleftrightarrow\,\exists\,z\subseteq x,|z|=k\text{ such that }w(z)=a_{+}>0\,.

Finally, the two properties of $\bm{w}$ ensure that

f_{h}(x)=\operatorname{sign}\left(\sum\nolimits_{z\in\Gamma_{k}(x)}w(z)\right)\,.

∎

See 4.7

Proof.

Throughout, for ease of notation, we use $x\in S$ to denote $x\in\{x_{1},\ldots,x_{m}\}$ . Let $Z_{k}=\{z\,:\,|z|=k,\exists x\in S~{}z\subseteq x\}$ . Recall $Z_{k,S}$ is equal to $Z_{k}$ at the beginning of the algorithm. Also, for each $z\in Z_{k}$ , let ${}\mathcal{X}_{z}=\{x\in S\mid z\subseteq x\}$ . The following lemma characterizes the training set for which there exists an $h\in H_{k}$ with zero empirical error.

Lemma E.1.

There exists an $h\in H_{k}$ with zero empirical error if and only if for all $x\in S^{+}$ there exists a $z\in Z_{k}$ and $z\subseteq x$ such that $z\not\subseteq x^{\prime}$ for all $x^{\prime}\in S^{-}$ .

Proof.

Suppose there exists an $h\in H_{k}$ with zero empirical error. Since $h\in H_{k}$ , from Lemma A.4 we have there exists a $g:Z_{k}\rightarrow\{\pm 1\}$ such that $h=\mathrm{lift}(g)$ . We state the following observation, and its proof follows from the definition of $\mathrm{lift}(g)$ .

Observation E.2.

1.

For every $z\in\mathcal{Z}$ such that $h(z)=1$ there is a $z^{\prime}\in Z_{k}$ such that $z^{\prime}\subseteq z$ and $g(z^{\prime})=1$ ,
2.

For every $z\in\mathcal{Z}$ such that $h(z)=-1$ , it must be that for every $z^{\prime}\in Z_{k}$ , $z^{\prime}\subseteq z$ we have $g(z^{\prime})=-1$ .

Further, as the empirical error for $h$ is zero, we have the following observation.

Observation E.3.

1.

For every $x\in S^{+}$ there is a $z\in\mathcal{Z}$ and $z\subseteq x$ such that $h(z)=1$ ,
2.

For every $x\in S^{-}$ , it must be that for every $z\in\mathcal{Z}$ , $z\subseteq x$ we have $h(z)=-1$ .

Proof.

For every $x\in S^{+}$ , since the empirical error is zero, we have $h(\phi_{h}(x))=1$ . From the definition of $\phi_{h}$ , this implies there is a $z\in\mathcal{Z}$ and $z\subseteq x$ such that $h(z)=1$ . Similarly, for every $x\in S^{-}$ , since the empirical error is zero, we have $h(\phi_{h}(x))=-1$ . Again from the definition of $\phi_{h}$ , it must be that for every $z\in\mathcal{Z}$ , $z\subseteq x$ we have $h(z)=-1$ . ∎

Hence, from Observations E.2 and E.3, we have for every $x\in S^{+}$ there is a $z\in Z_{k}$ , $z\subseteq x$ such that $g(z)=1$ . Similarly, for every $x\in S^{-}$ it must be that for every $z\in Z_{k}$ , $z\subseteq x$ we have $g(z)=-1$ . Hence, for every $x\in S^{+}$ there exists a $z\in Z_{k}$ and $z\subseteq x$ such that $z\not\subseteq x^{\prime}$ for all $x^{\prime}\in S^{-}$ .

Conversely, suppose for every $x\in S^{+}$ there exists a $z\in Z_{k}$ and $z\subseteq x$ such that $z\not\subseteq x^{\prime}$ for all $x^{\prime}\in S^{-}$ . Then define $g:Z_{k}\rightarrow\{\pm 1\}$ as follows: a) for all $z\in Z_{k}$ such that $z\subseteq x$ for an $x\in S^{-}$ , let $g(z)=-1$ , b) for all $z\in Z_{k}$ , such that $z\subseteq x$ for an $x\in S^{+}$ and $z\not\subseteq x^{\prime}$ for an $x^{\prime}\in S^{-}$ , let $g(z)=1$ , c) for all $z\in Z_{k}$ , such that $z\not\subseteq x$ for any $x\in S$ , let $g(z)=-1$ . From the supposition, we have that for every $x\in S^{+}$ , there is a $z\in Z_{k}$ and $z\subseteq x$ such that $g(z)=1$ . Now define $h=\mathrm{lift}(g)$ . To show that the empirical error of $h$ is zero, it is sufficient to show that for every $x\in S^{-}$ $h(\phi_{h}(x))=-1$ , and for every $x\in S^{+}$ $h(\phi_{h}(x))=1$ . Let $x\in S^{-}$ . From the definition of $g$ , for every $z\in Z_{k}$ such that $z\subseteq x$ , $g(z)=-1$ . Hence, from the definition of $\mathrm{lift}$ , we have for every $z\in\mathcal{Z}$ such that $z\subseteq x$ , $h(z)=-1$ . Now from the definition of best response, we have $h(\phi_{h}(x))=-1$ . Similarly, if $x\in S^{+}$ from our supposition and the definition of $g$ , we have there exists a $z\in Z_{k}$ such that $z\subseteq x$ and $g(z)=1$ . Hence, from the definition of $\mathrm{lift}$ , there exists a $z\in\mathcal{Z}$ such that $z\subseteq x$ and $h(z)=1$ . Finally, from the definition of best response , we have $h(\phi_{h}(x))=1$ . ∎

Now, if $Y\in F_{k}$ then there exists an $h\in H_{k}$ such that the induced function of $h$ is equal to $Y$ , that is, $f_{h}=Y$ . This implies there exists an $h\in H_{k}$ which attains zero empirical error on the training set. Since empirical error is always non-negative, such an $h$ minimizes the empirical error in this case. Hence, from Lemma E.1, it follows that if $Y$ is realizable then for all $x\in S^{+}$ at Step 17 of Alg there is either a $z\in Z^{+}$ and $z\subseteq x$ , or $z\in Z_{k,S}$ and $z\subseteq x$ . Now, observe that at the beginning of Step 22, set $Z^{+}$ satisfies the following:

z\in Z^{+}\,\,\,\iff\,\,\,\exists\,x\in S^{+}\text{ such that }z\subseteq x\text{ and }\not\exists x^{\prime}\in S^{-}\text{ such that }z\subseteq x^{\prime}.

(10)

Further, at Step 22, for a $z\in Z_{k}$ , $w(z)=a_{+}$ if $z\in Z^{+}$ . This implies

w(z)=a_{+}\,\,\,\iff\,\,\,\exists\,x\in S^{+}\text{ such that }z\subseteq x\text{ and }\not\exists x^{\prime}\in S^{-}\text{ such that }z\subseteq x^{\prime}.

(11)

Also, from Theorem 4.6, the induced function $f_{\hat{h}}$ corresponding to the returned $\hat{h}$ is given as

f_{\hat{h}}(x)=\operatorname{sign}\left(\sum\nolimits_{z\in\Gamma_{k}(x)}w(z)\right)\,.

(12)

To complete the proof of theorem, we show that $f_{\hat{h}}(x_{i})=y_{i}$ for every $x_{i}\in S$ . Suppose $x\in S^{-}$ . Then from Equations and 10 and 11, for every $z\subseteq x$ and $|z|\leq k$ we have $w(z)=a_{-}<0$ , and hence from Equation 12 for $f_{\hat{h}}$ we have $f_{\hat{h}}(x)=y=-1$ . Similarly, suppose $x\in S^{+}$ . Then from Equation 11, there exists $z\subseteq x$ , $|z|=k$ such that $w(z)=a_{+}$ . Hence, from Equation 12, and noting that $a_{+}>\sum_{i\in[k]}{n\choose i}$ and $a_{-}\in(-1,0)$ we have $f_{\hat{h}}(x)=y=1$ . ∎

See 4.8

Proof.

In the first two for loops, for each $x\in S^{+}$ (or in $S^{-}$ ) the internal for loop runs for $O({n\choose k})$ time. Since $|S|\leq m$ , this is a total of at most $O(m{n\choose k})$ operations. Similarly, Step 22 places weights on at most $m{n\choose k}$ subsets, and hence runs in $O(m{n\choose k})$ time. Hence, Alg runs in $O(m{n\choose k})$ time. ∎

Appendix F Missing Proofs from Section 5

See 5.2

Proof.

Let $\ell=\min_{k^{\prime}\in[1,k]}\{\text{there exists a }g:Z_{\ell}\rightarrow\{\pm 1\}\text{ such that }h=\text{lift}(g)\}$ . From Lemma A.4, we know $\ell\leq k$ . Further, assume $g:Z_{\ell}\rightarrow\{\pm 1\}$ is such that $h=\text{lift}(g)$ . Now, from the definition of $f_{h}$ , we have for all $x\in{}\mathcal{X}$ , $f_{h}(x)=1$ if and only if there exists a $z\in\mathcal{Z}$ such that $z\subseteq x$ and $h(z)=1$ . Since $h=\text{lift}(g)$ , $f_{h}(x)=1$ if and only if there exists a $z\in Z_{\ell}$ such that $z\subseteq x$ and $g(z)=1$ . This implies the induced complexity of $f_{h}$ is $\ell\leq k$ . ∎

See 5.3

Proof.

From Lemma 5.2, we know that functions in $F_{k}$ have induced complexity at most $k$ . We show that if $f$ has induced complexity at most $k$ then there is an $h\in H_{k}$ such that $f=f_{h}$ . Let the induced complexity of $f$ be equal to $\ell\leq k$ . Then there exists a $g:Z_{\ell}\rightarrow\{\pm 1\}$ such that

\displaystyle f(x)=1\iff\exists z\in Z_{\ell}\text{ such that }z\subseteq x\text{ and }g(z)=1.

(13)

Let $h=\mathrm{lift}(g)$ . First we show that $f(x)=f_{h}(x)$ for all $x\in{}\mathcal{X}$ . Since $h$ is a lift of $g$ , if $g(z^{\prime})=1$ for a $z^{\prime}\in Z_{\ell}$ then for all $z\in\mathcal{Z}$ such that $z^{\prime}\subseteq z$ , we have $h(z)=1$ .

\displaystyle h(z)=1\iff\exists z^{\prime}\in Z_{\ell}\text{ such that }z^{\prime}\subseteq z\text{ and }g(z^{\prime})=1.

(14)

Hence, from Equations 13 and 14 for all $x\in{}\mathcal{X}$ , $f(x)=1$ if and only if there exists $z\in\mathcal{Z}$ such that $z\subseteq x$ and $h(z)=1$ . From the definition of induced function, this implies $f(x)=f_{h}(x)$ for all $x\in{}\mathcal{X}$ .

To show $h\in H_{k}$ , we construct a weight function $\bm{w}$ on sets of size at most $k$ . For $z\in 2^{E}$ and $|z|<k$ , let $w(z)=a_{-}\in(-1,0)$ . For $z\in Z_{k}$ , let

w(z)\begin{cases}=a_{+}>\sum_{i\in[1,k]}{n\choose i}&\text{{if }}\,\,\exists z^{\prime}\subseteq z,|z^{\prime}|=\ell\,\,\text{{and}}\,\,g(z)=1\\ =a_{-}&\text{{o.w.}}\end{cases}

Now from Equation 14, $h(z)=1$ if and only if there exists $z^{\prime}\in Z_{\ell}$ such that $z^{\prime}\subseteq z$ and $g(z^{\prime})=1$ . Hence, from the definition of $\bm{w}$ , $h(z)=1$ if and only if there exists $z^{\prime}\in Z_{\ell}$ such that $z^{\prime}\subseteq z$ and $w(z^{\prime})=a_{+}$ . In particular, since $a_{+}>\sum_{i=1}^{k}{n\choose i}$ and $a_{-}\in(0,1)$ , we have

h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}:z^{\prime}\subseteq z,|z^{\prime}|\leq k}w(z^{\prime})\right)\,.

∎

See 5.4

Proof.

Since the induced complexity of $v$ is $\ell^{*}$ , there is a function $g:Z_{\ell^{*}}\rightarrow\{\pm 1\}$ s.t.:

v(x)=\begin{cases}1&\text{{if }}\,\,\exists z\subseteq x,|z|=\ell^{*}\,\,\text{{and}}\,\,g(z)=1\\ -1&\text{{o.w.}}\end{cases}

Let $a_{+}>\sum_{i\in[1,k]}{n\choose i}$ and $a_{-}\in(0,1)$ , and define the weight function $\bm{w}$ on sets of size at most $k$ as follows: a) if $|z|<k$ then let $w(z)=a_{-}$ , b) if $|z|=k$ and there exists a $z^{\prime}\subseteq z$ such that $g(z^{\prime})=1$ then $w(z)=a_{+}$ , and c) if $|z|=k$ and there does not exist a $z^{\prime}\subseteq z$ such that $g(z^{\prime})=1$ then $w(z)=a_{-}$ . Now define $h$ using $\bm{w}$ as follows:

h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right)\,.

We now show that for each $x\in{}\mathcal{X}$ , $h(\phi_{h}(x))=f_{h}(x)=v(x)$ implying $h^{*}_{k}=h$ . Suppose $f_{h}(x)=1$ for an $x\in{}\mathcal{X}$ . Then there exists a $z\in\mathcal{Z}$ such that $z\subseteq x$ and $h(z)=1$ . From Theorem 4.6, and the choice of $a_{+}$ and $a_{-}$ we have that there exists a $z\subseteq x$ , $|z|=k$ such that $w(z)=a_{+}$ . From the construction of $w$ this implies there exists a $z\subseteq x$ , $|z|=\ell^{*}$ such that $g(z)=a_{+}$ . But from the above definition of $v$ this implies $v(x)=1$ . Similarly, we can argue, if $f_{h}(x)=-1$ then $v(x)=-1$ for any $x\in{}\mathcal{X}$ . Hence, $h(\phi_{h}(x))=v(x)$ for each $x\in{}\mathcal{X}$ implying $h^{*}_{k}=h$ and 0 approximation error for $h$ . ∎

See 5.5

Proof.

In the proof of Theorem 5.4, we show that for $k=\ell^{*}$ , we have zero approximation error. Hence, to prove the corollary it is sufficient to show that for a $k<\ell^{*}$ the approximation error is not zero. Suppose there is an $h\in H_{k}$ such that $\varepsilon(h)=0$ and $k<\ell^{*}$ . Since the distribution $D$ has full support, this implies $f_{h}(x)=v(x)$ for all $x\in{}\mathcal{X}$ . Hence, the induced complexity of $v$ is at most $k<\ell^{*}$ giving a contradiction. ∎

See 5.6

Proof.

The approximation error weakly decreases because $H_{k-1}\subseteq H_{k}$ for all $k\leq k_{2}$ . Also, from the proof of Corollary 5.5, it is clear that no $k$ can achieve zero approximation error. ∎

See 5.7

Proof.

We construct a $v$ such that the approximation error for $h^{*}_{k}\in H_{k}$ is as given below

{{\varepsilon}}(h^{*}_{k})=\frac{1}{4{q\choose n}}\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}\,.

It is easy to see that ${{\varepsilon}}(h^{*}_{k})$ diminishes convexly with $k$ (see Fig. 1). We choose $k_{2}$ elements $e_{1},e_{2},\ldots,e_{k_{2}}\in E$ (the ground set), and let $z_{e}$ be the $k_{2}$ size subset consisting of these $k_{2}$ elements. For a $v:{}\mathcal{X}\rightarrow\mathbb{R}$ , let ${}\mathcal{X}_{v}^{+}=\{x\in{}\mathcal{X}|\operatorname{sign}(v(x))=1\}$ and ${}\mathcal{X}_{v}^{-}=\{x\in{}\mathcal{X}|\operatorname{sign}(v(x))=-1\}$ . We first show that there exists a $v$ with the following two properties:

1.

if $x\in{}\mathcal{X}_{v}^{+}$ then there exists a $z\subseteq z_{e}$ such that $z\subseteq x$ .
2.

For $k\in[1,k_{2}]$ , let ${}\mathcal{X}_{k}=\{x\in{}\mathcal{X}|\exists z\subseteq z_{e},|z|=k,\text{ and }z\subseteq x\}$ . Then ${}\mathcal{X}_{v}^{+}\cap{}\mathcal{X}_{k}=\frac{3}{4}(\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell})$ , for every $k\in[1,k_{2}]$ .
3.

For every $z\subseteq z_{e}$ , let ${}\mathcal{X}_{z}=\{x\in{}\mathcal{X}\mid z\subseteq x\}$ . Then $|{}\mathcal{X}_{v}^{+}\cap{}\mathcal{X}_{z}|=\frac{3}{4}{q-k\choose n-k}$ , where $|z|=k$ .

We construct such a $v$ iteratively. We begin by making the following observation.

Observation F.1.

For each $k\in[1,k_{2}]$ , $|{}\mathcal{X}_{k}|=\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}$ .

Proof.

Recall ${}\mathcal{X}$ consists of size $n$ subsets of $E$ . For a $k\in[1,k_{2}]$ we wish to choose $n$ size subsets of $E$ that contain a $z\subseteq z_{e}$ , $|z|=k$ . This equivalent to choosing a fixed $\ell\geq k$ size subset of $z_{e}$ and then choosing the remaining $n-\ell$ elements from the $q-k_{2}$ elements (not part of $z_{e}$ ) in $E$ . For every $\ell\geq k$ we can choose $\ell$ size subset of $z_{e}$ in ${k_{2}\choose\ell}$ ways, and for each such choice we can choose the remaining $n-\ell$ elements in ${q-k_{2}\choose n-\ell}$ ways. Since, this holds for any $\ell\in[k,k_{2}]$ , we have $|{}\mathcal{X}_{k}|=\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}$ . ∎

Constructing v: The idea is to iteratively add elements in ${}\mathcal{X}$ to ${}\mathcal{X}_{v}^{+}$ , that is, iteratively determine the $x\in{}\mathcal{X}$ such that $\operatorname{sign}(v(x))=1$ . In the first round, we arbitrarily choose $\frac{3}{4}{q-k_{2}\choose n-k_{2}}$ from ${}\mathcal{X}_{k_{2}}$ and add it to ${}\mathcal{X}_{v}^{+}$ , and the remaining $\frac{1}{4}{q-k_{2}\choose n-k_{2}}$ are added to ${}\mathcal{X}_{v}^{-}$ . At round $k$ , assume we have constructed a $v$ satisfying the above three properties for $k^{\prime}>k$ , that is,

1.

if $x\in{}\mathcal{X}_{v}^{+}$ then there exists a $z\subseteq z_{e}$ such that $z\subseteq x$ .
2.

${}\mathcal{X}_{v}^{+}\cap{}\mathcal{X}_{k^{\prime}}=\frac{3}{4}(\sum_{\ell=k^{\prime}}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell})$ , for every $k^{\prime}\in[k+1,k_{2}]$ .
3.

For every $z\subseteq z_{e}$ , let ${}\mathcal{X}_{z}=\{x\in{}\mathcal{X}\mid z\subseteq x\}$ . Then $|{}\mathcal{X}_{v}^{+}\cap{}\mathcal{X}_{z}|=\frac{3}{4}{q-k^{\prime}\choose n-k^{\prime}}$ , where $|z|=k^{\prime}>k$ .

Hence, at round $k$ , we have ${k_{2}\choose k}{q-k_{2}\choose n-k}$ elements in ${}\mathcal{X}_{k}$ which are not yet in ${}\mathcal{X}_{v}^{+}$ or ${}\mathcal{X}_{v}^{-}$ . From these elements in ${}\mathcal{X}_{k}$ , for every $k$ size subset $z\subseteq z_{e}$ we arbitrarily choose $\frac{3}{4}{q-k_{2}\choose n-k}$ elements containing $z$ and add the remaining $\frac{1}{4}{q-k_{2}\choose n-k}$ elements to ${}\mathcal{X}_{v}^{-}$ . Now, observe that $v$ satisfies the first two properties for every $k^{\prime}\in[k,k_{2}]$ after this procedure. We argue $v$ satisfies the third property for any $z\subseteq z_{e}$ , such that $|z|=k$ . The $n$ size sets in ${}\mathcal{X}$ containing a $z\subseteq z_{e}$ , such that $|z|=k$ , can be partitioned into sets containing different $\ell>=k$ size subsets of $z_{e}$ . In particular, we have the following combinatorial equality

{q-k^{\prime}\choose n-k^{\prime}}=\sum_{\ell=k^{\prime}}^{k_{2}}{k_{2}-k^{\prime}\choose\ell-k^{\prime}}{q-k_{2}\choose n-\ell}

In the above expression, ${q-k_{2}\choose n-\ell}$ corresponds to the number of $n$ size sets that contain only a a specific $\ell\geq k^{\prime}$ size subset of $z_{e}$ . Since our iterative procedure ensures from each such partition at least $\frac{3}{4}$ fraction of $x$ is added to ${}\mathcal{X}_{v}^{+}$ , we have that $v$ satisfies the third property.

Optimal $h^{*}\in H_{k}$ : From the construction of $v$ , it is clear that the optimal $h^{*}\in H_{k}$ for the above constructed $v$ , for any $k\in[1,k_{2}]$ satisfies the following: for every $z\in\mathcal{Z}$ , $h^{*}(z)=1$ if and only if there exists a $z^{\prime}\subseteq z_{e}$ , $|z^{\prime}|=k$ , and $z^{\prime}\subseteq z$ . Further as $D$ is the uniform distribution, for such an $h^{*}$ :

{{\varepsilon}}(h^{*}_{k})=\frac{1}{4{q\choose n}}\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}\,.

∎

See 5.8

Proof.

Let $h\in H_{\mathrm{SA}}$ with a corresponding $g:\mathcal{Z}\rightarrow\mathbb{R}$ such that $h(z)=\operatorname{sign}(g(z))$ for all $z\in\mathcal{Z}$ . Choose an $a_{+}>\sum_{i=1}^{k_{1}}{n\choose i}$ , and $a_{-}\in(0,1)$ . Define a weight function $\bm{w}$ on sets of size at most $k_{1}$ as follows:

w(z)=\begin{cases}a_{+}&\text{{if }}\,|z|=k_{1},\,h(z)=1\\ a_{-}&\text{{o.w.}}\end{cases}

Let $h^{\prime}\in H_{k_{1}}$ be the function defined by the binary weight $\bm{w}$ as defined above. We argue that for every $x\in{}\mathcal{X}$ , if $h(\phi_{h}(x))=h^{\prime}(\phi_{h^{\prime}}(x))$ . For every $x\in{}\mathcal{X}$ , $h(\phi_{h}(x))=1$ if and only if there is a $z\in\mathcal{Z}$ and $z\subseteq x$ such that $h(z)=1$ . Since $g$ is sub-additive, we have

0\leq g(z)\leq\sum_{z^{\prime}\subseteq z,z^{\prime}\neq z,z^{\prime}\in\mathcal{Z}}g(z^{\prime})\,.

(15)

A simple recursive argument implies $h(\phi_{h}(x))=1$ if and only if there is a $z\subseteq x$ such that $|z|=k_{1}$ and $h(z)=1$ , and hence $w(z)=a_{+}$ . Hence, from Theorem 4.6 this implies, $h(\phi_{h}(x))=1$ if and only if $h^{\prime}(\phi_{h^{\prime}}(x))=1$ . ∎

See 5.9

Proof.

We first argue that the VC dimension of $H_{k}$ is at most ${q\choose k}$ . Let $d=\sum_{i\in[1,k]}{n\choose i}$ , index the vectors in $\{0,1\}^{d}$ by $z\subseteq E$ (the ground set), such that $|z|\leq k$ . Then each $z\in\mathcal{Z}$ can be represented by a binary vector $e_{z}\in\{0,1\}^{d}$ , with the entry indexed by a $z^{\prime}$ being $1$ if and only if $z^{\prime}\subseteq z$ . Further, let $w\in\{a_{-},a_{+}\}^{d}$ be a binary weighted vector with $a_{-}$ and $a_{+}$ as in Def. 4.3. Then from the definition of $H_{k}$ , for each $h\in H_{k}$ , there is a $w_{h}\in\{a_{-},a_{+}\}^{d}$ such that a) $h(z)=\operatorname{sign}(\langle w,e_{z}\rangle)$ for all $z\in Z$ , and b) the entry of $w$ indexed by a $z^{\prime}$ with $|z^{\prime}|<k$ is $a_{-}$ . From this we observe that the VC dimension of $H_{k}$ is at most ${q\choose k}$ , since each $h\in H_{k}$ is decided by the realization of binary weights on entries indexed by the ${q\choose k}$ sets. Now the first part of the theorem follows by noting that the first bound is the agnostic PAC generalization guarantee for an algorithm minimizing the empirical error in the standard classification setting with VC dimension at most ${q\choose k}$ . To prove the second part, we have $Y\in F_{k}$ , and hence the approximation error is zero, that is, ${{\varepsilon}}(h^{*})=0$ (from Lemma E.1). Further, Alg minimizes the empirical error (Theorem 4.7), and returns an $\hat{h}$ with zero empirical error. ∎

See 5.10

Proof.

The $v$ is constructed as in the proof Lemma 5.7. We recall notations from the proof of Lemma 5.7: $z_{e}$ is a $k_{2}$ size subset. Further, in the proof of Lemma 5.7 we argued that for $k\in[1,k_{2}]$ , $h^{*}_{k}$ is such that for all $z\in\mathcal{Z}$ , $h^{*}(z)=1$ if and only if there exists a $k$ size $z^{\prime}\subseteq z$ which is also a subset of $z_{e}$ .

Now let $k,k^{\prime}\in[1,k_{2}]$ such that $k<k^{\prime}$ . Since $D$ is the uniform distribution, to show system’s utility is more for $k$ compared to $k^{\prime}$ it is sufficient to show that

\sum_{x\in{}\mathcal{X}}{\mathds{1}{\{{h^{*}_{k}(\phi_{h^{*}_{k}}(x))=1}\}}}>\sum_{x\in{}\mathcal{X}}{\mathds{1}{\{{h^{*}_{k^{\prime}}(\phi_{h^{*}_{k^{\prime}}}(x))=1}\}}}

From the proof of Lemma 5.7 and Theorem 4.6, it follows that

\sum_{x\in{}\mathcal{X}}{\mathds{1}{\{{h^{*}_{k}(\phi_{h^{*}_{k}}(x))=1}\}}}=\sum_{x\in{}\mathcal{X}}{\mathds{1}{\{{f_{h^{*}_{k}}(x)=1}\}}}=\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}

Similarly,

\sum_{x\in{}\mathcal{X}}{\mathds{1}{\{{h^{*}_{k^{\prime}}(\phi_{h^{*}_{k^{\prime}}}(x))=1}\}}}=\sum_{\ell=k^{\prime}}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}

Since $k<k^{\prime}$ , we have

\sum_{x\in{}\mathcal{X}}{\mathds{1}{\{{f_{h^{*}_{k}}(x)=1}\}}}=\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}>\sum_{\ell=k^{\prime}}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}

implying system’s utility is more for $k$ compared to $k^{\prime}$ . ∎

See 5.11

Proof.

In Lemma 5.10, we showed there exists a user with $v$ such that for all $k,k^{\prime}\in[1,k_{2}]$ and $k<k^{\prime}$ , the system has better utility against the optimal choice function in $H_{k}$ than in $H_{k^{\prime}}$ . Since the choice of $k$ the user can make is bounded by $k_{2}$ , a lower $k_{2}$ maximizes the worst-case payoff to the system. ∎