This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Strategic Representation

Vineet Nair Vineet Nair is now at Google Research, India. Technion Israel Institute of Technology, Haifa, Israel. Ganesh Ghalme This work was done when Ganesh was a post-doctoral fellow at Technion. Indian Institute of Technology Hyderabad, India. Inbal Talgam-Cohen Technion Israel Institute of Technology, Haifa, Israel. Nir Rosenfeld Technion Israel Institute of Technology, Haifa, Israel.
Abstract

Humans have come to rely on machines for reducing excessive information to manageable representations. But this reliance can be abused—strategic machines might craft representations that manipulate their users. How can a user make good choices based on strategic representations? We formalize this as a learning problem, and pursue algorithms for decision-making that are robust to manipulation. In our main setting of interest, the system represents attributes of an item to the user, who then decides whether or not to consume. We model this interaction through the lens of strategic classification (Hardt et al. 2016), reversed: the user, who learns, plays first; and the system, which responds, plays second. The system must respond with representations that reveal ‘nothing but the truth’ but need not reveal the entire truth. Thus, the user faces the problem of learning set functions under strategic subset selection, which presents distinct algorithmic and statistical challenges. Our main result is a learning algorithm that minimizes error despite strategic representations, and our theoretical analysis sheds light on the trade-off between learning effort and susceptibility to manipulation.

1 Introduction

There is a growing recognition that learning systems deployed in social settings are likely to induce strategic interactions between the system and its users. One promising line of research in this domain is strategic classification Hardt et al. (2016), which studies learning in settings where users can strategically modify their features in response to a learned classification rule. The primary narrative in strategic classification is one of self-interested users that act to ‘game’ the system, and of systems that should defend against this form of gaming behavior. In this paper, we initiate the study of reversed scenarios, in which it is the system that strategically games its users. We argue that this is quite routine: In settings where users make choices about items (e.g., click, buy, watch), decisions are often made on the basis of only partial information. But the choice of what information to present to users often lies in the hands of the system, which can utilize its representation power to promote its own goals. Here we formalize the problem of strategic representation, and study how learning can now aid users in choosing well despite strategic system behavior.

As a concrete example, consider a user browsing for a hotel in an online booking platform. Hotels on the platform are represented by a small set of images; as there are many images available for each hotel, the system must choose which subset of these to display. Clearly, the choice of representation can have a significant effect on users’ decision-making (whether to click the hotel, and subsequently whether to book it), and in turn, on the system’s profit. The system may well attempt to capitalize on the control it has over what information is presented to the user—at the expense of the user, who may be swayed to make sub-optimal decisions. Given that users rely on system-crafted representations for making decisions, choosing well requires users to account for the strategic nature of the representations they face.

Our goal in this paper is to study when and how learning can aid users in making decisions that are strategically robust. We model users as deciding based on a choice strategy hh mapping represented items to binary choices, which can be learned from the user’s previous experiences (e.g., hotels stayed in, and whether they were worthwhile). Since in reality full descriptions of items are often too large for humans to process effectively (e.g., hundreds or thousands of attributes), the role of the system is to provide users with compact representations (e.g., a small subset of attributes). We therefore model the system as responding to hh with a representation mapping ϕ{\phi}, which determines for each item xx its representation z=ϕ(x)z={\phi}(x), on which users rely for making choices (i.e., hh is a function of zz). We focus on items xx that are discrete and composed of a subset of ground elements; accordingly, we consider representation mappings ϕ{\phi} that are lossy but truthful, meaning they reveal a cardinality-constrained subset of the item’s full set of attributes: zxz\subseteq x and k1|z|k2k_{1}\leq|z|\leq k_{2} for some exogenously-set k1,k2k_{1},k_{2}.

Given that the system has representation power—how should users choose hh? The key idea underlying our formulation is that the system and its users have misaligned goals. A user wishes to choose items that are ‘worthwhile’ to her, as determined by her valuation function v(x)v(x)—in our case, a set function, which can reflect complex tastes (e.g., account for complementarity among the attributes, such as ‘balcony’ and ‘ocean view’ in the hotel example). Meanwhile, the system aims to maximize user engagement by choosing a feasible representation z=ϕ(x)z={\phi}(x) that will incite the user to choose the item. Importantly, while values are based on the true xx, choices rely on representations zz—which the system controls. This causes friction: a representation may be optimal for the system, but may not align with user interests.

The challenge for users (and our goal in learning) is therefore to make good choices on the basis of strategic representations. Note that this is not a simple missing value problem, since which attributes will be missing depends on how users choose, i.e., on their learned choice function hh.111Consider the following subtlety: since users must commit to a choice strategy hh, the choice of how to handle missing features (a part of the strategy) determines which features will be missing. Nor is this a problem that can be addressed by mapping representations to a categorical representation (with classes ‘0’,‘1’, and ‘unknown’); to see this, note that given a subset representation zz, it is impossible to know which of the attributes that do not appear in zz are withheld—and which are truly missing.

Subset representations.

We focus on subset representations since they provide a natural means to ensure that representations reveal ‘nothing but the truth’ (but not necessarily ‘the whole truth’). This is our primary modeling concern, which stems from realistic restrictions on the system (like consumer expectations, legal obligations, or business practices). Subset representations cleanly capture what we view as a fundamental tension between users and an informationally-advantageous system—the ability to withhold information; examples include retail items described by a handful of attributes; videos presented by several key frames; and movie recommendations described by a select set of reviews, to name a few.

Overview of our results.

We begin by showing that users who choose naïvely can easily be manipulated by a strategic system (Sec. 3). We then proceed to study users who learn (Sec. 4). Our main result is an efficient algorithm for learning hh (Thm. 4.7), which holds under a certain realizability assumption on vv. The algorithm minimizes the empirical error over a hypothesis class of set-function classifiers, whose complexity is controlled by parameter kk, thus allowing users to trade off expressivity and statsitical efficiency. The algorithm builds upon several structural properties which we establish for set-function classifiers. Our key result here is that ‘folding’ the system’s strategic response into the hypothesis class results in an induced class having a simple form that makes it amenable to efficient optimization (Thm. 4.6).

Building on this, we continue to study the ‘balance of power’ (Sec. 5), as manifested in the interplay between kk (hypothesis complexity, which reflects the power of the user) and the range [k1,k2][k_{1},k_{2}] (which determines the power of the system). For fixed k1,k2k_{1},k_{2}, we analyze how the choice of kk affects the user’s classification error, through its decomposition into estimation and approximation errors. For estimation, we give a generalization bound (Thm. 5.9), obtained by analyzing the VC dimension of the induced function class (as it relies on kk). For approximation, we give several results (e.g., Thm. 5.4) that link the expressivity of the user’s value function vv to the complexity of the learned hh (again, as it relies on kk). Together, our results characterize how much is gained versus how much effort is invested in learning as kk is varied. One conclusion is that even minimal learning can help significantly (whereas no learning can be catastrophic).

From the system’s perspective, and for fixed kk, we study how the range [k1,k2][k_{1},k_{2}] affects the system. Intuitively, we would expect that increasing the range should be beneficial to the system, as it provides more alternatives to choose zz from. However, perhaps surprisingly, we find that the system can increase its payoff by ‘tying its hands’ to a lower k2k_{2}. This is because k2k_{2} upper-bounds not only the system’s range but also the ‘effective’ kk of the user (who gets nothing from choosing k>k2k>k_{2}), and the lower the kk, the better it is for the system (Lemma 5.10). The choice of k1k_{1} turns out to be immaterial against fully strategic users, but highly consequential against users that are not.

1.1 Relation to Strategic Classification

Our formalization of strategic representation shares a deep connection to strategic classification (Hardt et al., 2016). Strategic representation and strategic classification share an underlying structure (a leading learning player who must take into account a strategic responder), but there are important differences. The first is conceptual: in our setting, roles are reversed—it is users who learn (and not the system), and the system strategically responds (and not users). This allows us to pursue questions regarding the susceptibility of users to manipulation, with different emphases and goals, while maintaining the ‘language’ of strategic classification.

The second difference is substantive: we consider inputs that are sets (rather than continuous vectors), and manipulations that hide information (rather than alter it). Technically, switching to discrete inputs can be done by utilizing the cost function to enforce truthfulness constraints as a ‘hard’ penalty on modifications. But this change is not cosmetic: since the system behaves strategically by optimizing over subsets, learning must account for set-relations between different objects in input space; this transforms the problem to one of learning set functions.222This is a subtle point: Since sets can be expressed as binary membership vectors, it is technically possible to use conventional vector-based approaches to learn hh. Nonetheless, these approaches cannot account for strategic behavior; this is since ϕ\phi implements a set operation, which is ill-defined for continuous vector inputs. From a modeling perspective, subsets make our work compatible with classic attribute-based consumer decision theory (Lancaster, 1966).

Overall, we view the close connection to strategic classification as a strength of our formalization, showing that the framework of strategic classification is useful far beyond what was previously considered; and also that a fairly mild variation can lead to a completely different learning problem, with distinct algorithmic and statistical challenges.

1.2 Related Work (see also Appx. B)

Strategic classification.

Strategic classification is a highly active area of research. Recent works in this field include statistical learning characterizations Zhang and Conitzer (2021); Sundaram et al. (2021); Ghalme et al. (2021), practical optimization methods Levanon and Rosenfeld (2021, 2022), relaxation of key assumptions Ghalme et al. (2021); Bechavod et al. (2022); Jagadeesan et al. (2021); Levanon and Rosenfeld (2022); Eilat et al. (2022), relations to causal aspects Miller et al. (2020); Chen et al. (2020a), and consideration of societal implications Milli et al. (2019); Hu et al. (2019); Chen et al. (2020b); Levanon and Rosenfeld (2021).

Two recent works are closely relate to ours: Zrnic et al. (2021) consider a dynamic setting of repeated learning that can change the order of play; in contrast, we switch the roles. Krishnaswamy et al. (2021) consider information withholding by users (rather than the system). They aim to learn a truth-eliciting mechanism, which incentivizes the second player to reveal all information (i.e., ‘the whole truth’). Their mechanism ensures that withholding never occurs; in contrast, our goal is to predict despite strategic withholding (i.e., ‘nothing but the truth’).

Bayesian persuasion. In Bayesian persuasion (Kamenica and Gentzkow, 2011), a more-informed player (i.e., the system) uses its information advantage coupled with commitment power to influence the choices of a decision maker (i.e., the user). Works closest to ours are by Dughmi et al. (2015), who upper bound the number of signaled attributes, and by Haghtalab et al. (2021), who study strategic selection of anecdotes. Both works consider the human player as a learner (as do we). However, in our work, the order of play is reversed—the decision-maker (user) moves first and the more-informed player (system) follows. Bayesian persuasion also assumes that the system knows the user’s valuation, and crucially relies on both parties knowing the distribution DD of items. In contrast, we model the user as having only sample access to DD, and the system as agnostic to it.

2 A Formalization of Strategic Representation

We begin by describing the setting from the perspective of the user, which is the learning entity in our setup. We then present the ‘types’ of users we study, and draw connections between our setting and others found in the literature.

2.1 Learning Setting

In our setup, a user is faced with a stream of items, and must choose which of these to consume. Items are discrete, with each item x𝒳2Ex\in{}\mathcal{X}\subseteq 2^{E} described by a subset of ground attributes, EE, where |E|=q|E|=q. We assume all feasible items have at most nn attributes, |x|n|x|\leq n. The value of items for the user are encoded by a value function, v:𝒳v:{}\mathcal{X}\rightarrow\mathbb{R}. We say an item xx is worthwhile to the user if it has positive value, v(x)>0v(x)>0, and use y=Y(x)=sign(v(x))y=Y(x)=\operatorname{sign}(v(x)) to denote worthwhileness, i.e., y=1y=1 if xx is worthwhile, and y=1y=-1 if it is not. Items are presented to the user as samples drawn i.i.d. from some unknown distribution DD over 𝒳{}\mathcal{X}, and for each item, the user must choose whether to consume it (e.g., click, buy, watch) or not.

We assume that the user makes choices regarding items by committing at the onset to a choice function hh that governs her choice behavior. In principle, the user is free to chose hh from some predefined function class HH; learning will consider finding a good hHh\in H, but the implications of the choice of HH itself will play a central role in our analysis. Ideally, the user would like to choose items if and only if they are worthwhile to her; practically, her goal is to find an hh for which this holds with large probability over DD. For this, the user can make use of her knowledge regarding items she has already consumed, and therefore also knows their value; we model this as providing the user access to a labeled sample set S={(xi,yi)}i=1mS=\{(x_{i},y_{i})\}_{i=1}^{m} where xiDx_{i}\sim D and yi=sign(v(xi))y_{i}=\operatorname{sign}(v(x_{i})), which she can use for learning hh.

Strategic representations. The difficulty in learning hh is that user choices at test time must rely only on item representations, denoted z𝒵z\in\mathcal{Z}, rather than on full item descriptions. Thus, learning considers choice functions that operate on representations, h:𝒵{±1}h:\mathcal{Z}\rightarrow\{\pm 1\}; the challenge lies in that while choices must be made on the basis of representations zz, item values are derived from their full descriptions xx—which representations describe only partially.

The crucial aspect of our setup is that representations are not arbitrary; rather, representations are controlled by the system, which can choose them strategically to promote its own goals. We model the system as acting through a representation mapping, ϕ:𝒳𝒵{\phi}:{}\mathcal{X}\rightarrow\mathcal{Z}, which operates independently on any xx, and can be determined in response to the user’s choice of hh. This mimics a setting in which a fast-acting system can infer and quickly respond to a user’s (relatively fixed) choice patterns.

We assume the system’s goal is to choose a ϕh{\phi}_{h} that maximizes expected user engagement:

𝔼xD[𝟙{h(ϕh(x))=1}].\mathbb{E}_{x\sim D}{\left[{\mathds{1}{\{{h({\phi}_{h}(x))=1}\}}}\right]}. (1)

Nonetheless, representations cannot be arbitrary, and we require ϕh{\phi}_{h} to satisfy two properties. First, chosen representations must be truthful, meaning that zxz\subseteq x for all xx. Second, representations are subject to cardinality constraints, k1|z|k2k_{1}\leq|z|\leq k_{2} for some predetermined k1,k2k_{1},k_{2}\in\mathbb{N}. We will henceforth use 𝒵\mathcal{Z} to mean representations of feasible cardinality. Both requirements stem from realistic considerations: A nontruthful system which intentionally distorts item information is unlikely to be commercially successful in the long run; intuitively, truthfulness gives users some hope of resilience to manipulation. For k1,k2k_{1},k_{2}, we think of these as exogenous parameters of the environment, arising naturally due to physical restrictions (e.g., screen size) or cognitive considerations (e.g., information processing capacity); if k2<nk_{2}<n, we say representations are lossy.

Under these constraints, the system can optimize Eq. (1) by choosing representations via the best-response mapping:

ϕh(x)=argmaxz𝒵h(z) s.t. zx,|z|[k1,k2]{\phi}_{h}(x)=\operatorname*{argmax}_{z\in\mathcal{Z}}h(z)\,\,\,\,\text{ s.t. }\,\,\,\,z\subseteq x,|z|\in[k_{1},k_{2}] (2)

Eq. (2) is a best-response since it maximizes Eq. (1) for any given hh: for every xx, ϕh{\phi}_{h} chooses a feasible zxz\subseteq x that triggers a positive choice event, h(z)=1h(z)=1—if such a zz exists. In this way, k1,k2k_{1},k_{2} control how much leeway the system has in revealing only partial truths; as we will show, both parameters play a key role in determining outcomes for both system and user. From now on we overload notation and by ϕh(x){\phi}_{h}(x) refer to this best-response mapping.

Learning objective. Given function class HH and a labeled sample set SS, the user aims to find a choice function hHh\in H that correctly identifies worthwhile items given their representation, and in a way that is robust to strategic system manipulation. The user’s objective is therefore to maximize:

𝔼xD[𝟙{h(ϕh(x))=y}]\mathbb{E}_{x\sim D}{\left[{\mathds{1}{\{{h({\phi}_{h}(x))=y}\}}}\right]} (3)

where ϕh{\phi}_{h} is the best-response mapping in Eq. (2).

Note that since hh is binary, the argmax of ϕh{\phi}_{h} may not be unique; e.g., if some z1x,z2xz_{1}\subseteq x,z_{2}\subseteq x both have h(z1)=h(z2)=1h(z_{1})=h(z_{2})=1. Nonetheless, the particular choice of zz does not matter—from the user’s perspective, her choice of hh is invariant to the system’s choice of best-response zz (proof in Appx. C):

Observation 2.1.

Every best-response zϕh(x)z\in\phi_{h}(x) induces the same value in the user’s objective function (Eq. (3)).

2.2 User Types

Our main focus throughout the paper will be on users that learn hh by optimizing Eq. (3). But to understand the potential benefit of learning, we also analyze ‘simpler’ types of user behavior. Overall, we study three user types, varying in their sophistication and the amount of effort they invest in choosing hh. These include:

  • The naïve user: Acts under the (false) belief that representations are chosen in her own best interest. This user type truthfully reports her preferences to the system by setting h=vh=v as her choice function.333Note that while hh takes values in 𝒳{}\mathcal{X}, vv takes values in 𝒳{}\mathcal{X}. Nonetheless, truthfulness implies that 𝒵𝒳\mathcal{Z}\subseteq{}\mathcal{X}, and so vv is well-defined as a choice function over 𝒵\mathcal{Z}.

  • The agnostic user: Makes no assumptions about the system. This user employs a simple strategy that relies on basic data statistics which provides minimal but robust guarantees regarding her payoff.

  • The strategic user: Knows that the system is strategic, and anticipates it to best-respond. This user is willing to invest effort (in terms of data and compute) in learning a choice function hh that maximizes her payoff by accounting for the system’s strategic behavior.

Our primary goal is to study the balance of power between users (that choose) and the system (which represents). In particular, we will be interested in exploring the tradeoff between a user’s effort and her susceptibility to manipulation.

2.3 Strategic Representation as a Game

Before proceeding, we give an equivalent characterization of strategic representation as a game. Our setting can be compactly described as a single-step Stackelberg game: the first player is User, which observes samples S={(xi,yi)}i=1mS=\{(x_{i},y_{i})\}_{i=1}^{m}, and commits to a choice function h:𝒵{±1}h:\mathcal{Z}\rightarrow\{\pm 1\}; the second player is System, which given hh, chooses a truthful ϕh:𝒳𝒵\phi_{h}:{}\mathcal{X}\rightarrow\mathcal{Z} (note how ϕh\phi_{h} depends on hh). The payoffs are:

User:𝔼xD[𝟙{h(ϕh(x))=y}]\displaystyle\text{{User:}}\qquad\,\,\mathbb{E}_{x\sim D}{\left[{\mathds{1}{\{{h(\phi_{h}(x))=y}\}}}\right]} (4)
System:𝔼xD[𝟙{h(ϕh(x))=1}]\displaystyle\text{{System:}}\quad\,\,\mathbb{E}_{x\sim D}{\left[{\mathds{1}{\{{h(\phi_{h}(x))=1}\}}}\right]} (5)

Note that payoffs differ only in that User seeks correct choices, whereas System benefits from positive choices. This reveals a clear connection to strategic classification, in which System, who plays first, is interested in accurate predictions, and for this it can learn a classifier; and User, who plays second, can manipulate individual inputs (at some cost) to obtain positive predictions. Thus, strategic representation can be viewed as a variation on strategic classification, but with roles ‘reversed’. Nonetheless, and despite these structural similarities, strategic representation bears key differences: items are discrete (rather than continuous), manipulations are subject to ‘hard’ set constraints (rather than ‘soft’, continuous costs), and learning regards set functions (rather than vector functions). These differences lead to distinct questions and unique challenges in learning.

3 Warm-up: Naïve and Agnostic Users

The naïve user. The naïve user employs a ‘what you see is what you get’ policy: given a representation of an item, zz, this user estimates the item’s value based on zz alone, acting ‘as if’ zz were the item itself. Consequently, the naïve user sets h(z)=sign(v(z))h(z)=\operatorname{sign}(v(z)), even though vv is truly a function of xx. The naïve user fails to account for the system’s strategic behavior (let alone the fact that zxz\subseteq x of some actual xx).

Despite its naivety, there are conditions under which this user’s approach makes sense. Our first result shows that the naïve policy is sensible in settings where the system is benevolent, and promotes user interests instead of its own.

Lemma 3.1.

If system plays the benevolent strategy:

ϕhbenev(x)=argmaxzx,|z|[k1,k2]{𝟙{h(z)=sign(v(x))},\phi^{\mathrm{benev}}_{h}(x)=\operatorname*{argmax}_{z\subseteq x,|z|\in[k_{1},k2]}\{\mathbbm{1}\{h(z)=\operatorname{sign}(v(x))\},

then the naïve approach maximizes user payoff.

Proof in Appx. D. The above lemma is not meant to imply that naïve users assume the system is benevolent; rather, it justifies why users having this belief might act in this way. Real systems, however, are unlikely to be benevolent; our next example shows a strategic system can easily manipulate naïve users to receive arbitrarily low payoff.

Example 1.

Let x1={a1},x2={a1,a2},x3={a2}x_{1}=\{a_{1}\},x_{2}=\{a_{1},a_{2}\},x_{3}=\{a_{2}\} with v(x1)=+1v(x_{1})=+1 and v(x2)=v(x3)=1v(x_{2})=v(x_{3})=-1. Fix k1=k2=1k_{1}=k_{2}=1, and let D=(ε/2,1ε,ε/2)D=(\varepsilon/2,1-\varepsilon,\varepsilon/2). Note 𝒵={a1,a2}\mathcal{Z}=\{a_{1},a_{2}\} are the feasible representations. The naïve  user assigns h=(a1)=+1,h(a2)=1h=(a_{1})=+1,h(a_{2})=-1 according to vv. For x2x_{2}, a strategic system plays ϕ(x2)=a1\phi(x_{2})=a_{1}. The expected payoff to the user is ε\varepsilon.

One reason a naïve user is susceptible to manipulation is because she does not make any use of the data she may have. We next describe a slightly sophisticated user that uses a simple strategy to ensure a better payoff.

The agnostic user. The agnostic user puts all faith in data; this user does not make assumptions on, nor is she susceptible to, the type of system she plays against. Her strategy is simple: collect data, compute summary statistics, and choose to either always accept or always reject (or flip a coin). In particular, given a sample set S={(xi,yi)}i=1mS=\{(x_{i},y_{i})\}_{i=1}^{m}, the agnostic user first computes the fraction of positive examples, μ^:=1mi=1myi{\hat{\mu}}:=\frac{1}{m}\sum_{i=1}^{m}y_{i}. Then, for some tolerance τ\tau, sets for all zz, h(z)=1h(z)=1 if μ^1/2+τ{\hat{\mu}}\geq 1/2+\tau, h(z)=1h(z)=-1 if μ^1/2τ{\hat{\mu}}\leq 1/2-\tau, and flips a coin otherwise. In Example 1, an agnostic user would choose h=(1,1)h=(-1,-1) when mm is large, guaranteeing a payoff of at least m(1ε/2)2+m(1ε/2)\frac{\sqrt{m}(1-\varepsilon/2)}{2+\sqrt{m}}\rightarrow(1-\varepsilon/2) as mm\rightarrow\infty. Investing minimal effort, for an appropriate choice of τ\tau, this user’s strategy turns out to be quite robust.

Theorem 3.2.

(Informal) Let μ\mu be the true rate of positive examples, μ=𝔼D[Y]\mu=\mathbb{E}_{D}{\left[{Y}\right]}. Then as mm increases, the agnostic user’s payoff approaches max{μ,1μ}\max\{\mu,1-\mu\} at rate 1/m1/\sqrt{m}.

Formal statement and proof in Appx. A.1. In essence, the agnostic user guarantee herself the ‘majority’ rate with rudimentary usage of her data, and in a way that does not depend on how system responds. But this can be far from optimal; we now turn to the more elaborate strategic user who makes more clever use of the data at her disposal.

4 Strategic Users Who Learn

A strategic agent acknowledges that the system is strategic, and anticipates that representations are chosen to maximize her own engagement. Knowing this, the strategic user makes use of her previous experiences, in the form of a labeled data set S={(xi,yi)}i=1mS=\{(x_{i},y_{i})\}_{i=1}^{m}, to learn a choice function h^{\hat{h}} from some function class HH that optimizes her payoff (given that the system is strategic). Cast as a learning problem, this is equivalent to minimizing the expected classification error on strategically-chosen representations:

h=argminhH𝔼D[𝟙{h(ϕh(x))y}].h^{*}=\operatorname*{argmin}_{h\in H}\mathbb{E}_{D}{\left[{\mathds{1}{\{{h({\phi}_{h}(x))\neq y}\}}}\right]}. (6)

Since the distribution DD is unknown, we follow the conventional approach of empirical risk minimization (ERM) and optimize the empirical analog of Eq. (6):

h^=argminhH1mi=1mh(ϕh(xi))yi).{\hat{h}}=\operatorname*{argmin}_{h\in H}\frac{1}{m}\sum_{i=1}^{m}h({\phi}_{h}(x_{i}))\neq y_{i}). (7)

Importantly since every zi=ϕh(xi)z_{i}=\phi_{h}(x_{i}) is a set, HH must include set functions h:𝒵{±1}h:\mathcal{Z}\rightarrow\{\pm 1\}, and any algorithm for optimizing Eq. (7) must take this into account. In Sections 4.1 and 4.2, we characterize the complexity of a user’s choice function and relate its complexity to that of vv, and in Section 4.3 give an algorithm that computes h^{\hat{h}}, the empirical minimizer, for a hypothesis class of a given complexity.

4.1 Complexity Classes of Set Functions

Ideally, a learning algorithm should permit flexibility in choosing the complexity of the class of functions it learns (e.g., the degree of a polynomial kernel, the number of layers in a neural network), as this provides means to trade-off running time with performance and to reduce overfitting. In this section we propose a hierarchy of set-function complexity classes that is appropriate for our problem.

Denote by Γk(z)\Gamma_{k}(z) all subsets of zz having size at most kk:

Γk(z)={z2E:zz,|z|k}.\Gamma_{k}(z)=\{z^{\prime}\in 2^{E}\,:\,z^{\prime}\subseteq z,|z^{\prime}|\leq k\}.

We start by defining kk-order functions over the representation space. These functions are completely determined by weights placed on subsets of size at most kk.

Definition 4.1.

We say the function h:𝒵{±1}h:\mathcal{Z}\rightarrow\{\pm 1\} is of order kk if there exists real weights on sets of cardinality at most kk, {w(z):zΓk(z)}\{w(z^{\prime})\,:\,z^{\prime}\in\Gamma_{k}(z)\}, such that

h(z)=sign(zΓk(z)w(z)).h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right).

Not all functions h(z)h(z) can necessarily be expressed as a kk-order function (for some kk); nonetheless, in the context of optimizing Eq. (7), we show that working with kk-order functions is sufficiently general, since any set function hh can be linked to a matching kk-order function hh^{\prime} (for some kk2k\leq k_{2}) through how it operates on strategic inputs.

Lemma 4.2.

For any h:𝒵{±1}h:\mathcal{Z}\rightarrow\{\pm 1\}, there exists kk2k\leq k_{2} and a corresponding kk-order function hh^{\prime} such that:

h(ϕh(x))=h(ϕh(x)).h({\phi}_{h}(x))=h^{\prime}({\phi}_{h^{\prime}}(x))\,.

Lem. 4.2 permits us to focus on kk-order functions. The proof is constructive (see Appx. E), and the construction itself turns out to be highly useful. In particular, the proof constructs hh^{\prime} having a particular form of binary basis weights, w(z){a,a+}w(z)\in\{a_{-},a_{+}\}, which we assume from now on are fixed (k\forall k). Hence, every function hh has a corresponding binary-weighted kk-order function hh^{\prime}, which motivates the following definition of functions and function classes.

Definition 4.3.

We say a kk-order function hh with basis weights 𝒘\bm{w} is binary-weighted if:

w(z){{a,a+}z such that |z|=k=az such that |z|<kw(z)\begin{cases}\in\{a_{-},a_{+}\}&\forall z\text{ such that }|z|=k\\ =a_{-}&\forall z\text{ such that }|z|<k\end{cases}

for some fixed a(1,0)a_{-}\in(-1,0) and a+>i[k](ni)a_{+}>\sum_{i\in[k]}{n\choose i}.

A binary weighted kk-order hh determines a family of kk-size subsets, described by having weights as a+a_{+}, such that for any z𝒵z\in\mathcal{Z} with |z|k|z|\geq k, h(z)=1h(z)=1 if and only if zz contains a subset from the family (for zz with |z|<k|z|<k, h(z)=1h(z)=-1 always). This is made precise using the notion of lifted functions in Lem. A.4 in Appx. A.2. Next, denote:

Hk={h:h is a binary-weighted k-order function}.H_{k}=\{h\,:\,h\text{ is a binary-weighted $k$-order function}\}.

The classes {Hk}k\{H_{k}\}_{k} will serve as complexity classes for our learning algorithm; the user provides kk as input, and Alg outputs an h^Hk{\hat{h}}\in H_{k} that minimizes the empirical loss444Assuming the empirical error is zero.. As we will show, using kk as a complexity measure provides the user direct control over the tradeoff between estimation and approximation error, as well as over the running time.

Next, we show that the {Hk}k\{H_{k}\}_{k} classes are strictly nested. This will be important for our analysis of approximation error, as it will let us reason about the connection between the learned h^{\hat{h}} and the target function vv (proof in Appx. E).

Lemma 4.4.

For all kk, Hk1HkH_{k-1}\subseteq H_{k} and HkHk1H_{k}\setminus H_{k-1}\neq\emptyset.

Note that HnH_{n} includes all binary-weighted set functions, but since representations are of size at most k2k_{2}, it suffices to consider only kk2k\leq k_{2}. Importantly, kk can be set lower than k1k_{1}; for example, H1H_{1} is the class of threshold modular functions, and H2H_{2} is the class of threshold pairwise functions. The functions we consider are parameterized by their weights, 𝒘\bm{w}, and so any kk-order function has at most |𝒘|=i=0k(qi)|\bm{w}|=\sum_{i=0}^{k}{q\choose i} weights. In this sense, the choice of kk is highly meaningful. Now that we have defined our complexity classes, we turn to discussing how they can be optimized over.

4.2 Learning via Reduction to Induced Functions

The simple structure of functions in HkH_{k} makes them good candidates for optimization. But the main difficulty in optimizing the empirical error in Eq. (7) is that the choice of hh does not only determine the error, but also determines the inputs on which errors are measured (indirectly through the dependence of ϕh{\phi}_{h} on hh). To cope with this challenge, our approach is to work with induced functions that already have the system’s strategic response encoded within, which will prove useful for learning. Additionally, as they operate directly on xx (and not zz), they can easily be compared with vv, which will become important in Sec. 5.

Definition 4.5.

For a class HH, its induced class is:

FH{f:𝒳{±1}:hH s.t. f(x)=h(ϕh(x))}F_{H}\triangleq\{f:{}\mathcal{X}\rightarrow\{\pm 1\}\,:\,\exists h\in H\text{ s.t. }f(x)=h(\phi_{h}(x))\}

The induced class FHF_{H} includes for every hHh\in H a corresponding function that already has ϕh{\phi}_{h} integrated in it. We use Fk=FHkF_{k}=F_{H_{k}} to denote the induced class of HkH_{k}. For every hh, we denote its induced function by fhf_{h}. Whereas hh functions operate on zz, induced functions operate directly on xx, with each fhf_{h} accounting internally for how the system strategic responds to hh on each xx.

Our next theorem provides a key structural result: induced functions inherit the weights of their kk-order counterparts.

Theorem 4.6.

For any hHkh\in H_{k} with weights 𝐰\bm{w}:

h(z)=sign(zΓk(z)w(z)),h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right),

its induced fhFkf_{h}\in F_{k} can be expressed using the same weights, 𝐰\bm{w}, but with summation over subsets of xx, i.e.:

fh(x)=sign(zΓk(x)w(z)).f_{h}(x)=\operatorname{sign}\left(\sum\nolimits_{z\in\Gamma_{k}(x)}w(z)\right).

Thm. 4.6 is the main pillar on which our algorithm stands: it allows us to construct hh by querying the loss directly—i.e., without explicitly computing ϕh{\phi}_{h}—by working with the induced fhf_{h}; this is since:

𝟙{h(ϕh(xi))yi}=𝟙{fh(xi)yi}\mathds{1}{\{{h({\phi}_{h}(x_{i}))\neq y_{i}}\}}=\mathds{1}{\{{f_{h}(x_{i})\neq y_{i}}\}}

Thus, through their shared weights, induced functions serve as a bridge between what we optimize, and how.

4.3 Learning Algorithm

We now present our learning algorithm, Alg. The algorithm is exact: it takes as input a training set SS and a parameter kk, and returns an hHkh\in H_{k} that minimizes the empirical loss (Eq. (7)). Correctness holds under the realizabilility condition YFkY\in F_{k}, i.e., YY is the induced function of some hHkh\in H_{k}.555Note that even for standard linear binary classification, finding an empirical minimizer of the 0/10/1 in the agnostic (i.e., non-realizable) case is NP-hard (Shalev-Shwartz and Ben-David, 2014).

The algorithm constructs hh by sequentially computing its weights, 𝒘={w(z)}|z|k\bm{w}=\{w(z)\}_{|z|\leq k}. As per Def. 4.3, only w(z)w(z) for zz with |z|=k|z|=k must be learned; hence, weights are sparse, in the sense that only a small subset of them are assigned a+a_{+}, while the rest are aa_{-}. Weights can be implemented as a hash table, where w(z)=a+w(z)=a_{+} if zz is in the table, and w(z)=aw(z)=a_{-} if it is not. Our next result establishes the correctness of Alg. The proof leverages a property that characterizes the existence of an hHkh\in H_{k} having zero empirical error (see Lem. E.1). The proof of Lem. E.1 uses Thm. 4.6, which enables the loss to be directly computed for the induced functions using the shared weight structure.

1
2
31:  Input: S={(xi,yi)}i[m]S=\{(x_{i},y_{i})\}_{i\in[m]}, k[k2]k\in[k_{2}]
2:  Pre-compute:
3:  S+={xS:y=+1}S^{+}=\{x\in S:y=+1\},
4:  S={xS:y=1}S^{-}=\{x\in S:y=-1\}
5:  Zk,S={z:|z|=k,xSzx}Z_{k,S}=\{z\,:\,|z|=k,\exists x\in S~{}z\subseteq x\}
6:  p^(xi)=1mj[m]𝟙{xi=xj}i[m]{\hat{p}}(x_{i})=\frac{1}{m}\sum_{j\in[m]}\mathds{1}{\{{x_{i}=x_{j}}\}}\quad\forall i\in[m]
47:  Fix a(1,0)a_{-}\in(-1,0) and a+>i[1,k](ni)a_{+}>\sum_{i\in[1,k]}{n\choose i}
8:  Initialize:
59:  Z+=Z^{+}=\varnothing, Z=Z^{-}=\varnothing, V=V=\varnothing,   Sz=zZk,SS_{z}=\varnothing\quad\forall z\in Z_{k,S}
10:  Run:
11:  for xSx\in S^{-} do
12:     for zz s.t. zxz\subseteq x and zZk,Sz\in Z_{k,S} do
13:        Z=Z{z}Z^{-}=Z^{-}\cup\{z\},    Zk,S=Zk,S{z}Z_{k,S}=Z_{k,S}\setminus\{z\}
14:        Sz=Sz{x}S_{z}=S_{z}\cup\{x\}
15:     end for
616:  end for
17:  for xS+x\in S^{+} do
18:     for zxz\subseteq x such that zZk,Sz\in Z_{k,S} do
19:        Z+=Z+{z}Z^{+}=Z^{+}\cup\{z\}
20:     end for
721:  end for
22:  Set w(z)={a+if zZ+ implementedao.w. (implicitly) as hash tablew(z)=\begin{cases}a_{+}&\text{if }z\in Z^{+}\hskip 65.04034pt{\leavevmode\color[rgb]{0.46875,0.46875,0.46875}{\triangleright\text{{ implemented}}}}\\ a_{-}&\text{o.w. (implicitly)}\hskip 39.0242pt{\leavevmode\color[rgb]{0.46875,0.46875,0.46875}{\text{{as hash table}}}}\end{cases}
23:  Return h^(z)=sign(zΓk(z)w(z))\hat{h}(z)=\operatorname{sign}(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})).
Algorithm 1 Alg
Theorem 4.7.

For any k[k2]k\in[k_{2}], if YY is realizable then Alg returns an h^\hat{h} that minimizes the empirical error.

Proof in Appx. E. Note that our algorithm is exact: it returns a true minimizer of the empirical 0/1 loss, assuming YY is realizable. Additionally, Alg can be used to identify if there exists an hHkh\in H_{k} with zero empirical error; at Step 15, for each xS+x\in S^{+} if there does not exist a zZk,Sz\in Z_{k,S} or zZ+z\in Z^{+} such that zxz\subseteq x then from Lem. E.1 in Appx. D there does not exist an hh with zero empirical error.

Lemma 4.8.

Let nn be the size of elements in 𝒳{}\mathcal{X}, mm be the number of samples, and kk2k\leq k_{2} be the user’s choice of complexity. Then Alg runs in O(m(nk))O(m{n\choose k}) time.

This runtime is made possible due to several key factors: (i) only kk-sized weights need to be learned, (ii) all weights are binary-valued, and (iii) loss queries are efficient in induced space. Nonetheless, when nn and kk are large, runtime may be significant, and so kk must be chosen with care. Fortunately, our results in Sec. 5.1.1 give encouraging evidence that learning with small kk—even k=1k=1, for which runtime is O((mn)2)O((mn)^{2})—is quite powerful (assuming YY is realizable).

In the analysis, the m(nk)m{n\choose k} is made possible only since weights are sparse, and since Alg operates on a finite sample set of size mm. Alternatively, if mm is large, then this can be replaced with (qk){q\choose k}. This turns out to be necessary; in Appx. A.3 we show that, in the limit, (qk){q\choose k} is a lower bound.

5 Balance of Power

Our final section explores the question: what determines the balance of power between system and users? We begin with the perspective of the user, who has commitment power, but can only minimize the empirical error. For her, the choice of complexity class kk is key in balancing approximation error—how well (in principle) can functions hHkh\in H_{k} approximate vv; and estimation error—how close the empirical payoff of the learned h^{\hat{h}} is to its expected value. Our results give insight into how these types of error trade off as kk is varied (here we do not assume realizability).

For the system, the important factors are k1k_{1} and k2k_{2}, since these determine its flexibility in choosing representations. Since more feasible representation mean more flexibility, it would seem plausible that smaller k1k_{1} and larger k2k_{2} should help the system more. However, our results indicate differently: for system, smaller k2k_{2} is better, and the choice of k1k_{1} has limited effect on strategic users. The result for k2k_{2} goes through a connection to the user’s choice of kk; surprisingly, smaller kk turns out to be, in some sense, better for all.

5.1 User’s Perspective

We begin by studying the effects of kk on user payoff. Recall that users aim to minimize the expected error (Eq. (6)):

ε(h)=𝔼D[𝟙{h(ϕh(x))sign(v(x))}],{{\varepsilon}}(h)=\mathbb{E}_{D}{\left[{\mathds{1}{\{{h({\phi}_{h}(x))\neq\operatorname{sign}(v(x))}\}}}\right]},

but instead minimize the empirical error (Eq. (7)). For reasoning about the expected error of the learned choice function h^Hk{\hat{h}}\in H_{k}, a common approach is to decompose it into two error types—approximation and estimation:

ε(h^)=ε(h)approx.+ε(h^)ε(h)estimation,h=argminhHkε(h){{\varepsilon}}({\hat{h}})=\underbrace{{{\varepsilon}}(h^{*})}_{\text{approx.}}+\underbrace{{{\varepsilon}}({\hat{h}})-{{\varepsilon}}(h^{*})}_{\text{estimation}},\qquad h^{*}=\operatorname*{argmin}_{h^{\prime}\in H_{k}}{{\varepsilon}}(h^{\prime})

Approximation error describes the lowest error obtainable by functions in HkH_{k}; this measures the ‘expressivity’ of HkH_{k}, and is independent of h^{\hat{h}}. For approximation error, we define a matching complexity structure for value functions vv, and give several results relating the choice of kk and the complexity of vv. Estimation error describes how far the learned h^{\hat{h}} is from the optimal hHkh^{*}\in H_{k}, and depends on the data size, mm. Here we give a generalization bound based on VC analysis.

5.1.1 User approximation error

To analyze the approximation error, we must be able to relate choice functions hh (that operate on representations zz) to the target value function vv (which operates on items xx). To connect the two, we will again use induced functions, for which we now define a matching complexity structure.

Definition 5.1.

A function f:𝒳{±1}f:{}\mathcal{X}\rightarrow\{\pm 1\} has an induced complexity of \ell if exists a function g:Z{±1}g:Z_{\ell}\rightarrow\{\pm 1\} s.t.:

f(x)={1if zx,|z|=andg(z)=11o.w.f(x)=\begin{cases}1&\text{{if }}\,\,\exists z\subseteq x,|z|=\ell\,\,\text{{and}}\,\,g(z)=1\\ -1&\text{{o.w.}}\end{cases}

and \ell is minimal (i.e., there is no such g:Z1{±1}g^{\prime}:Z_{\ell-1}\rightarrow\{\pm 1\}).

We show in Lem. 5.2 and Cor. 5.3 that the induced complexity of a function ff captures the minimum k[1,n]k\in[1,n] such that ff is an induced function of an hHkh\in H_{k}.

Lemma 5.2.

Let kk2k\leq k_{2}. Then for every hHkh\in H_{k}, the induced complexity of the corresponding fhf_{h} is k\ell\leq k.

Corollary 5.3.

Let Fk=FHkF_{k}=F_{H_{k}} be the induced function class of HkH_{k}, as defined in Def. 4.5. Then:

Fk={f:𝒳{±1}:f has induced complexity k}.F_{k}=\{f:{}\mathcal{X}\rightarrow\{\pm 1\}\,:\,f\textnormal{ has induced complexity }\leq k\}.

Proof of Cor. 5.3 is in Appx. F. We now turn to considering the effect of kk on approximation error. Since the ‘worthwhileness’ function Y(x)=sign(v(x))Y(x)=\operatorname{sign}(v(x)) operates on xx, we can consider its induced complexity, which we denote by \ell^{*} (i.e., YFY\in F_{\ell^{*}}). The following result shows that if k\ell^{*}\leq k, then HkH_{k} is expressive enough to perfectly recover YY.

Theorem 5.4.

If k\ell^{*}\leq k then the approximation error is 0.

One conclusion from Thm. 5.4 is that if the user knows \ell^{*}, then zero error is, in principle, obtainable; another is that there is no reason to choose k>k>\ell^{*}. In practice, knowing \ell^{*} can aid the user in tuning kk according to computational (Sec. 4.3) and statistical considerations (Sec. 5.1.2). Further conclusions relate \ell^{*} and k2k_{2}:

Corollary 5.5.

If k2\ell^{*}\leq k_{2} and the distribution DD has full support on 𝒳{}\mathcal{X}, then k=k=\ell^{*} is the smallest kk that gives zero approximation error.

Corollary 5.6.

If >k2\ell^{*}>k2, then the approximation error weakly increases with kk, i.e., ε(hk)ε(hk1){{\varepsilon}}(h^{*}_{k})\leq{{\varepsilon}}(h^{*}_{k-1}) for all kk2k\leq k_{2}. Furthermore, if the distribution DD has full support on 𝒳{}\mathcal{X} then no kk can achieve zero approximation error.

Refer to caption
Figure 1: Upper bound on approximation error showing diminishing returns. Parameters: q=400,n=30q=400,n=30 and k2=10k_{2}=10.

Proofs in Appx. F. In general, Cor. 5.6 guarantees only weak improvement with kk. Next, we show that increasing kk can exhibit clear diminishing-returns behavior, with most of the gain obtained at very low kk.

Lemma 5.7.

Let DD be the uniform distribution over 𝒳{}\mathcal{X}. Then there is a value function vv for which ε(hk){{\varepsilon}}(h^{*}_{k}) diminishes convexly with kk.

The proof is constructive (see Appx. F). We construct a vv whose approximation error hkHkh^{*}_{k}\in H_{k} is upper bounded by

ε(hk)14(qn)=kk2(k2)(qk2n).{{\varepsilon}}(h^{*}_{k})\leq\frac{1}{4{q\choose n}}\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}\,.

The diminishing returns property of upper bound is illustrated in Fig. 1. Although Lem. 5.7 describes a special case, we conjecture that this phenomena applies more broadly.

Our next result shows that learning k1k_{1}-order functions can be as powerful as learning subadditive functions; hence, learning with k=k1k=k_{1} is highly expressive. Interestingly, the connection between general (kk-order) functions and subadditive functions is due to the strategic response mapping, ϕ{\phi}.

Lemma 5.8.

Consider threshold-subadditive functions:

HSA={sign(g(z)):g is subadditive on subsets in 𝒵}H_{\mathrm{SA}}=\{\operatorname{sign}(g(z))\,:\,g\textnormal{ is subadditive on subsets in }\mathcal{Z}\}

Then for every threshold-subadditive hgHSAh_{g}\in H_{\mathrm{SA}}, there is an hHk1h\in H_{k_{1}} for which h(ϕh(x))=hg(ϕhg(x))x𝒳h(\phi_{h}(x))=h_{g}(\phi_{h_{g}}(x))\,\,\forall x\in{}\mathcal{X}.

5.1.2 User estimation error

For estimation error, we give generalization bounds based on VC analysis. The challenge in analyzing functions in HkH_{k} is that generalization applies to the strategic 0/1 loss, i.e., 𝟙{h(ϕh(x))y}\mathds{1}{\{{h({\phi}_{h}(x))\neq y}\}}, and so standard bounds (which apply to the standard 0/1 loss) do not hold. To get around this, our approach relies on directly analyzing the VC dimension of the induced class, FkF_{k} (a similar approach was taken in Sundaram et al. (2021) for SC). This allows us to employ tools from VC theory, which give the following bound.

Theorem 5.9.

For any kk and mm, given a sample set SS of size mm sampled from DD and labeled by some vv, we have

ε(h^)ε(h)C((qk)log((qk)/ϵ)+log(1/δ)m{{\varepsilon}}({\hat{h}})-{{\varepsilon}}(h^{*})\leq\sqrt{\frac{C({q\choose k}\log({q\choose k}/\epsilon)+\log(1/\delta)}{m}}

w.p. at least 1δ1-\delta over SS, and for a fixed constant CC. In particular, Alg in Sec. 4.3, assuming YY is realizable, returns an h^Hk{\hat{h}}\in H_{k} for which:

ε(h^)C((qk)log((qk)/ϵ)+log(1/δ)m{{\varepsilon}}({\hat{h}})\leq\sqrt{\frac{C({q\choose k}\log({q\choose k}/\epsilon)+\log(1/\delta)}{m}}

w.p. at least 1δ1-\delta over SS, and for a fixed constant CC.

The proof relies on Thm. 4.6; since hh and fhf_{h} share weights, the induced FkF_{k} can be analyzed as a class of qq-variate degree-kk multilinear polynomials. Since induced functions already incorporate ϕ{\phi}, VC analysis for the 0/1 loss can be applied. Note that such polynomials have exactly (qk){q\choose k} degrees of freedom; hence the term in the bound.

5.2 System’s Perspective

The system’s expressive power derives from its flexibility in choosing representations zz for items xx. Since k1,k2k_{1},k_{2} determine which representations are feasible, they directly control the system’s power to manipulate; and while the system itself may not have direct control over k1,k2k_{1},k_{2} (i.e., if they are set by exogenous factors like screen size), their values certainly affect the system’s ability to optimize engagement. Our next result is therefore unintuitive: for system, a smaller k2k_{2} is better (in the worst case), even though it reduces the set of feasible representations. This result is obtained indirectly, by considering the effect of k2k_{2} on the user’s choice of kk.

Lemma 5.10.

There exists a distribution DD and a value function vv such that for all k<kk2k<k^{\prime}\leq k_{2}, system has higher payoff against the optimal hkHkh^{*}_{k}\in H_{k} than against hkHkh^{*}_{k^{\prime}}\in H_{k^{\prime}}.

The proof is in Appx. F; it uses the uniform distribution, and the value function from Thm. 5.7. Recalling that the choice of kk controls the induced complexity \ell (Cor. 5.3), and that users should choose kk to be no greater than k2k_{2}, we can conclude the following (in a worst-case sense):

Corollary 5.11.

For the system, lower k2k_{2} is better.

Proof in Appx. F. For k1k_{1}, it turns out that against strategic users—it is inconsequential. This is since payoff to the strategic user is derived entirely from kk, which is upper-bounded by k2k_{2}, but can be set lower than k1k_{1}. This invariance is derived immediately from how functions in HkH_{k} are defined, namely that w(z)=aw(z)=a_{-} for all zz with |z|<k|z|<k (Def. 4.3). This, however, holds when the strategic user chooses to learn over HkH_{k} for some kk. Consider, alternatively, a strategic user that decides to learn subadditive functions instead. In this case, Thm. 5.8 shows that k1k_{1} determines the users ‘effective’ kk; the smaller k1k_{1}, the smaller the subset of subadditive functions that can be learned. Hence, for user, smaller k1k_{1} means worse approximation error. This becomes even more pronounced when facing a naïve user; for her, a lower k1k_{1} means that system now has a large set of representations to choose from; if even one of them has v(z)=1v(z)=1, the system can exploit this to increase its gains. In this sense, as k1k_{1} decreases, payoff to the system (weakly) improves.

6 Discussion

Our analysis of the balance of power reveals a surprising conclusion: for both parties, in some sense, simple choice functions are better. For system, lower kk improves its payoff through how it relates to k2k_{2} (Corollary 5.11). For users, lower kk is clearly better in terms of runtime (Lemma 4.8) and estimation error (Theorem 5.9), and for approximation error, lower kk has certain benefits—as it relates to \ell^{*} (Corollary 5.5), and via diminishing returns (Theorem 5.7). Thus, and despite their conflicting interests—to some degree, the incentives of the system and its users align.

But the story is more complex. For users, there is no definitive notion of ‘better’; strategic users always face a trade-off, and must choose kk to balance approximation, estimation, and runtime. In principle, users are free to choose kk at will; but since there is no use for k>k2k>k_{2}, a system controlling k2k_{2} de facto controls kk as well. This places a concrete restriction on the freedom of users to choose, and inequitably: for small k2k_{2}, users whose vv has complexity k2\leq k_{2} (i.e., having ‘simple tastes’) are less susceptible to manipulation than users with vv of complexity >k2>k_{2} (e.g., fringe users with eclectic tastes) (Theorem 5.4, Corollaries. 5.5 and 5.6). In this sense, the choice of k2k_{2} also has implications on fairness. We leave the further study of these aspects of strategic representation for future work.

From a purely utilitarian point of view, it is tempting to conclude that systems should always set k2k_{2} to be low. But this misses the broader picture: although systems profit from engagement, users engage only if they believe it is worthwhile to them, and dissatisfied users may choose to leave the system entirely (possibly into the hands of another). Thus, the system should not blindly act to maximize engagement; in reality, it, too, faces a tradeoff.

References

  • Abboud et al. [1999] Elias Abboud, Nader Agha, Nader H Bshouty, Nizar Radwan, and Fathi Saleh. Learning threshold functions with small weights using membership queries. In Proceedings of the twelfth annual conference on Computational learning theory, pages 318–322, 1999.
  • Abraham et al. [2012] Ittai Abraham, Moshe Babaioff, Shaddin Dughmi, and Tim Roughgarden. Combinatorial auctions with restricted complements. In Proceedings of the 13th ACM Conference on Electronic Commerce, pages 3–16, 2012.
  • Angluin [1988] Dana Angluin. Queries and concept learning. Machine learning, 1988.
  • Balcan and Harvey [2011] Maria-Florina Balcan and Nicholas JA Harvey. Learning submodular functions. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 793–802, 2011.
  • Bechavod et al. [2022] Yahav Bechavod, Chara Podimata, Zhiwei Steven Wu, and Juba Ziani. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
  • Chen et al. [2020a] Yatong Chen, Jialu Wang, and Yang Liu. Linear classifiers that encourage constructive adaptation. arXiv preprint arXiv:2011.00355, 2020a.
  • Chen et al. [2020b] Yatong Chen, Jialu Wang, and Yang Liu. Strategic recourse in linear classification. arXiv preprint arXiv:2011.00355, 2020b.
  • Chevaleyre et al. [2008] Yann Chevaleyre, Ulle Endriss, Sylvia Estivie, and Nicolas Maudet. Multiagent resource allocation in k -additive domains: preference representation and complexity. Ann. Oper. Res., 163(1):49–62, 2008.
  • Conitzer et al. [2005] Vincent Conitzer, Tuomas Sandholm, and Paolo Santi. Combinatorial auctions with k-wise dependent valuations. In AAAI, 2005.
  • Dughmi et al. [2015] Shaddin Dughmi, Nicole Immorlica, Ryan O’Donnell, and Li-Yang Tan. Algorithmic signaling of features in auction design. In Algorithmic Game Theory - 8th International Symposium, SAGT, pages 150–162. Springer, 2015.
  • Eilat et al. [2022] Itay Eilat, Ben Finkelshtein, Chaim Baskin, and Nir Rosenfeld. Strategic classification with graph neural networks. arXiv preprint arXiv:2205.15765, 2022.
  • Feige et al. [2015] Uriel Feige, Michal Feldman, Nicole Immorlica, Rani Izsak, Brendan Lucier, and Vasilis Syrgkanis. A unifying hierarchy of valuations with complements and substitutes. In Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
  • Feldman [2009] Vitaly Feldman. On the power of membership queries in agnostic learning. The Journal of Machine Learning Research, 10:163–182, 2009.
  • Ghalme et al. [2021] Ganesh Ghalme, Vineet Nair, Itay Eilat, Inbal Talgam-Cohen, and Nir Rosenfeld. Strategic classification in the dark. In Proceedings of the 38th International Conference on Machine Learning (ICML), 2021.
  • Haghtalab et al. [2021] Nika Haghtalab, Nicole Immorlica, Brendan Lucier, Markus Mobius, and Divyarthi Mohan. Persuading with anecdotes. Technical report, National Bureau of Economic Research, 2021.
  • Hardt et al. [2016] Moritz Hardt, Nimrod Megiddo, Christos H. Papadimitriou, and Mary Wootters. Strategic classification. In Proceedings of the 2016 ACM Conference on Innovations in Theoretical Computer Science, 2016.
  • Hu et al. [2019] Lily Hu, Nicole Immorlica, and Jennifer Wortman Vaughan. The disparate effects of strategic manipulation. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), pages 259–268, 2019.
  • Jagadeesan et al. [2021] Meena Jagadeesan, Celestine Mendler-Dünner, and Moritz Hardt. Alternative microfoundations for strategic classification. In International Conference on Machine Learning, 2021.
  • Kamenica and Gentzkow [2011] Emir Kamenica and Matthew Gentzkow. Bayesian persuasion. American Economic Review, 101(6), 2011.
  • Krishnaswamy et al. [2021] Anilesh K Krishnaswamy, Haoming Li, David Rein, Hanrui Zhang, and Vincent Conitzer. Classification with strategically withheld data. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  • Lancaster [1966] Kelvin J Lancaster. A new approach to consumer theory. Journal of political economy, 74(2):132–157, 1966.
  • Levanon and Rosenfeld [2021] Sagi Levanon and Nir Rosenfeld. Strategic classification made practical. In Proceedings of the 38th International Conference on Machine Learning, ICML, 2021.
  • Levanon and Rosenfeld [2022] Sagi Levanon and Nir Rosenfeld. Generalized strategic classification and the case of aligned incentives. In Proceedings of the 39th International Conference on Machine Learning (ICML), 2022.
  • Miller et al. [2020] John Miller, Smitha Milli, and Moritz Hardt. Strategic classification is causal modeling in disguise. In International Conference on Machine Learning, pages 6917–6926. PMLR, 2020.
  • Milli et al. [2019] Smitha Milli, John Miller, Anca D. Dragan, and Moritz Hardt. The social cost of strategic classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT*), pages 230–239, 2019.
  • Rosenfeld et al. [2020] Nir Rosenfeld, Kojin Oshiba, and Yaron Singer. Predicting choice with set-dependent aggregation. In International Conference on Machine Learning, 2020.
  • Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • Sundaram et al. [2021] Ravi Sundaram, Anil Vullikanti, Haifeng Xu, and Fan Yao. Pac-learning for strategic classification. In International Conference on Machine Learning, pages 9978–9988, 2021.
  • Zhang and Conitzer [2021] Hanrui Zhang and Vincent Conitzer. Incentive-aware pac learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 5797–5804, 2021.
  • Zrnic et al. [2021] Tijana Zrnic, Eric Mazumdar, Shankar Sastry, and Michael Jordan. Who leads and who follows in strategic classification? Advances in Neural Information Processing Systems, 34, 2021.

Appendix A Additional Results

A.1 Agnostic User

Theorem A.1 stated below is the formal version of Theorem 3.2 in Section 3. Theorem A.1 shows that given a a large enough sample size mm, an agnostic user’s payoff would approach max{μ,1μ}\max\{\mu,1-\mu\}, where μ=𝔼D[y]\mu=\mathbb{E}_{D}{\left[{y}\right]}.

Theorem A.1.

Let 22+mδ<1/8\frac{2}{2+\sqrt{m}}\leq\delta<1/8 and τ=δ2(1δ)+2log(1/δ)m\tau=\frac{\delta}{2(1-\delta)}+\sqrt{\frac{2\log(1/\delta)}{m}}, then agnostic user’s expected payoff guarantee is given by

{(1δ)(1μ)if μ^1/2τ(1δ)μif μ^1/2+τ=1/2Otherwise\begin{cases}\geq(1-\delta)(1-\mu)&\text{if }\ \widehat{\mu}\leq 1/2-\tau\\ \geq(1-\delta)\mu&\text{if }\ \widehat{\mu}\geq 1/2+\tau\\ =1/2&\text{Otherwise}\end{cases}

Before we prove the theorem, we state Hoeffding’s inequality, which is a well known result from probability theory.

Lemma A.2 (Hoeffding).

Let Sm=i=1mXiS_{m}=\sum_{i=1}^{m}X_{i} be the sum of mm i.i.d. random variables with Xi[0,1]X_{i}\in[0,1] and μ=𝔼[Xi]\mu=\mathbb{E}[X_{i}] for all i[m]i\in[m], then

(Smmμε)e2mε2 and(Smmμε)e2mε2.\mathbb{P}(\frac{S_{m}}{m}-\mu\geq\varepsilon)\leq e^{-2m\varepsilon^{2}}\,\,\,\text{ and}\,\,\,\mathbb{P}(\frac{S_{m}}{m}-\mu\leq-\varepsilon)\leq e^{-2m\varepsilon^{2}}.

We will use the following equivalent form of the above inequality. Let δ:=e2mε2\delta:=e^{-2m\varepsilon^{2}} i.e. ε=2log(1/δ)m\varepsilon=\sqrt{\frac{2\log(1/\delta)}{m}} and μ^=Smm\widehat{\mu}=\frac{S_{m}}{m}. Then we have with probability at-least (1δ)(1-\delta) we have

μμ^+2log(1/δ)mand\mu\leq\widehat{\mu}+\sqrt{\frac{2\log(1/\delta)}{m}}~{}~{}~{}~{}\text{and} (8)
μμ^2log(1/δ)m\mu\geq\widehat{\mu}-\sqrt{\frac{2\log(1/\delta)}{m}} (9)

Now we are ready to give the proof of Theorem A.1

Proof of Theorem A.1.

We begin with the following supporting lemma.

Lemma A.3.

Let 22+mδ<1/8\frac{2}{2+\sqrt{m}}\leq\delta<1/8, then τ<1/2.\tau<1/2.

Proof.

The proof follows from following sequence of inequalities,

22+m<δm>4(1/δ1)2m>4(1/δ1)log(1/δ)δ2(1δ)>2log(1/δ)m\displaystyle\frac{2}{2+\sqrt{m}}<\delta\iff m>4(1/\delta-1)^{2}\implies m>4(1/\delta-1)\log(1/\delta)\iff\frac{\delta}{2(1-\delta)}>\frac{2\log(1/\delta)}{m}

Let γ=δ2(1δ)\gamma=\frac{\delta}{2(1-\delta)}. We have τ=γ+γ\tau=\gamma+\sqrt{\gamma} which is an increasing function of δ\delta, so we the maximum is achieved at δ=1/8\delta=1/8 and is given by 1/14+1/14<1/21/\sqrt{14}+1/14<1/2. This completes the proof of the lemma. ∎

From Lemma A.3 we have that 1/2+τ<11/2+\tau<1, hence there is a non-trivial range i.e. μ^[1/2+τ,1]\widehat{\mu}\in[1/2+\tau,1] where user assigns h(z)=+1h(z)=+1 for all zz with probability 1. Similarly, when μ^[0,1/2τ]\widehat{\mu}\in[0,1/2-\tau] user assigns h(z)=1h(z)=-1 for all zz with probability 1. We will consider three cases separately.

Case 1 (μ^[1/2+τ,1]\widehat{\mu}\in[1/2+\tau,1]): From Hoeffding’s inequality (Eq. 9) we have that with probability at-least (1δ)(1-\delta),

μ\displaystyle\mu μ^2log(1/δ)m\displaystyle\geq\widehat{\mu}-\sqrt{\frac{2\log(1/\delta)}{m}}
μ\displaystyle\implies\mu 1/2+δ2(1δ)=12(1δ)\displaystyle\geq 1/2+\frac{\delta}{2(1-\delta)}=\frac{1}{2(1-\delta)}

Hence, with probability at-least (1δ)(1-\delta) an agnostic user will get a payoff of μ\mu. Hence, the expected payoff in this case is at-least (1δ)μ1/2(1δ)(1μ)(1-\delta)\mu\geq 1/2\geq(1-\delta)(1-\mu).

Case 2 (μ^[0,1/2τ]\widehat{\mu}\in[0,1/2-\tau]): Similar to Case 1 here we use the tail bound given by Hoeffding’s inequality (Eq. 8) to get with probability at least (1δ)(1-\delta),

μ\displaystyle\mu μ^+2log(1/δ)m\displaystyle\leq\widehat{\mu}+\sqrt{\frac{2\log(1/\delta)}{m}}
μ\displaystyle\implies\mu 1/2δ2(1δ)=12δ2(1δ).\displaystyle\leq 1/2-\frac{\delta}{2(1-\delta)}=\frac{1-2\delta}{2(1-\delta)}.

Hence, (1μ)12(1δ)(1-\mu)\geq\frac{1}{2(1-\delta)}. The agnostic user guarantees a payoff of (1μ)(1-\mu) with probability at least (1δ1-\delta) in this case. Hence we have the payoff of (1δ)(1μ)1/2(1δ)μ(1-\delta)(1-\mu)\geq 1/2\geq(1-\delta)\mu in this case.

Case 3 (μ^(1/2τ,1/2+τ)\widehat{\mu}\in(1/2-\tau,1/2+\tau)): Finally, in this case, the agnostic user chooses h(z)=1h(z)=1 for all z𝒵z\in\mathcal{Z} with probability 1/21/2 and h(z)=1h(z)=-1 for all z𝒵z\in\mathcal{Z} with probability 1/21/2. Hence, the expected payoff is given by 12μ+12(1μ)=1/2\frac{1}{2}\mu+\frac{1}{2}(1-\mu)=1/2 irrespective of the true mean μ\mu of positive samples. ∎

A.2 Lifted Functions

The relation between choice functions and their induced counterparts passes through an additional type of functions that operate on sets of size exactly \ell. Denote 𝒵={z2E:|z|=}\mathcal{Z}_{\ell}=\{z\in 2^{E}:|z|=\ell\}, and note that all feasible representations can be partitioned as 𝒵=𝒵k1𝒵k2\mathcal{Z}=\mathcal{Z}_{k_{1}}\uplus\ldots\uplus\mathcal{Z}_{k_{2}}. We refer to functions that operate on single-sized sets as restricted functions. Our next result shows that choice functions in HkH_{k} can be represented by restricted functions over 𝒵k\mathcal{Z}_{k} that are ‘lifted’ to operate on the entire 𝒵\mathcal{Z} space. This will allow us to work only with sets of size exactly kk.

Lemma A.4.

For each hHkh\in H_{k} there exists a corresponding g:Zk{±1}g:Z_{k}\rightarrow\{\pm 1\} such that h=lift(g)h=\mathrm{lift}(g), where:

lift(g)(z)={1if k|z| and zz,|z|=k s.t. g(z)=11o.w.\mathrm{lift}(g)(z)=\begin{cases}1&\text{{if }}k\leq|z|\text{{ and }}\\ &\exists z^{\prime}\subseteq z,|z^{\prime}|=k\,\text{{ s.t. }}\,g(z^{\prime})=1\\ -1&\text{{o.w.}}\end{cases}
Proof.

Let hHkh\in H_{k}. Then there is a weight function ww on sets of size at most kk such that either w(z)(1,0)w(z)\in(-1,0) or w(z)>i[k](ni)w(z)>\sum_{i\in[k]}{n\choose i}, and

h(z)=sign(z:zz,|z|kw(z))h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}:z^{\prime}\subseteq z,|z^{\prime}|\leq k}w(z^{\prime})\right)

Define g:𝒵k{1,1}g:\mathcal{Z}_{k}\rightarrow\{-1,1\} such that for a z𝒵kz\in\mathcal{Z}_{k}, g(z)=1g(z)=1 if w(z)>0w(z)>0 and g(z)=1g(z)=-1 otherwise. It is easy to see from the choice of w(z)w(z) that h=lift(g)h=\mathrm{lift}(g). ∎

A.3 A Lower Bound on the Running Time

As stated in Lemma 4.8, the running time of our algorithm is m((nk))m({n\choose k}). We argued in Section 4 that is made possible only since weights are sparse, and since the algorithm operates on a finite sample set of size mm. If mm is large, then this expression can be replaced with (qk){q\choose k}. We now show that, in the limit (or under full information), the dependence on (qk){q\choose k} is necessary. The conclusion from Lemma A.5 is that to find the loss minimizer, any algorithm must traverse at least all such hh; since there exist (qk){q\choose k} such functions, this is a lower bound. This is unsurprising; HkH_{k} is tightly related to the class of multilinear polynomials, whose degrees of freedom are exactly (qk){q\choose k}.

Lemma A.5.

Consider a subclass of HkH_{k} composed of choice functions hh which have w(z)=a+w(z)=a_{+} for exactly one zz with |z|=k|z|=k, and w(z)=aw(z)=a_{-} otherwise. Then, for every such hh, there exists a corresponding vv, such that hh is a unique minimizer (within this subclass) of the error w.r.t. vv.

Proof.

Let z1z_{1} and z2z_{2} be distinct kk size subsets, and let a(0,1)a_{-}\in(0,1) and a+>i[1,k](ni)a_{+}>\sum_{i\in[1,k]}{n\choose i}. Further, let 𝒘i\bm{w}_{i}, i[1,2]i\in[1,2] be a weight function that assigns a+a_{+} to ziz_{i} and aa_{-} to all other subsets of size at most kk. Let h1h_{1} and h2h_{2} be two function in HkH_{k} defined by the binary weighted functions 𝒘1\bm{w}_{1} and 𝒘2\bm{w}_{2} respectively. Observe that for vi=fhiv_{i}=f_{h_{i}} the approximation error (see (Eq. (6))) of hih_{i} is zero. Hence, to prove the lemma it is sufficient to show that fh1fh2f_{h_{1}}\neq f_{h_{2}}.

Suppose fh1=fh2f_{h_{1}}=f_{h_{2}}. Since z1z2z_{1}\neq z_{2}, there exists an x𝒳x\in{}\mathcal{X} such that z1xz_{1}\subseteq x but z2xz_{2}\subseteq x. From Theorem 4.6 and the choice of a+a_{+} and aa_{-}, this implies fh1(x)=1f_{h_{1}}(x)=1 but fh2(x)=1f_{h_{2}}(x)=-1, and hence, a contradiction. ∎

Appendix B Additional Related Work

Learning set functions.

Concept learning refers to learning a binary function over hypercubes [Angluin, 1988] through a query access model. Abboud et al. [1999] provide a lower bound on membership queries to exactly learn a threshold function over sets where each element has small integer valued weights. Our learning framework admits large weights and has only a sample access in contrast with the query access studied in this literature. Feldman [2009] show that the problem of learning set functions with sample access is computationally hard. However, we show (see Section 5) that the strategic setting is more nuanced; a more complex representations are disadvantageous for both user and the system. In other words, it is in the best interest of system to choose smaller (and much simpler) representations. A by-now classic work in learning theory studies the learnability from data of submodular (non-threshold) set functions [Balcan and Harvey, 2011]. Though we consider learning subadditive functions in this work, an extension to submodular valuations is a natural extension. Learning set functions is in general hard, even for certain subcalsses such as submodular functions. Rosenfeld et al. [2020] show that it’s possible to learn certain parameterized subclasses of submodular functions, when the goal is to use them for optimization. But this refers to learning over approximate proxy losses; whereas in our work, we show that learning is possible directly over the 0/1 loss.

Hierarchies of set functions.

Conitzer et al. [2005] (and independently, Chevaleyre et al. [2008]) suggest a notion of kk-wise dependent valuations, to which our Definition 4.1 is related. We also allow up to kk-wise dependencies, but our valuations need not be positive and we focus on their sign (an indication whether an item is acceptable or not). Our set function valuations are also over item attributes rather than multiple items. Despite the differences, the definitions have a shared motivation: Conitzer et al. [2005] believe that this type of valuation is likely to arise in many economic scenarios, especially since due to cognitive limitations, it might be difficult for a player to understand the inter-relationships between a large group of items. Hierarchies of valuations with limited dependencies/synergies have been further studied by Abraham et al. [2012], Feige et al. [2015] under the title ‘hypergraph valuations’. These works focus on monotone valuations that have only positive weights for every subset, and are thus mathematically different than ours.

Appendix C A Missing Proof from Section 2

See 2.1

Proof.

The proof follows from the definition of best response (Eq. 2). Let z1,z2ϕh(x)z_{1},z_{2}\in{\phi}_{h}(x). Then since ϕh{\phi}_{h} consists of only best response, we have either h(z1)=h(z2)=1h(z_{1})=h(z_{2})=1, or h(z1)=h(z2)=1h(z_{1})=h(z_{2})=-1. Hence, h(z1)=v(x)h(z_{1})=v(x) if and only if h(z2)=v(x)h(z_{2})=v(x) for any z1,z2ϕh(x)z_{1},z_{2}\in\phi_{h}(x). ∎

Appendix D A Missing Proof from Section 3 and an Additional Example

See 3.1

Proof.

Since a naïve user plays h(z)=sign(v(z))h(z)=\operatorname{sign}(v(z)), for each x𝒳x\in{}\mathcal{X} the payoff of the user is maximized if in response the system plays a zxz\subseteq x such that sign(v(z))=sign(v(x))\operatorname{sign}(v(z))=\operatorname{sign}(v(x)). Observe that, if there exists a z𝒵z\in\mathcal{Z} and zxz\subseteq x, such that sign(v(z))=sign(v(x))\operatorname{sign}(v(z))=\operatorname{sign}(v(x)) then zϕhbenev(x)z\in\phi^{\mathrm{benev}}_{h}(x) and consequently the user’s payoff is maximized for such an xx. Conversely, if there exists no z𝒵z\in\mathcal{Z} and zxz\subseteq x such that sign(v(z))=sign(v(x))\operatorname{sign}(v(z))=\operatorname{sign}(v(x)), then no truthful system can ensure more than zero utility for such an xx. Hence, a benevolent system maximizes the utility of a naïve user. ∎

We now present an additional example to show how a naïve user’s choice function can be manipulated by the strategic system and, as a consequence, the user may obtain arbitrarily small payoff against a strategic system.

Example 2.

Let x1={a1,a2},x2={a1,a3},x3={a1,a4},x4={a2,a3},x5={a3,a4}}x_{1}=\{a_{1},a_{2}\},x_{2}=\{a_{1},a_{3}\},x_{3}=\{a_{1},a_{4}\},x_{4}=\{a_{2},a_{3}\},x_{5}=\{a_{3},a_{4}\}\} with sign(v(x1))=sign(v(x5))=sign(v(a2))=sign(v(a4))=+1\operatorname{sign}(v(x_{1}))=\operatorname{sign}(v(x_{5}))=\operatorname{sign}(v(a_{2}))=\operatorname{sign}(v(a_{4}))=+1 and sign(v(x2))=sign(v(x3))=sign(v(x4))=sign(v(a1))=sign(v(a3))=1\operatorname{sign}(v(x_{2}))=\operatorname{sign}(v(x_{3}))=\operatorname{sign}(v(x_{4}))=\operatorname{sign}(v(a_{1}))=\operatorname{sign}(v(a_{3}))=-1. Further, let k1=k2=1k_{1}=k_{2}=1 with zi=aiz_{i}=a_{i} as representations and a distribution D=(ε4,ε4,1ε,ε4,ε4)D=(\frac{\varepsilon}{4},\frac{\varepsilon}{4},1-\varepsilon,\frac{\varepsilon}{4},\frac{\varepsilon}{4}) supported over (x1,x2,x3,x4,x5)(x_{1},x_{2},x_{3},x_{4},x_{5}).

A unique truthful representation for this instance is h=(1,+1,1,+1)h=(-1,+1,-1,+1). A strategic agent can manipulate a naïve agent into non-preferred choices by using a representation (a2,a1,a4,a2,a4)(a_{2},a_{1},a_{4},a_{2},a_{4}) for (x1,x2,x3,x4,x5)(x_{1},x_{2},x_{3},x_{4},x_{5}). Note here that a naïve agent expected z1z_{1} as a representation for x3x_{3} since h(z1)=sign(v(x3))=1h(z_{1})=\operatorname{sign}(v(x_{3}))=-1 and h(z4)=+1sign(v(x3))h(z_{4})=+1\neq\operatorname{sign}(v(x_{3})). However, a strategic agent chose a4a_{4} as under given hh we have h(a4)=1h(a_{4})=1. A naïve users payoff in this case is reduced to ε\varepsilon which can be arbitrarily small.

Appendix E Missing Proofs from Section 4

See 4.2

Proof.

Define kk as follows: if h(z)=1h(z)=-1 for all z𝒵z\in\mathcal{Z} then k=k1k=k_{1}, and otherwise

k=maxk[k1,k2]{z such that |z|=k and h(z)=1, but for all zz and z𝒵,h(z)=1}.k=\max_{k^{\prime}\in[k_{1},k_{2}]}\{\exists z\text{ such that }|z|=k^{\prime}\text{ and }h(z)=1,\text{ but for all }z^{\prime}\subset z\text{ and }z^{\prime}\in\mathcal{Z},h(z^{\prime})=-1\}.

Define hh^{\prime} as follows: For |z|<k|z|<k, h(z)=1h^{\prime}(z)=-1; for |z|k|z|\geq k

h(z)=1\displaystyle h^{\prime}(z)=1 if z:|z|=k,zz and h(z)=1;\displaystyle~{}~{}~{}~{}~{}\text{if }\exists z^{\prime}:|z^{\prime}|=k,z^{\prime}\subseteq z\text{ and }h(z^{\prime})=1;
h(z)=1\displaystyle h^{\prime}(z)=-1 otherwise.\displaystyle~{}~{}~{}~{}~{}\text{otherwise}.

First, we argue that hh^{\prime} defined as above satisfies h(ϕh(x))=h(ϕh(x))h^{\prime}(\phi_{h^{\prime}}(x))=h(\phi_{h}(x)) for all x𝒳x\in{}\mathcal{X}. Suppose h(ϕh(x))=1h(\phi_{h}(x))=1. Then there exists z𝒵z\in\mathcal{Z} such that zxz\subseteq x and h(z)=1h(z)=1. From the choice of kk, we may assume without loss of generality that |z|=k|z|=k. Further, from the construction of hh^{\prime}, we have h(z)=1h^{\prime}(z)=1, and hence h(ϕh(x))=h(ϕh(x))=1h^{\prime}(\phi_{h^{\prime}}(x))=h(\phi_{h}(x))=1. Now suppose h(ϕh(x))=1h(\phi_{h}(x))=-1. Then for all zxz\subseteq x we have h(z)=1h(z)=-1. In particular, for all zxz\subseteq x such that |z|=k|z|=k we have h(z)=1h(z)=-1. This implies for all zxz\subseteq x such that |z|k|z|\geq k we have h(z)=1h^{\prime}(z)=-1. This is because if there exists zxz\subseteq x such that |z|k|z|\geq k and h(z)=1h^{\prime}(z)=1 then from the definition of hh^{\prime} there exists a zzxz^{\prime}\subseteq z\subseteq x such that |z|=k|z^{\prime}|=k, and h(z)=1h(z^{\prime})=1 (a contradiction). Additionally, from definition, for all zxz\subseteq x such that |z|<k|z|<k we have h(z)=1h^{\prime}(z)=-1. Hence, if h(ϕh(x))=1h(\phi_{h}(x))=-1 then h(ϕh(x))=1h^{\prime}(\phi_{h^{\prime}}(x))=-1.

Now, we show that hh^{\prime} is a kk-order function. Let a(1,0)a_{-}\in(-1,0) and w(z)=aw(z)=a_{-} for all zz such that |z|k|z|\leq k and h(z)=1h^{\prime}(z)=-1. Further, for all zz such that |z|=k|z|=k, if h(z)=1h^{\prime}(z)=1 then let w(z)=a+>i[k](ni)w(z)=a_{+}>\sum_{i\in[k]}{n\choose i}. For all z𝒵z\in\mathcal{Z}, if |z|<k|z|<k then by construction of hh^{\prime}, we have h(z)=1h^{\prime}(z)=-1, and since for all zΓk(z)z^{\prime}\in\Gamma_{k}(z), w(z)=a<0w(z^{\prime})=a_{-}<0 we have zΓk(z)w(z)<0\sum_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})<0. Hence, for all z𝒵z\in\mathcal{Z}, if |z|<k|z|<k

h(z)=sign(zΓk(z)w(z))=1.h^{\prime}(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right)=-1.

Similarly, for all z𝒵z\in\mathcal{Z}, if |z|k|z|\geq k then by construction of hh^{\prime}, we have h(z)=1h^{\prime}(z)=1 if and only if there exists a zzz^{\prime}\subseteq z, and z=|k|z^{\prime}=|k| such that h(z)=h(z)=1h^{\prime}(z)=h(z)=1. In particular, if |z|k|z|\geq k and h(z)=1h^{\prime}(z)=1 then there exists a zzz^{\prime}\subseteq z, and |z|=k|z^{\prime}|=k such that w(z)=a+w(z^{\prime})=a_{+}. Since a+>i[k](ni)a_{+}>\sum_{i\in[k]}{n\choose i}, a(1,0)a_{-}\in(-1,0), and k2nk_{2}\leq n, we have if |z|k|z|\geq k and h(z)=1h^{\prime}(z)=1 then

h(z)=sign(zΓk(z)w(z))=1.h^{\prime}(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right)=1.

Finally, if |z|k|z|\geq k and h(z)=1h^{\prime}(z)=-1 then from the definition of hh^{\prime} there does not exists a zzz^{\prime}\subseteq z, and |z|=k|z^{\prime}|=k such that w(z)=a+w(z^{\prime})=a_{+}. Since a(1,0)a_{-}\in(-1,0), we have if |z|k|z|\geq k and h(z)=1h^{\prime}(z)=-1 then

h(z)=sign(zΓk(z)w(z))=1.h^{\prime}(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right)=-1.

See 4.4

Proof.

Arbitrarily choose uEu\subset E (recall EE is the ground set) such that |u|=k|u|=k, and let w(u)=ak,+>i[k](ni)w(u)=a_{k,+}>\sum_{i\in[k]}{n\choose i}.666Here we wish to distinguish between a+a_{+} for kk and k1k-1 and hence we use ak,+a_{k,+} instead of a+a_{+}. Also for all zuz\neq u and |z|k|z|\leq k, let w(z)=a(1,0)w(z)=a_{-}\in(-1,0). Let h:𝒵{±1}h:\mathcal{Z}\rightarrow\{\pm 1\} be defined as follows

h(z)=sign(zΓk(z)w(z))h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right)

From the definition of HkH_{k}, we have hHkh\in H_{k}. We show that hHk1h\not\in H_{k-1}. First, observe that for all z𝒵z\in\mathcal{Z} h(z)=1h(z)=1 if and only if uzu\subseteq z. Suppose hHk1h\in H_{k-1}. Then there is a weight function ww^{\prime} on sets of size at most k1k-1 such that either w(z)=a(1,0)w^{\prime}(z)=a_{-}\in(-1,0) or wz=ak1,+>i[k1](ni)w_{z}=a_{k-1,+}>\sum_{i\in[k-1]}{n\choose i}, and

h(z)=sign(zΓk1(z)w(z))h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k-1}(z)}w^{\prime}(z^{\prime})\right)

Let z𝒵z\in\mathcal{Z} be such that uzu\subseteq z. This implies h(z)=1h(z)=1. Hence there exist a uzu^{\prime}\subseteq z such that |u|=k1|u^{\prime}|=k-1 and w(u)=ak1,+w^{\prime}(u^{\prime})=a_{k-1,+}. Let z~𝒵\tilde{z}\in\mathcal{Z} be such that uz~u^{\prime}\subseteq\tilde{z} but uz~u\not\subseteq\tilde{z}. Such a z~\tilde{z} exists because uuuu\cap u^{\prime}\neq u. Further, as uz~u\not\subseteq\tilde{z}, we have h(z~)=1h(\tilde{z})=-1. But since uz~u^{\prime}\subseteq\tilde{z}, we have from the choice of ak1,+a_{k-1,+} and aa_{-}

zΓk1(z~)w(z)\displaystyle\sum\nolimits_{z^{\prime}\in\Gamma_{k-1}(\tilde{z})}w^{\prime}(z^{\prime})~{}~{} >0\displaystyle>~{}~{}0
sign(zΓk1(z~)w(z))\displaystyle\Rightarrow\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k-1}(\tilde{z})}w^{\prime}(z^{\prime})\right)~{}~{} =h(z~)=1.\displaystyle=~{}~{}h(\tilde{z})=~{}~{}1.

This gives a contradiction. Hence, hHk1h\not\in H_{k-1}. ∎

See 4.6

Proof.

Since hHkh\in H_{k}, 𝒘\bm{w} satisfies the following two properties (see Definition 4.3):

  1. 1.

    either w(z)=a(1,0)w(z)=a_{-}\in(-1,0) or w(z)=a+>i[k](ni)w(z)=a_{+}>\sum_{i\in[k]}{n\choose i} ,

  2. 2.

    w(z)=aw(z)=a_{-} for all zz having |z|<k|z|<k.

Further, from the definition of fhf_{h}, we have fh(x)=h(ϕh(x))f_{h}(x)=h(\phi_{h}(x)). This implies

fh(x)=1z𝒵,zx such that h(z)=1.f_{h}(x)=1\Longleftrightarrow\,\exists z\in\mathcal{Z},\,z\subseteq x\text{ such that }h(z)=1\,.

From the the above two properties of the weights function, we have

h(z)=1zz,|z|=k such that w(z)=a+>0.h(z)=1\Longleftrightarrow\,\exists z^{\prime}\subseteq z,\,|z^{\prime}|=k\text{ such that }w(z^{\prime})=a_{+}>0\,.

From the above two equations we conclude that

fh(x)=1zx,|z|=k such that w(z)=a+>0.f_{h}(x)=1\Longleftrightarrow\,\exists\,z\subseteq x,|z|=k\text{ such that }w(z)=a_{+}>0\,.

Finally, the two properties of 𝒘\bm{w} ensure that

fh(x)=sign(zΓk(x)w(z)).f_{h}(x)=\operatorname{sign}\left(\sum\nolimits_{z\in\Gamma_{k}(x)}w(z)\right)\,.

See 4.7

Proof.

Throughout, for ease of notation, we use xSx\in S to denote x{x1,,xm}x\in\{x_{1},\ldots,x_{m}\}. Let Zk={z:|z|=k,xSzx}Z_{k}=\{z\,:\,|z|=k,\exists x\in S~{}z\subseteq x\}. Recall Zk,SZ_{k,S} is equal to ZkZ_{k} at the beginning of the algorithm. Also, for each zZkz\in Z_{k}, let 𝒳z={xSzx}{}\mathcal{X}_{z}=\{x\in S\mid z\subseteq x\}. The following lemma characterizes the training set for which there exists an hHkh\in H_{k} with zero empirical error.

Lemma E.1.

There exists an hHkh\in H_{k} with zero empirical error if and only if for all xS+x\in S^{+} there exists a zZkz\in Z_{k} and zxz\subseteq x such that zxz\not\subseteq x^{\prime} for all xSx^{\prime}\in S^{-}.

Proof.

Suppose there exists an hHkh\in H_{k} with zero empirical error. Since hHkh\in H_{k}, from Lemma A.4 we have there exists a g:Zk{±1}g:Z_{k}\rightarrow\{\pm 1\} such that h=lift(g)h=\mathrm{lift}(g). We state the following observation, and its proof follows from the definition of lift(g)\mathrm{lift}(g).

Observation E.2.
  1. 1.

    For every z𝒵z\in\mathcal{Z} such that h(z)=1h(z)=1 there is a zZkz^{\prime}\in Z_{k} such that zzz^{\prime}\subseteq z and g(z)=1g(z^{\prime})=1,

  2. 2.

    For every z𝒵z\in\mathcal{Z} such that h(z)=1h(z)=-1, it must be that for every zZkz^{\prime}\in Z_{k}, zzz^{\prime}\subseteq z we have g(z)=1g(z^{\prime})=-1.

Further, as the empirical error for hh is zero, we have the following observation.

Observation E.3.
  1. 1.

    For every xS+x\in S^{+} there is a z𝒵z\in\mathcal{Z} and zxz\subseteq x such that h(z)=1h(z)=1,

  2. 2.

    For every xSx\in S^{-}, it must be that for every z𝒵z\in\mathcal{Z}, zxz\subseteq x we have h(z)=1h(z)=-1.

Proof.

For every xS+x\in S^{+}, since the empirical error is zero, we have h(ϕh(x))=1h(\phi_{h}(x))=1. From the definition of ϕh\phi_{h}, this implies there is a z𝒵z\in\mathcal{Z} and zxz\subseteq x such that h(z)=1h(z)=1. Similarly, for every xSx\in S^{-}, since the empirical error is zero, we have h(ϕh(x))=1h(\phi_{h}(x))=-1. Again from the definition of ϕh\phi_{h}, it must be that for every z𝒵z\in\mathcal{Z}, zxz\subseteq x we have h(z)=1h(z)=-1. ∎

Hence, from Observations E.2 and E.3, we have for every xS+x\in S^{+} there is a zZkz\in Z_{k}, zxz\subseteq x such that g(z)=1g(z)=1. Similarly, for every xSx\in S^{-} it must be that for every zZkz\in Z_{k}, zxz\subseteq x we have g(z)=1g(z)=-1. Hence, for every xS+x\in S^{+} there exists a zZkz\in Z_{k} and zxz\subseteq x such that zxz\not\subseteq x^{\prime} for all xSx^{\prime}\in S^{-}.

Conversely, suppose for every xS+x\in S^{+} there exists a zZkz\in Z_{k} and zxz\subseteq x such that zxz\not\subseteq x^{\prime} for all xSx^{\prime}\in S^{-}. Then define g:Zk{±1}g:Z_{k}\rightarrow\{\pm 1\} as follows: a) for all zZkz\in Z_{k} such that zxz\subseteq x for an xSx\in S^{-}, let g(z)=1g(z)=-1, b) for all zZkz\in Z_{k}, such that zxz\subseteq x for an xS+x\in S^{+} and zxz\not\subseteq x^{\prime} for an xSx^{\prime}\in S^{-}, let g(z)=1g(z)=1, c) for all zZkz\in Z_{k}, such that zxz\not\subseteq x for any xSx\in S, let g(z)=1g(z)=-1. From the supposition, we have that for every xS+x\in S^{+}, there is a zZkz\in Z_{k} and zxz\subseteq x such that g(z)=1g(z)=1. Now define h=lift(g)h=\mathrm{lift}(g). To show that the empirical error of hh is zero, it is sufficient to show that for every xSx\in S^{-}h(ϕh(x))=1h(\phi_{h}(x))=-1, and for every xS+x\in S^{+}h(ϕh(x))=1h(\phi_{h}(x))=1. Let xSx\in S^{-}. From the definition of gg, for every zZkz\in Z_{k} such that zxz\subseteq x, g(z)=1g(z)=-1. Hence, from the definition of lift\mathrm{lift}, we have for every z𝒵z\in\mathcal{Z} such that zxz\subseteq x, h(z)=1h(z)=-1. Now from the definition of best response, we have h(ϕh(x))=1h(\phi_{h}(x))=-1. Similarly, if xS+x\in S^{+} from our supposition and the definition of gg, we have there exists a zZkz\in Z_{k} such that zxz\subseteq x and g(z)=1g(z)=1. Hence, from the definition of lift\mathrm{lift}, there exists a z𝒵z\in\mathcal{Z} such that zxz\subseteq x and h(z)=1h(z)=1. Finally, from the definition of best response , we have h(ϕh(x))=1h(\phi_{h}(x))=1. ∎

Now, if YFkY\in F_{k} then there exists an hHkh\in H_{k} such that the induced function of hh is equal to YY, that is, fh=Yf_{h}=Y. This implies there exists an hHkh\in H_{k} which attains zero empirical error on the training set. Since empirical error is always non-negative, such an hh minimizes the empirical error in this case. Hence, from Lemma E.1, it follows that if YY is realizable then for all xS+x\in S^{+} at Step 17 of Alg there is either a zZ+z\in Z^{+} and zxz\subseteq x, or zZk,Sz\in Z_{k,S} and zxz\subseteq x. Now, observe that at the beginning of Step 22, set Z+Z^{+} satisfies the following:

zZ+xS+ such that zx and xS such that zx.z\in Z^{+}\,\,\,\iff\,\,\,\exists\,x\in S^{+}\text{ such that }z\subseteq x\text{ and }\not\exists x^{\prime}\in S^{-}\text{ such that }z\subseteq x^{\prime}. (10)

Further, at Step 22, for a zZkz\in Z_{k}, w(z)=a+w(z)=a_{+} if zZ+z\in Z^{+}. This implies

w(z)=a+xS+ such that zx and xS such that zx.w(z)=a_{+}\,\,\,\iff\,\,\,\exists\,x\in S^{+}\text{ such that }z\subseteq x\text{ and }\not\exists x^{\prime}\in S^{-}\text{ such that }z\subseteq x^{\prime}. (11)

Also, from Theorem 4.6, the induced function fh^f_{\hat{h}} corresponding to the returned h^\hat{h} is given as

fh^(x)=sign(zΓk(x)w(z)).f_{\hat{h}}(x)=\operatorname{sign}\left(\sum\nolimits_{z\in\Gamma_{k}(x)}w(z)\right)\,. (12)

To complete the proof of theorem, we show that fh^(xi)=yif_{\hat{h}}(x_{i})=y_{i} for every xiSx_{i}\in S. Suppose xSx\in S^{-}. Then from Equations and 10 and 11, for every zxz\subseteq x and |z|k|z|\leq k we have w(z)=a<0w(z)=a_{-}<0, and hence from Equation 12 for fh^f_{\hat{h}} we have fh^(x)=y=1f_{\hat{h}}(x)=y=-1. Similarly, suppose xS+x\in S^{+}. Then from Equation 11, there exists zxz\subseteq x, |z|=k|z|=k such that w(z)=a+w(z)=a_{+}. Hence, from Equation 12, and noting that a+>i[k](ni)a_{+}>\sum_{i\in[k]}{n\choose i} and a(1,0)a_{-}\in(-1,0) we have fh^(x)=y=1f_{\hat{h}}(x)=y=1. ∎

See 4.8

Proof.

In the first two for loops, for each xS+x\in S^{+} (or in SS^{-}) the internal for loop runs for O((nk))O({n\choose k}) time. Since |S|m|S|\leq m, this is a total of at most O(m(nk))O(m{n\choose k}) operations. Similarly, Step 22 places weights on at most m(nk)m{n\choose k} subsets, and hence runs in O(m(nk))O(m{n\choose k}) time. Hence, Alg runs in O(m(nk))O(m{n\choose k}) time. ∎

Appendix F Missing Proofs from Section 5

See 5.2

Proof.

Let =mink[1,k]{there exists a g:Z{±1} such that h=lift(g)}\ell=\min_{k^{\prime}\in[1,k]}\{\text{there exists a }g:Z_{\ell}\rightarrow\{\pm 1\}\text{ such that }h=\text{lift}(g)\}. From Lemma A.4, we know k\ell\leq k. Further, assume g:Z{±1}g:Z_{\ell}\rightarrow\{\pm 1\} is such that h=lift(g)h=\text{lift}(g). Now, from the definition of fhf_{h}, we have for all x𝒳x\in{}\mathcal{X}, fh(x)=1f_{h}(x)=1 if and only if there exists a z𝒵z\in\mathcal{Z} such that zxz\subseteq x and h(z)=1h(z)=1. Since h=lift(g)h=\text{lift}(g), fh(x)=1f_{h}(x)=1 if and only if there exists a zZz\in Z_{\ell} such that zxz\subseteq x and g(z)=1g(z)=1. This implies the induced complexity of fhf_{h} is k\ell\leq k. ∎

See 5.3

Proof.

From Lemma 5.2, we know that functions in FkF_{k} have induced complexity at most kk. We show that if ff has induced complexity at most kk then there is an hHkh\in H_{k} such that f=fhf=f_{h}. Let the induced complexity of ff be equal to k\ell\leq k. Then there exists a g:Z{±1}g:Z_{\ell}\rightarrow\{\pm 1\} such that

f(x)=1zZ such that zx and g(z)=1.\displaystyle f(x)=1\iff\exists z\in Z_{\ell}\text{ such that }z\subseteq x\text{ and }g(z)=1. (13)

Let h=lift(g)h=\mathrm{lift}(g). First we show that f(x)=fh(x)f(x)=f_{h}(x) for all x𝒳x\in{}\mathcal{X}. Since hh is a lift of gg, if g(z)=1g(z^{\prime})=1 for a zZz^{\prime}\in Z_{\ell} then for all z𝒵z\in\mathcal{Z} such that zzz^{\prime}\subseteq z, we have h(z)=1h(z)=1.

h(z)=1zZ such that zz and g(z)=1.\displaystyle h(z)=1\iff\exists z^{\prime}\in Z_{\ell}\text{ such that }z^{\prime}\subseteq z\text{ and }g(z^{\prime})=1. (14)

Hence, from Equations 13 and 14 for all x𝒳x\in{}\mathcal{X}, f(x)=1f(x)=1 if and only if there exists z𝒵z\in\mathcal{Z} such that zxz\subseteq x and h(z)=1h(z)=1. From the definition of induced function, this implies f(x)=fh(x)f(x)=f_{h}(x) for all x𝒳x\in{}\mathcal{X}.

To show hHkh\in H_{k}, we construct a weight function 𝒘\bm{w} on sets of size at most kk. For z2Ez\in 2^{E} and |z|<k|z|<k, let w(z)=a(1,0)w(z)=a_{-}\in(-1,0). For zZkz\in Z_{k}, let

w(z){=a+>i[1,k](ni)if zz,|z|=andg(z)=1=ao.w.w(z)\begin{cases}=a_{+}>\sum_{i\in[1,k]}{n\choose i}&\text{{if }}\,\,\exists z^{\prime}\subseteq z,|z^{\prime}|=\ell\,\,\text{{and}}\,\,g(z)=1\\ =a_{-}&\text{{o.w.}}\end{cases}

Now from Equation 14, h(z)=1h(z)=1 if and only if there exists zZz^{\prime}\in Z_{\ell} such that zzz^{\prime}\subseteq z and g(z)=1g(z^{\prime})=1. Hence, from the definition of 𝒘\bm{w}, h(z)=1h(z)=1 if and only if there exists zZz^{\prime}\in Z_{\ell} such that zzz^{\prime}\subseteq z and w(z)=a+w(z^{\prime})=a_{+}. In particular, since a+>i=1k(ni)a_{+}>\sum_{i=1}^{k}{n\choose i} and a(0,1)a_{-}\in(0,1), we have

h(z)=sign(z:zz,|z|kw(z)).h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}:z^{\prime}\subseteq z,|z^{\prime}|\leq k}w(z^{\prime})\right)\,.

See 5.4

Proof.

Since the induced complexity of vv is \ell^{*}, there is a function g:Z{±1}g:Z_{\ell^{*}}\rightarrow\{\pm 1\} s.t.:

v(x)={1if zx,|z|=andg(z)=11o.w.v(x)=\begin{cases}1&\text{{if }}\,\,\exists z\subseteq x,|z|=\ell^{*}\,\,\text{{and}}\,\,g(z)=1\\ -1&\text{{o.w.}}\end{cases}

Let a+>i[1,k](ni)a_{+}>\sum_{i\in[1,k]}{n\choose i} and a(0,1)a_{-}\in(0,1), and define the weight function 𝒘\bm{w} on sets of size at most kk as follows: a) if |z|<k|z|<k then let w(z)=aw(z)=a_{-}, b) if |z|=k|z|=k and there exists a zzz^{\prime}\subseteq z such that g(z)=1g(z^{\prime})=1 then w(z)=a+w(z)=a_{+}, and c) if |z|=k|z|=k and there does not exist a zzz^{\prime}\subseteq z such that g(z)=1g(z^{\prime})=1 then w(z)=aw(z)=a_{-}. Now define hh using 𝒘\bm{w} as follows:

h(z)=sign(zΓk(z)w(z)).h(z)=\operatorname{sign}\left(\sum\nolimits_{z^{\prime}\in\Gamma_{k}(z)}w(z^{\prime})\right)\,.

We now show that for each x𝒳x\in{}\mathcal{X}, h(ϕh(x))=fh(x)=v(x)h(\phi_{h}(x))=f_{h}(x)=v(x) implying hk=hh^{*}_{k}=h. Suppose fh(x)=1f_{h}(x)=1 for an x𝒳x\in{}\mathcal{X}. Then there exists a z𝒵z\in\mathcal{Z} such that zxz\subseteq x and h(z)=1h(z)=1. From Theorem 4.6, and the choice of a+a_{+} and aa_{-} we have that there exists a zxz\subseteq x, |z|=k|z|=k such that w(z)=a+w(z)=a_{+}. From the construction of ww this implies there exists a zxz\subseteq x, |z|=|z|=\ell^{*} such that g(z)=a+g(z)=a_{+}. But from the above definition of vv this implies v(x)=1v(x)=1. Similarly, we can argue, if fh(x)=1f_{h}(x)=-1 then v(x)=1v(x)=-1 for any x𝒳x\in{}\mathcal{X}. Hence, h(ϕh(x))=v(x)h(\phi_{h}(x))=v(x) for each x𝒳x\in{}\mathcal{X} implying hk=hh^{*}_{k}=h and 0 approximation error for hh. ∎

See 5.5

Proof.

In the proof of Theorem 5.4, we show that for k=k=\ell^{*}, we have zero approximation error. Hence, to prove the corollary it is sufficient to show that for a k<k<\ell^{*} the approximation error is not zero. Suppose there is an hHkh\in H_{k} such that ε(h)=0\varepsilon(h)=0 and k<k<\ell^{*}. Since the distribution DD has full support, this implies fh(x)=v(x)f_{h}(x)=v(x) for all x𝒳x\in{}\mathcal{X}. Hence, the induced complexity of vv is at most k<k<\ell^{*} giving a contradiction. ∎

See 5.6

Proof.

The approximation error weakly decreases because Hk1HkH_{k-1}\subseteq H_{k} for all kk2k\leq k_{2}. Also, from the proof of Corollary 5.5, it is clear that no kk can achieve zero approximation error. ∎

See 5.7

Proof.

We construct a vv such that the approximation error for hkHkh^{*}_{k}\in H_{k} is as given below

ε(hk)=14(qn)=kk2(k2)(qk2n).{{\varepsilon}}(h^{*}_{k})=\frac{1}{4{q\choose n}}\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}\,.

It is easy to see that ε(hk){{\varepsilon}}(h^{*}_{k}) diminishes convexly with kk (see Fig. 1). We choose k2k_{2} elements e1,e2,,ek2Ee_{1},e_{2},\ldots,e_{k_{2}}\in E (the ground set), and let zez_{e} be the k2k_{2} size subset consisting of these k2k_{2} elements. For a v:𝒳v:{}\mathcal{X}\rightarrow\mathbb{R}, let 𝒳v+={x𝒳|sign(v(x))=1}{}\mathcal{X}_{v}^{+}=\{x\in{}\mathcal{X}|\operatorname{sign}(v(x))=1\} and 𝒳v={x𝒳|sign(v(x))=1}{}\mathcal{X}_{v}^{-}=\{x\in{}\mathcal{X}|\operatorname{sign}(v(x))=-1\}. We first show that there exists a vv with the following two properties:

  1. 1.

    if x𝒳v+x\in{}\mathcal{X}_{v}^{+} then there exists a zzez\subseteq z_{e} such that zxz\subseteq x.

  2. 2.

    For k[1,k2]k\in[1,k_{2}], let 𝒳k={x𝒳|zze,|z|=k, and zx}{}\mathcal{X}_{k}=\{x\in{}\mathcal{X}|\exists z\subseteq z_{e},|z|=k,\text{ and }z\subseteq x\}. Then 𝒳v+𝒳k=34(=kk2(k2)(qk2n)){}\mathcal{X}_{v}^{+}\cap{}\mathcal{X}_{k}=\frac{3}{4}(\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}), for every k[1,k2]k\in[1,k_{2}].

  3. 3.

    For every zzez\subseteq z_{e}, let 𝒳z={x𝒳zx}{}\mathcal{X}_{z}=\{x\in{}\mathcal{X}\mid z\subseteq x\}. Then |𝒳v+𝒳z|=34(qknk)|{}\mathcal{X}_{v}^{+}\cap{}\mathcal{X}_{z}|=\frac{3}{4}{q-k\choose n-k}, where |z|=k|z|=k.

We construct such a vv iteratively. We begin by making the following observation.

Observation F.1.

For each k[1,k2]k\in[1,k_{2}], |𝒳k|==kk2(k2)(qk2n)|{}\mathcal{X}_{k}|=\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}.

Proof.

Recall 𝒳{}\mathcal{X} consists of size nn subsets of EE. For a k[1,k2]k\in[1,k_{2}] we wish to choose nn size subsets of EE that contain a zzez\subseteq z_{e}, |z|=k|z|=k. This equivalent to choosing a fixed k\ell\geq k size subset of zez_{e} and then choosing the remaining nn-\ell elements from the qk2q-k_{2} elements (not part of zez_{e}) in EE. For every k\ell\geq k we can choose \ell size subset of zez_{e} in (k2){k_{2}\choose\ell} ways, and for each such choice we can choose the remaining nn-\ell elements in (qk2n){q-k_{2}\choose n-\ell} ways. Since, this holds for any [k,k2]\ell\in[k,k_{2}], we have |𝒳k|==kk2(k2)(qk2n)|{}\mathcal{X}_{k}|=\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}. ∎

Constructing v: The idea is to iteratively add elements in 𝒳{}\mathcal{X} to 𝒳v+{}\mathcal{X}_{v}^{+}, that is, iteratively determine the x𝒳x\in{}\mathcal{X} such that sign(v(x))=1\operatorname{sign}(v(x))=1. In the first round, we arbitrarily choose 34(qk2nk2)\frac{3}{4}{q-k_{2}\choose n-k_{2}} from 𝒳k2{}\mathcal{X}_{k_{2}} and add it to 𝒳v+{}\mathcal{X}_{v}^{+}, and the remaining 14(qk2nk2)\frac{1}{4}{q-k_{2}\choose n-k_{2}} are added to 𝒳v{}\mathcal{X}_{v}^{-}. At round kk, assume we have constructed a vv satisfying the above three properties for k>kk^{\prime}>k, that is,

  1. 1.

    if x𝒳v+x\in{}\mathcal{X}_{v}^{+} then there exists a zzez\subseteq z_{e} such that zxz\subseteq x.

  2. 2.

    𝒳v+𝒳k=34(=kk2(k2)(qk2n)){}\mathcal{X}_{v}^{+}\cap{}\mathcal{X}_{k^{\prime}}=\frac{3}{4}(\sum_{\ell=k^{\prime}}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}), for every k[k+1,k2]k^{\prime}\in[k+1,k_{2}].

  3. 3.

    For every zzez\subseteq z_{e}, let 𝒳z={x𝒳zx}{}\mathcal{X}_{z}=\{x\in{}\mathcal{X}\mid z\subseteq x\}. Then |𝒳v+𝒳z|=34(qknk)|{}\mathcal{X}_{v}^{+}\cap{}\mathcal{X}_{z}|=\frac{3}{4}{q-k^{\prime}\choose n-k^{\prime}}, where |z|=k>k|z|=k^{\prime}>k.

Hence, at round kk, we have (k2k)(qk2nk){k_{2}\choose k}{q-k_{2}\choose n-k} elements in 𝒳k{}\mathcal{X}_{k} which are not yet in 𝒳v+{}\mathcal{X}_{v}^{+} or 𝒳v{}\mathcal{X}_{v}^{-}. From these elements in 𝒳k{}\mathcal{X}_{k}, for every kk size subset zzez\subseteq z_{e} we arbitrarily choose 34(qk2nk)\frac{3}{4}{q-k_{2}\choose n-k} elements containing zz and add the remaining 14(qk2nk)\frac{1}{4}{q-k_{2}\choose n-k} elements to 𝒳v{}\mathcal{X}_{v}^{-}. Now, observe that vv satisfies the first two properties for every k[k,k2]k^{\prime}\in[k,k_{2}] after this procedure. We argue vv satisfies the third property for any zzez\subseteq z_{e}, such that |z|=k|z|=k. The nn size sets in 𝒳{}\mathcal{X} containing a zzez\subseteq z_{e}, such that |z|=k|z|=k, can be partitioned into sets containing different >=k\ell>=k size subsets of zez_{e}. In particular, we have the following combinatorial equality

(qknk)==kk2(k2kk)(qk2n){q-k^{\prime}\choose n-k^{\prime}}=\sum_{\ell=k^{\prime}}^{k_{2}}{k_{2}-k^{\prime}\choose\ell-k^{\prime}}{q-k_{2}\choose n-\ell}

In the above expression, (qk2n){q-k_{2}\choose n-\ell} corresponds to the number of nn size sets that contain only a a specific k\ell\geq k^{\prime} size subset of zez_{e}. Since our iterative procedure ensures from each such partition at least 34\frac{3}{4} fraction of xx is added to 𝒳v+{}\mathcal{X}_{v}^{+}, we have that vv satisfies the third property.

Optimal hHkh^{*}\in H_{k}: From the construction of vv, it is clear that the optimal hHkh^{*}\in H_{k} for the above constructed vv, for any k[1,k2]k\in[1,k_{2}] satisfies the following: for every z𝒵z\in\mathcal{Z}, h(z)=1h^{*}(z)=1 if and only if there exists a zzez^{\prime}\subseteq z_{e}, |z|=k|z^{\prime}|=k, and zzz^{\prime}\subseteq z. Further as DD is the uniform distribution, for such an hh^{*}:

ε(hk)=14(qn)=kk2(k2)(qk2n).{{\varepsilon}}(h^{*}_{k})=\frac{1}{4{q\choose n}}\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}\,.

See 5.8

Proof.

Let hHSAh\in H_{\mathrm{SA}} with a corresponding g:𝒵g:\mathcal{Z}\rightarrow\mathbb{R} such that h(z)=sign(g(z))h(z)=\operatorname{sign}(g(z)) for all z𝒵z\in\mathcal{Z}. Choose an a+>i=1k1(ni)a_{+}>\sum_{i=1}^{k_{1}}{n\choose i}, and a(0,1)a_{-}\in(0,1). Define a weight function 𝒘\bm{w} on sets of size at most k1k_{1} as follows:

w(z)={a+if |z|=k1,h(z)=1ao.w.w(z)=\begin{cases}a_{+}&\text{{if }}\,|z|=k_{1},\,h(z)=1\\ a_{-}&\text{{o.w.}}\end{cases}

Let hHk1h^{\prime}\in H_{k_{1}} be the function defined by the binary weight 𝒘\bm{w} as defined above. We argue that for every x𝒳x\in{}\mathcal{X}, if h(ϕh(x))=h(ϕh(x))h(\phi_{h}(x))=h^{\prime}(\phi_{h^{\prime}}(x)). For every x𝒳x\in{}\mathcal{X}, h(ϕh(x))=1h(\phi_{h}(x))=1 if and only if there is a z𝒵z\in\mathcal{Z} and zxz\subseteq x such that h(z)=1h(z)=1. Since gg is sub-additive, we have

0g(z)zz,zz,z𝒵g(z).0\leq g(z)\leq\sum_{z^{\prime}\subseteq z,z^{\prime}\neq z,z^{\prime}\in\mathcal{Z}}g(z^{\prime})\,. (15)

A simple recursive argument implies h(ϕh(x))=1h(\phi_{h}(x))=1 if and only if there is a zxz\subseteq x such that |z|=k1|z|=k_{1} and h(z)=1h(z)=1, and hence w(z)=a+w(z)=a_{+}. Hence, from Theorem 4.6 this implies, h(ϕh(x))=1h(\phi_{h}(x))=1 if and only if h(ϕh(x))=1h^{\prime}(\phi_{h^{\prime}}(x))=1. ∎

See 5.9

Proof.

We first argue that the VC dimension of HkH_{k} is at most (qk){q\choose k}. Let d=i[1,k](ni)d=\sum_{i\in[1,k]}{n\choose i}, index the vectors in {0,1}d\{0,1\}^{d} by zEz\subseteq E (the ground set), such that |z|k|z|\leq k. Then each z𝒵z\in\mathcal{Z} can be represented by a binary vector ez{0,1}de_{z}\in\{0,1\}^{d}, with the entry indexed by a zz^{\prime} being 11 if and only if zzz^{\prime}\subseteq z. Further, let w{a,a+}dw\in\{a_{-},a_{+}\}^{d} be a binary weighted vector with aa_{-} and a+a_{+} as in Def. 4.3. Then from the definition of HkH_{k}, for each hHkh\in H_{k}, there is a wh{a,a+}dw_{h}\in\{a_{-},a_{+}\}^{d} such that a) h(z)=sign(w,ez)h(z)=\operatorname{sign}(\langle w,e_{z}\rangle) for all zZz\in Z, and b) the entry of ww indexed by a zz^{\prime} with |z|<k|z^{\prime}|<k is aa_{-}. From this we observe that the VC dimension of HkH_{k} is at most (qk){q\choose k}, since each hHkh\in H_{k} is decided by the realization of binary weights on entries indexed by the (qk){q\choose k} sets. Now the first part of the theorem follows by noting that the first bound is the agnostic PAC generalization guarantee for an algorithm minimizing the empirical error in the standard classification setting with VC dimension at most (qk){q\choose k}. To prove the second part, we have YFkY\in F_{k}, and hence the approximation error is zero, that is, ε(h)=0{{\varepsilon}}(h^{*})=0 (from Lemma E.1). Further, Alg minimizes the empirical error (Theorem 4.7), and returns an h^\hat{h} with zero empirical error. ∎

See 5.10

Proof.

The vv is constructed as in the proof Lemma 5.7. We recall notations from the proof of Lemma 5.7: zez_{e} is a k2k_{2} size subset. Further, in the proof of Lemma 5.7 we argued that for k[1,k2]k\in[1,k_{2}], hkh^{*}_{k} is such that for all z𝒵z\in\mathcal{Z}, h(z)=1h^{*}(z)=1 if and only if there exists a kk size zzz^{\prime}\subseteq z which is also a subset of zez_{e}.

Now let k,k[1,k2]k,k^{\prime}\in[1,k_{2}] such that k<kk<k^{\prime}. Since DD is the uniform distribution, to show system’s utility is more for kk compared to kk^{\prime} it is sufficient to show that

x𝒳𝟙{hk(ϕhk(x))=1}>x𝒳𝟙{hk(ϕhk(x))=1}\sum_{x\in{}\mathcal{X}}{\mathds{1}{\{{h^{*}_{k}(\phi_{h^{*}_{k}}(x))=1}\}}}>\sum_{x\in{}\mathcal{X}}{\mathds{1}{\{{h^{*}_{k^{\prime}}(\phi_{h^{*}_{k^{\prime}}}(x))=1}\}}}

From the proof of Lemma 5.7 and Theorem 4.6, it follows that

x𝒳𝟙{hk(ϕhk(x))=1}=x𝒳𝟙{fhk(x)=1}==kk2(k2)(qk2n)\sum_{x\in{}\mathcal{X}}{\mathds{1}{\{{h^{*}_{k}(\phi_{h^{*}_{k}}(x))=1}\}}}=\sum_{x\in{}\mathcal{X}}{\mathds{1}{\{{f_{h^{*}_{k}}(x)=1}\}}}=\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}

Similarly,

x𝒳𝟙{hk(ϕhk(x))=1}==kk2(k2)(qk2n)\sum_{x\in{}\mathcal{X}}{\mathds{1}{\{{h^{*}_{k^{\prime}}(\phi_{h^{*}_{k^{\prime}}}(x))=1}\}}}=\sum_{\ell=k^{\prime}}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}

Since k<kk<k^{\prime}, we have

x𝒳𝟙{fhk(x)=1}==kk2(k2)(qk2n)>=kk2(k2)(qk2n)\sum_{x\in{}\mathcal{X}}{\mathds{1}{\{{f_{h^{*}_{k}}(x)=1}\}}}=\sum_{\ell=k}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}>\sum_{\ell=k^{\prime}}^{k_{2}}{k_{2}\choose\ell}{q-k_{2}\choose n-\ell}

implying system’s utility is more for kk compared to kk^{\prime}. ∎

See 5.11

Proof.

In Lemma 5.10, we showed there exists a user with vv such that for all k,k[1,k2]k,k^{\prime}\in[1,k_{2}] and k<kk<k^{\prime}, the system has better utility against the optimal choice function in HkH_{k} than in HkH_{k^{\prime}}. Since the choice of kk the user can make is bounded by k2k_{2}, a lower k2k_{2} maximizes the worst-case payoff to the system. ∎