Machine Learning for Strategic Inference

In-Koo Cho Department of Economics, Emory University, Atlanta, GA 30322 USA [email protected] https://sites.google.com/site/inkoocho and Jonathan Libgober Department of Economics, University of Southern California, Los Angeles, CA 90089 USA [email protected] http://www.jonlib.com/

Abstract.

We study interactions between strategic players and markets whose behavior is guided by an algorithm. Algorithms use data from prior interactions and a limited set of decision rules to prescribe actions. While as-if rational play need not emerge if the algorithm is constrained, it is possible to guide behavior across a rich set of possible environments using limited details. Provided a condition known as weak learnability holds, Adaptive Boosting algorithms can be specified to induce behavior that is (approximately) as-if rational. Our analysis provides a statistical perspective on the study of endogenous model misspecification.

We thank Juan Carrillo, Jason Hartline, Navin Kartik, Roger Moon, Xiaosheng Mu, Guofu Tan, Ashesh Rambachan, Joel Sobel, Erik Strand, and especially Grigory Franguridi, for helpful conversations and comments, and seminar audiences at AMETS, Emory, the Econometric Society World Congress, the NSF/NBER/CEME Conference on Mathematical Economics, Rochester, UC-San Diego, and USC. This project started when the first author was visiting the University of Southern California. We are grateful for hospitality and support from USC. Financial support from the National Science Foundation is gratefully acknowledged.

1. Introduction

The importance of algorithms in guiding economic behavior is already significant, and likely to only be more so in the years to come. But since a number of economic phenomena rely crucially on the presence of rational individuals on both sides of a particular interaction, an open question is whether traditional rational models apply to such situations. Of course, economists recognize that people often fail to act rationally, with certain consistent failures having empirical implications. But the extent to which algorithms are susceptible to errors is a separate issue, and one that should be addressed for economists to speak to the increasing number of applications where algorithm design plays a central role.

This paper introduces a framework to study the question of whether and when algorithms can approximate rational behavior. In our model, a rational, strategic player (who we refer to as a sender) chooses a strategy when interacting with an algorithm that prescribes actions to a stream of short lived actors (who we refer to as receivers).¹¹1The terminology of sender and receiver is to highlight the role of our model’s timing; but many of our applications are beyond where these labels are traditionally applied. A distinguishing feature of our exercise is our focus on the problem of strategic inference: Specifically, we assume that the sender commits to a strategy which maps states into actions, and so a rational receiver would update beliefs about the best reply after observing the sender’s action. A rational receiver would thus make an inference regarding a payoff relevant state using knowledge of the sender’s strategy. On the other hand, the algorithm has access to observations on what transpired in previous interactions. We are interested in comparing the rational receiver’s strategy with the strategy induced by the algorithm, with a focus on determining when rationality can be replicated. We are particularly interested in the case where the algorithm seeks to provide these recommendations without using details of the sender’s objective or the particular setting at hand (for instance, as a sales platform might when designing an algorithm to be used for a variety of different products). In other words, we are interested in finding an algorithm capable of inducing rationality under as wide a set of environments as possible.

In our model, the algorithm produces recommendations using data from relevant interactions in the past (where data consists of sender actions and ex-post payoffs). These recommendations are determined by finding a best fitting decision rule. By best fitting, we mean that there is minimal error, with errors being weighted according to some specific objective. The main assumption here is that the algorithm can determine the best fitting rule from a particular set of decision rules, which we refer to as a hypothesis class (following the machine learning literature). A crucial limitation is that this hypothesis class is restricted and must be specified in advance, so that not every feasible mapping from messages into actions can be fit to the data. Thus, there is no a priori guarantee that finding the best fitting rule within the given set yields the rational reply; whether this property holds will depend upon the sender’s strategy.

Our theoretical question can be phrased as follows: Do these limitations of algorithms inhibit the ability to prescribe actions which are (approximately) rational? We show that, while constraints may be exploited by strategic actors, an algorithm designer with particular capabilities can induce the as-if rational outcome in equilibrium. The answer to this question thus depends on what we assume the algorithm is capable of. Our contribution is to identify what some of these capabilities are.

Constraints on algorithms of the kind in our model are often studied in the machine learning literature, which typically treats the data generating process as exogenous. Our goal, however, is to perform a similar algorithm design exercise, but in a strategic setting. To make sense of the restrictions on classifiers that can be fit to data, it may be instructive to note that in typical machine learning problems, a simple prediction (for instance, a “yes-no” recommendation) is sought for an observation among a very large set of possibilities. Seeking to find the correct recommendation for each one may be intractable or undesireable (given data limitations), and so a simpler set may be used as a baseline. On the other hand, it may still be possible to construct a new decision rule if the algorithm specifies how this should be done in advance. In our model, this takes the form of assuming the algorithm is limited in what can be fit to the data, but is otherwise flexible, in a way we will make precise below.

One interpretation of this limitation is that the algorithm suffers from a form of model misspecification: the true optimal decision rule for a receiver may fall outside of the class of decision rules that can be prescribed by the algorithm. There are two notable differences from a standard model misspecification exercise, however. The first difference is that the algorithm in our framework is concerned explicitly with prescribing behavior, and not with the problem of inference per se. In the (currently very active) literature on model misspecification (see, for instance, \citeasnounEspondaandPouzo14), a decisionmaker is assumed to be potentially incorrect regarding the set of possible parameters, but otherwise uses an optimally chosen decision rule. We, on the other hand, are not (directly) interested in learning the underlying parameters, but rather making an optimal prediction. The second difference is that the extent to which the optimal prediction falls outside of the realm of considered models is endogenous in our setting. Since we allow algorithms to specify decision rules arbitrarily—instead constraining how models can be fit to the data—they are, in principle, able to expand the potential decision rules the receiver could use if it is specified how this should be done. As a result, the extent to which the algorithm is misspecified is endogenous to the constraints of the algorithm design problem.

What should one expect to happen given these limitations of an algorithm? On the one hand, in order for the algorithm to be able to give non-degenerate predictions without using detailed knowledge of the particular parameters of the receiver’s problem, a sufficiently rich set of classifiers should be used. We focus on cases where this criterion suggests using at least the set of single-threshold classifiers, which conditions a recommendation only on which side of a threshold the observable messages lie. However, since our setting requires strategic inference on the part of receivers, this class of hypotheses is susceptible to manipulation by a rational sender. For our purposes, \citeasnounRubinstein93 identified the key economic force, studying a buyer-seller game where the buyer is restricted in the set of decision rules that can be utilized. Specifically, this paper showed that if a rational decisionmaker is restricted to use a single threshold classifier—i.e., one that makes the same decision on a given side of a fixed threshold—then the seller can price discriminate via a particular form of randomization which “fools” these buyers into making a decision which is suboptimal given the realized price.²²2The reasoning behind this result is as follows. First, the optimally chosen classifier chosen can do strictly better than simply randomizing the guess, implying that the seller can exploit the incentives of the buyer in order to manipulate the decision rule. On the other hand, it is impossible for threshold rules to implement the optimal decision with probability 1 when this rational rule is non-monotone in the price. The first point implies the buyer trades off against errors, and the second point implies that the tradeoff falls short of the fully rational response. As a result, the seller can force a different decision than would be rationally optimal for these buyers (with arbitrarily high probability). Our framework nests \citeasnounRubinstein93 as a special case, but considers more general environments as well.

Our analysis elucidates a tension between the ability to fit rich and coarse sets of models. As \citeasnounRubinstein93 shows, if a decisionmaker is limited in the decision rules that can be utilized, then there is a potential for exploitation. In order to combat this temptation, one may seek to add more possible replies to be fit to the data; in other words, to make the hypothesis class richer. Indeed, a decisionmaker could prevent the particular instance of exploitation he highlights by doing so. However, fitting richer decision rules may have other undesirable consequences, and may still fail to prevent a slightly more elaborate strategy from succeeding at exploitation. Above, we mentioned that this view is common in the machine learning literature; finding the best fitting model within a set of models may be computationally demanding if this set is very large. A goal of our paper is to highlight this tradeoff between fitting coarse models—which have attractive statistical properties, but poor behavioral properties—and rich models, for which the sitution is reversed.

Our proposed solution is to use the Adaptive Boosting algorithm (\citeasnounSchapireandFreund12), which specifies exactly how to construct a decision rule as a weighted combination of classifiers, with the weights specified by the algorithm. The algorithm requires (repeatedly) fitting a classifier to some distribution over prices and outcomes, from some set of baseline classifiers.

Returning to the particular question at hand, the requirement on the set of classifiers able to be fit is called weak learnability, and it is significantly less demanding than requiring all possible rational replies to be specified. We seek to highlight that this requirement is necessary and sufficient to overcome the problem of model misspecification mentioned above, i.e., the gap between the set of decision rules that can be fit to the data and those that a rational receiver can utilize. We provide results which show how to check it in several straightforward applications, particularly when resorting to single-threshold classifiers (which typically have natural interpretations).

To summarize, the answer to our theoretical question is that rationality can be ensured with the ability to (a) find a best-fitting decision rule from a class which satisfies weak learnability, and (b) combine such classifiers in a particular way (specified in advance). It is worth emphasizing one technical difference—due to our focus on a strategic inference problem—between our exercise and similar ones considered in computer science or machine learning where these issues have received more attention. In principle, the rational decision in our model is not observed if the sender uses a strategy that does not reveal the state given an observation. In a lemons problem, for instance, it may be that “low quality” is observed at some price, but that “high quality” is in fact more likely and that correspondingly a rational buyer (receiver) would choose a “buying” action. Therefore, in our problem, the payoff-maximizing decision must be inferred and constructed by the algorithm. One of the main results of this paper is that this added difficulty does not change the qualitative desirable properties of the algorithm, which we show using results from large deviations theory—though this does induce some modifications on precisely how good of an approximation the algorithm is able to guarantee.

Returning to the discussion of \citeasnounRubinstein93, we see that the issue with single threshold classifiers is that they are not strong learners (i.e., they cannot ensure the optimal decision is taken with probability 1 following any price), even though they are weak learners (i.e., they can outperform random guesses when chosen optimally). The remarkable property of the Adaptive Boosting algorithm is that weak learnability is sufficient to construct a classifier that yields a similar guarantee as under a model class satisfying strong learnability. It is interesting that part of the intuition for the main result in \citeasnounRubinstein93—which relies upon the buyer being able to strictly improve payoffs beyond a trivial default to induce a particular decision rule—exactly tells us how to overcome the main conclusion, once we have the algorithm in hand.

At first glance, it appears that there is a significant gap between decision rules satisfying weak learnability and those which induce rational replies. Rationality requires, in principle, very rich decision rules to be used, and for the performance of them to leave very little room for error. Weak learnability does not, and only requires a uniform improvement over a random guess. It is therefore perhaps surprising that in our exercise, the turns out to be no gap at all. Due to weak learnability, the apparent gap in rationality caused by the limitation in the decision rules that can be fit to data can be overcome by a clever choice of algorithm. The result is that the algorithm can induce rational behavior without knowing anything beyond the observed data from past interactions. In contrast, strong learnability (i.e., prescribing the optimal action with high probability) will usually require precise knowledge of the sender’s strategy.

We briefly mention that the algorithm design problem we study accommodates a rich possible action space, even with the same restrictions in the decision rules that can be fit to data. In particular, a version of the weak learnability condition in settings with two possible receiver actions also applies to settings with an arbitrary finite number of actions. This is in sharp contrast to many other papers in the large literature on “decisionmakers as statisticians” (reviewed below), which use similar motivation to study departures from rationality. These papers have typically focused on the binary action case. This limitation is very natural—many of the key results from machine learning which arise when there are two possible predictions do not extend easily (or even at all) to the case of multiple actions. However, we can handle this in our problem, suggesting our algorithm is of broader interest. We believe that this extension is important, as it shows our conclusions do not hinge on other artificial limitations on the environment.

Our exercise provides formalism within which machine learning methods can be applied to answer new questions relevant to microeconomic theorists, and visa versa. Our model is deliberately abstract, in order to provide general principles guiding when the problem of model misspecification can be overcome. One key message is that while it is not possible to guarantee that rationality emerges for arbitrarily data generating process, it is possible if the data generating process is endogenous (due to the strategic player) to the statistical algorithm. This argument requires some additional steps using incentives of the actors to demonstrate that the resulting output does in fact correspond to what is traditionally thought of as subgame perfection. This endogeneity issue makes the problem no longer a pure statistical exercise. The modifications our analysis requires extend beyond the initial need to show that it is possible to do better than random guessing in this environment. As our analysis elucidates, AdaBoost is capable of handling a particular kind of unboundedness in the cardinality of the action space. It is thus necessary to discipline the environment further in order to achieve our results.

2. Literature

This paper takes the framework of PAC learnability, familiar from machine learning, and applies it to a strategic setting. Within economics, this agenda is most closely related to the literature on learning in games when behavior depends on a statistical method. The single-agent problem is a particular special case, and this case is the focus of \citeasnounAlNajjar09 and \citeasnounAlNajjarandPai2014. However, since we are focused on a strategic setting, the data the algorithm receives is endogenous in our setting. In contrast, their benchmarks correspond to the case of exogenous data. This problem is also studied in \citeasnounSpiegler2016, who focuses on causality and defines a solution concept for behavior that arises from individuals fitting a directed acyclic graph to past observations. More recently, \citeasnounZhaoetal2020 take a decision-theoretic approach in a single-agent setting with lotteries, showing how a relaxation of the independence axiom leads to a neural-network representation of preferences.

Taking these approaches to games, the literature has still for the most part focused on settings where the interactions between players is static, ruling out the main environments we are interested in here.³³3By itself the distinction may not immediately seem significant—after all, a Nash equilibrium in an extensive form game involves choosing a strategy to best respond to the opponent, and is usually stated as a single (and thus static) choice. However, the additional restriction to binary action or 0-1 prediction problems makes nesting our problem less straightforward. In contrast, our setting is a simple, two-player (and two-move) sequential game. We also note that much (though not all) of this literature focuses on binary prediction problems, whereas we discuss how to specify algorithms in the general finite action cases as well. \citeasnounCherryandSalant2019 discuss a procedure whereby players’ behavior arises from a statistical rule estimated by sampling past actions. This leads to an endogeneity issue similar to the one present in our environment, i.e., an interaction between the data generating process and the statistical method used to evaluate it. \citeasnounEliazandSpiegler2018 study the problem of a statistician estimating a model in order to help an agent take an action, motivated as we are by issues involved with the interaction between rational plays and statistical algorithms. \citeasnounLiang2018 also focuses on games of incomplete information, asking when a class of learning rules leads to rationalizable behavior. Studying model selection in econometrics, \citeasnounOleaOrtolevaPaiPrat2019 consider an auction model and ask which statistical models achieve the highest confidence in results as a function of a particular dataset.⁴⁴4On the question of algorithms in particular, one concern is that the algorithm design problem may be susceptible to bias or induce unwanted discrimination when implemented, relative to rationality. See \citeasnounRambachanetal2020 for an analysis of these issues and how they may be overcome.

On the other hand, the literature on learning in extensive form games has typically assumed that agents experiment optimally, and hence embeds notion of rationality on the part of agents which we dispense with in this paper. Classic contributions include \citeasnounFudenbergandKreps1995, \citeasnounFudenbergandLevine1993 and \citeasnounFudenbergandLevine06. Most of this literature has focused on cases where there is no exogenous uncertainty regarding a player’s type, and asking whether self-confirming behavior emerges as the outcome. An important exception is \citeasnounFudenbergandHe2018, who study the steady-state outcomes from experimentation in a signalling game. While a rational agent in our game would need to form an expectation over an exogenous random variable, signalling issues do not arise because our sender has commitment.

Perhaps closest in motivation is the computer science literature studying how well algorithms perform in strategic situations, as well as how rational actors may respond when facing them. \citeasnounBravermanetal2018 consider optimal pricing of a seller repeatedly selling to a single buyer who repeatedly uses a no-regret learning algorithm. They show that, on the one hand, while a particular class of learning algorithms (i.e., those that are mean-based) are susceptible to exploitation, others would lead to the seller’s optimal strategy simply being to use the Myersonian optimum. \citeasnounDengetal2019 also study strategies against no-regret learners in a broad class of games without uncertainty, and consider whether a strategic player can guarantee a higher payoff than what would be implied by first-mover advantage. \citeasnounBlumetal2008 consider the Price of Anarchy (i.e., the ratio between first-best welfare and worst-case equilibrium welfare), and show in a broad class of games that this quantity is the same whether players use Nash strategies or regret-minimizing ones. \citeasnounNekipelovetal2015 assume players in a repeated auction use a no-regret learning algorithm, making similar behavioral assumptions as we do here. Their interest is in inferring the set of rationalizable actions from data.

While our motivation is very similar—and indeed, we seek to incorporate several aspects of this literature’s conceptual framework—there are three notable differences. First, this literature typically assumes particular algorithms or principal objectives (such as no-regret learning) which differ from traditional Bayesian rationality. In contrast, we maintain a Bayesian rational objective for the seller, and also focus on an algorithm designer seeking to maximize the expected payoffs of agents. Second, we focus on relating the incentives of the rational player and the algorithm’s capabilities, and study the extent to which different assumptions on the algorithm design problem influence the task of approximating rationality. Our main result articulates how different action spaces for the algorithm designer yield different results regarding whether and when the outcome will approximate the rational benchmark. Lastly, our general framework focuses on settings with strategic inference—that is, where the payoffs following a given principal action are state-dependent—and thus covers a set of single-agent applications which extend beyond particular pricing settings, where most (though admittedly not all) of this literature has focused. In particular, the settings discussed in this literature do not cover Lemons markets settings (which \citeasnounRubinstein93 falls under, for example) or Persuasion, which form our primary starting point.⁵⁵5An important exception is \citeasnounCamaraEtAl2020, who study an environment covering many of our same applications such as Bayesian Persuasion. However, they still maintain the other two distinguishing features, focusing on a regret objective for the principal, as well as particular no-regret assumptions for the agent. Still, we emphasize that both our paper and theirs focuses on environments where the principal/sender chooses a state-dependent strategy. This leads to the aforementioned endogeneity between the data generating process (induced by the principal) and the choices of the algorithm/learner—this emerges due to the fact that the same sender action may induce two distinct replies from the algorithm following two distinct sender strategies. In their setting, this endogeneity motivates the use of “policy-regret” as an objective for the principal (due to their reinforcement learning approach to the principal’s problem). While we do not use a regret objective for the principal, see \citeasnounAroraEtAl2012 and \citeasnounAroraEtAl2018 for more on the differences between these notions. As a result, new technical issues (e.g., dealing with residual uncertainty in the correct actions) is not addressed in these papers to our knowledge. Despite these differences, our hope is that this paper inspires further connection between the economics literature on decisionmakers as statisticians and the computer science literature on strategic choices against classes of algorithms. It appears to us that these results from computer science have not yet been fully appreciated in economics.

3. (Sender-Receiver) Stage Games

We first describe the stage game interaction in which the algorithm designer seeks to prescribe actions on behalf of myopic actors (who may be, for instance, receivers, buyers, or agents, depending on the particular setting of interest). The stage game features a strategic actor as well. That said, our exposition in this section addresses neither how the strategic actor chooses her strategy, nor how the algorithm is determined. This is done in Section 4, which describes the interaction which yields these objects and the relevant objective for each actor.

3.1. Actions and Parameters

The stage game is a sender-receiver game in which an informed sender makes the first move. We often call the sender the (informed) principal, and the receiver the agent, as our lead example is built on the informed principal problem of \citeasnounMaskinTirole92. However, our model also describes a sender-receiver game with sender commitment, as in \citeasnounKamenicaGentzkow2011.

Let $\Theta$ be the set of types endowed with a prior distribution $\pi$ which is common knowledge among players. This type is payoff relevant to both the sender and the receiver. Define $\pi(\theta)$ as the probability that type $\theta\in\Theta$ is realized. Throughout the paper, we only consider $\pi$ with finite support. Conditioned on the realized value of $\theta\in\Theta$ , the sender takes an action $p\in{\mathcal{P}}\subset{\mathbb{R}}^{n}$ where ${\mathcal{P}}$ is a compact subset of ${\mathbb{R}}^{n}$ . Our analysis in most of the paper will assume further that $\mathinner{\!\left\lvert\mathcal{P}\right\rvert}<\infty$ , although we discuss how to modify this assumption in Section 5.3.3 (to allow for, for instance, continuous distributions). A strategy of the Sender is:

\sigma\mathrel{\mathop{\mathchar 58\relax}}\Theta\rightarrow\Delta({\mathcal{P}}),

where $\Delta(X)$ denotes the set of probability distributions over a set $X$ . We let $\Sigma$ denote the set of feasible strategies for the sender, and importantly, assume that this strategy is determined (and committed to) in the Algorithm Game described in Section 4. Conditioned on $p$ (but not $\theta$ ), the agent chooses $a\in A$ according to

r\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow\Delta(A).

We assume $\mathinner{\!\left\lvert A\right\rvert}<\infty$ , and in our analysis we treat the case of $\mathinner{\!\left\lvert A\right\rvert}=2$ and $\mathinner{\!\left\lvert A\right\rvert}>2$ separately. The payoffs of the sender and the receiver from $(\theta,p,a)$ are $u(\theta,p,a)$ and $v(\theta,p,a)$ .

The timing of the moves in the stage game is as follows:

$S_{1}.$

An exogenous state $\theta\in\Theta$ is realized according to $\pi$ , with only the sender observing the realized state $\theta$ .
$S_{2}.$

The sender’s action $p\in{\mathcal{P}}$ is realized according to $\sigma(p\mathrel{\mathop{\mathchar 58\relax}}\theta)$ .
$S_{3}.$

The receiver takes action $a\in A$ conditioned on $p$ .
$S_{4}.$

Payoffs are realized according to $u(\theta,p,a)$ and $v(\theta,p,a)$ .

For instance, if we interpret $p=(p_{1},\ldots,p_{n})$ as a contract, and $a\in A=\{-1,1\}$ as “reject” ( $a=-1$ ) or “accept” ( $a=1$ ), the stage game is a model of the informed principal (\citeasnounMaskinTirole92). If $p$ is interpreted as a message sent by a worker, and $a\in A$ as the wage paid by the firm, then the stage game becomes a signaling game (\citeasnounSpence73). For now, we place no further restrictions on $u(\theta,p,a)$ and $v(\theta,p,a)$ , though these are often implicit in the economic problem of interest.

3.2. Payoffs and the Rational Benchmark

Describing the outcomes of the above interactions when the receiver is rational is a familiar exercise. In this case, his optimization problem is

\max_{a\in A}~{}~{}\sum_{\theta\in{\sf supp}\pi}v(\theta,p,a)\pi(\theta\mathrel{\mathop{\mathchar 58\relax}}p)

where $\pi(\theta\mathrel{\mathop{\mathchar 58\relax}}p)$ is the posterior probability assigned to $\theta$ conditioned on $p$ . If $p$ is used with a positive probability by $\sigma$ , then $\pi(\theta\mathrel{\mathop{\mathchar 58\relax}}p)$ is computed by Bayes rule:

\pi(\theta\mathrel{\mathop{\mathchar 58\relax}}p)=\frac{\sigma(p\mathrel{\mathop{\mathchar 58\relax}}\theta)\pi(\theta)}{\sum_{\theta^{\prime}}\sigma(p\mathrel{\mathop{\mathchar 58\relax}}\theta^{\prime})\pi(\theta^{\prime})}.

We define the rational label to denote the receiver’s strategy were they rational. More precisely, the optimal response is a function of the chosen $\sigma\in\Sigma$ and the realized $p\in\mathcal{P}$ :

y^{R}\mathrel{\mathop{\mathchar 58\relax}}\Sigma\times{\mathcal{P}}\rightarrow A

is a solution to the following optimization problem:

\sum_{\theta\in{\sf supp}\pi}v(\theta,p,y^{R}(\sigma,p))\pi(\theta\mathrel{\mathop{\mathchar 58\relax}}p)\geq\sum_{\theta\in{\sf supp}\pi}v(\theta,p,a)\pi(\theta\mathrel{\mathop{\mathchar 58\relax}}p)\qquad\forall a\in A

where $\pi(\theta\mathrel{\mathop{\mathchar 58\relax}}p)$ is computed via Bayes rule whenever $\sum_{\theta}\sigma(p\mathrel{\mathop{\mathchar 58\relax}}\theta)\pi(\theta)>0$ .⁶⁶6For a fixed $\sigma$ , $y^{R}(\sigma,\cdot)\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A$ is a strategy of the agent, satisfying sequential rationality.

Define $\sigma^{R}$ as a best response of the sender against a Bayesian rational receiver with perfect foresight:

\sum_{\theta,p,a}u(\theta,p,a)\sigma^{R}(p\mathrel{\mathop{\mathchar 58\relax}}\theta)y^{R}(\sigma^{R},p)\pi(\theta)\geq\sum_{\theta,p,a}u(\theta,p,a)\sigma(p\mathrel{\mathop{\mathchar 58\relax}}\theta)y^{R}(\sigma,p)\pi(\theta)\qquad\forall\sigma\in\Sigma.

By the construction, $(\sigma^{R},y^{R})$ constitutes a perfect Bayesian equilibrium in the stage-game with a rational receiver.

3.3. Examples of Stage Games

Before proceeding to the description of the algorithm game, we describe a few of the stage game interactions that are of primary interest. We will return to these later in order to illustrate the incentives for each player.

3.3.1. Insurance

The following is borrowed from \citeasnounMaskinTirole92. Suppose that the principal (sender) is a shipping company seeking to purchase insurance from an insurance company, an agent (receiver) that is seeking to delegate the decision of whether to offer the terms put forth by the shipping company. The principal seeks insurance every period, but faces risk (e.g., due to the location of shipping demand) that is idiosyncratic every period.

In this case, we imagine the principal choose terms within some compact set $\mathcal{P}\subset\mathbb{R}^{2}$ , where $p=(x,q)$ denotes a policy which provides a payment $x$ in the event of a loss, and costs an amount $q$ . If $\theta\in\{L,H\}$ (with $L<H$ ) denotes the probability of a loss, then the principal’s utility is:

u(\theta,p,a)=\begin{cases}(1-\theta)f(I-q)+\theta f(I-q-L+x)&a=1\\ (1-\theta)f(I)+\theta f(I-L)&a=-1\end{cases},

for some concave $f$ . The agent’s utility is:

v(\theta,p,a)=\begin{cases}q-\theta x&a=1\\ 0&a=-1\end{cases}

It is natural to consider $\mathcal{P}$ whereby, against a rational buyer, the principal would seek a high level of insurance when risk is high (i.e., $\theta=H$ ), and avoid insurance when risk is low (i.e., $\theta=L$ ). In contrast, the agent’s payoff may be decreasing in the quantity of insurance when $\theta=H$ , while increasing in the quantity of insurance when $\theta=L$ .

3.3.2. Labor Market Signaling

Our framework is general and can be expanded to cover other settings as well. Let us consider a labor market signaling model. Here, the “receiver” takes the role of the firm and the “sender” takes the role of the worker from the Spence signalling model (as in, for instance, \citeasnounMaskinTirole92). The true state is the productivity of the worker $\theta\in\Theta=\{H,L\}$ where $\pi(H)=\pi(L)=\frac{1}{2}$ : $H>L$ . Conditioned on $\theta$ , a worker chooses $p$ which we interpret as education level. Her strategy is

\sigma\mathrel{\mathop{\mathchar 58\relax}}\Theta\rightarrow{\mathcal{P}}\subset{\mathbb{R}}_{+}.

The payoff function of the sender is

u(\theta,p,a)=a-\frac{p}{\theta+1}

We abstract away the competition among multiple firms in the labor market. Conditioned on $p$ , the labor market wage is determined according to the expected productivity $\mathbf{E}(\theta\mathrel{\mathop{\mathchar 58\relax}}p)$ conditioned on $p$ . The firm has to pay the worker the equal amount of the expected productivity because of (un-modeled) competition among firms. The receiver’s goal is to make an accurate forecast about the expected productivity of the worker. The payoff of the receiver is

v(\theta,p,a)=-(\theta-a)^{2}

If the support of $\sigma(p\mathrel{\mathop{\mathchar 58\relax}}H)$ is disjoint from the support of $\sigma(p\mathrel{\mathop{\mathchar 58\relax}}L)$ , $\sigma$ is a separating strategy. If a separating strategy is an equilibrium strategy, then the equilibrium is called a separating equilibrium. We often focus on the Riley outcome, which maximizes the ex ante expected payoff of the principal among all separating equilibria.

3.3.3. Monopoly Market

In the first two examples, only the sender has private information. We can allow the stage game interaction to feature additional parameter observed only by the receiver, denoting this by $i$ . We denote the sender’s payoff by $u(\theta,p,a,i)$ , and the receiver’s payoff by $v(\theta,p,a,i)$ . For example, the sender may interact with some receivers who are algorithmic, and others who are fully rational. The agent knows whether he is algorithmic or fully rational. The principal does not observe the type of an agent, but only knows the probability distribution. Indeed, the setting of \citeasnounRubinstein93 features such a dichotomy, as we discuss.

While \citeasnounRubinstein93 differs expositionally, we review the key ideas and describe how it falls under our framework. Suppose $\theta\in\Theta=\{L,H\}$ . $v_{\theta}$ is the marginal utility of the good where $v_{H}>v_{L}>0$ . The prior probability distribution is $\pi(H)=\pi(L)=\frac{1}{2}$ . The seller observes $\theta$ . The seller chooses a price $p\in[v_{L},v_{H}]\subset{\mathcal{P}}\subset{\mathbb{R}}$ , conditioned on $\theta\in\Theta$ . The action of a buyer is $a\in A=\{-1,1\}$ . A buyer responds to $p\in{\mathcal{P}}$ by purchasing $(a=1)$ or not purchasing $(a=-1)$ the good at $p$ .

The seller is facing a unit mass of infinitesimal buyers, who can be either type 1 or type 2. The proportion of type 1 buyer is $r\in(0,1)$ . A buyer observes his type, but the seller does not observe the type of a buyer. The buyers differ in terms of the cost of sales. If $\theta=L$ , the product costs $c_{L}$ for the seller regardless of the types of the buyer. If $\theta=H$ , the product costs $c_{i}$ to serve type $i$ buyer $(i\in\{1,2\})$ . We assume

	$\displaystyle c_{1}>v_{H}>c_{2}>v_{L}>c_{L}$		(3.1)
	$\displaystyle rc_{1}+(1-r)c_{2}>v_{H}$		(3.2)

so that the agent is exposed to the lemon’s problem. We focus on $\pi$ supported on parameters which satisfy (3.1) and (3.2).

A buyer generates utility only if he purchases the good, whose payoff function is

v(\theta,p,a,i)=\begin{cases}0&\text{if }\ a=-1\\ v_{\theta}-p&\text{if }\ a=1.\end{cases}

regardless of $i$ . The payoff of the seller is

u(\theta,p,a,i)=\begin{cases}0&\text{if }\ a=-1\\ p-c_{L}&\text{if }\ a=1,\ \theta=L\\ p-c_{i}&\text{if }\ a=1,\ \theta=H.\end{cases}

The unique equilibrium strategy of the seller is

\sigma^{R}(\theta)=\begin{cases}v_{L}&\text{ if }\ \theta=L\\ v_{H}&\text{ if }\ \theta=H.\end{cases}

The buyer’s equilibrium strategy is

y^{R}(\sigma^{R},p)=\begin{cases}1&\text{if }\ p\leq v_{L}\\ -1&\text{if }\ p>v_{L}.\end{cases}

The trading occurs only if $\theta=L$ , and therefore, the equilibrium is inefficient. Note that the construction of $y^{R}$ requires a precise information about $v_{L}$ .

3.4. Introducing Time

Our question of interest is whether the receiver can learn the rational label $y^{R}$ , if the stage game is repeated over time. As an intermediate step toward defining algorithm games, we describe our approach and assumptions involved with this step. In the next section, we discuss the algorithm choice that occurs on top of this.

By expanded stage game, we refer to a repetition of the stage game interaction, played over discrete time $t=1,2,\ldots$ , where the stage game interactions described previously occur at every $t\geq 1$ . Let $(\sigma_{t},r_{t})$ be the pair of strategies by the sender and the receiver in period $t$ .⁷⁷7For now, we are intentionally vague about the strategy space of each player in the expanded stage game, as this is described in the next section. The true state $\theta$ is drawn IID across periods according to $\pi$ and the pair $(p_{t},a_{t})$ of the price and the action in period $t$ is selected by $(\sigma_{t},r_{t})$ . In this case, the expected payoff of the sender is⁸⁸8Our results are most elegantly stated in the undiscounted limit. Prior versions of this paper considered the case where future payoffs were discounted at rate $\delta<1$ ; the main lessons remain valid for $\delta$ sufficiently large, although there are some added technical difficulties in the analysis of Section 5.3.3 this introduces.

\lim_{T\rightarrow\infty}\mathbf{E}\frac{1}{T}\sum_{t=1}^{T}u(\theta_{t},p_{t},a_{t})\pi(\theta_{t})\sigma_{t}(p_{t}\mathrel{\mathop{\mathchar 58\relax}}\theta_{t})r_{t}(a_{t}\mathrel{\mathop{\mathchar 58\relax}}p_{t})

(3.3)

and the expected payoff of the receiver is

\lim_{T\rightarrow\infty}\mathbf{E}\frac{1}{T}\sum_{t=1}^{T}v(\theta_{t},p_{t},a_{t})\pi(\theta_{t})\sigma_{t}(p_{t}\mathrel{\mathop{\mathchar 58\relax}}\theta_{t})r_{t}(a_{t}\mathrel{\mathop{\mathchar 58\relax}}p_{t}).

(3.4)

4. Algorithm game

Having outlined the basic timing of moves, we now describe the “super” game which determines the player’s strategy in the expanded stage game. We refer to this as an algorithm game. Throughout this paper, we assume that the sender (principal) is fully rational, but the strategic choice of the receiver (agent) must be delegated to an algorithm.

4.1. Choices of Algorithms

We will refer to the strategy a receiver uses—which is output by the algorithm at every time—as a classifier, in line with the machine learning and computer science literature:

Definition 4.1.

A classifier is a function

\gamma\mathrel{\mathop{\mathchar 58\relax}}P\rightarrow A.

This may additionally be referred to as either a strategy or a forecasting rule.

In order to construct the classifier, the algorithm faces some computational constraints. More precisely, we assume that there is a fixed set of classifiers $\mathcal{H}$ (referred to as the hypothesis class) for which the algorithm can solve the following problem:

\min_{h\in\mathcal{H}}~{}~{}~{}\sum_{p}\mathbf{1}[h(p)=y(p)]L(p),

(4.5)

for an arbitrary function $L$ and function $y\mathrel{\mathop{\mathchar 58\relax}}\mathcal{P}\rightarrow A$ . We refer to this step as finding the best fitting hypothesis. We can think of $L$ as being the cost of misclassifying a particular observation, which may vary. Note that, since we can add arbitrary constants to $L$ and normalize so that it sums to 1 over all $p$ , it is equivalent to assume the algorithm can solve

\max_{h\in\mathcal{H}}~{}~{}~{}\sum_{p}\mathbf{1}[h(p)=y(p)]d(p),

(4.6)

for a probability distribution $d$ over $p$ . This provides an alternative interpretation, regarding the classifier seeking to make the correct guess with the highest possible probability.

We treat the process of finding the best fitting hypothesis as a black box. The purpose of this paper, however, is to understand how the algorithm designer might utilize from additional capabilities, and across a variety of environments. One question is which kinds of additional capabilities are necessary. The main ones we will discuss are:

•

Constructing labels based on observations,
•

Creating classifiers derived from solutions to the above maximization,
•

Changing observations of $p_{t}$ to $\hat{p}_{t}$ , if the data is generated by a randomized rule.

One hypothesis class is of particular interest. Let ${\sf H}(\lambda,\omega)$ be a hyperplane in ${\mathbb{R}}^{n}$ : $\exists\lambda\in{\mathbb{R}}^{n}$ and $\omega\in{\mathbb{R}}$ such that

{\sf H}(\lambda,\omega)=\left\{p\in{\mathbb{R}}^{n}\ \mathrel{\mathop{\mathchar 58\relax}}\ \lambda p=\omega\right\}.

Define ${\sf H}_{+}(\lambda,\omega)$ as the closed half space above ${\sf H}(\lambda,\omega)$ :

{\sf H}_{+}(\lambda,\omega)=\left\{p\in{\mathbb{R}}^{n}\ \mathrel{\mathop{\mathchar 58\relax}}\ \lambda p\geq\omega\right\}.

Definition 4.2.

A single threshold (linear) classifier is a mapping

h\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A

where $\exists a_{+},a_{-}\in A$ , $\lambda\in\mathbb{R}^{n}$ and $\omega\in\mathbb{R}$ such that

h(p)=\begin{cases}a_{+}&\text{if }\ p\in{\sf H}_{+}(\lambda,\omega)\\ a_{-}&\text{if }\ p\not\in{\sf H}_{+}(\lambda,\omega).\end{cases}

Definition 4.3.

Let $\Gamma$ be the set of all classifiers, and $\tilde{\Gamma}\subset\Gamma$ denote a subset of classifiers. A statistical procedure or algorithm is an onto function

\tau\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\rightarrow\tilde{\Gamma},

where $\mathcal{D}$ is a set of histories, $\mathcal{T}$ is the set of feasible algorithms (i.e., a subset of the set of functions from $\mathcal{D}$ into $\tilde{\Gamma}$ ).

What $\mathcal{D}$ consists of is very much problem specific. In a typical learning model, we assume that the receiver observes the realized outcome $(p_{t},a_{t})$ in period $t$ but also can access some information about the performance of his choice $a_{t}$ to achieve his goal. For example, if the goal of the receiver is to learn the rational label $y^{R}$ , a natural candidate would be a sufficient statistics of the ex-post payoff $v(\theta_{t},p_{t},a_{t})$ .

One specification of $\mathcal{T}$ emerges from not having any restrictions on $\tilde{\Gamma}$ at all. In general, the set $\tilde{\Gamma}$ will be implicit in the description of the algorithm. Our main interest is in understanding which kinds of ${\mathcal{T}}$ enable the receiver to approximate the rational label $y^{R}$ .

4.2. Timing and Objectives

An algorithm game takes the interaction in the stage game as a starting point, and considers the outcome when, instead of having the receiver’s strategy emerging from Bayesian rationality, it instead emerges from fitting a model to past observations.

An algorithm game is a simultaneous move game under asymmetric information between the (rational) sender and the boundedly rational (“algorithmic”) receiver, built on the “expanded” stage game.

$A_{-1}.$

According to some prior distribution, nature selects the distribution $\pi$ of the underlying game from $\Pi$ , where $\Pi$ is a subset of probability distributions over $\Theta$ with finite support.
$A_{0}.$

Conditioned on realized $\pi$ , the sender commits to some strategy $\sigma$ . The receiver (or alternatively, an entity acting on the receiver’s behalf) commits an algorithm $\tau\in\mathcal{T}$ without observing the realized $\pi\in\Pi$ .
$A_{1}.$

The expanded stage game is played, with the receiver’s strategy in each period $t$ being $\tau(D_{t})(p)$ (i.e., the action specified by the algorithm following sender’s action $p$ at time $t$ ), with the algorithm adding the observation (which includes $p$ and ex-post utility following each receiver action) to the dataset at the end of each period.

These actions determine the realized payoffs by each player, as described in the previous section. Notice that we do not necessarily assume that any pair $\pi_{1},\pi_{2}\in\Pi$ have intersecting or even overlapping support (though this is also certainly allowed). Correspondingly, we emphasize we do not assume $\Theta$ is itself finite, even though all $\pi\in\Pi$ have finite support. Additionally, since we only assume the algorithm observes the receiver’s ex-post utility, $\theta$ itself need not ever observed by the algorithm. For instance, $\theta$ may reflect a production cost which only influences the sender’s payoff. In this case, while the algorithm would observe the receiver’s payoff from each action, they would not observe the seller’s cost.

The sender chooses $\sigma$ once and for all, with the action $p_{t}$ drawn i.i.d. over time. On the other hand, the receiver’s strategy in period $t$ is determined by the algorithm $\tau$ and history $D_{t}$ in period $t$ . The expression for the payoffs of the sender and the receiver are (3.3) and (3.4), respectively, where $r_{t}(\cdot\mathrel{\mathop{\mathchar 58\relax}}p)$ is given by $\tau(D_{t})(p)$ .

We consider the objectives of the rational player and the algorithmic player separately. The former is straightforward; given a sequence $(\theta_{t},p_{t},a_{t})$ , the rational player’s payoff (i.e., the sender’s payoff) function is simply the long run average expected payoff:

{\mathcal{U}}_{s}(\sigma,\tau)=\lim_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}u(\theta_{t},p_{t},a_{t}),

where $(p_{t},a_{t})$ is generated by $(\sigma,\tau)$ in period $t$ and the expectation is otherwise conditioned only on $\pi\in\Pi$ (recalling that $\theta$ is taken to be drawn IID). The objective of the rational sender is to maximize ${\mathcal{U}}_{s}$ by choosing $\sigma$ , conditioned on $\pi\in\Pi$ . The payoff function of the algorithm player is also the long-run average expected payoff:

{\mathcal{U}}_{r}(\sigma,\tau)=\lim_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbf{E}v(\theta_{t},p_{t},a_{t}).

Note that implicitly we the players do not discount future payoffs, we call the algorithm game an algorithm

We are interested in comparing the outcomes induced by the algorithm and the rational label $y^{R}(\sigma,p)$ introduced in the last section. We note that the comparison is potentially unfair because algorithms are more constrained in the decision rules that can be used. We therefore introduce a notion of rationality reflecting these limits:

Definition 4.4.

An algorithm $\tau$ is constrained rational, if $\forall\epsilon,\delta>0$ , $\forall\sigma$ , $\exists T$ such that $\forall t\geq T$ ,

\mathbf{P}\left(\sum_{\theta\in{\sf supp}\pi}v(\theta,p,\tau(D_{t})(p))\sigma(p\mathrel{\mathop{\mathchar 58\relax}}\theta)\pi(\theta)\geq\max_{h\in{\tilde{\Gamma}}}\sum_{\theta\in{\sf supp}\pi}v(\theta,p,h(p))\sigma(p\mathrel{\mathop{\mathchar 58\relax}}\theta)\pi(\theta)-\epsilon\right)\geq 1-\delta,

with the probability referring to uncertainty over $D_{t}$ . An algorithm $\tau$ is fully rational if $\tilde{\Gamma}$ is replaced by the set of all $h\mathrel{\mathop{\mathchar 58\relax}}\mathcal{P}\rightarrow A$ .

The “constrained” qualifier is due to the limits on the strategies that can be chosen by the receiver. A fully rational receiver would choose $y^{R}(\sigma,p)$ in the stage game; a constrained rational algorithm yields actions are as optimally as possible, given that its output must be within the expanded model class $\tilde{\Gamma}$ . We often regard $\gamma\in{\tilde{\Gamma}}$ as a forecasting rule and $\tau$ as a formal procedure to construct a (strong) forecasting rule.

In later sections, we will also discuss an important performance criterion of an algorithm is PAC (Probably Approximately Correct) learnability (\citeasnounShalev-ShwartzandBen-David14).

Definition 4.5.

Algorithm $\tau$ is PAC (Probably Approximately Correct) learnable if $\forall\sigma\in\Sigma$ , $\forall\epsilon>0$ , $\exists T$ such that $\forall t\geq T$

\mathbf{P}\left(\tau(D_{t})(p)\neq y^{R}(\sigma,p)\right)<\epsilon,

with the probability referring to uncertainty over $D_{t}$ and the realized $p$ .

A key difference between Definitions 4.4 and 4.5 is that the latter is a condition on the actions themselves and the decision rule, yet the former is a condition on the utility. In order to learn the equililbrium outcome $y^{R}(\sigma^{R},\cdot)$ , $\sigma^{R}$ must be a best response to the decision rule induced by the algorithm in the long run.

Definition 4.6.

An outcome $({\overline{\sigma}},\tau)$ of the algorithm game emulates $(\sigma^{R},y^{R})$ of the underlying stage game, if ${\overline{\sigma}}=\sigma^{R}$ and $\tau$ is fully rational.

The substance of the definition is that $\sigma^{R}$ is a best response to $\tau$ . Then, along the equilibrium path of the algorithm game, the receiver behaves as if he perfectly foresees $\sigma^{R}$ and responds optimally subject to the feasibility constraint imposed by ${\tilde{\Gamma}}$ .

4.3. Specifying $\mathcal{T}$

Our main interest will be in the case where $\mathcal{T}$ is restricted to emerge as the outcome of an ensemble algorithm.

Definition 4.7.

Classifier $H$ is an ensemble of ${\mathcal{H}}$ if $\exists h_{1},\ldots,h_{K}\in{\mathcal{H}}$ and $\alpha_{1},\ldots,\alpha_{K}\geq 0$ such that

H(\sigma,p)=\arg\max_{a}\sum_{k=1}^{K}\alpha_{k}\mathbf{1}[a=h_{k}(\sigma,p)]

Without loss of generality, we can assume that $\sum_{k=1}^{K}\alpha_{k}=1$ , since if not we can simply divide by this sum and obtain the same classifier. We can interpret $H$ as a weighted majority vote of $h_{1},\ldots,h_{K}$ . An ensemble algorithm constructs a classifier through a linear combination classifiers from $\mathcal{H}$ . Since the final classifier is constructed through a basic arithmetic operation, one can easily construct an elaborate classifier from rudimentary classifiers. Ensemble algorithms have been remarkably successful in real world applications (\citeasnounDietterich00).

The algorithms produce an output ensemble classifier according to a recursive scheme:

•

First, the loss function in (4.5), say $L_{1}$ , or probability distribution in (4.6), say $d_{1}$ , is taken to treat all observed sender actions symmetrically—that is, $L_{1}(p)=d_{1}(p)=1/G$ where $G$ is the number of elements in the support of mixed strategy $\sigma$ .
•

At each stage $k=1,\ldots$ , the best fitting hypothesis is found by solving either (4.5) or (4.6). The best fitting hypothesis is referred to as $h_{k}$ .
•

The term $\alpha_{k}$ is then determined, possibly as a function of the objective of the best fitting hypothesis.
•

Depending on $h_{k}$ and $\alpha_{k}$ , the loss function $L_{k}$ is updated to $L_{k+1}$ (or, in the case of distributions, $d_{k}$ is updated to $d_{k+1}$ ).
•

After repeating this iteration $K$ times, a classifier of the form of Definition 4.7 is output, which is used to determine the final choice of the receiver.

The ability to use an ensemble algorithm allows additional richness in the set of classifiers that can be used. There remain, however, a number of challenges:

•

Clearly, repeatedly solving the same problem will not yield different outcomes, and so to meaningfully expand $\mathcal{H}$ one needs to determine how to change the objective to be fit as well, and
•

Weights must be specified in advance.

Both of these are on top of the need to potentially alter the observed $p_{t}$ and determining the labels $y_{t}(\sigma,p_{t})$ to use for the observations, since the observed utility-maximizing decision need not coincide with the rational one ex-post.

Remark 4.8.

The reader may still wonder why algorithm design is necessary in the first place. For instance, if $y^{R}(\sigma,p)$ is a single threshold rule, it may be surprising that simply fitting the optimal single threshold rule to the data is insufficient to emulate rationality. While it may be sufficient in some cases, it is not in general, and in particular the ability to emulate rationality does not follow from the rational reply being in $\mathcal{H}$ ; the reason is that it is necessary to be able to construct richer rules in order to deter the sender from deviating to exploit limitations in $\mathcal{H}$ , which would prevent the receiver from choosing the rational reply “off-path.” This is articulated in Section 5.3.2, where we also clarify the role of taking a richer set of possible $\Pi$ , to correspondingly justify choosing a sufficiently rich set of $\mathcal{H}$ to begin with.

5. Main Results

We now present our main results, showing the existence of an equilibrium of the algorithm game where the rational reply is emulated. We begin with a preliminary observation, useful for understanding our subsequent analysis: PAC learnability is a sufficient condition for the algorithm game to have a Nash equilibrium emulating $(\sigma^{R},y^{R}(\sigma^{R},p))$ .

Proposition 5.1.

If $\tau$ is PAC learnable, then $(\sigma^{R},\tau)$ is a Nash equilibrium of the algorithm game which emulates $(\sigma^{R},y^{R}(\sigma^{R},p))$ .

Proof.

If $\tau$ is PAC learnable, then the receiver learns $\sigma$ accurately in the long run. Thus, the long run average expected payoff of the sender is

{\mathcal{U}}(\sigma,\tau)=\mathbf{E}_{\theta}u(\theta,\sigma,y^{R}(\sigma,\sigma(\theta)))

By the definition,

\sigma^{R}=\arg\max\mathbf{E}_{\theta}u(\theta,\sigma,y^{R}(\sigma,\sigma(\theta))).

By PAC learnability,

\lim_{t\rightarrow\infty}\mathbf{P}[\tau(D_{t})(p)\neq y^{R}(\sigma^{R},p)]=0,

implying that $\mathbf{E}[v(\theta_{t},p_{t},a_{t})]\rightarrow\mathbf{E}[v(\theta_{t},p_{t},y^{R}(\sigma^{R},p)]$ as $t\rightarrow\infty$ . This implies the long run discounted payoffs are equal to those obtained against a rational player, and hence $(\sigma^{R},\tau)$ constitutes a Nash equilibrium which emulates $(\sigma^{R},y^{R}(\sigma^{R},p))$ . ∎

This observation suggests it suffices to show the PAC-condition holds; in that case, the sender would find it optimal to choose $\sigma^{R}$ , and by definition it would not be possible for the receiver to outperform rationality. However, there are two main difficulties which we seek to emphasize:

(1)

First, it may be that $y^{R}\in\mathcal{H}$ , and yet if $\mathcal{H}$ is limited then the rational outcome cannot be emulated without expanding the set of feasible decision rules, and
(2)

Second, one still needs to specify how the algorithm should use the historical data in inferring the correct decision.

This section addresses each of these issues. We first consider the case where the receiver knows the values of $y^{R}(\sigma,p)$ $\forall(\sigma,p)$ . We then show, in Section 5.1, that the PAC-condition holds for an algorithm:

Proposition 5.2.

If the receiver knows the values of $y^{R}(\sigma,p)$ $\forall(\sigma,p)$ , there exists an algorithm $\tau_{A}$ that is PAC learnable. Thus, $(\sigma^{R},\tau_{A})$ is a Nash equilibrium of the algorithm game, which emulates $(\sigma^{R},y^{R})$ .

We then turn to the case where the algorithm cannot observe $y^{R}(\sigma,p)$ . This yields an algorithm $\tau_{\hat{A}}$ , which coincides with $\tau_{A}$ with the added step of inferring the labels. We show that we obtain an analogous result for this case in Section 5.2:

Proposition 5.3.

Suppose that $y^{R}(\sigma,p)$ is a strict best response $\forall\sigma$ but the receiver does not observe the values of $y^{R}(\sigma,p)$ . Then, there exists an algorithm $\tau_{\hat{A}}$ that is PAC learnable. $(\sigma^{R},\tau_{\hat{A}})$ is a Nash equilibrium of the algorithm game that emulates $(\sigma^{R},y^{R})$ .

In our analysis, the first step is to construct an algorithm that generates an accurate forecast in the long run. The remaining step is to show whether the sender has an incentive to choose $\sigma^{R}$ against $\tau_{\hat{A}}$ in the algorithm game, in Section 6.

5.1. Specifying the Algorithm and Weak Learnability (Proposition 5.2)

5.1.1. Weak Learnability

The sufficient condition which ensures we can approximate an arbitrary decision rule combining single-thresholds is weak learnability. Roughly speaking, weak learnability says that the hypothesis class can outperform someone who had some very minimal knowledge of the truth of the hypothesis. That is, it must be that the hypothesis class can do better than a someone who made a random guess, which would be made correct with some arbitrarily small probability. While this may seem permissive—and indeed, it is certainly less stringent than requiring it can approximate the truth with high probability—the difficulty in achieving it is the fact that this guarantee must be uniform over all possible distributions.

We formally define this as follows:

Definition 5.4.

Let $P(\sigma)$ be the support of $\sigma$ . If $\overline{h}$ solves

\sum_{p\in P(\sigma)}d(p)\mathbf{1}[y(p)=\overline{h}(p)]\geq\sum_{p\in P(\sigma)}d(p)\mathbf{1}[y(p)=h(p)]\qquad\forall h\in\mathcal{H},

$\overline{h}$ is an optimal weak hypothesis.

Definition 5.5.

If $\mathinner{\!\left\lvert A\right\rvert}=2$ , a hypothesis class $\mathcal{H}$ is weakly learnable if, for every distribution $d$ over observations $p\in P(\sigma)$ and labels $y(p)$ , the optimal weak hypothesis satisfies:

\sum_{p\in P(\sigma)}D(p)(\mathbf{1}[y(\sigma,p)\neq h(p)]-\mathbf{1}[y(\sigma,p)=h(p)])\geq\rho.

If $\mathinner{\!\left\lvert A\right\rvert}>2$ , a hypothesis class $\mathcal{H}$ is weakly learnable if, for every distribution $d$ over observations $p\in P(\sigma)$ and labels $y(p)$ , the optimal weak hypothesis satisfies:

\sum_{p\in P(\sigma)}\mathbf{1}[\overline{h}(p)\neq y(p)]d(p)\leq\sum_{p\in P(\sigma)}\mathbf{E}_{\tilde{y}\sim B}[(1-\rho)\mathbf{1}[\tilde{y}\neq y(p)]]d(p),

for some $\rho>0$ and some distribution $B$ over $A$ .

The second condition is a generalization of the first, though the first is perhaps more familiar from the machine learning literature (as most attention has focused on the two-label case). This condition reflects the idea that the classifier randomly guesses the label according to some distribution $B$ , but is “flipped to being correct” with probability $\rho$ . For the $\mathinner{\!\left\lvert A\right\rvert}>2$ case, the right hand side describes the expected error in such a case, and the left hand side describes the error from the optimal weak classifier.

If weak learnability fails, then no recursive ensemble algorithm can be built to approximate $y(p)$ based on $\mathcal{H}$ alone.⁹⁹9For example, imagine $\mathcal{H}$ only consists of trivial classifiers. A corollary of a result in Appendix A.1 is that these classifiers can do equally well as a random guesser. However, it is clear that they cannot do strictly better, as they are restricted to giving the same guess to all possible $p$ , unlike a random guesser who is correct with an added probability $\rho$ . Perhaps more surprising is that it is tight, a fact which we discuss further in Section 5.3.1. For now, we simply mention that if we take $\mathcal{H}$ , the set of single threshold classifiers is weakly learnable.

Proposition 5.6.

The set of single-threshold classifiers satisfies the weak learnability condition of Definition 5.5.

Proof.

See Appendix A. ∎

Our proof uses the important fact: Any hypothesis class that contains all label permutations can at least match the random guess guarantee. The proof of this intermediate lemma uses a duality argument in order to show that no distribution can lead to a lower payoff when this condition is satisfied. Importantly, however, this is true for any hypothesis class, including the trivial one. This observation allows us to show that the added richness of single-threshold classifiers is sufficient to provide the additional gain over random guessing.

5.1.2. From Weak Learnability to Decision Rules

For simplicity, we present the case where

A=\{-1,1\}

leaving the general case to the appendix. The formal description of the algorithm takes two steps. First, we describe an algorithm under the assumption that the receiver knows the value of $y^{R}(\sigma,p)$ $\forall p$ . If $A$ contains two elements, the specification of the algorithm parameters coincides with the Adaptive Boosting algorithm $\tau_{A}$ of \citeasnounSchapireandFreund12. We first outline the parameters and then review, for completeness.

The $k$ th stage (initializing with the uniform distribution if $k=1$ ) starts with probability distribution $d_{k}(p)$ over the support of $\sigma$ . Define

\epsilon_{k}=\mathbf{P}_{d_{k}}\left(h_{k}(p)\neq y^{R}(\sigma,p)\right)

(5.7)

as the probability that the optimal classifier $h_{k}$ at $k$ misclassifies $p$ under $d_{k}$ . If $\epsilon_{k}=0$ , then we stop the training and output $h$ as the forecasting rule, which perfectly forecasts $y^{R}(\sigma,p)$ .

Suppose that $\epsilon_{k}>0$ . Define

\alpha_{k}=\frac{1}{2}\log\frac{1-\epsilon_{k}}{\epsilon_{k}}

(5.8)

The weak learnability of the single threshold rule implies that $\exists\rho>0$ such that

\epsilon_{k}\leq\frac{1}{2}-\rho\qquad\forall k\geq 1.

Define for each $p$ in the support of $\sigma$ , and each pair $(p,y^{R}(\sigma,p))$ ,

d_{k+1}(p)=\frac{d_{k}(p)\exp(-\alpha_{k}y^{R}(\sigma,p)h_{k}(p))}{Z_{k}}

where

Z_{k}=\sum_{p}d_{k}(p)\exp(-\alpha_{k}y^{R}(\sigma,p)h_{k}(p)).

Given $d_{k+1}$ , we can recursively define $h_{k+1}$ and $\epsilon_{k+1}$ , both of which are functions of $d_{k+1}$ as per the above.

The decision of the receiver is based upon

\tau_{A}(D_{k})(p)=\arg\max_{a\in A}\sum_{t=1}^{k}\alpha_{t}{\mathbf{1}}(h_{t}(p)=a)

which is equivalent to

\tau_{A}(D_{k})(p)=\textbf{sgn}\left[\sum_{t=1}^{k}\alpha_{t}h_{t}(p)\right]

if $A=\{-1,1\}$ , where $\textbf{sgn}(x)$ is the sign of real number $x$ .

Following \citeasnounSchapireandFreund12, we can show that

\mathbf{P}\left(\tau_{A}(D_{t})(p)=y^{R}(\sigma,p)\right)\geq 1-e^{-t\rho(G)}

(5.9)

for any mixed strategy $\sigma$ , where $G$ is the number of elements in the support of $\sigma$ .¹⁰¹⁰10A sketch of the proof is in Appendix B.

5.2. Inferring the Rational Label (Proposition 5.3)

Next, we drop the assumption that the receiver can observe $\sigma$ so that he can calculate the expected utility conditioned on $p$ in the support of $\sigma$ :

\sum_{\theta}v(\theta,p,a)\mu(\theta\mathrel{\mathop{\mathchar 58\relax}}p)

(5.10)

where $\mu$ is computed via Bayes rule, and therefore, knows the value of $y^{R}(\sigma,p)$ . If the receiver does now know $y^{R}(\sigma,p)$ , then he cannot calculate $\epsilon_{k}$ in (5.7). We need to construct an estimator $\hat{y}_{t}(p)$ for $y^{R}(\sigma,p)$ from data $D_{t}$ available at the beginning of period $t$ . How we construct estimator $\hat{y}_{t}(p)$ depends upon the specific details of the rule of the game such as the available data and the variable of interest. We require that $\hat{y}_{t}(p)$ satisfies a regularity property.

Definition 5.7.

$\hat{y}_{t}(p)$ is a consistent estimator if $\hat{y}_{t}(p)$ converges to $y^{R}(\sigma,p)$ in probability as $t\rightarrow\infty$ .

We require that $\hat{y}_{t}(p)$ satisfies the large deviation property (LDP), which is a stronger property than consistency.

Definition 5.8.

$\hat{y}_{t}(p)$ satisfies large deviation properties (LDP) if $\exists\lambda>0$ such that, $\forall p$ in the support of $\sigma$ ,

\limsup_{t\rightarrow\infty}-\frac{1}{t}\log\mathbf{P}\left(y^{R}(\sigma,p)\neq\hat{y}_{t}(p)\right)\leq\lambda.

(5.11)

If an estimator satisfies LDP, the tail portion of the forecating error vanishes at the exponential rate, as the sample average of i.i.d. random variables converges to the population mean. If an estimator fails to satisfy LDP, the finite sample property of the estimator tends to be extremely erratic (\citeasnounMeyn07). Most estimators in economics satisfy LDP.

In the three examples illustrated in Section 3.3, the variable of interest is the probability distribution of the underlying valuation conditioned on $p\in{\mathcal{P}}$ . Let $\pi(v\mathrel{\mathop{\mathchar 58\relax}}p)$ be the posterior distribution of $v$ conditioned on $p$ . If $v$ is drawn from a finite set, then $\pi(v\mathrel{\mathop{\mathchar 58\relax}}p)$ is a multinomial distribution. Let ${\hat{\pi}}_{t}(v\mathrel{\mathop{\mathchar 58\relax}}p)$ be the sample average for $\pi(v\mathrel{\mathop{\mathchar 58\relax}}p)$ . We know that the rate function of ${\hat{\pi}}_{t}(v\mathrel{\mathop{\mathchar 58\relax}}p)$ is the relative entropy of ${\hat{\pi}}_{t}$ with respect to $\pi$ (\citeasnounDemboandZeitouni98)

I_{\pi}=\sum_{v}{\hat{\pi}}_{t}(v\mathrel{\mathop{\mathchar 58\relax}}p)\log\frac{{\hat{\pi}}_{t}(v\mathrel{\mathop{\mathchar 58\relax}}p)}{\pi(v\mathrel{\mathop{\mathchar 58\relax}}p)},

from which we derive $\lambda$ in (5.11): $\forall\epsilon>0$ , let $N_{\epsilon}(\pi)$ be the $\epsilon$ neighborhood of $\pi(v\mathrel{\mathop{\mathchar 58\relax}}p)$ , and

\lambda=\inf_{{\hat{\pi}}_{t}\not\in N_{\epsilon}(\pi)}I_{\pi}.

Note that

y^{R}(\sigma,p)\neq\hat{y}_{t}(p)

only if $\pi$ and ${\hat{\pi}}_{t}$ prescribe differen actions. Since ${\hat{\pi}}_{t}$ is a consistent estimator of $\pi$ , the probability of two probability distributions prescribing two different actions vanishes. The large deviation property of ${\hat{\pi}}_{t}$ implies that $\hat{y}_{t}(p)$ satisfies (5.11), if $y^{R}(\sigma,p)$ is a strict best response.

By the concavity of the logarithmic function, $I_{\pi}$ is minimized if $\pi$ is a uniform distribution and

\inf_{\pi}I_{\pi}>0.

If $\mathinner{\!\left\lvert P\right\rvert}<\infty$ and $\mathinner{\!\left\lvert A\right\rvert}<\infty$ , we obtain the uniform version of (5.11) with respect to the true probability distribution. We state the result without proof for later reference.

Lemma 5.9.

Suppose that $\hat{y}_{t}(p)$ is consistent and satisfies (5.11). Then, $\exists\lambda>0$ such that

\limsup_{t\rightarrow\infty}-\frac{1}{t}\log\mathbf{P}\left(y^{R}(\sigma,p)\neq\hat{y}_{t}(p)\ \ \forall p\ \text{in the support of}\ \sigma\right)\leq\lambda.

(5.12)

We construct algorithm $\tau_{\hat{A}}$ by replacing $y(\sigma,p)$ by $\hat{y}_{t}(p)$ in $\tau_{A}$ constructed in the previous section. More precisely, let $f^{y}_{t}(p)$ be the empirical probability that $\hat{y}_{t}(p)=1$ at the beginning of period $t$ . Thus, $\hat{y}_{t}(p)=-1$ with probability $1-f^{y}_{t}(p)$ . Given $\{d_{t}(p),\hat{y}_{t}(p)\}_{p}$ , $h_{t}$ solves

\max_{h\in{\mathcal{H}}}\sum_{p}h(p)d_{t}(p)[1\cdot f^{y}_{t}(p)-1\cdot(1-f^{y}_{t}(p))]

and

{\hat{\epsilon}}_{t}=\sum_{p}d_{t}(p)\left[f^{y}_{t}(p){\mathbf{1}}(h(p)=1)+(1-f^{y}_{t}(p)){\mathbf{1}}(h(p)=-1)\right].

Using weak learnability, we can show that $\exists\rho>0$ such that

{\hat{\epsilon}}_{t}\leq\frac{1}{2}-\rho.

Since $\hat{y}_{t}(p)$ has the full support over $\{-1,1\}$ $\forall t\geq 1$ ,

{\hat{\epsilon}_{t}}>0.

Given an algorithm $\tau_{A}$ with observed labels, we can therefore replace it with $\tau_{\hat{A}}$ which involves inferring the labels $y^{R}(\sigma,\cdot)$ , setting them equal to $\hat{y}_{t}(\cdot)$ , for all $t\geq 1$ .

With $\tau_{\hat{A}}$ , we can construct labels from data, and that for the hypothesis class of interest the weak learnability condition is satisfied. The last step to show the algorithm works, in the case where the set of possible $p$ has finite support, is that the output of the algorithm will indeed converge to the rational reply, as dictated by the labels, provided the weights are specified correctly.

Proposition 5.10.

Suppose that $\hat{y}_{t}$ satisfies uniform LDP and that $y^{R}(\sigma,p)$ is a strict best response $\forall p$ . Then, $\forall\sigma$ that randomizes over $G$ elements of ${\mathcal{P}}$ , $\exists T$ and $\exists\rho(G)>0$ such that

\mathbf{P}\left(\tau_{\hat{A}}(D_{t})(p)=y^{R}(\sigma,p)\ \forall t\geq T\right)\geq 1-e^{-t\rho(G)}.

Proof.

See Appendix B. ∎

The construction of $\hat{y}_{t}$ depends on the specifics of a problem, especially what data $D_{t}$ available in period $t$ contains. In many interesting economic models, the algorithm for $\hat{y}_{t}$ needs the knowledge of $\Theta$ rather than simply the support of $\pi$ , and $D_{t}$ can contain at least the ordinal information about the performance of the decision recommended by $\tau_{\hat{A}}$ .

Let us consider the insurance model illustrated in Section 3.3.1. The critical value is (5.10). Instead, suppose that the receiver can observe the average performance difference of two actions:

\textbf{sgn}\left(\sum_{k=1}^{t-1}v(\theta_{k},p,1)-v(\theta_{k},p,0)\right)

(5.13)

in the past.¹¹¹¹11At the end of each period, the receiver is supposed observe the performance difference. If not, we can devise an experimentation strategy to infer the average performance difference following the idea of exploration and exploitation. That is,

\hat{y}_{t}(p)=\begin{cases}1&\text{if }\ \sum_{t^{\prime}=1}^{t-1}v(\theta,p,1)-v(\theta,p,0).\geq 0\\ -1&\text{if }\ \sum_{t^{\prime}=1}^{t-1}v(\theta,p,1)-v(\theta,p,0).<0.\end{cases}

Given a probability distribution over $y^{R}(\sigma,p)$ , $\hat{y}_{t}(p)$ satisfies LDP: $\exists\lambda>0$ such that

\limsup_{t\rightarrow\infty}-\frac{1}{t}\log\mathbf{P}\left(\hat{y}_{t}(p)\neq y^{R}(\sigma,p)\right)\leq\lambda.

We know that the large deviation rate function over a binominal distribution is uniformly bounded from below (\citeasnounDemboandZeitouni98). Thus, we can choose $\lambda>0$ uniformly over all probability distribution over $y^{R}(\sigma,p)$ .

The ordinal information (5.13) about the average quality is necessary. Without access to (5.13), the algorithm cannot estimate $y(\sigma,p)$ , which is critical for emulating the rational behavior. The information contained in (5.13) is coarse, because the algorithm does not take any cardinal information about the parameters of the underlying game. Without the cardinal information, the receiver cannot implement the equilibrium strategy of the baseline game, which is a single threshold rule. Because the algorithm does not rely on parameter values of the underlying game, the algorithm is robust against specific details of the game, if the algorithm can function as intended by the decision maker.

5.3. Discussion

5.3.1. Accommodating Multiple Actions

We use the Adaptive Boosting algorithm, as introduced by \citeasnounSchapireandFreund12, to specify the $\alpha_{k}$ weights and the updates if $\mathinner{\!\left\lvert A\right\rvert}=2$ . The original Adaptive Boosting algorithm only applies to the case of $\mathinner{\!\left\lvert A\right\rvert}=2$ . To handle the case of $\mathinner{\!\left\lvert A\right\rvert}>2$ , we appeal to a generalization introduced by \citeasnounMukherjeeSchapire2013.

The $\mathinner{\!\left\lvert A\right\rvert}>2$ algorithm works for the $\mathinner{\!\left\lvert A\right\rvert}=2$ case, with one minor drawback, which is that the learnability constant must be computed in advance. While our work shows an algorithm exists, the computation of the learnability constant is more indirect and hence explicitly finding a parameter that works is more difficult. The arguments for these proofs follow from results in the machine learning literature (see \citeasnounSchapireandFreund12), which we can apply to show that this algorithm can yield a response for which the misclassification probability vanishes.

The proof of Proposition 5.10 is stated for the general case. The proof reveals that the rate at which the probability of misclassfication vanishes is determined entirely by the number of sender actions in the support of $\sigma$ . Thus, the algorithm is efficient in that it maintains an exponential rate of convergence (\citeasnounShalev-ShwartzandBen-David14).

5.3.2. On the Necessity of Expanding $\mathcal{H}$

So far, our analysis has assumed that the set of initial classifiers $\mathcal{H}$ contains the set of single-threshold classifiers, we have shown that the rational reply can be guaranteed by an algorithm. We now show that this result requires the ability to construct algorithms to expand the set of possible decision rules, even if $y^{R}(\sigma,p)\in\mathcal{H}$ , given sufficient richness in the set $\Pi$ —recall that we do not necessarily assume $\mathinner{\!\left\lvert\Theta\right\rvert}<\infty$ , so that different $\pi$ with non-overlapping support may be possible; that is, if the designer seeks to provide rational replies in a variety of different environments, one cannot simply find the best fitting hypothesis within $\mathcal{H}$ to emulate rationality.

Indeed, it is straightforward to find conditions on $\Pi$ , the set of possible distributions over $\Theta$ (all of which, we assume, have finite support), such that the optimal decision rule is of the threshold form. In fact, the algorithm designer may improve upon the rational reply given knowledge of $\pi$ . To illustrate, suppose $u(\theta,p,a)$ is independent of $\theta$ , and weakly increasing (coordinatewise) in $p$ , for each $a$ (the latter of which would hold if, for instance, $p$ were a menu of prices). The following simple result shows that in this case, at least (increasing) single threshold classifiers should be included:

Proposition 5.11.

Suppose $u(\theta,p,1)-u(\theta,p,0)$ is constant in $\theta$ and weakly concave in $p$ . Suppose further that $u(\theta,p^{*},1)>u(\theta,p^{*},0)$ . Then there exists a single threshold classifier which the algorithm could commit to using which ensures the strategic player chooses $p^{*}$ with probability 1.

The proof follows immediately from an observation that the set of $p$ at which the consumer chooses $a=1$ is convex under the conditions of the proposition.¹²¹²12See also \citeasnounGilboaSamet1989 for a similar observation on how the use of restricted decision rules can be advantageous.

In order for the algorithm designer to improve upon a degenerate prescription to always choose $a=0$ , Proposition 5.11 suggests including at least single threshold classifiers which are increasing. Against the highlighted $\pi$ , such prescriptions would give the receiver even higher commitment power than the rational benchmark. In order to maximize payoff against richer and richer $\Pi$ , more and more classifiers should therefore be included to $\mathcal{H}$ .

This raises the question of whether adding in these classifiers goes “too far.” Namely, in seeking to maximize payoff against a rich set of possible $\pi$ , does this risk doing worse against others? In fact, it may be that the receiver does worse than the rational benchmark.

Proposition 5.12.

Suppose that ${\tilde{\Gamma}}$ is the set of all single threshold classifiers, and suppose all $\pi\in\Pi$ has binary support. For any $\{\theta_{L},\theta_{H}\}$ supporting $\pi$ , suppose the following is satisfied:

•

The sender’s optimal $p-a$ pair when $\theta=\theta_{L}$ is $(p_{L}^{*},1)$
•

The sender’s optimal $p-a$ pair when $\theta=\theta_{H}$ is $(p_{H}^{*},0)$ , with $p_{L}^{*}<p_{H}^{*}$ .
•

$v(\theta_{H},p_{H}^{*},1)=v(\theta_{H},p_{H}^{*},0)$ ,
•

$v(\theta_{L},p_{L}^{*},1)\geq v(\theta_{L},p_{L}^{*},0)$ , and
•

$v(\theta,p,1)-v(\theta,p,0)$ increasing in $p$ , for all $\theta$ .

Then a policy arbitrarily close to the sender’s optimal $p-a$ pair is implementable, even if this differs from the rational outcome under $\sigma^{R}$ .

A setting where this sender-optimal action strategy differs from the rational outcome was first studied, to the best of our knowledge, in \citeasnounRubinstein93. His setting satisfies the conditions of the proposition. Our proof adapts his arguments to the current setting (i.e., incorporating the statistical aspect of our exercise and beyond the application he considered, described above), and highlights the importance of counterveiling incentivse in driving the result. The reason the sender can profitably deviate in the previous proof is because the new $\sigma$ induces a non-monotone response from the receiver optimally, even though this is not prescribed by $\sigma^{R}$ . In contrast, decision rules with single-threshold classifiers must be monotone.¹³¹³13Even though $y^{R}(\sigma^{R},p)$ is an element of ${\tilde{\Gamma}}$ , $y^{R}(\sigma^{*},p)$ is not an element of ${\tilde{\Gamma}}$ . Since $\sigma^{*}$ is the choice variable of the sender, the sender generates misspecification endogenously.

In other words, we show that the sender can construct a strategy which ensures that the receiver’s utility as a function of $p$ violates single-crossing. Now, if the sender were using the particular $\sigma^{*}$ from the previous proof, then the rational response can be achieved via a double-threshold classifier, since there are only three optimal sender choices. But on the other hand, if the receiver were restricted to using single- or double-threshold classifiers, then one could find another strategy whereby the optimal response would be to use a triple-threshold classifier, via a similar scheme. As long as the number of thresholds used is finite, a similar kind of exploitation would emerge.

5.3.3. Accommodating Richer Principal Action Spaces

While $\tau_{\hat{A}}$ is designed to be robust against parametric details of the underlying problems, the algorithm is still vulnerable to strategic manipulation by the rational sender. The proof of Proposition 5.10 reveals that the rate of convergence is decreasing as the number of sender actions in the support of $\sigma$ increases. The sender can randomize over infinitely many messages to slow down the convergence rate arbitrarily. That said, such manipulation would be short lived, and therefore have limited gains. Nevertheless, in order to ensure that there are only a finite number of observations that the algorithm may observe, it is necessary to augment the observation space so that the distribution facing the receiver can be treated as discrete. This section discusses how this modification can be done.

Approach One: Discretization We describe how to revise $\tau_{\hat{A}}$ accordingly to discretize the observation space. Instead of processing individual actions, we let $\tau_{\hat{A}}$ process a group of actions at a time, treating “close” actions as the same group. In principle, we want to partition $\mathcal{P}$ into a set of half-open rectangles intervals with size $\lambda$ . More precisely, given some arbitrary $\lambda$ , we can partition each dimension of a rectangle containing $\mathcal{P}$ into the collection of half open intervals of size $\lambda>0$ with a possible exception of the last interval:

P_{0}^{j}=[{\underline{p}}_{,}{\underline{p}}+\lambda),\ldots,P_{K_{j}^{\lambda}}^{j}=[{\underline{p}}+(K_{j}^{\lambda}-1)\lambda),{\overline{p}}]

where $K_{j}^{\lambda}$ is the number of elements in the partition and $j\in\{1,\ldots,n\}$ is a partiular dimension.

For each element in the partition, the algorithm receives an ordinal information about the average outcome from the decision, if it contains a sender action in the support of $\sigma$ :

\hat{y}^{\lambda}_{t}(k)=a\text{ if }a=\arg\max\sum_{p\in P_{k}}v(\theta,p,a)

where $p$ in the support of $\sigma$ and $P_{k}$ is the product of partition elements. Let $\tau^{\lambda}_{\hat{A}}$ be the algorithm obtained by replacing $\hat{y}_{t}(p)$ in $\tau_{\hat{A}}$ by $\hat{y}^{\lambda}_{t}(k)$ . Note that as $\lambda\rightarrow 0$ , the size of the individual elements in the partition shrinks and $\tau^{\lambda}_{\hat{A}}$ converges to $\tau_{\hat{A}}$ for a fixed $\sigma$ .

Compared to $\tau_{A}$ and $\tau_{\hat{A}}$ , $\tau^{\lambda}_{\hat{A}}$ takes only coarse information for two important reasons. First, the algorithm cannot differentiate two $p$ s which are very close. This features makes the algorithm robust against strategic manipulation of the sender to slow down the speed of learning. Second, the algorithm cannot detect the precise consequence of its decision, but only the ordinal information of the past decision, aggregated over time. The second feature allows the algorithm to operate with very little information about the details of the parameters of the underlying game.

Approach Two: Smoothing Discretizing the action space as above is one way of ensuring that there are only a finite number of sender actions to worry about in the long run, and given a sufficiently fine discretization, any distinct $p$ is distinguished by the algorithm. However, in principle, close sender actions may still be quite far in terms of payoffs, and only be distinguished in the long run. That is, there is no guarnatee that for a fixed horizon, that the algorithm is not grouping too many $p$ possibilities. The issue is that the discretization approach uses no information about the receiver’s payoff function. Our other alternative describes more explicitly how close to rationality the receiver can achive, given some fixed discretization scheme.

The idea is the following: We add a small amount of noise to each observed $p$ , with the amount of noise tending to 0 as the sample size grows large. Doing so allows us to show that the receiver perceives the sender’s strategy to have the property that $\mathbf{E}_{\theta}[u(a,\theta,p(a)\mathrel{\mathop{\mathchar 58\relax}}p]$ is uniformly equicontinuous (as functions of $p$ ). As a result, if the receiver only seeks to use a strategy that is $\varepsilon-$ optimal against $\sigma$ , uniform equicontinuity implies that their best reply can essentially be collapsed within intervals.

It will additionally be important that the algorithm does not seek to make predictions at $p$ values where the corresponding density would be estimated to be small. Hence a second step will be to determine whether a $p$ realization occur in a region with sufficiently large probability, where the “sufficient” amount will also tend to 0 as the amount of data grows large.

Formally, suppose the algorithm observes data $((p_{1},y),\ldots,(p_{n},y))$ . Let $z_{\eta,i}$ be an independent random vector in the unit ball around 0 distributed according to the PDF:

\phi_{\eta}(z)=\frac{1}{K}\exp\left(-\frac{1}{1-\mathinner{\!\left\lvert z/\eta\right\rvert}^{2}}\right)\frac{1}{\eta^{\mathinner{\!\left\lvert A\right\rvert}-1}},

where $K$ is a constant which ensures $\phi_{\eta}$ integrates to 1. Our first augmentation is the following:

•

Replace the observed $p_{1},\ldots,p_{n}$ with $\hat{p}_{1},\ldots,\hat{p}_{n}$ , where $\hat{p}_{i}=p_{i}+z_{\eta,i}$ , with $z_{\eta,i}$ distributed according to the above.

Second, it turns out that the above smoothing operation only works if the density is sufficiently large. Otherwise, the smoothing noise has too much power.

•

For any $\tilde{p}=(\tilde{p}_{a})_{a\in A\backslash a_{0}}$ drawn, estimate the event that $\tilde{\sigma}_{\eta}(\tilde{p})<\gamma$ by fixing some $\delta$ small and determining whether menu(s) $p$ with $\max_{a\in A\backslash\{a_{0}\}}{\tilde{p}_{a}-p_{a}}<\delta$ occurs with frequency at least $(2\delta)^{\mathinner{\!\left\lvert A\right\rvert}-1}\gamma$ . Recommend action $a_{0}$ for any such $p$ .

As $\delta\rightarrow 0$ , the condition holds if the density is at least $\gamma$ . Together with the previous, we can show that if the receiver instead observes noisy sender actions, the perceived sender’s strategy is sufficiently well-behaved to maintain the appropriate convergence for the algorithm.

Proposition 5.13.

Suppose the sender is restricted to choosing distributions which are either discrete or continuous with bounded density. Consider an algorithm which can ensure that an $\varepsilon$ -rational label is PAC-learnable, for any arbitrary $\varepsilon>0$ given a finite number of possible sender actions. Then there exists a smoothing operation which maintains PAC-learnability of $\varepsilon$ -rationality, for every $\varepsilon>0$ .

The idea of the proposition is to use the smoothing operation to show that the algorithm perceives that the sender uses a $\sigma$ such that $\mathbf{E}[u(a,\theta,p(a))\mathrel{\mathop{\mathchar 58\relax}}p]$ is uniformly equicontinuous. Given that we seek $\varepsilon$ -optimality, uniform equicontinuity allows us to essentially discretize the menu space, transforming the environment into a much simpler one.

There are two important properties of the transformation which allows us to ensure this works. The first is that, defining $\tilde{\sigma}_{\eta}(\cdot\mathrel{\mathop{\mathchar 58\relax}}\theta)$ to be the perceived $p$ distribution of $p_{i}+z_{i}$ , we have:

D^{\alpha}\tilde{\sigma}_{\eta}(p\mathrel{\mathop{\mathchar 58\relax}}\theta)=\int_{P}D^{\alpha}\phi_{\eta}(p-\tilde{p})\sigma(\tilde{p}\mathrel{\mathop{\mathchar 58\relax}}\theta)d\tilde{p},

so that $\tilde{\sigma}_{\eta}$ inherits the smoothness properties of $\phi_{\eta}$ . The second is that, on any compact subset of $P$ , we have $\sigma_{\eta}(\cdot\mathrel{\mathop{\mathchar 58\relax}}\theta)\rightarrow\sigma(\cdot\mathrel{\mathop{\mathchar 58\relax}}\theta)$ uniformly. Now, in order to obtain uniform continuity as $\eta\rightarrow 0$ , it will be important that we can simultaneously ensure that the sender’s strategy does not involve dramatic movements in the conditional probability. For instance, suppose the sender were to use the following strategy:

\sigma(p\mathrel{\mathop{\mathchar 58\relax}}G)=p(\sin\left(\frac{1}{p}\right)+1),\sigma(p\mathrel{\mathop{\mathchar 58\relax}}B)=p(\sin\left(\frac{1}{p}-\pi\right)+1),

defined on an interval $[0,\overline{p}]$ such that both densities integrate to 1. Then $\mathbf{P}[\theta=G\mathrel{\mathop{\mathchar 58\relax}}p]=1$ if $p=\frac{1}{(2k+1/2)\pi}$ for some $k\in\mathbf{N}$ , and 0 if $p=\frac{1}{(2k+1/2)\pi}$ , for some $k\in\mathbf{N}$ . As $k\rightarrow\infty$ (so that $p\rightarrow 0$ ), this oscillates infinitely often.

We handle the problem this example poses by only making non-degenerate predictions if the probability of using such sender actions is sufficiently high. That is, we “ignore” $p$ realizations which only occur with low probability according to an estimated density.¹⁴¹⁴14One may wonder why this trick works; for instance, we do not obtain the result when $\sigma(p\mathrel{\mathop{\mathchar 58\relax}}G)=\sin\left(\frac{1}{p}\right)+1,\sigma(p\mathrel{\mathop{\mathchar 58\relax}}B)=\sin\left(\frac{1}{p}-\pi\right)+1.$ However, unlike the previous example, these will fail the continuity requirement on the sender’s strategy space, which is needed in the proof. Seeking to estimate the probability that all sender actions are within $\delta$ of $p$ in order to estimate the density is just one way of doing this step; for instance, one could estimate the CDF $\tilde{\sigma}_{\eta}(p)$ , and use the estimated density to determine whether the observations should be thrown away. Ultimately, however, given the compact $\mathcal{P}$ , we can minimize the probability that this is done by using sufficiently low thresholds. As a result, it has a vanishing impact on PAC-learnability, as well as the sender’s expected profit.

6. Review of Examples

We now verify the implications of this observation on the sender behavior in our particular examples, showing that this results in the sender-preferred Stackleberg outcome is emerging. This requires us to verify the previously discussed conditions in the context of these applications.

6.1. Informed Principal

The decision problem of the agent is to identify each pair $(x,q)$ of payment $x$ and cost $q$ as an acceptable constract $(a=1)$ or not $(a=-1)$ . Without loss of generality, we can assume that the agent uses the single threshold linear classifier induced by hyperplane

{\sf H}(\lambda_{x},\lambda_{q},\omega)=\{(x,q)\mathrel{\mathop{\mathchar 58\relax}}\lambda_{x}x+\lambda_{q}q=\omega\}

and

h(x,q)=\begin{cases}1&\text{if }\ (x,q)\in{\sf H}^{+}(\lambda_{x},\lambda_{q},\omega)\\ -1&\text{otherwise.}\end{cases}

We can construct $\tau_{\hat{A}}$ by estimating $\mathbf{E}v(\theta,p,a)$ for each $(p,a)$ .

Lemma 6.1.

Suppose that $\sigma$ assigns a positive probability to $(x,q)$ where

\mathbf{E}(q-\theta x\mathrel{\mathop{\mathchar 58\relax}}(q,x))=0

for $x>0$ . Then $\sigma$ is not a best response to $\tau_{\hat{A}}$ .

Proof.

Let $(x,q)$ be some offer such that the agent is indifferent between accepting and rejecting, so that:

q-\mathbf{E}[\theta\mathrel{\mathop{\mathchar 58\relax}}(x,q)]x=0

The principal’s expected payoff is found by taking the expectation of $u(\theta,(x,q),a)$ over all realizations of $x,q$ . By the law of iterated expectations, this occurs if and only if the principal’s payoff is maximized following each realization of $(x,q)$ . We claim the principal is not indifferent between actions following any such $(x,q)$ . Indeed, letting $\mathbf{E}[\theta\mathrel{\mathop{\mathchar 58\relax}}(x,q)]=r$ , indifference implies:

(1-r)f(I-rx)+rf(I-L+(1-r)x)=(1-r)f(I)+rf(I-L).

Note that equality holds if $f(y)=y$ . This implies that both lotteries, whether or not the principal accepts, have the same expected values. However, if $f$ is concave, then since $I>I-rx>I-L+(1-r)x>I-L$ , it must be that the left hand side is strictly greater than the right hand side.

It follows that if indifference holds, the principal strictly prefers the agent accept the offer by slightly reducing $x$ . ∎

Following the same logic as in the previous example, we conclude that if $\sigma$ is a best response to $\tau_{\hat{A}}$ , then

\tau_{\hat{A}}(D_{t})(q,x)=y^{R}(\sigma,(x,q))

with probability 1. A best reply $\sigma$ to $\tau_{\hat{A}}$ emulates $(\sigma^{R},y^{R}(\sigma^{R},p))$ .

6.2. Labor Market Signaling

The firm’s objective function is to forecast the productivity of the worker:

v(\theta,p,a)=-(\theta-a)^{2}

If $A$ is a real line, then

y^{R}(\sigma,p)=\arg\max_{a\in A}\mathbf{E}_{\theta}\left[v(\theta,p,a)\mathrel{\mathop{\mathchar 58\relax}}p,\sigma\right]

where the posterior distribution over $\theta$ is calculated via Bayes rule from $\sigma$ and the prior over $\theta$ . Strict concavity of $v$ implies that $y^{R}(\sigma,p)$ is a strict best response $\forall\sigma,p$ .

Without loss of generality, we consider a single threshold decision rule parameterized by $(a^{+},a^{-},p^{0})$ :

h(p)=\begin{cases}a^{+}&\text{if }\ p\geq p^{0}\\ a^{-}&\text{if }\ p<p^{0}.\\ \end{cases}

Let ${\mathcal{H}}$ be the set of all single threshold decision rules. In each round, $h_{t}$ solves

\max_{h\in{\mathcal{H}}}\mathbf{E}_{\theta}\left[v(\theta,p,a)\mathrel{\mathop{\mathchar 58\relax}}p,\sigma\right]

if the data includes $\sigma$ . We construct $\tau_{A}$ accordingly. If $\sigma$ is not observable by the algorithm, we estimate the posterior distribution of $\sigma$ conditioned on each $p$ to construct $\tau_{\hat{A}}$ . If the agent learns $y^{R}(\sigma,p)$ eventually $\forall\sigma,p$ , then the principal’s choice $\sigma^{R}$

\mathbf{E}[u(\theta,p,y^{R}(\sigma,p))\mathrel{\mathop{\mathchar 58\relax}}\sigma,p]=\sum_{\theta}\sum_{p}u(\theta,p,y^{R}(\sigma,p))\sigma(p\mathrel{\mathop{\mathchar 58\relax}}\theta)\pi(\theta).

If $\sigma^{R}$ entails separation by the high productivity worker, then the Riley outcome is the solution, that generates the largest ex ante expected surplus for the principal among all separating equilibria. In order to satisfy the incentive constraint among different types of the principal, the principal with $\theta=H$ incurs the signaling cost. If the signaling cost outweighs the benefit of separation, then $\sigma^{R}$ is the pooling equilibrium where both types of the workers takes the minimal signal.

The analysis is based upon the assumption that $y^{R}(\sigma,p)$ is a strict best response $\forall\sigma,p$ . As $\mathinner{\!\left\lvert A\right\rvert}=J<\infty$ , $y^{R}(\sigma,p)$ may not be a strict best response for some $\sigma$ and $p$ . Let us assume that

A=\{a_{0}=0,a_{1},\ldots,a_{J}\}

and $a_{i}-a_{i-1}=\Delta>0$ and $a_{F}=1+\sup\Theta>0$ . Although $y^{R}(\sigma,p)$ may not be a strict best response for some $\sigma$ and $p$ , the set of best responses contains at most 2 elements, which differ by $\Delta>0$ . Abusing notation, let $y^{R}(\sigma,p)$ be the set of best responses, if the agent has multiple best responses at $p$ . Applying the convergence result, we have $\exists T$ such that, $\forall t\geq T$ ,

\mathbf{P}\left(\exists y\in y^{R}(\sigma,p),y=\tau_{\hat{A}}(D_{t})(p)\right)<e^{-\rho t}.

For a sufficiently small $\Delta>0$ , $\sigma^{R}$ is either a strategy close to the Riley outcome, or the pooling equilibrium where both types of the principals choose the smallest value of $p$ .

6.2.1. Monopoly Market

In the model of \citeasnounRubinstein93 illustrated in Section 3.3.3, suppose that type 1 buyer is an algorithmic player, while type 2 buyer is a rational player. To simplify the model, we assume that type 2 buyer’s decision is $y^{R}(\sigma,p)$ $\forall\sigma$ . Assume that type 1 buyer uses $\tau_{\hat{A}}$ .

$y^{R}(\sigma,p)$ is not a strict best response, if

\mathbf{E}_{\theta}v(\theta,p,a,i)=0\qquad\forall a\in A=\{1,-1\},\forall i

(6.14)

so that the agent is indifferent between accepting and rejecting $p$ . Thus, $\tau_{\hat{A}}$ is not PAC learnable.

Still, the best response of the monopolistic seller against $\tau_{\hat{A}}$ is $\sigma^{R}$ . The critical step is to show that a rational seller would not use any $\sigma$ which assigns positive probability $p>v_{L}$ satisfying (6.14).

Lemma 6.2.

Fix $\sigma$ which assigns $p>v_{L}$ with positive probability, satisfying

\mathbf{E}_{\theta}v(\theta,p,1)\geq 0.

(6.15)

Then, the ex ante expected profit of the principal against $\tau_{\hat{A}}$ from $\sigma$ is strictly smaller than from $\sigma^{\prime}$ :

{\mathcal{U}}(\sigma^{R},\tau_{\hat{A}})>{\mathcal{U}}(\sigma,\tau_{\hat{A}}).

Proof.

It suffices to show that if $p>v_{L}$ and $\mathbf{E}(v\mathrel{\mathop{\mathchar 58\relax}}p)-p\geq 0$ , then the expected profit from $p$ is strictly less than $\pi_{L}v_{L}$ . We write the proof in \citeasnounRubinstein93 for the later reference. For any price $p$ satisfying

\mathbf{P}(H\mathrel{\mathop{\mathchar 58\relax}}p)v_{H}+\mathbf{P}(L\mathrel{\mathop{\mathchar 58\relax}}p)v_{L}\geq p,

the revenue cannot exceed

\mathbf{P}(H\mathrel{\mathop{\mathchar 58\relax}}p)v_{H}+\mathbf{P}(L\mathrel{\mathop{\mathchar 58\relax}}p)v_{L}

but the cost is

\mathbf{P}(H\mathrel{\mathop{\mathchar 58\relax}}p)(1-r)c_{2}+\mathbf{P}(H\mathrel{\mathop{\mathchar 58\relax}}p)rc_{1}.

Thus, the seller’s expected profit is at most

\mathbf{P}(L\mathrel{\mathop{\mathchar 58\relax}}p)v_{L}+\mathbf{P}(H\mathrel{\mathop{\mathchar 58\relax}}p)((1-r)(v_{H}-c_{2})+r(v_{H}-c_{1}))

Because of the lemon’s problem,

(1-r)(v_{H}-c_{2})+r(v_{H}-c_{1})<0

and

\mathbf{P}(H\mathrel{\mathop{\mathchar 58\relax}}p)>0

to satisfy

\mathbf{P}(H\mathrel{\mathop{\mathchar 58\relax}}p)v_{H}+\mathbf{P}(L\mathrel{\mathop{\mathchar 58\relax}}p)v_{L}\geq p>v_{L}.

Integrating over $p$ , we conclude that the ex ante profit is strictly less than $\pi_{L}v_{L}$ . ∎

Lemma 6.2 implies that again $\tau_{\hat{A}}$ , the principal will not use $\sigma$ which assigns a positive probability to $p$ so that both 1 and -1 are best responses. Thus, if $\sigma$ is a best response to $\tau_{\hat{A}}$ , then $y^{R}(\sigma,p)$ is a strict best response $\forall p>v_{L}$ . We can apply Proposition 5.3.

Proposition 6.3.

In the example of \citeasnounRubinstein93, if $\sigma$ is a best response to $\tau_{\hat{A}}$ , then $(\sigma,\tau_{\hat{A}})$ is a Nash equilibrium of the algorithm game, which emulates $(\sigma^{R},y^{R}(\sigma^{R},p))$ .

7. Conclusion

This paper has applied the framework of PAC learnability to describe the performance of algorithms in a strategic setting. We show that as long as some initial set of classifiers satisfy weak learnability, an algorithm can be specified which ensures the receiver takes an optimal response to the sender’s action. As noted by \citeasnounRubinstein93, this need not be the case when the receiver’s behavior follows from the optimally chosen single-threshold classifier given the sender’s strategy. However, being able to combine classifiers is enough to overcome this limitation, even if it only remains possible to find the “best” classifier from within this limited class.

Our general analysis has focused on settings featuring strategic inference—based on the observed action of the strategic sender, a rational receiver would update beliefs about an underlying state (thus influencing the optimal response). This adds a complicating feature that the ex-post optimal action is only observed with noise. Yet because this noise diminishes with the size of the sample, we are still able to show this presents no added difficulty (thanks to results from large deviations theory). We briefly mention that if the amount of label noise were bounded away from zero, then our approach need not be successful (a well-known issue with Boosting algorithms). While a technical contribution, it is one that is necessary due to the uncertainty inherent in our applications of interest.

We have sought to articulate the following tradeoff in the design of statistical algorithms to mimic rationality: on the one hand, simply fitting a single-threshold classifier to data will fall short of rational play and be exploited. On the other hand, it may not be clear why this is the end of the story. By adding the ability to fit classifiers repeatedly and combining them in particular ways, we show how the rational benchmark can be restored. Here, we have taken as a black box the ability to fit these classifiers. But given this, our algorithm specifies exactly how to put these fitted classifiers together in order to construct one which can mimic rationality arbitrarily well.

We have focused on a simple yet general setting where the comparison to the rational benchmark is most transparent. Still, we believe that many concerns highlighted by the machine learning literature regarding the design of algorithms can speak to issues of interest to economic theorists. Given how productive the machine learning literature has been in terms of designing algorithms for the purposes of classification, we hope that our work will inspire further analysis of how these algorithms behave in strategic settings.

Appendix A Weak Learnability Proofs

The proof of 5.6 uses the following Lemma:

Lemma A.1.

Let $\mathcal{H}$ be an arbitrary hypothesis class with the property that for every $h\in\mathcal{H}$ and every permutation $\pi\mathrel{\mathop{\mathchar 58\relax}}A\rightarrow A$ , the composition $\pi\circ h$ is contained in $\mathcal{H}$ . Then this hypothesis class can do at least as well as a uniform random guesser.

Proof.

Let $\Pi$ be the set of all possible permutations on $A$ , noting that $\mathinner{\!\left\lvert\Pi\right\rvert}=k!$ . Fix an arbitrary classifier $h\in\mathcal{H}$ , and define $h^{\pi}=\pi\circ h$ . Let $c_{j,y}$ be the cost of assigning label $y$ to price $p_{j}$ . Define

\sum_{\pi\in\Pi}c_{j,h^{\pi}(p_{j})}=\overline{c}_{j}.

In particular, note that this is invariant to the true label of $j$ . As a result, the random guesser’s expected payoff on observation $j$ is is $\overline{c}_{j}/k!$ . To see this, note that $h(p_{j})$ gives some fixed guess regarding the label of price $p_{j}$ . Then randomizing over permutations is equivalent to randomizing over labels, as there are an equal number of permutations which flip the label according to $h(p_{j})$ and every other label.

We therefore obtain the following matrix equation, for an arbitrary $\rho\in(0,\infty)$ , where the number of columns is $k!$ and the number of rows is the number of possible prices.

\left(\begin{array}[]{c|cc|c}&&&\\ c_{j,h(p)}&\cdots&\cdots&c_{j,h^{\pi}(p)}\\ -\overline{c}_{j}/k!&&&-\overline{c}_{j}/k!\\ &&&\end{array}\right)\cdot\begin{pmatrix}\frac{\rho}{k!}\\ \vdots\\ \frac{\rho}{k!}\end{pmatrix}=\mathbf{0}

Also note that:

(1/\rho,\cdots,1/\rho)\cdot\begin{pmatrix}\frac{\rho}{k!}\\ \vdots\\ \frac{\rho}{k!}\end{pmatrix}=1

So as long as $\rho>0$ , by the theorem of the alternative, we therefore cannot have that a vector $\mathbf{x}$ exists with:

\left(\begin{array}[]{cccc}&c_{j,h(p)}-\overline{c}_{j}/k!&\\ \hline\cr\vdots&\vdots&\\ \vdots&\vdots&\\ \hline\cr&c_{j,h^{\pi}(p)}-\overline{c}_{j}/k!&\end{array}\right)\cdot\mathbf{x}\geq\begin{pmatrix}\frac{1}{\rho}\\ \vdots\\ \frac{1}{\rho}\end{pmatrix}.

Let $D(p)$ be an arbitrary distribution. Since $\sum_{p\in P}D(p)=1$ , this implies we can find some $\pi$ such that:

\left(\sum_{p_{j}\in P}D(p_{j})(c_{j,h^{\pi}(p_{j})}-\frac{\overline{c}_{j}}{k!})\right)<\frac{1}{\rho}.

Taking $\rho\rightarrow\infty$ and rearranging gives:

\left(\mathbf{E}_{p\sim D}[c_{j,h^{\pi}(p_{j})}]\right)\leq\mathbf{E}_{j\sim D}\left[\frac{\overline{c}_{j}}{k!}\right]

Recalling again that the right hand side of this inequality is the payoff of the random guesser, we have shown that for every possible distribution over prices, we can find some permutation which delivers a cost bounded above by the random guesser. This proves the Lemma. ∎

Proof of Proposition 5.6.

Let $\mathcal{H}$ be the set of hyperplane classifiers. We prove this by contradiction. If there were no universal lower bound on the error, then we would have, for all $\rho$ , a distribution $D_{\rho}$ and cost $c_{j,y}^{\rho}$ (without loss normalized to be on the unit sphere themselves) with the property that:

\max_{h\in\mathcal{H}}\sum_{p\in P}D_{\rho}(p)c_{j,h(p)}^{\rho}<U_{c}^{\rho},

where $U_{c}^{\rho}$ is the payoff of the uniform random guesser who is correct with added probability $\rho$ . Taking $\rho\rightarrow 0$ and passing to a subsequence if necessary, compactness of the unit sphere implies that we can find a distribution $D^{*}$ and cost function $c^{*}$ such that:

\max_{h\in\mathcal{H}}\sum_{p\in P}D^{*}(p)c_{j,h(p)}^{*}=U_{c}^{0},

where we note by Lemma A.1 that at least this bound can be obtained by permutation the labels if necessary. We will arrive at a contradiction by exhibiting a single-hyerplane classifier that achieves a strictly better accuracy, given $D^{*}$ . Note that $\mathcal{H}$ contains the set of “trivial” classifiers, which give all menus the same label. Also note that the only non-trivial case to consider is when there are at least two prices in the support of $D^{*}$ ; if there were only one price, then simply choosing the prediction corresponding to the label on that price would yield a perfect fit. Since, by assumption, no classifier does better than random guessing, it must be the case in particular that each trivial classifier cannot exceed the random-guess bound. On the other hand, by our previous result, we know there does exist a trivial classifier which achieves at least this bound, for any $D$ supported on $P$ .

Let $P=\{p_{1},\ldots,p_{k}\}$ be the set of prices supporting $D^{*}$ , and let $\tilde{p}\in P$ be a price in that is also an extreme point of the convex hull of $P$ . Without loss of generality, assume that $\tilde{p}$ is nontrivial, in the sense that it does not give the same cost to all labels. Note that indeed, this is without loss, since for any such price, the choice of classification is irrelevant.¹⁵¹⁵15If all prices are trivial, then we will achieve a contradiction, because that implies that the classifier does do at least as well as the edge-over-random guesser, since all classifiers achieve the same payoff. Note that $\tilde{p}$ is not in the convex hull of $P\backslash\{\tilde{p}\}$ . Therefore, by the separating hyperplane theorem, we can find an $h\in\mathcal{H}$ which (strictly) separates $\tilde{p}$ from $P\backslash\{\tilde{p}\}$ . Denote such a hyperplane by $h^{*}$ , and note that the set of hyperplane classifiers contains classifiers which assign any two labels (possibly the same label) to prices depending on which side of $h^{*}$ they lie on.

Also note that, again by our previous result, a trivial classifier supported on $P\backslash\{\tilde{p}\}$ can achieve the random guess guaranatee if $p$ is distributed according to the conditional distribution on this set. In other words, our prior lemma implies that there exists $y^{*}\in A$ such that:

\sum_{p_{j}\in P\backslash\tilde{p}}\frac{D^{*}(p_{j})}{\sum_{q\in P\backslash\tilde{P}}D^{*}(q)}c_{j,y^{*}}^{*}=U_{c^{*}}^{0}.

On the other hand, a classifier which separates $p_{\tilde{j}}$ from the other prices can fit $p_{\tilde{j}}$ perfectly. Thus we must have

c_{\tilde{j},y_{\tilde{j}}}<\mathbf{E}_{\hat{y}\sim\text{Unif}}[c_{\tilde{j},\hat{y}}].

So consider the hyperplane classifier which predicts $\tilde{y}$ for $\tilde{p}$ , and $y^{*}$ for $p\in P\backslash\{\tilde{p}\}$ , i.e., depending on which side of $h^{*}$ they are on (acknowledging that this may be a trivial classifier). Denote the resulting classifier by $h$ . For this single-hyperplane classifier, we have

\sum_{p_{j}\in P}D^{*}(p_{j})c_{j,h(p_{j})}=D^{*}(p_{\tilde{j}})c_{\tilde{j},y_{\tilde{j}}}+\left(\sum_{q\in P\backslash\{p_{\tilde{j}}\}}D^{*}(q)\right)\sum_{p_{k}\in P\backslash\{p_{\tilde{j}}\}}\frac{D^{*}(p_{k})}{\sum_{q\in P\backslash\{p_{\tilde{j}}\}}D^{*}(q)}c_{k,y^{*}}>U_{c^{*}}^{0},

where the inequality holds since the single-threshold classifier does strictly better on some non-trivial price, and as well on all other prices. This completes the proof. ∎

Appendix B Specifying the Algorithm Parameters and the Proof of Proposition 5.10.

B.1. Convergence of $\tau_{A}$

B.1.1. The $\mathinner{\!\left\lvert A\right\rvert}=2$ case

We replicate the proof in \citeasnounSchapireandFreund12 for reference. Define

F_{t}(p)=\sum_{k=1}^{t}\alpha_{k}h_{k}(p).

Following the same recursive process described in \citeasnounSchapireandFreund12, we have

d_{t+1}(p)=\frac{d_{1}(p)\exp\left(-y(\sigma,p)\sum_{k=1}^{t}\alpha_{k}h_{k}(p)\right)}{\prod_{k=1}^{t}Z_{k}}=\frac{d_{1}(p)\exp(-y(\sigma,p)F_{t}(p))}{\prod_{k=1}^{t}Z_{k}}.

(B.16)

Following \citeasnounSchapireandFreund12, we can show that

\mathbf{P}\left(H_{t}(p)\neq y(\sigma,p)\right)=\mathbf{E}\sum_{p}d_{1}(p){\mathbf{1}}(H_{t}(p)\neq y(\sigma,p))\leq\mathbf{E}\sum_{p}d_{1}(p)\exp(-y(\sigma,p)F_{t}(p)),

and

\mathbf{P}(H_{t}(p)\neq y(\sigma,p))=\mathbf{E}\prod_{k=1}^{t}Z_{k}.

Note

Z_{k}=\sum_{p}d_{k}(p)\exp\left(-y(\sigma,p)\alpha_{k}h_{k}(p)\right).

The rest of the proof follows from \citeasnounSchapireandFreund12, which we copy here for later reference.

$\displaystyle Z_{t}$	$\displaystyle=$	$\displaystyle\sum_{p}d_{t}(p)\exp\left(-y(\sigma,p)\alpha_{t}h_{t}(p)\right)$
	$\displaystyle=$	$\displaystyle\sum_{y(\sigma,p)h_{t}(p)=1}d_{t}(p)\exp\left(-\alpha_{t}\right)+\sum_{y(\sigma,p)h_{t}(p)=-1}d_{t}(p)\exp\left(-\alpha_{t}\right)$
	$\displaystyle=$	$\displaystyle e^{-\alpha_{t}}(1-\epsilon_{t})+e^{\alpha_{t}}\epsilon_{t}$
	$\displaystyle=$	$\displaystyle e^{-\alpha_{t}}\left(\frac{1}{2}+\gamma_{t}\right)+e^{\alpha_{t}}\left(\frac{1}{2}-\gamma_{t}\right)$
	$\displaystyle=$	$\displaystyle\sqrt{1-4\gamma^{2}_{t}}$

where

\gamma_{t}=\frac{1}{2}-\epsilon_{t}.

By weak learnability, we know that $\gamma_{t}$ is uniformly bounded away from 0: $\exists\gamma>0$ such that

\gamma_{t}\geq\gamma\qquad\forall t\geq 1.

Recall that the maximum number of the elements in the support of $\sigma$ is $N$ . Thus,

d_{t+1}(p)=d_{1}(p)\prod_{k=1}^{t}\sqrt{1-4\gamma^{2}_{t}}\leq\frac{1}{N}\left(1-4\gamma^{2}\right)^{\frac{t}{2}}\leq\frac{1}{N}e^{-2\gamma^{2}t}

where the right hand side converges to 0 at the exponential rate uniformly over $p$ .

B.2. The $\mathinner{\!\left\lvert A\right\rvert}>2$ case

The specification of the algorithm can be found in \citeasnounMukherjeeSchapire2013. The proof provided below fills in some details to show that convergence holds in a self-contained way.

First, initialize $F_{y}^{0}(x_{i})=0$ .

•

From previous stage, take $F_{y}^{t}$ .

•

At stage $t$ , find the $h\in\mathcal{H}$ solving:

\min_{h\in\mathcal{H}}\frac{1}{m}\sum_{i=1}^{m}\mathbf{1}[h_{t}(x_{i})=y_{i}]\left((e^{-\eta}-1)\sum_{\tilde{y}\neq y_{i}}e^{\eta(F_{\tilde{y}}^{t-1}-F_{y_{i}}^{t-1})}\right)+\mathbf{1}[h_{t}(x_{i})\neq y_{i}](e^{\eta}-1)e^{\eta(F_{h_{t}(x_{i})}^{t-1}-F_{y}^{t-1}(x_{i}))}.

•

Define $F_{y}^{t}(x_{i})=\sum_{s=1}^{t}\mathbf{1}[h_{t}(x_{i})=y]$ .

The final prediction is $H_{t}(x_{i})=\arg\max_{\tilde{y}}\sum_{t=1}^{T}\mathbf{1}[h_{t}(x_{i})=\tilde{y}].$

The weak learnability condition says that the hypothesis class can outperform a random guesser that does better than some $\gamma$ , where we allow for a potentially asymmetric cost of making different errors.

We now show convergence to the rational rule:

Step 1: Bounding The Mistakes: This step is as previous. We have

\sum_{i=1}^{m}\mathbf{1}[H_{t}(x_{i})\neq y_{i}]\leq\sum_{i=1}^{m}\sum_{\tilde{y}\neq y_{i}}e^{\eta(F_{\tilde{y}}^{t}(x_{i})-F_{y_{i}}^{t}(x_{i}))}.

Indeed, the exponential is positive, so this inequality holds when $y_{i}$ is labelled correctly, and if the label is incorrect, then that means that some $\tilde{y}_{i}$ satisfies $F_{\tilde{y}_{i}}^{t}(x_{i})>F_{y_{i}}^{t}(x_{i})$ . Since all exponential terms are positive, and furthermore the exponent is positive if $x_{i}$ is labelled incorrectly, meaning the right hand side is greater than 1 if mislabeled.

Step 2: Recursive Formulation of the Loss We now show that the right hand side goes to 0 at an exponential rate. We define the loss function to be:

L_{t}(x_{i})=\sum_{\tilde{y}\neq y_{i}}e^{\eta(F_{\tilde{y}}^{t}(x_{i})-F_{y_{i}}^{t}(x_{i}))},\tilde{L_{t}}=\frac{1}{m}\sum_{i=1}^{m}L_{t}(x_{i}).

We first express $\tilde{L}_{t+1}$ as a function of $\tilde{L}_{t}$ . Note that $F_{\tilde{y}}^{t+1}(x_{i})=F_{\tilde{y}}^{t}(x_{i})$ for all $\tilde{y}\neq h_{t}(x_{i})$ , and $F_{\tilde{y}}^{t+1}(x_{i})=F_{\tilde{y}}^{t}(x_{i})+1$ for $\tilde{y}=h_{t}(x_{i})$ . The loss from a given $x_{i}$ changes depending on whether or not it is correctly classified. For any observation that is classified correctly at the $t+1$ th stage, we multiply that observation’s loss by a factor of $e^{-\eta}$ . On the other hand, for any observation that is classified incorrectly as $\tilde{y}$ , we add the following:

e^{\eta(F_{\tilde{y}}^{t}(x_{i})-F_{y_{i}}^{t}(x_{i}))}(e^{\eta}-1).

So:

\tilde{L}_{t+1}=\frac{1}{m}\left(\sum_{i\mathrel{\mathop{\mathchar 58\relax}}h_{t+1}(x_{i})=y_{i}}e^{-\eta}L_{t}(x_{i})+\sum_{i\mathrel{\mathop{\mathchar 58\relax}}h_{t+1}(x_{i})\neq y_{i}}\left(L_{t}(x_{i})+e^{\eta(F_{h_{t+1}(x_{i})}^{t}(x_{i})-F_{y_{i}}^{t}(x_{i}))}(e^{\eta}-1)\right)\right).

Note that if we subtract $\tilde{L}_{t}$ from both sides, and substitute in for $L_{t}(x_{i})$ above, we obtain:

\tilde{L}_{t+1}-\tilde{L}_{t}=\frac{1}{m}\left(\sum_{i\mathrel{\mathop{\mathchar 58\relax}}h_{t+1}(x_{i})=y_{i}}(e^{-\eta}-1)\sum_{\tilde{y}\neq y_{i}}e^{\eta(F_{\tilde{y}}^{t}(x_{i})-F_{y_{i}}^{t}(x_{i}))}+\sum_{i\mathrel{\mathop{\mathchar 58\relax}}h_{t+1}(x_{i})\neq y_{i}}e^{\eta(F_{h_{t+1}(x_{i})}^{t}(x_{i})-F_{y_{i}}^{t}(x_{i}))}(e^{\eta}-1)\right).

Step 3: Weak Learnability By the above, $h_{t+1}$ is chosen to solve:

\min_{h\in\mathcal{H}}\frac{1}{m}\sum_{i=1}^{m}\mathbf{1}[h(x_{i})=y_{i}]\left((e^{-\eta}-1)\sum_{\tilde{y}\neq y_{i}}e^{\eta(F_{\tilde{y}}^{t}(x_{i})-F_{y_{i}}^{t}(x_{i}))}\right)+\mathbf{1}[h(x_{i})\neq y_{i}](e^{\eta}-1)e^{\eta(F_{h(x_{i})}^{t}(x_{i})-F_{y}^{t}(x_{i}))}.

In fact, using the previous step, we see that this can equivalently be expressed as $\tilde{L}_{t+1}-\tilde{L}_{t}$ . On the other hand, someone who is random guessing, but is correct with extra probability $\gamma$ , will be correct with probability $\frac{1-\gamma}{k}+\gamma$ , and guess an incorrect label $\tilde{y}$ with probability $\frac{1-\gamma}{k}$ . Furthermore, the hypothesis class ensures a weakly lower error (as measured by this cost) than the random guessing. Hence this expression is bounded above by:

\frac{1}{m}\sum_{i=1}^{m}\left((\frac{1-\gamma}{k}+\gamma)(e^{-\eta}-1)L_{t}(x_{i})+\frac{1-\gamma}{k}\sum_{\tilde{y}\neq y_{i}}(e^{\eta}-1)e^{\eta(F_{\tilde{y}}^{t}(x_{i})-F_{y}(x_{i}))}\right)

Again substituting in for $L_{t}(x_{i})$ and rearranging, we obtain:

\left((\frac{1-\gamma}{k}+\gamma)(e^{-\eta}-1)+\frac{1-\gamma}{k}(e^{\eta}-1)\right)\tilde{L}_{t}.

Putting this together, we have this is an upper bound of $\tilde{L}_{t+1}-\tilde{L}_{t}$ , and therefore:

\tilde{L}_{t+1}\leq\left(1+\left((\frac{1-\gamma}{k}+\gamma)(e^{-\eta}-1)+\frac{1-\gamma}{k}(e^{\eta}-1)\right)\right)\tilde{L}_{t}.

Step 4: Specifying $\eta$ We are done if we can ensure $\tilde{L}_{t}\rightarrow 0$ as $t\rightarrow\infty$ , since Step 1 shows that this implies that the number of misclassifications approaches 0 as well. To complete the argument, we must specify an $\eta$ which delivers the exponential convergence. However, first note that if $\eta=0$ , the coefficient on $\tilde{L}_{t}$ in the previous inequality is 1, and the derivative with respect to $\eta$ is $-\gamma$ at 0, so that this expression is less than 1, for some $\eta>0$ . Setting $\eta=\log(1+\gamma)$ , the above coefficient on $\tilde{L}_{t}$ reduces to:

\overbrace{1+\left((\frac{1-\gamma}{k}+\gamma)(\frac{1}{1+\gamma}-1)+\frac{1-\gamma}{k}\gamma\right)}^{z_{k}(\gamma)}.

Note that $z_{k}(\gamma)$ is bounded above by $\tilde{z}(\gamma)=e^{-\gamma^{2}/2}$ . Indeed, this expression is decreasing in $k$ , with $z_{k}(0)=1=\tilde{z}(0)$ , and $z_{2}(\gamma)=1-\frac{\gamma^{2}}{2}<e^{-\gamma^{2}/2}=\tilde{z}(\gamma)$ . Since $\tilde{L}_{0}=(k-1)$ , we therefore have that:

\tilde{L}_{t}\leq(k-1)e^{-\gamma t^{2}/2},

as desired.

B.3. Convergence of $\tau_{\hat{A}}$

Under the assumption that $y^{R}(\sigma,p)$ is a strict best response,

\lim_{t\rightarrow\infty}\hat{y}_{t}(p)=y^{R}(\sigma,p)

almost surely. Since $\hat{y}_{t}(p)$ satisfies the uniform LDP, $\forall\epsilon>0$ , $\exists\rho(\epsilon,\sigma)>0$ and $T(\epsilon,\sigma)$ such that

\mathbf{P}\left(\exists t\geq T(\epsilon,\sigma),\hat{y}_{t}(p)\neq y^{R}(\sigma,p)\right)\leq e^{-t\rho(\epsilon,\sigma)}.

Since the support of $\sigma$ contains a finite number of $p$ , the empirical the multinomial probability distribution over $\theta$ .

Let ${\hat{\pi}}_{t}(\theta\mathrel{\mathop{\mathchar 58\relax}}p)$ be the empirical probability distribution over $\Theta$ following $t$ rounds of observations. By the law of large numbers, ${\hat{\pi}}_{t}(\theta\mathrel{\mathop{\mathchar 58\relax}}p)\rightarrow\pi(\theta\mathrel{\mathop{\mathchar 58\relax}}p)$ computed via Bayes rule from the prior distribution over $\theta$ and $\sigma$ . Write $\Theta=(\theta_{1},\ldots,\theta_{\mathinner{\!\left\lvert\Theta\right\rvert}})$ . Given ${\bf\epsilon}=(\epsilon,\ldots,\epsilon)\in{\mathbb{R}}^{\mathinner{\!\left\lvert\Theta\right\rvert}}$ , the rate function of the multinomial distribution is

\sum_{i=1}^{\mathinner{\!\left\lvert\Theta\right\rvert}}\epsilon\log\frac{\epsilon}{p(\theta)}

where $p(\theta)$ is the probability that $\theta$ is realized. Since $\sum_{\theta}p(\theta)=1$ ,

\sum_{i=1}^{\mathinner{\!\left\lvert\Theta\right\rvert}}\epsilon\log\frac{\epsilon}{p(\theta)}\geq\prod_{i=1}^{\mathinner{\!\left\lvert\Theta\right\rvert}}\epsilon\log\frac{\epsilon}{1/\mathinner{\!\left\lvert\Theta\right\rvert}}=\prod_{i=1}^{\mathinner{\!\left\lvert\Theta\right\rvert}}\epsilon\log\epsilon\mathinner{\!\left\lvert\Theta\right\rvert}>0.

Note that the right hand side is independent of $\sigma$ , which is the rate function of the uniform distribution over $\Theta$ . Thus, we can choose $\rho(\epsilon)\leq\rho(\epsilon,\sigma)$ uniformly over $\sigma$ , which is strictly increasing with respect to $\epsilon>0$ . We choose $T(\epsilon)$ independently of $\sigma$ as well.

Define an event

{\mathcal{L}}=\left\{\hat{y}_{t}(p)=y^{R}(\sigma,p)\qquad\forall t\geq T(\epsilon)\right\}

We know that

\mathbf{P}({\mathcal{L}})\geq 1-e^{-t\rho(\epsilon)}.

Fix $t>T(\epsilon)$ . We have

			$\displaystyle\mathbf{P}\left(\tau_{\hat{A}}(D_{t})(p)\neq y^{R}(\sigma,p)\right)$
		$\displaystyle=$	$\displaystyle\mathbf{P}\left(\tau_{\hat{A}}(D_{t})(p)\neq y^{R}(\sigma,p)\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{L}}\right)\mathbf{P}({\mathcal{L}})+\mathbf{P}\left(\tau_{\hat{A}}(D_{t})(p)\neq y^{R}(\sigma,p)\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{L}}^{c}\right)\mathbf{P}({\mathcal{L}}^{c})$
		$\displaystyle\leq$	$\displaystyle\mathbf{P}\left(\tau_{\hat{A}}(D_{t})(p)\neq y^{R}(\sigma,p)\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{L}}\right)+\mathbf{P}({\mathcal{L}}^{c})$
		$\displaystyle\leq$	$\displaystyle\mathbf{P}\left(\tau_{\hat{A}}(D_{t})(p)\neq y^{R}(\sigma,p)\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{L}}\right)+e^{-t\rho(\epsilon)}.$

Following the same logic as in the proof of Proposition 5.10, we can show that $\exists\gamma(G)>0$ such that

\hat{Z}_{t}\leq 1-\gamma(G)\qquad\forall t\geq 1

(B.17)

under $\tau_{\hat{A}}$ .

Recall that

F_{a}(p)=\sum_{s=1}^{t}\alpha_{s}{\mathbf{1}}(h_{s}(p)=a).

Similarly, we define

{\hat{F}}_{a}(p)=\sum_{s=1}^{t}\hat{\alpha}_{s}{\mathbf{1}}(h_{s}(p)=a).

Following the same logic as in the proof of Proposition 5.10, we know that if $\tau_{\hat{A}}(D_{t})(p)\neq y^{R}(p)$ ,

{\hat{F}}_{y^{R}(\sigma,p)}(p)+\sum_{a\neq y^{R}(\sigma,p)}{\hat{F}}_{a}(p)>0.

Thus,

	$\displaystyle{\mathbf{1}}(\tau_{\hat{A}}(D_{t})(p)\neq y^{R}(\sigma,p))$	$\displaystyle\leq$	$\displaystyle{\mathbf{1}}\left({\hat{F}}_{y^{R}(\sigma,p)}(p)+\sum_{a\neq y^{R}(\sigma,p)}{\hat{F}}_{a}(p)\right)$
		$\displaystyle\leq$	$\displaystyle\exp\left({\hat{F}}_{y^{R}(\sigma,p)}(p)+\sum_{a\neq y^{R}(\sigma,p)}{\hat{F}}_{a}(p)\right).$

Conditioned on event ${\mathcal{L}}$ ,

\hat{y}_{t}(p)=y^{R}(\sigma,p)\qquad\forall t\geq T(\epsilon).

We can write for $t\geq T(\epsilon)$ ,

$\displaystyle d_{t+1}(p)$	$\displaystyle=$	$\displaystyle\frac{{\hat{d}}_{t}(p)\exp(\alpha_{t}({\mathbf{1}}(h_{t}(p)\neq\hat{y}_{t}(p))-{\mathbf{1}}(h_{t}(p)=\hat{y}_{t}(p))))}{\hat{Z}_{t}}$
	$\displaystyle=$	$\displaystyle\frac{{\hat{d}}_{t}(p)\exp(\alpha_{t}({\mathbf{1}}(h_{t}(p)\neq y^{R}(\sigma,p))-{\mathbf{1}}(h_{t}(p)=y^{R}(\sigma,p))))}{\hat{Z}_{t}}$
	$\displaystyle=$	$\displaystyle\frac{d_{T(\epsilon)}(p)\exp(\sum_{s=T(\epsilon)}^{t}\alpha_{s}({\mathbf{1}}(h_{s}(p)\neq y^{R}(\sigma,p))-{\mathbf{1}}(h_{s}(p)=y^{R}(\sigma,p))))}{\prod_{s=T(\epsilon)}^{t}\hat{Z}_{t}}.$

Thus,

			$\displaystyle\prod_{s=T(\epsilon)}^{t}\hat{Z}_{t}$
		$\displaystyle=$	$\displaystyle\sum_{p}d_{T(\epsilon)}(p)\exp\left[\sum_{s=T(\epsilon)}^{t}\alpha_{s}({\mathbf{1}}(h_{s}(p)\neq y^{R}(\sigma,p))-{\mathbf{1}}(h_{s}(p)=y^{R}(\sigma,p)))\right]$
		$\displaystyle\geq$	$\displaystyle\left(\min_{p\in{\mathcal{P}}(\sigma)}d_{T(\epsilon)}(p)\right)\sum_{p}\exp\left[\sum_{s=T(\epsilon)}^{t}\alpha_{s}({\mathbf{1}}(h_{s}(p)\neq y^{R}(\sigma,p))-{\mathbf{1}}(h_{s}(p)=y^{R}(\sigma,p)))\right].$

Since $d_{1}(p)$ is the uniform distribution over ${\mathcal{P}}(\sigma)$ ,

\min_{p\in{\mathcal{P}}(\sigma)}d_{T(\epsilon)}(p)>0.

We can write

			$\displaystyle\prod_{s=1}^{t}\hat{Z}_{t}=\prod_{s=T(\epsilon)}^{t}\hat{Z}_{t}\prod_{s=1}^{T(\epsilon)-1}\hat{Z}_{t}$
		$\displaystyle\geq$	$\displaystyle\left(\min_{p\in{\mathcal{P}}(\sigma)}d_{T(\epsilon)}(p)\right)\sum_{p}\exp(\sum_{s=T(\epsilon)}^{t}\hat{\alpha}_{s}({\mathbf{1}}(h_{s}(p)\neq y^{R}(\sigma,p))-{\mathbf{1}}(h_{s}(p)=y^{R}(\sigma,p))))\prod_{s=1}^{T(\epsilon)-1}\hat{Z}_{t}$
		$\displaystyle=$	$\displaystyle\frac{\left(\min_{p\in{\mathcal{P}}(\sigma)}d_{T(\epsilon)}(p)\right)\prod_{s=1}^{T(\epsilon)-1}\hat{Z}_{t}}{\sum_{p}\exp\left[\sum_{s=1}^{T(\epsilon)-1}\hat{\alpha}_{s}({\mathbf{1}}(h_{s}(p)\neq y^{R}(\sigma,p))-{\mathbf{1}}(h_{s}(p)=y^{R}(\sigma,p)))\right]}$
			$\displaystyle\ \ \ \ \ \ \ \ \ \ \ \ \ \times\sum_{p}\exp\left[\sum_{s=1}^{t}\hat{\alpha}_{s}({\mathbf{1}}(h_{s}(p)\neq y^{R}(\sigma,p))-{\mathbf{1}}(h_{s}(p)=y^{R}(\sigma,p)))\right]$

over ${\mathcal{L}}$ . Define

M(\epsilon)=\frac{\left(\min_{p\in{\mathcal{P}}(\sigma)}d_{T(\epsilon)}(p)\right)\prod_{s=1}^{T(\epsilon)-1}\hat{Z}_{t}}{\sum_{p}\exp(\sum_{s=1}^{T(\epsilon)-1}\hat{\alpha}_{s}({\mathbf{1}}(h_{s}(p)\neq y^{R}(\sigma,p))-{\mathbf{1}}(h_{s}(p)=y^{R}(\sigma,p))))}

which is bounded away from 0.

Recall that

			$\displaystyle\mathbf{P}(\tau_{\hat{A}}(D_{t})(p)\neq y^{R}(\sigma,p))$
		$\displaystyle\leq$	$\displaystyle\sum_{p}d_{1}(p)\exp(\sum_{s=1}^{t}\hat{\alpha}_{s}({\mathbf{1}}(h_{s}(p)\neq y^{R}(\sigma,p))-{\mathbf{1}}(h_{s}(p)=y^{R}(\sigma,p))))$
		$\displaystyle\leq$	$\displaystyle\frac{\prod_{s=1}^{t}\hat{Z}_{t}}{M(\epsilon)}\leq\frac{(1-\gamma(G))^{t}}{M(\epsilon)}\leq\frac{e^{-t\gamma(G)}}{M(\epsilon)}.$

Combining the probabilities over ${\mathcal{L}}$ and ${\mathcal{L}}^{c}$ , we have that $\forall\epsilon$ , $\forall\sigma\in\Sigma^{G}\subset\Sigma$ , $\exists T(\epsilon)$ , $\rho(\epsilon)$ and $\gamma(G)$ such that

\mathbf{P}\left(\exists t\geq T(\epsilon),\ \tau_{\hat{A}}(D_{t})(p)\neq y^{R}(\sigma,p)\right)\leq\frac{e^{-t\gamma(G)}}{M(\epsilon)}+e^{-t\rho(\epsilon)}.

We can choose $T>T(\epsilon)$ and ${\overline{\rho}}$ such that $\forall t\geq T$ ,

\frac{e^{-t\gamma(G)}}{M(\epsilon)}+e^{-t\rho(\epsilon)}\leq e^{-{\overline{\rho}}t}

which proves the proposition.

Appendix C Proofs for Section 5.3.2

Proof of Proposition 5.11.

Concave differences implies that the set

K=\{p\mathrel{\mathop{\mathchar 58\relax}}u(\theta,p,1)\geq u(\theta,p,0)\}

is a convex set; if $u(\theta,p_{i},1)-u(\theta,p_{i},0)\geq 0$ for $i=1,2$ , then the same conclusion holds for $\alpha p_{1}+(1-\alpha)p_{2}$ for all $\alpha\in[0,1]$ . Therefore, given any $p^{*}$ on the boundary, the supporting hyperplane theorem implies that we can find a linear hyperplane $(\lambda,\omega)$ tangent to this set at $p^{*}$ .

Suppose the algorithm designer prescribes that the receiver choose $a=1$ at any menu $p$ such that $\lambda\cdot p\leq\omega$ and $a=0$ otherwise. Note that having the receiver choose $a=1$ therefore requires choosing $p$ where the sender would rather the receiver choose action $a=0$ , by definition of $K$ . Therefore, the strategic player cannot do any better than choosing $\sigma(p\mid\theta)$ which is a point mass at $p^{*}$ . ∎

Proof of Proposition 5.12.

The ideas in this proof are largely borrowed from \citeasnounRubinstein93, accommodating two additional features of our enviroment: (a) need to infer the strategy from observed data and (b) the generalized setting, but we provide the proof for completeness. We construct a strategy $\sigma^{*}$ for the sender that generates higher payoff than the equilibrium strategy $\sigma^{R}$ , thus deriving the contradiction that $\sigma^{R}$ is a best response to $\tau$ in the long run. More precisely, define $(p_{\theta}^{*},a(\theta))$ to be the sender payoff-maximizing strategy. We show that the sender can induce the receiver to choose $a(\theta)\neq y_{R}(p_{\theta}^{*})$ .

Fix $\epsilon>0$ small, and without loss suppose $\mathinner{\!\left\lvert\Theta\right\rvert}=2$ . First suppose $v(\theta_{L},p_{L}^{*},a_{1})=v(\theta_{L},p_{L}^{*},a_{0})$ . Let $\tilde{p}\in(p_{L},p_{H})$ satisfies $v(\theta_{L},\tilde{p},1)<v(\theta_{L},\tilde{p},0)$ . (If $p$ is multidimensional, we can take $\tilde{p}$ tp be on the line segment connecting $p_{L}^{*}$ and $p_{H}^{*}$ ) Set $\eta=v(\theta_{L},\tilde{p},0)-v(\theta_{L},\tilde{p},1)>0.$ We then choose $\epsilon,\epsilon_{H},\epsilon_{L}>0$ to satisfy

\pi(H)\epsilon_{H}<\pi(L)\epsilon_{L},

(C.18)

and such that

\frac{\epsilon_{L}}{\epsilon_{L}+\eta}<\epsilon<\frac{\pi(L)\epsilon_{L}-\pi(H)\epsilon_{H}}{\pi(L)\epsilon_{L}}.

(C.19)

Under the increasing differences assumption, we can find $p_{i}(\varepsilon_{i})$ such that

\varepsilon_{i}=v(\theta_{i},p_{i}(\varepsilon_{i}),1)-v(\theta_{i},p_{i}(\varepsilon_{i}),0).

Consider the following randomized pricing rule $\sigma^{*}$ of the sender: in state $H$ , $\tilde{p}_{H}(\epsilon_{H})$ is chosen with probability 1. In state $L$ , $p_{L}(\epsilon_{L})$ is chosen with probability $1-\epsilon$ and $\tilde{p}$ with probability $\epsilon$ .

Under this strategy, the optimal response following $\tilde{p}$ is 0, and this does not vanish as all other parameters tend to 0. However, the ex-post optimal decisions are 1 for both $\tilde{p}_{L}(\epsilon_{L})$ and $\tilde{p}_{H}(\epsilon_{H})$ . Nevertheless, (C.19) implies first, the decisionmaker prefers to choose $a=1$ if and only if $\tilde{p}_{L}(\epsilon_{L})$ than choose $a=1$ if and only if $\tilde{p}_{H}(\epsilon_{H})$ ; and second, that the loss from choosing $a=1$ following $\tilde{p}$ is larger than the loss from choosing $a=0$ at $\tilde{p}_{L}(\epsilon_{L})$ . Putting this together, and taking $\epsilon,\epsilon_{L},\epsilon_{H}\rightarrow 0$ shows this policy approximates the sender’s optimum, as desired.

The case of $v(\theta_{L},p_{L}^{*},a_{1})>v(\theta_{L},p_{L}^{*},a_{0})$ is even more straightforward, since in this case the gain from choosing $a_{1}$ is non-vanishing, meaning that we can set $\varepsilon_{L}=0$ .

The verification that the optimal rule converges to this threshold when emerging from data is straightforward; any recursive learning algorithm generates $\{\phi_{t}\}$ which converges to $\phi\in\left(v_{L}-\epsilon_{L},\frac{v_{H}+v_{L}}{2}\right)$ to emulate the best response of type 1 buyer against $\sigma$ . Thus, the long run average payoff against such algorithm should be bounded from below by ${\mathcal{U}}_{p}^{*}-\epsilon$ . ∎

Appendix D Proofs for Section 5.3.3

D.1. Proof of Proposition 5.13

The proof of the theorem proceeds in the following steps:

•

Step 1: Show that the expected value conditional on price, in the image of the sender’s possible strategies after applying the augmentation, is uniformly equicontinuous.
•

Step 2: Show that the same label is applied to $\mathbf{E}[v_{\theta}\mid p+z_{i,\eta},\sigma,\phi_{\eta}]$ as would be applied to $\mathbf{E}[v_{\theta}\mid p,\sigma]$ , with high probability.
•

Step 3: Verify that the change in recommendation due to discarding “low density prices” occurs with vanishing probability.

Putting these together shows that the change in the expectation can be made arbitrarily small, as can the probability that small density observations are drawn. The condition that $\sigma$ is either discrete or continuous is stronger than necessary; what is necessary is continuity of the conditional expectation as a function of price, which can be satisfied if the discrete portions and continuous portions are separated, for instance. However, the proposition highlights that we need not restrict the sender’s strategy space at all in order for our algorithm to converge.

The Theorem implies that if the sender were to use an arbitrary strategy $\sigma$ , the receiver could instead focus on finding a rational response to $\tilde{\sigma}_{\eta}$ . Doing so would still lead to PAC learnability of the approximately optimal response to $\sigma$ . On the other hand, we can show that the optimal response to $\tilde{\sigma}_{\eta}$ is PAC learnable (unlike, potentially, the optimal response to $\sigma$ ), and doing the change leads to a negligible impact on the sender’s surplus.

Before presenting the proof, we argue that uniform equicontinuity implies weak learnability. Suppose that $\mathbf{E}[v\mid\sigma,p]-p$ is uniformly equicontinuous (which holds if $\mathbf{E}[v\mid\sigma,p]$ is uniformly equicontinuous). By uniform equicontinuity, we have there exists some $\delta$ such that whenever $\mathinner{\!\left\lvert p-p^{\prime}\right\rvert}<\delta$ , we have that

\mathinner{\!\left\lvert\mathbf{E}[v\mid\sigma,p]-\mathbf{E}[v\mid\sigma,p^{\prime}]\right\rvert}<2\varepsilon,

for any $\sigma$ . Suppose we have some price $p$ such that $\mathbf{E}[v\mid\sigma,p]-p>\varepsilon$ . Then if $\mathbf{E}[v\mid\sigma,p^{\prime}]-p^{\prime}<-\varepsilon$ , it follows that $\mathinner{\!\left\lvert p-p^{\prime}\right\rvert}>\delta$ . It follows that there can only be at most $\frac{v_{H}-v_{L}}{\delta}$ prices such that $y(\sigma,p)=-y(\sigma,p^{\prime})$ , where $p$ and $p^{\prime}$ are adjacent (ignoring all prices where $\mathinner{\!\left\lvert\mathbf{E}[v\mid\sigma,p]-p\right\rvert}<\varepsilon$ , as the classification decision is irrelevant there).

D.1.1. Step One

We first show that $\mathbf{E}[v_{\theta}\mid\tilde{\sigma}_{\eta},p]$ is Lipschitz in $p$ uniformly of $\tilde{\sigma}_{\eta}$ , noting that we are restricting to prices where $\tilde{\sigma}_{\eta}(p)>\gamma$ . Note that:

\tilde{\sigma}_{\eta}^{\prime}(p\mid\theta)=\int\phi_{\eta}^{\prime}(p-\tilde{p})\sigma(\tilde{p}\mid\theta)d\tilde{p}\leq\max\phi_{\eta}^{\prime}\mathrel{\mathop{\mathchar 58\relax}}=\overline{\phi^{\prime}}.

Furthermore, we have:

\frac{d}{dp}\mathbf{P}_{\tilde{\sigma}_{\eta}}[\theta\mid p]=\frac{\tilde{\sigma}_{\eta}^{\prime}(p\mid\theta)\mathbf{P}[\theta]}{\sum_{\tilde{\theta}}\tilde{\sigma}_{\eta}(p\mid\tilde{\theta})\mathbf{P}[\tilde{\theta}]}-\frac{\tilde{\sigma}_{\eta}(p\mid\theta)\mathbf{P}[\theta](\sum_{\tilde{\theta}}\sigma_{\eta}^{\prime}(p\mid\tilde{\theta})\mathbf{P}[\tilde{\theta}])}{{(\sum_{\tilde{\theta}}\tilde{\sigma}_{\eta}(p\mid\tilde{\theta})\mathbf{P}[\tilde{\theta}])^{2}}},

so:

\mathinner{\!\left\lvert\frac{d}{dp}\mathbf{P}_{\tilde{\sigma}_{\eta}}[\theta\mid p]\right\rvert}\leq\overline{\phi^{\prime}}\mathbf{P}[\theta]\cdot\left(\frac{1}{\sum_{\tilde{\theta}}\tilde{\sigma}_{\eta}(p\mid\tilde{\theta})\mathbf{P}[\tilde{\theta}]}\right)+\overline{\phi^{\prime}}\left(\frac{\tilde{\sigma}_{\eta}(p\mid\theta)\mathbf{P}[\theta]}{(\sum_{\tilde{\theta}}\tilde{\sigma}_{\eta}(p\mid\tilde{\theta})\mathbf{P}[\tilde{\theta}])^{2}}\right)\leq\overline{\phi^{\prime}}\mathbf{P}[\theta]\left(\frac{1}{\gamma}+\frac{M(\eta)}{\gamma^{2}}\right),

where $M(\eta)$ is a bound on $\tilde{\sigma}_{\eta}(p\mid\theta)\mathbf{P}[\theta]$ , which exists since $\sigma$ and $\phi_{\eta}$ have bounded densities. Hence we see that for all $p\neq p^{*}$ , the conditional probability is uniformly bounded in $p$ , and is hence Lipschitz continuous. Importantly, the bound only depends on $\eta$ and $\gamma$ (and $\mathbf{P}[\theta]$ ), and is therefore uniform over all strategies in the image of the augmentation. Hence we can ensure that Lipschitz continuity is mainted for all prices in the support of $\tilde{\sigma}_{\eta}$ .

In fact, recall that the Lipschitz constant is equal to the $L^{\infty}$ norm of the derivative. Hence Lipschitz continuity depends only on $\gamma$ , $M(\eta)$ and $\overline{\phi_{\eta}^{\prime}}$ , meaning that the Lipschitz constant holds uniformly over the image of the distributions emerging under the algorithm. It follows that the image is uniformly equicontinuous.

D.1.2. Step Two

Note that since $\mathbf{E}[v_{\theta}\mid\sigma,p]$ is continuous on $S=\cup_{\theta}~{}\text{ Supp }~{}\sigma(\cdot\mid\theta)$ , $\mathbf{E}[v_{\theta}\mid\sigma,p]$ is uniformly continuous on any compact $K\subset S$ . Define:

K_{\gamma}=\{p\mathrel{\mathop{\mathchar 58\relax}}\sum_{\theta}\sigma(p\mid\theta)\mathbf{P}[\theta]\geq\gamma\}.

Using that mollifiers converge uniformly on compact sets, we have that $\tilde{\sigma}_{\eta}\rightarrow\sigma$ uniformly on $K_{\gamma}$ . We therefore have that, for any $\tilde{\varepsilon}$ , we can find some $\overline{\eta}$ such that if $\eta<\overline{\eta}$ and $p\in K_{\gamma}$ , then $\mathinner{\!\left\lvert\tilde{\sigma}_{\eta}(p\mid\theta)-\sigma(p\mid\theta)\right\rvert}<\tilde{\varepsilon}$ for all $\theta$ , and $\mathinner{\!\left\lvert\sum_{\theta}\tilde{\sigma}_{\eta}(p\mid\theta)\mathbf{P}[\theta]-\sum_{\theta}\sigma(p\mid\theta)\mathbf{P}[\theta]\right\rvert}<\tilde{\varepsilon}$ .

Furthermore, since $\sigma$ is uniformly continuous on $K_{\gamma}$ , we have:

\mathinner{\!\left\lvert\sigma(p\mid\theta)-\tilde{\sigma}(p^{\prime}\mid\theta)\right\rvert}=\mathinner{\!\left\lvert\int\phi_{\eta}(p^{\prime}-\tilde{p})(\sigma(p\mid\theta)-\sigma(\tilde{p}\mid\theta))d\tilde{p}\right\rvert}\leq\tilde{\varepsilon},

using the uniform continuity of $\sigma$ on $K_{\gamma}$ .

So for any $p\in K_{\gamma}$ , and $\eta$ sufficiently small, we have (letting $\overline{v}=\max_{\theta}v_{\theta}$ ):

	$\displaystyle\mathinner{\!\left\lvert\mathbf{E}[v_{\theta}\mid\sigma,p]-\mathbf{E}[v_{\theta}\mid\tilde{\sigma}_{\eta},p^{\prime}]\right\rvert}$	$\displaystyle=\mathinner{\!\left\lvert\frac{\sum_{\theta}v_{\theta}\sigma(p\mid\theta)\mathbf{P}[\theta]\sum_{\tilde{\theta}}\tilde{\sigma}_{\eta}(p^{\prime}\mid\tilde{\theta})\mathbf{P}[\tilde{\theta}]-\sum_{\theta}v_{\theta}\tilde{\sigma}_{\eta}(p^{\prime}\mid\theta)\mathbf{P}[\theta]\sum_{\tilde{\theta}}\sigma(p\mid\tilde{\theta})\mathbf{P}[\tilde{\theta}]}{\left(\sum_{\theta}\sigma(p\mid\theta)\mathbf{P}[\theta]\right)\left(\sum_{\theta}\tilde{\sigma}_{\eta}(p^{\prime}\mid\theta)\mathbf{P}[\theta]\right)}\right\rvert}$
		$\displaystyle\leq\frac{1}{\sigma(p)\cdot(\gamma-\tilde{\varepsilon})}\biggl{\lvert}\sum_{\theta}v_{\theta}(\sigma(p\mid\theta)-\tilde{\sigma}_{\eta}(p^{\prime}\mid\theta))\mathbf{P}[\theta]\sum_{\tilde{\theta}}\sigma(p\mid\tilde{\theta})\mathbf{P}[\tilde{\theta}]$
		$\displaystyle+\sum_{\theta}v_{\theta}\sigma(p\mid\theta)\mathbf{P}[\theta]\sum_{\tilde{\theta}}(\tilde{\sigma}_{\eta}(p^{\prime}\mid\tilde{\theta})-\sigma(p\mid\tilde{\theta}))\mathbf{P}[\tilde{\theta}]\biggr{\rvert}$
		$\displaystyle\leq\frac{1}{\sigma(p)\cdot(\gamma-\tilde{\varepsilon})}\biggl{(}\overbrace{\biggl{\lvert}\sum_{\theta}v_{\theta}(\sigma(p\mid\theta)-\tilde{\sigma}_{\eta}(p^{\prime}\mid\theta))\sum_{\tilde{\theta}}\sigma(p\mid\tilde{\theta})\mathbf{P}[\tilde{\theta}]\biggr{\rvert}}^{\leq\overline{v}\tilde{\varepsilon}\sigma(p)}$
		$\displaystyle+\overbrace{\biggl{\lvert}\sum_{\theta}v_{\theta}\sigma(p\mid\theta)\mathbf{P}[\theta]\sum_{\tilde{\theta}}(\tilde{\sigma}_{\eta}(p^{\prime}\mid\tilde{\theta})-\sigma(p\mid\tilde{\theta}))\mathbf{P}[\tilde{\theta}]\biggr{\rvert}}^{\leq\overline{v}\cdot\tilde{\varepsilon}\cdot\sigma(p)}\biggr{)}$
		$\displaystyle\leq\frac{2\overline{v}\tilde{\varepsilon}}{\gamma-\tilde{\varepsilon}}.$

The first inequality follows from adding and subtracting $\sum_{\theta}v_{\theta}\sigma(p\mid\theta)\mathbf{P}[\theta]\sum_{\tilde{\theta}}\sigma(p\mid\tilde{\theta})\mathbf{P}[\tilde{\theta}]$ to the numerator inside the absolute value (as well as the lower bound on $\tilde{\sigma}_{\eta}(p)$ ), and the second inequality is from the triangle inequality, and the overbraced expression follows from $v_{\theta}\leq\overline{v}$ and uniform convergence of $\tilde{\sigma}_{\eta}$ to $\sigma$ .

So for any fixed $\gamma$ , we can find some some $\eta$ such that whenever $\eta<\overline{\eta}$ , we can ensure that on $K_{\gamma}$ , $\mathinner{\!\left\lvert\mathbf{E}[v_{\theta}\mid\tilde{\sigma}_{\eta},p]-\mathbf{E}[v_{\theta}\mid\sigma,p]\right\rvert}<\varepsilon^{*}$ , by choosing $\tilde{\varepsilon}$ sufficiently small so that $\frac{2\tilde{\varepsilon}}{\gamma(\gamma-\tilde{\varepsilon})}<\varepsilon^{*}$ . It follows that if the receiver’s classifier converges to a rule that is $\varepsilon$ -optimal under $\tilde{\sigma}_{\eta}$ , it converges to a rule that is $\varepsilon+\varepsilon^{*}$ optimal under $\sigma$ . The probability that this fails to occur is simply the probability that the price is outside of $K_{\gamma}$ , which can be made arbitrarily small by taking $\gamma\rightarrow 0$ , since we can approximate the support of $\sigma$ arbitrarily well.

D.1.3. Step Three

Note that, for an arbitrary continuous distribution $f$ , if $p\sim f$ we have (for any compact $K$ ):

\mathbf{P}_{f}[L_{\gamma}]=\int_{K}\mathbf{1}[p\mathrel{\mathop{\mathchar 58\relax}}f(p)\leq\gamma]f(p)dp\leq\int_{K}\mathbf{1}[p\mathrel{\mathop{\mathchar 58\relax}}f(p)\leq\gamma]\gamma dp\leq\mu(K)\cdot\gamma,

where $\mu$ is Lebesgue measure. It follows that the probability that $p\in L_{\gamma}$ , is small if $\gamma$ is small, and furthermore that this probability can be made small uniformly, using only $\gamma$ .

As shown by the claim above, by taking $\eta$ small, we can ensure that the difference in the conditional expected value is small with high probability. By taking $\gamma$ small, we ensure that the probability of a different outcome due to smoothing goes to 0, implying the result.

Appendix E Proofs for Examples

Proof of Lemma 6.2.

It suffices to show that if $p>v_{L}$ and $\mathbf{E}(v|p)-p\geq 0$ , then the expected profit from $p$ is strictly less than $\pi_{L}v_{L}$ . We write the proof in \citeasnounRubinstein93 for the later reference. For any price $p$ satisfying

\mathbf{P}(H|p)v_{H}+\mathbf{P}(L|p)v_{L}\geq p,

the revenue cannot exceed

\mathbf{P}(H|p)v_{H}+\mathbf{P}(L|p)v_{L}

but the cost is

\mathbf{P}(H|p)(1-r)c_{2}+\mathbf{P}(H|p)rc_{1}.

Thus, the sender’s expected payoff is at most

\mathbf{P}(L|p)v_{L}+\mathbf{P}(H|p)((1-r)(v_{H}-c_{2})+r(v_{H}-c_{1}))

Because of the lemon’s problem,

(1-r)(v_{H}-c_{2})+r(v_{H}-c_{1})<0

and

\mathbf{P}(H|p)>0

to satisfy

\mathbf{P}(H|p)v_{H}+\mathbf{P}(L|p)v_{L}\geq p>v_{L}.

Integrating over $p$ , we conclude that the ex ante profit is strictly less than $\pi_{L}v_{L}$ .

∎

References

[1] \harvarditem[Akerlof]Akerlof1970Akerlof70 Akerlof, G. A. (1970): “The Market for ”Lemons”: Quality Uncertainty and the Market Mechanism,” Quarterly Journal of Economics, 84(3), 488–500.
[2] \harvarditem[Al-Najjar]Al-Najjar2009AlNajjar09 Al-Najjar, N. I. (2009): “Decision Makers as Statisticians: Diversity, Ambiguity and Learning,” Econometrica, 77(5), 1371–1401.
[3] \harvarditem[Al-Najjar and Pai]Al-Najjar and Pai2014AlNajjarandPai2014 Al-Najjar, N. I., and M. M. Pai (2014): “Coarse decision making and overfitting,” J. Economic Theory, 150, 467–486.
[4] \harvarditem[Arora, Dekel, and Tewari]Arora, Dekel, and Tewari2012AroraEtAl2012 Arora, R., O. Dekel, and A. Tewari (2012): “Online Bandit Learning against an Adaptive Adversary: from Regret to Policy Regret,” in Proceedings of the 29th international coference on international conference on machine learning, pp. 1747–1754.
[5] \harvarditem[Arora, Dinitz, Marinov, and Mohri]Arora, Dinitz, Marinov, and Mohri2018AroraEtAl2018 Arora, R., M. Dinitz, T. Marinov, and M. Mohri (2018): “Policy Regret in Repeated Games,” in Proceedings of the 32nd international conference on neural information processing systems.
[6] \harvarditem[Blum, Hajiaghayi, Ligett, and Roth]Blum, Hajiaghayi, Ligett, and Roth2008Blumetal2008 Blum, A., M. Hajiaghayi, K. Ligett, and A. Roth (2008): “Regret minimization and the price of total anarchy,” in Proceedings of the fortieth annual ACM symposium on Theory of computing, pp. 373–382.
[7] \harvarditem[Braverman, Mao, Schneider, and Weinberg]Braverman, Mao, Schneider, and Weinberg2018Bravermanetal2018 Braverman, M., J. Mao, J. Schneider, and M. Weinberg (2018): “Selling to a No-Regret Buyer,” in ACM Conf. on ACM Conference on Economics and Computation (ACM EC), pp. 523–538.
[8] \harvarditem[Camara, Hartline, and Johnsen]Camara, Hartline, and Johnsen2020CamaraEtAl2020 Camara, M., J. Hartline, and A. Johnsen (2020): “Mechanisms for a No-Regret Agent: Beyond the Common Prior,” FOCS.
[9] \harvarditem[Cherry and Salant]Cherry and Salant2019CherryandSalant2019 Cherry, J., and Y. Salant (2019): “Statistical Inference in Games,” Northwestern University.
[10] \harvarditem[Dembo and Zeitouni]Dembo and Zeitouni1998DemboandZeitouni98 Dembo, A., and O. Zeitouni (1998): Large Deviations Techniques and Applications. Springer-Verlag, New York, 2nd edn.
[11] \harvarditem[Deng, Schneider, and Sivan]Deng, Schneider, and Sivan2019Dengetal2019 Deng, Y., J. Schneider, and B. Sivan (2019): “Strategizing against No-regret Learners,” Discussion paper.
[12] \harvarditem[Dietterich]Dietterich2000Dietterich00 Dietterich, T. G. (2000): “Ensemble Methods in Machine Learning,” in Multiple Classifier Systems, pp. 1–15, Berlin, Heidelberg. Springer Berlin Heidelberg.
[13] \harvarditem[Eliaz and Spiegler]Eliaz and SpieglerForthcomingEliazandSpiegler2018 Eliaz, K., and R. Spiegler (Forthcoming): “The Model Selection Curse,” American Economic Review: Insights.
[14] \harvarditem[Esponda and Pouzo]Esponda and Pouzo2014EspondaandPouzo14 Esponda, I., and D. Pouzo (2014): “An Equilibrium Framework for Players with Misspecified Models,” University of Washington and University of California, Berkeley.
[15] \harvarditem[Fudenberg and He]Fudenberg and He2018FudenbergandHe2018 Fudenberg, D., and K. He (2018): “Learning and Type Compatibility in Signaling Games,” Econometrica, 86(4), 1215–1255.
[16] \harvarditem[Fudenberg and Kreps]Fudenberg and Kreps1995FudenbergandKreps1995 Fudenberg, D., and D. M. Kreps (1995): “Learning in Extensive Form Games I: Self-confirming Equilibria,” Journal of Economic Theory, 8(1), 20–55.
[17] \harvarditem[Fudenberg and Levine]Fudenberg and Levine1993FudenbergandLevine1993 Fudenberg, D., and D. K. Levine (1993): “Steady State Learning and Nash Equilibrium,” Econometrica, 61(3), 547–573.
[18] \harvarditem[Fudenberg and Levine]Fudenberg and Levine2006FudenbergandLevine06 (2006): “Superstition and Rational Learning,” American Economic Reivew, 96, 630–651.
[19] \harvarditem[Gilboa and Samet]Gilboa and Samet1989GilboaSamet1989 Gilboa, I., and D. Samet (1989): “Bounded versus Unbounded Rationality: The Tyrrany of the Weak,” Games and Economic Behavior, 1, 213–221.
[20] \harvarditem[Kamenica and Gentzkow]Kamenica and Gentzkow2011KamenicaGentzkow2011 Kamenica, E., and M. Gentzkow (2011): “Bayesian Persuasion,” American Economic Reivew, 101(6), 2590–2615.
[21] \harvarditem[Liang]Liang2018Liang2018 Liang, A. (2018): “Games of Incomplete Information Played by Statisticians,” Discussion paper, University of Pennsylvania.
[22] \harvarditem[Maskin and Tirole]Maskin and Tirole1992MaskinTirole92 Maskin, E., and J. Tirole (1992): “ The Principal-Agent Relationship with an Informed Principal, II: Common Values,” Econometrica, 60(1), 1–42.
[23] \harvarditem[Meyn]Meyn2007Meyn07 Meyn, S. P. (2007): Control Techniques for Complex Networks. Cambridge University Press.
[24] \harvarditem[Mukherjee and Schapire]Mukherjee and Schapire2013MukherjeeSchapire2013 Mukherjee, I., and R. E. Schapire (2013): “A Theory of Multiclass Boosting,” Journal of Machine Learning Research, 14, 437–497.
[25] \harvarditem[Nekipelov, Syrgkanis, and Tardos]Nekipelov, Syrgkanis, and Tardos2015Nekipelovetal2015 Nekipelov, D., V. Syrgkanis, and E. Tardos (2015): “Econometrics for Learning Agents,” in Proceedings of the Sixteenth ACM Conference on Economics and Computation, pp. 1–18.
[26] \harvarditem[Olea, Ortoleva, Pai, and Prat]Olea, Ortoleva, Pai, and Prat2019OleaOrtolevaPaiPrat2019 Olea, J. L. M., P. Ortoleva, M. M. Pai, and A. Prat (2019): “Competing Models,” Columbia University, Princeton University and Rice University.
[27] \harvarditem[Rambachan, Kleinberg, Mullainathan, and Ludwig]Rambachan, Kleinberg, Mullainathan, and Ludwig2020Rambachanetal2020 Rambachan, A., J. Kleinberg, S. Mullainathan, and J. Ludwig (2020): “An Economic Approach to Regulating Algorithms,” Discussion paper, Harvard Universitiy, Cornell University, and University of Chicago.
[28] \harvarditem[Rubinstein]Rubinstein1993Rubinstein93 Rubinstein, A. (1993): “On Price Recognition and Computational Complexity in a Monopolistic Model,” Journal of Political Economy, 101(3), 473–484.
[29] \harvarditem[Schapire and Freund]Schapire and Freund2012SchapireandFreund12 Schapire, R. E., and Y. Freund (2012): Boosting: Foundations and Algorithms. MIT Press.
[30] \harvarditem[Shalev-Shwartz and Ben-David]Shalev-Shwartz and Ben-David2014Shalev-ShwartzandBen-David14 Shalev-Shwartz, S., and S. Ben-David (2014): Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.
[31] \harvarditem[Spence]Spence1973Spence73 Spence, A. M. (1973): “Job Market Signaling,” Quarterly Journal of Economics, 87(3), 355–374.
[32] \harvarditem[Spiegler]Spiegler2016Spiegler2016 Spiegler, R. (2016): “ Bayesian Networks and Boundedly Rational Expectations *,” The Quarterly Journal of Economics, 131(3), 1243–1290.
[33] \harvarditem[Zhao, Ke, Wang, and Hsieh]Zhao, Ke, Wang, and Hsieh2020Zhaoetal2020 Zhao, C., S. Ke, Z. Wang, and S.-L. Hsieh (2020): “Behavioral Neural Networks,” Discussion paper.
[34]

Machine Learning for Strategic Inference

Abstract.

1. Introduction

2. Literature

3. (Sender-Receiver) Stage Games

3.1. Actions and Parameters

3.2. Payoffs and the Rational Benchmark

3.3. Examples of Stage Games

3.3.1. Insurance

3.3.2. Labor Market Signaling

3.3.3. Monopoly Market

3.4. Introducing Time

4. Algorithm game

4.1. Choices of Algorithms

Definition 4.1.

Definition 4.2.

Definition 4.3.

4.2. Timing and Objectives

Definition 4.4.

Definition 4.5.

Definition 4.6.

4.3. Specifying 𝒯\mathcal{T}

Definition 4.7.

Remark 4.8.

5. Main Results

Proposition 5.1.

Proof.

Proposition 5.2.

Proposition 5.3.

5.1. Specifying the Algorithm and Weak Learnability (Proposition 5.2)

5.1.1. Weak Learnability

Definition 5.4.

Definition 5.5.

Proposition 5.6.

Proof.

5.1.2. From Weak Learnability to Decision Rules

5.2. Inferring the Rational Label (Proposition 5.3)

Definition 5.7.

Definition 5.8.

Lemma 5.9.

Proposition 5.10.

Proof.

5.3. Discussion

5.3.1. Accommodating Multiple Actions

5.3.2. On the Necessity of Expanding ℋ\mathcal{H}

Proposition 5.11.

Proposition 5.12.

5.3.3. Accommodating Richer Principal Action Spaces

Proposition 5.13.

6. Review of Examples

6.1. Informed Principal

Lemma 6.1.

Proof.

6.2. Labor Market Signaling

6.2.1. Monopoly Market

Lemma 6.2.

Proof.

Proposition 6.3.

7. Conclusion

Appendix A Weak Learnability Proofs

Lemma A.1.

Proof.

Proof of Proposition 5.6.

Appendix B Specifying the Algorithm Parameters and the Proof of Proposition 5.10.

B.1. Convergence of τA\tau_{A}

B.1.1. The |A|=2\mathinner{\!\left\lvert A\right\rvert}=2 case

B.2. The |A|>2\mathinner{\!\left\lvert A\right\rvert}>2 case

B.3. Convergence of τA^\tau_{\hat{A}}

Appendix C Proofs for Section 5.3.2

Proof of Proposition 5.11.

Proof of Proposition 5.12.

Appendix D Proofs for Section 5.3.3

D.1. Proof of Proposition 5.13

D.1.1. Step One

D.1.2. Step Two

D.1.3. Step Three

Appendix E Proofs for Examples

Proof of Lemma 6.2.

References

4.3. Specifying $\mathcal{T}$

5.3.2. On the Necessity of Expanding $\mathcal{H}$

B.1. Convergence of $\tau_{A}$

B.1.1. The $\mathinner{\!\left\lvert A\right\rvert}=2$ case

B.2. The $\mathinner{\!\left\lvert A\right\rvert}>2$ case

B.3. Convergence of $\tau_{\hat{A}}$