Solving Decision-Dependent Games by Learning from Feedback

1 INTRODUCTION

The efficacy of stochastic optimization [nemirovskistochastic] and stochastic games [lei2022stochastic, koshal2010single, liu2011stochastic, lei2020synchronous, franci2021stochastic] generally hinges on the premise that the underlying data distribution is stationary. This means that the distribution of the data, which parameterize the problem or the game, does not change throughout the execution of the algorithm used to solve the stochastic problem or game, and is neither influenced or dependent on time nor the optimization variables themselves. This is a common setup that has been considered when game-theoretic frameworks have been applied to problems in, for example, ride hailing [fabiani2022stochastic], routing [bakhshayesh2021decentralized], charging of electric vehicles (EVs) [fele2019scenario, paccagnan2018nash], power markets [kannan2011strategic], power systems [zhou2019online], and in several approaches for training of neural networks [franci2021training]. However, this assumption can be invalid in a variety of setups in which the cost to be minimized is parameterized by data that is received from populations or a collection of automated control systems, whose response is uncertain and depends on the output of the optimization problem itself. As an example, in a competitive market for electric EV charging [fele2019scenario, tushar2012economics], the operators seek to find the charging prices (i.e., the optimization variables) to maximize the revenue from EVs; however, the expected demand (i.e., the “data” of the problem) is indeed dependent on the price itself. More broadly, power consumption in power distribution grids depends on electric prices [mathieu2011examining]. A similar example pertains to ride hailing [fabiani2022stochastic].

To accommodate this scenario, the so-called stochastic optimization with decision-dependent distributions (also known as performative prediction [perdomo2020performative]) posits that we represent the data distribution used in optimization instead as a distributional map $x\mapsto D(x)$ where $x$ are decision variables [drusvyatskiy2022stochastic, miller2021outside, narang2022learning, perdomo2020performative, wood2023stochastic]. In this work, we study decision-dependent stochastic games in which players seeks to minimize their cost (based on their optimization variables) subject to other players optimization variables, and where the data distribution of each player depends on the actions of all players (we will use the term player and agent interchangeably).

We focus on solving the Nash equilibrium problem of a game, which is to find a decision from which no agent is incentivized by their own cost to deviate when played. Formally, the stochastic Nash equilibrium problem with decision-dependent distributions considered in this paper is to find a point $x^{*}=(x_{1}^{*},\ldots,x_{n}^{*})\in\mathbb{R}^{n}$ such that

\displaystyle x_{i}^{*}\in\operatorname*{arg\,min}_{x_{i}\in{\cal X}_{i}}F_{i}(x_{i},x_{-i}^{*}),\quad\forall\,i\in\{1,\ldots,n\}

(1)

with $F_{i}(x_{i},x_{-i}^{*})$ defined as:

\displaystyle F_{i}(x_{i},x_{-i}^{*}):=\underset{z_{i}\sim D_{i}(x_{i},x_{-i}^{*})}{{\mathbb{E}}}f_{i}(x_{i},x_{-i}^{*},z_{i})

(2)

where: $z_{i}$ denotes a random variable supported on $\mathbb{R}^{k_{i}}$ , $f_{i}:\mathbb{R}^{d}\times\mathbb{R}^{k_{i}}\rightarrow\mathbb{R}$ is a scalar valued function that is convex and continuously differentiable in $x_{i}$ , ${\cal X}_{i}\subseteq\mathbb{R}^{d_{i}}$ is a compact convex set, and $D_{i}:\mathbb{R}^{d}\rightarrow{\cal P}(\mathbb{R}^{k_{i}})$ is a distributional map whose output is a probability distribution supported on $\mathbb{R}^{k_{i}}$ .

Standard stochastic first-order methods are insufficient for solving problems of this form. As we will demonstrate later in the paper, even estimating the expected gradient from samples requires knowledge of the probability density function associated with $D_{i}$ —which is not possible in a majority of practical applications.

Hereafter, we use the term “system” to refer to a population or a collection of automated controllers producing a response $z_{i}\in\mathbb{R}^{k_{i}}$ upon observing $x$ . To illustrate our setup, consider again the example where each agent represents an EV charging provider. Here, $x_{i}\in\mathbb{R}^{d_{i}}$ represents the charging price at a station managed by provider $i$ , expressed in $/kWh. Correspondingly, $z_{i}$ indicates demand for the service at that price, while $f_{i}$ is the service cost (or the negative of the total profit) for provider $i$ . This is an example of a competitive market in which the demand for service is a function of the price of all providers; see, for example, the game-theoretic approaches presented in [mediwaththe2018game, fele2019scenario] and the Stackelberg game presented in [tushar2012economics]. However, compared to existing game-theoretic models for EV markets, the framework proposed in this paper allows for an uncertain response of EV owners to price variations; this randomness is difficult to model, as it it related to the drivers’ preferences and other externalities such as the locations of the charging stations, etc., as explained in, e.g., [mediwaththe2018game, latinopoulos2017response, daina2017electric].

Challenges in solving problems of this form typically stem from the that fact that the distributional maps $D_{i}$ are often unknown [sun2014bayesian, den2015dynamic, cheung2017dynamic, chen2021nonparametric]. To overcome this challenge, we propose a learning-based optimization procedure – in the spirit of the methods proposed for convex optimization in [lin2023plug, miller2021outside] – to tackle the multi-player decision-dependent stochastic game. The key idea behind this framework is that we first propose a parameterization for the distributional map in the system and estimate it from responses. Then, we use the estimated distributional map throughout the game without requiring further interaction with the system.

1.1 RELATED WORK

Our work incorporates themes from games, learning, as well as stochastic optimization with decision-dependent distributions. We highlight the relationship with this relevant literature below.

Games. Within the context of games, our work is specifically focused on solving Nash equilibrium problems using gradient-based methods and a variational inequality (VI) framework. The literature on stochastic games is extensive; for a comprehensive yet concise review of the subject, we refer the reader to the tutorials [scutari2010convex] and [lei2022stochastic]; see also pertinent references therein. A common denominator of existing frameworks is that the data distribution is stationary. The work of [rosen1965existence] demonstrates that strictly monotone games have unique solutions and that gradient play converges to it. The modern approach of solving Nash equilibrium problems for continuous games via variational inequalities can be attributed to Facchinei and Pang [facchinei2003finite, harker1990finite]. For solving strongly monotone variational inequalities, the projected gradient method is capable of converging linearly.

Our work adds the additional complexity of minimizing communication between agents and hence we use a distributed gradient approach in our optimization algorithm. Distributed gradient methods have been explored extensively in the literature on convex optimization, though less so in that of variational inequalities. We refer the reader to [nedic2020distributed] for a review in the convex optimization setting, and [kovalev2022optimal] for variational inequalities.

Decision-Dependent Data. This paper contributes to the growing body of literature that studies stochastic optimization with decision-dependent data distributions. While the concept of decision-dependent distributions has existed for some time, its roots go back to the framework “Performative Prediction” and its use within the machine learning community [perdomo2020performative]; this work posits the formulation of optimization problems in which the data distribution is explicitly dependent on the optimization variables, and proposes repeated retraining (and the limit points thereof) as a solution. This is a family of points that solve the induced stationary optimization problem. For repeated retraining, convergence of various stochastic gradient algorithms are studied in [drusvyatskiy2022stochastic, mendler2020stochastic] in the batch setting, and in the time-varying setting in [cutler2021stochastic, wood2021online]. The extension of problems with to games includes two-player zero-sum games in [wood2021online], and general multiplayer games in [narang2022learning]. Additional recent extensions to this line of work include distributionally robust optimization[basciftci2021distributionally, luo2020distributionally] and time varying optimization [wood2021online, cutler2021stochastic].

Recent works have since pivoted towards directly solving the optimization problem by incorporating a model for the distributional map. In [miller2021outside], the authors provide conditions that ensure convexity of the expected cost, alongside a learning-based algorithm for finding solutions to problems with location scale families. The show that location scale families can be learned from the system prior to optimization without requiring further interaction with the system during optimization. The work of [lin2023plug] extends this learning framework beyond location scale families and derives regret bounds that incorporate the associated approximation and statistical errors arising from the statistical learning problem. Outside of the distribution learning approach, derivative-free methods have been studied in [miller2021outside, narang2022learning, wood2023stochastic]. While these methods do not depend on a learned model of the distributional map, they generally exhibit slower convergence rates. On the other hand, adaptive learning strategies, particularly for location-scale models, have shown promise, as discussed in [narang2022learning] whereby the model is learned during optimization. However, their exploration remains limited to this specific model category.

This work complements the technical findings in [lin2023plug] by offering a generalization of the so called “plug-in” approach presented in this work to the multi-agent setting. Relative to [narang2022learning], we focus on solving the Nash equilibrium problem for a general class of models by learning the distributional map prior to optimization—thereby reducing the required number of interaction with the system to receive a reasonable answer.

1.2 CONTRIBUTIONS

In this work, we provide the following contributions to the body of work on stochastic optimization and Nash equilibrium problems with decision-dependent distributions.

(i)

We propose an algorithm for finding a Nash equilibrium in stochastic games with decision-dependent distributions where: (i) the distributional map for each player’s cost is estimated from samples, and (ii) the estimated distributional map is used in gradient-based strategies.
(ii)

We provide guarantees on the approximation error of distributional maps for a class of map learning problems.
(iii)

We show that the parameterized cost approximates the ground-truth in high-probability.
(iv)

We propose a stochastic gradient-based algorithm for solving a parameterized strongly-monotone game, and we demonstrate linear convergence in expectation.
(v)

Finally, we provide numerical simulations of an EV charging market formulation using real-world data. The EV market formulation is new in the context of energy markets, thus providing contributions in this area.

1.3 ORGANIZATION

In Section 2, we provide necessary notation and background for our analysis. In Section 3 we discuss the proposed learning algorithm in detail and present our primary result. Section 5 discusses the details of the optimization stage. We provide our numerical simulations in Section 5. Proofs of the results are provided in the Appendix.

2 Notation and Preliminaries

Throughout the paper, $\mathbb{R}^{d}$ denotes the $d$ -dimensional Euclidean space with inner product $\langle\cdot,\cdot\rangle$ , and Euclidean norm $\|\cdot\|$ . For a matrix $X\in\mathbb{R}^{n\times m}$ , $\|X\|$ denotes the spectral norm. For a given integer $n$ , $[n]$ denotes the set $\{1,2,\ldots,n\}$ and $\mathcal{S}^{n-1}$ denotes the Euclidean hypersphere in $n$ dimensions, $\{x\in\mathbb{R}^{n}|\ \|x\|_{2}=1\}$ . The symbol ${\bf\mathds{1}}_{d}$ is used to denote the $d$ -dimensional vector of all ones. Given vectors $x\in\mathbb{R}^{n}$ and $z\in\mathbb{R}^{m}$ , we let $(x,z)\in\mathbb{R}^{n+m}$ denote their concatenation.

For a symmetric positive definite matrix $W\in\mathbb{R}^{d\times d}$ , the weighted inner product is defined by $\langle x,y\rangle_{W}=\langle x,Wy\rangle$ and corresponding weighted norm $\|x\|_{W}=\sqrt{\langle x,x\rangle_{W}}$ for any $x,y\in\mathbb{R}^{d}$ . The weighted projection onto a set ${\cal X}\subseteq\mathbb{R}^{d}$ with respect to the symmetric positive definite matrix $W\in\mathbb{R}^{d\times d}$ is given by the map

\mathsf{proj}_{{\cal X},W}(x):=\operatorname*{arg\,min}_{y\in{\cal X}}\frac{1}{2}\|x-y\|_{W}^{2}

(3)

for any $x\in\mathbb{R}^{d}$ .

2.1 Probability measures

Throughout this work, we restrict our focus to random variables drawn from continuous probability distributions supported over the Euclidean space. When random variables $X,Y\in\mathbb{R}^{k}$ are equal in distribution, i.e., $P(X\leq x)=P(Y\leq x)$ for all $x\in\mathbb{R}^{k}$ , we write $X\overset{d}{=}Y$ .

Our analysis includes study of sub-exponential random vectors. A univariate random variable $X\in\mathbb{R}$ is said to be sub-exponential with modulus $\theta>0$ provided that the survival function satisfies ${\mathbb{P}}(|X|\geq t)\leq 2\exp(-t/\theta)$ for all $t\geq 0$ . By extension, a random vector $X\in\mathbb{R}^{k}$ is sub-exponential provided that $\langle u,X\rangle$ is a sub-exponential random variable for all $u\in{\cal S}^{k-1}$ .

To compare probability distributions, we will be interested in computing the distance between their associated probability measures—for which we need a complete metric space. We let $\mathcal{P}(\mathbb{R}^{k})$ denote the set of finite first moment probability measures supported on $\mathbb{R}^{k}$ and write the Wasserstein- $1$ distance as

W_{1}(\mu,\nu)=\sup_{h\in\mathcal{L}_{1}}\bigg{\{}{\mathbb{E}}_{X\sim\mu}[h(X)]-{\mathbb{E}}_{Y\sim\mu}[h(Y)]\bigg{\}}

for any $\mu,\nu\in\mathcal{P}(\mathbb{R}^{k})$ , where $\mathcal{L}_{1}$ is the set of all $1$ -Lipschitz continuous functions $h:\mathbb{R}^{k}\rightarrow\mathbb{R}$ . Under these conditions, the set $(\mathcal{P}(\mathbb{R}^{k}),W_{1})$ forms a complete metric space [bogachev2012monge].

2.2 Games

We consider a game that consists of $n$ players. Each player has a cost function $F_{i}$ , distributional map $D_{i}$ , and decision set $\mathcal{X}_{i}\subseteq\mathbb{R}^{d_{i}}$ . Hence, each player chooses a decision, or strategy $x_{i}\in\mathcal{X}_{i}\subseteq\mathbb{R}^{d_{i}}$ . The concatenation of the decision variables is written as $x=(x_{1},\ldots,x_{n})\in\mathcal{X}\subseteq\mathbb{R}^{d}$ where $\mathcal{X}=\prod_{i=1}^{n}\mathcal{X}_{i}$ and $d=\sum_{i=1}^{n}d_{i}$ . For a fixed agent $i$ , we will decompose the decision $x$ as $x=(x_{i},x_{-i})$ where $x_{-i}\in\mathbb{R}^{d-d_{i}}$ is the strategy vector of all agents excluding the $i$ th one.

The collection of costs $F_{i}$ and decision sets ${\cal X}_{i}$ defines the game

\min_{x_{i}\in\mathcal{X}_{i}}F_{i}(x_{i},x_{-i}),\,\,\,\,\ i\in[n].

(4)

A Nash equilibrium of this game is a point $x^{*}\in\mathcal{X}$ provided that

x^{*}_{i}\in\operatorname*{arg\,min}_{x_{i}\in{\cal X}_{i}}F_{i}(x_{i},x_{-i}^{*})

(5)

for all $i\in[n]$ . Intuitively, $x^{*}$ is a strategy such that no agent can be incentivized by its cost to deviate from $x_{i}^{*}$ when all other agents play $x_{-i}^{*}$ . Finding Nash equilibria is the primary focus of this work.

Games of this form are commonly cast into a variational inequality framework. This is due, in part, to the observation that the Nash equilibria $x^{*}\in{\cal X}$ are the solutions to the variational inequality

\langle x-x^{*},G(x^{*})\rangle\geq 0,\,\,\,\,\forall\,x\in{\cal X},

where the gradient map $G:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is defined as

G(x)=\left(\nabla_{1}F_{1}(x),\ldots,\nabla_{n}F_{n}(x)\right).

(6)

Here, the notation $\nabla_{i}$ is used to represent the partial gradient $\nabla_{x_{i}}$ . We will denote the set of Nash equilibrium of a game with gradient map $G$ and domain ${\cal X}$ as $\texttt{Nash}(G,{\cal X})$ . Existence of solutions to variational inequalities of this form is guaranteed provided that the set ${\cal X}$ is convex and compact and the gradient map $G$ is monotone; uniqueness is guaranteed when $G$ is strongly-monotone [facchinei2003finite]. We say that $G$ is $\alpha$ -strongly-monotone on ${\cal X}$ provided that there exists $\alpha>0$ such that

\langle x-y,G(x)-G(y)\rangle\geq\alpha\|x-y\|^{2},\,\,\,\forall\,x,y\in{\cal X},

(7)

and monotone when $\alpha=0$ . In this work, we primarily focus on strongly-monotone games. While monotone games are tractable, methods for solving them with decision-dependent distributions require alternative gradient estimators—a topic we leave to future work.

2.3 Monotonicity in Decision-Dependent Games

In this work, we introduce the additional complexity to the formulation in (4) that the $F_{i}$ ’s are the expected cost over a distributional map $D_{i}:\mathbb{R}^{d}\rightarrow{\cal P}(\mathbb{R}^{k_{i}})$ . In particular, we write the cost as

F_{i}(x_{i},x_{-i}):=\underset{z_{i}\sim D_{i}(x_{i},x_{-i})}{{\mathbb{E}}}f_{i}(x_{i},x_{-i},z_{i}).

(8)

This can be written alternatively as the integral

F_{i}(x)=\int_{\mathbb{R}^{k_{i}}}f_{i}(x,z_{i})p_{i}(z_{i},x)\text{d}z_{i}

(9)

where $p_{i}$ is the probability density function for the distribution $D_{i}(x)$ . When the integral satisfies the Dominated Convergence Theorem, computing the gradient amounts to differentiating under the integral and using the product rule. We then obtain

\nabla_{i}F_{i}(x)=\underset{z_{i}\sim D_{i}(x)}{{\mathbb{E}}}\left[\nabla_{x_{i}}f_{i}(x,z_{i})+f_{i}(x,z_{i})\nabla_{i}\log p_{i}(x;z_{i})\right],

(10)

where we recall that $G(x)=\left(\nabla_{1}F_{1}(x),\ldots,\nabla_{n}F_{n}(x)\right)$ . In short, characterizing the gradient of this decision-dependent game requires assumptions not only on $f_{i}$ , but also on the properties of the distributional map $D_{i}$ . Sufficient conditions for strong monotonicity of the game in (4) are due to [narang2022learning] and are stated in terms of the decoupled costs, given by

F_{i}(x,y)=\underset{z_{i}\sim D_{i}(y)}{{\mathbb{E}}}f_{i}(x,z_{i})

(11)

for all $x,y\in\mathbb{R}^{d}$ , and their associated decoupled partial gradients

G_{i}(x,y)=\underset{z_{i}\sim D_{i}(y)}{{\mathbb{E}}}\nabla_{i}f_{i}(x,z_{i}),

(12)

for all $x,y\in\mathbb{R}^{d}$ and

H_{i}(x,y)=\nabla_{y_{i}}\underset{z_{i}\sim D_{i}(y)}{{\mathbb{E}}}f_{i}(x,z_{i})

(13)

for all $x,y\in\mathbb{R}^{d}$ . A key observation used in the proof is that $G_{i}(x)=\nabla_{i}F_{i}(x)=G_{i}(x,x)+H_{i}(x,x)$ .

Theorem 1.

(Strong Monotonicity, [narang2022learning]) Suppose that,

(i)

For all $y\in{\cal X}$ , $x\mapsto G(x,y)$ is $\lambda$ -strongly monotone,
(ii)

For all $x\in{\cal X}$ , $y\mapsto H(x,y)$ is monotone,

and that for all $i\in[n]$ ,

(iii)

For all $x\in{\cal X},z_{i}\mapsto\nabla_{i}f_{i}(x,z_{i})$ is $L_{i}$ -Lipschitz continuous,
(iv)

$y\mapsto D_{i}(y)$ is $\gamma_{i}$ -lipschitz continuous on $({\cal P}(\mathbb{R}^{k_{i}}),W_{1})$ .

Set $\kappa=\sqrt{\sum_{i=1}^{n}(\frac{\gamma_{i}L_{i}}{\lambda})^{2}}$ . Then if $\kappa<1/2$ , $x\mapsto G(x)$ is $\alpha=(1-2\kappa)\lambda$ -strongly monotone. $\Box$

3 Learning-based Decision-Dependent Games

In this work, we aim to solve the stochastic Nash equilibrium problem with decision-dependent data distributions as formulated in (1). Methods for finding Nash equilibrium for games with decision dependent data distributions either use derivative free optimization, at the expense of an extremely slow rate, or use derivative information in conjunction with a learned model of the distributional map [narang2022learning].

In [lin2023plug], it is shown that a “plug-in” optimization approach, whereby a model for the distributional map is learned from samples prior to optimization, yields a bounded excess risk for the convex optimization problems with decision-dependent data. In this work, we leverage the properties of the system to simplify the communication structure of our approach, which we depict in Figure 1. We assume that realizations of $z_{i}$ can be directly observed from the system, and the decisions $x_{-i}$ can be obtained from a server or are made public (for example, the prices of EV charging of different providers can be observed at the various stations).

Input:

m,\{D_{x_{i}}\}_{i=1}^{n}

for $j\in[m]$ do

for $i\in[n]$ do

Draw

x_{i}^{(j)}\sim D_{x_{i}}

;

end for

Deploy

x^{(j)}

;

Observe

z_{i}^{(j)}\sim D_{i}(x^{(j)})

;

end for

for $i\in[n]$ do

Fit

\widehat{\beta}_{i}\in\operatorname*{arg\,min}_{\beta_{i}\in{\cal B}_{i}}\frac{1}{m}\sum_{j=1}^{m}R_{i}(x^{(j)},z_{i}^{(j)},\beta_{i})

;

end for

Compute

\widehat{x}\in\texttt{Nash}(G_{\widehat{\beta}},{\cal X})

;

Algorithm 1 Multi-phase Optimization

To accommodate this setting, our algorithm proposes a multistage approach consisting of the following phases: (i) sampling; (ii) learning; (iii) optimization. It is important to note that following the learning phase players only need to participate in gradient play without receiving any additional feedback from the system in the form of $z_{i}\sim D_{i}(x)$ . This is distinct from existing approaches in which performatively stable points can only be reached after several (even thousands of) rounds of feedback [perdomo2020performative, narang2022learning, wood2023stochastic], and performatively optimal points can only be reached for models known to be location scale families a priori [miller2021outside, narang2022learning].

Refer to caption — Figure 1: Communication structure allows agents to interact with the system in square by sending decision $x_{i}$ . After deploying, agents can receive feedback from the system in the form of other agents decisions $x_{-i}$ and data $z_{i}$ .

Sampling. In the sampling phase we require that players collaborate by each deploying a set of decisions $\{x_{i}^{(j)}\}_{j=1}^{m}\overset{i.i.d}{\sim}D_{x_{i}}$ so that they can collectively receive feedback $z_{i}^{(j)}\sim D_{i}(x^{(j)})$ from the system (in response to their deployed decisions $\{x_{i}^{(j)}\}_{j=1}^{m}$ ). The result is that each agent has access to a dateset $\{x^{(j)},z_{i}^{(j)}\}_{j=1}^{m}$ which they can use to learn their distributional map $D_{i}$

Learning. In this procedure, each player will choose a hypothesis class of parameterized functions

{\cal H}_{{\cal B}_{i}}=\left\{D_{\beta_{i}}|\ \beta_{i}\in{\cal B}_{i}\subseteq\mathbb{R}^{\ell_{i}}\right\},

(14)

as well as a suitable criterion or risk function $R_{i}$ , to formulate their own expected risk minimization problem

\beta^{*}_{i}\in\operatorname*{arg\,min}_{\beta_{i}\in{\cal B}_{i}}\underset{x\sim D_{x},\\ z_{i}\sim D_{i}(x)}{{\mathbb{E}}}R_{i}(x,z_{i},\beta_{i})

(15)

over the random variable $(x,z_{i})$ drawn from the coupled distribution $(D_{x},D_{i}(x))$ . Then, using the set of samples from the previous sampling phase, they can formulate the corresponding empirical risk minimization problem

\widehat{\beta}_{i}\in\operatorname*{arg\,min}_{\beta_{i}\in{\cal B}_{i}}\frac{1}{m}\sum_{j=1}^{m}R_{i}(x^{(j)},z_{i}^{(j)},\beta_{i}).

(16)

The result is a learned distributional map $D_{\widehat{\beta}_{i}}$ approximating $D_{i}$ , which we can now use to solve the approximate Nash equilibrium problem.

Optimization. Following the approximation phase, each player now has an learned model of their distributional map $D_{\widehat{\beta}_{i}}$ , which can be used to formulate an approximation of the ground-truth cost $F_{i}$ and hence an approximate Nash equilibrium problem:

\displaystyle\widehat{x}_{i}\in\operatorname*{arg\,min}_{x_{i}\in{\cal X}_{i}}F_{\widehat{\beta}_{i}}(x_{i},\widehat{x}_{-i})

(17)

for all $i\in[n]$ , where

\displaystyle F_{\widehat{\beta}_{i}}(x_{i},\widehat{x}_{-i}):=\underset{z_{i}\sim D_{\widehat{\beta}_{i}}(x_{i},\widehat{x}_{-i})}{{\mathbb{E}}}f_{i}(x_{i},\widehat{x}_{-i},z_{i})\,.

(18)

Hereafter, we denote the Nash equilibrium of the approximate game as $\widehat{x}$ to distinguish it from the ground truth $x^{*}$ . In Algorithm 1, we write the set of Nash equilibria for the operator $G_{\widehat{\beta}}$ with domain ${\cal X}$ as $\texttt{Nash}(G_{\widehat{\beta}},{\cal X})$ . In practice, we will assume the appropriate assumptions to guarantee uniqueness of this assignment; in which case the set inclusion is simply an equality.

By solving (17) instead of (1) we have introduced two errors: (i) the approximation error of the distributional map $D_{i}$ by elements of the hypothesis class ${\cal H}_{{\cal B}_{i}}$ , and (ii) the estimation or statistical error by solving the ERM problem instead of the expected risk minimization problem. In [lin2023plug], the main result demonstrates that these two sources of error propagate through the optimization problem, and that the resulting excess risk can be bounded in terms of the sample complexity $m$ . Our goal is to expand this result and provide additional analysis to our setting.

3.1 Parameter Estimation for Regular Problems

A critical component of our analysis is the estimation or learning of the distributional map and the subsequent characterization of the estimation error. In this section, we outline a class of expected risk minimization problems, which we call regular problems, for which we can characterize the distance between expected risk minimization solutions and empirical risk minimization solutions. Throughout, we write $R_{i}(\beta_{i})={\mathbb{E}}_{(x,z)}[R(x,z_{i},\beta_{i})]$ and $\widehat{R}_{i}(\beta_{i})=(1/m)\sum_{j=1}^{m}R_{i}(x^{(j)},z_{i}^{(j)},\beta_{i})$ for $\beta_{i}\in\mathbb{R}^{\ell_{i}}$ to denote the expected and empirical risk, respectively.

Definition 1.

(Map Learning Regularity) A map learning problem, consisting of the optimization problems with costs $R_{i}$ and $\widehat{R}_{i}$ over ${\cal B}_{i}$ , is regular provided that:

(a)

Convexity: The expected risk $\beta_{i}\mapsto R_{i}(\beta_{i})$ is $\mu_{i}$ -strongly convex, and the empirical risk $\beta_{i}\mapsto\widehat{R}_{i}(\beta_{i})$ is convex.
(b)

Smoothness: For all realizations of $x\in{\cal X}$ and $z_{i}\in\mathbb{R}^{k_{i}}$ , $\beta_{i}\mapsto\nabla_{\beta_{i}}R_{i}(x,z_{i},\beta_{i})$ is $L_{\beta_{i}}$ -Lipschitz continuous.
(c)

Boundedness: The set ${\cal B}_{i}\subseteq\mathbb{R}^{\ell_{i}}$ is convex and compact.
(d)

Sub-Exponential gradient: For all $\beta_{i}\in{\cal B}_{i}$ , $\nabla_{\beta_{i}}R_{i}(x,z_{i},\beta_{i})$ is a sub-exponential vector with parameter $\theta_{i}>0$ . $\Box$

Items (a) and (c), taken together, guarantee existence of $\widehat{\beta}$ and uniqueness of $\beta^{*}$ as defined in (16) and (15), respectively. Furthermore, the inclusion of item (b) is necessary to guarantee that first-order stochastic gradient methods will converge at least sub-linearly to $\widehat{\beta}$ . Lastly, the heavy-tail assumption [vershynin2018high] will allow us to describe the concentration of the gradient estimates. Together, they allow us to relate the solutions to the sample complexity in the following lemma.

Lemma 2.

(Uniform Gradient Bound) If the smoothness and sub-exponential gradient assumptions in Definition 1 hold for player $i\in[n]$ , then for any $\delta\in(0,1/2)$ and any $m$ such that $m/\log(m)\geq 2(\ell_{i}+\log(1/\delta))$ , we have that:

\sup_{\beta\in{\cal B}}\|\nabla\widehat{R}_{i}(\beta)-\nabla R_{i}(\beta)\|\leq C_{i}\sqrt{\frac{\log(m)(\ell_{i}+\log(1/\delta))}{m}}

(19)

with probability at least $1-\delta$ , where $C_{i}=4\max\{L_{\beta_{i}}/15r_{i},\theta_{i}\}$ . $\Box$

The proof of this result is provided in the Appendix 6.1. This result offers a broad generalization of [mou2019diffusion, Equation (19b)] to any risk with Lipschitz-continuous sub-exponential gradients over any convex and compact set. Our result is comparable to the $\mathcal{O}(\sqrt{\ell_{i}}{m})$ rate that can be found for specific problem instances such as linear least squares regression and logistic regression, but with the addition of a $\sqrt{\log m}$ factor. Indeed, the generality of the risk function requires that we enforce compactness of the domain, thus giving rise to this extra logarithmic factor. This gradient estimation result will now allow us to reach our desired bounded distance result, which we present in the following theorem.

Theorem 3.

(ERM Approximation) If the map learning problem is regular for player $i\in[n]$ (i.e., it satisfies the assumptions in Definition 1), then for any $\delta\in(0,1/2)$ and any $m$ such that $m/\log(m)\geq 2(\ell_{i}+\log(1/\delta))$ we have that:

\|\widehat{\beta}_{i}-\beta_{i}^{*}\|\leq C_{i}^{\prime}\sqrt{\frac{\log(m)(\ell_{i}+\log(1/\delta))}{m}}

(20)

with probability at least $1-\delta$ , where $C_{i}^{\prime}=(4/\mu_{i})\max\{L_{\beta_{i}}/15r_{i},\theta_{i}\}$ . $\Box$

The proof of Theorem 3 is provided in the Appendix 6.2. The power in this characterization lies in the fact that it holds for any statistical learning problem satisfying the assumptions listed in Definition 1, and is not specific to the setting of learning distributional maps. We note that our Definition 1, which is a property used in the Theorem 3, is different from the one in [lin2023plug] and it involves conditions that are easier to check.

As an example, we provide conditions for which a linear least squares problem satisfies the regularity conditions and hence is subject to the above ERM approximation result.

Proposition 4.

(Linear Least Squares Regularity) Consider the linear least squares problem with expected risk problem

B_{i}^{*}\in\operatorname*{arg\,min}_{B\in{\cal B}_{i}}\frac{1}{2}{\mathbb{E}}_{(x,z)}\|Bx-z\|^{2},

and empirical risk minimization problem

\widehat{B}_{i}\in\operatorname*{arg\,min}_{B\in{\cal B}_{i}}\frac{1}{2m}\sum_{j=1}^{m}\|Bx_{i}^{(j)}-{z_{i}^{(j)}}\|^{2}.

Let $x_{i}\sim D_{i}$ with zero mean and covariance matrix $\Sigma_{i}$ . If
(i) There exist $\gamma_{i},L_{i}>0$ such that $\gamma_{i}I\leq\Sigma_{i}\leq L_{i}I$ ,
(ii) The entries of $xx^{T}$ and $zx^{T}$ are sub-exponential,
(ii) The constraint set ${\cal B}_{i}$ is convex and compact.
Then, the map learning problem is regular. $\Box$

The proof of Proposition 4 is provided in Appendix 6.3. Deriving conditions for the more general case of non-linear regression is attainable but outside the scope of this work.

3.2 Bounding the Approximation Error

Finding a relationship between $\widehat{x}$ and $x^{*}$ will require that we first characterize an appropriate hypothesis class of distributions for learning. Here, we formalize the notion of misspecification and sensitivity for a hypothesis class ${\cal H}_{{\cal B}_{i}}$ .

Definition 2.

(Misspecification [lin2023plug]) A hypothesis class ${\cal H}_{{\cal B}_{i}}$ is $\gamma_{i}$ -misspecified provided that there exists a $\gamma_{i}>0$ such that

W_{1}(D_{\beta_{i}^{*}}(x),D_{i}(x))\leq\eta_{i}

(21)

for all $x\in{\cal X}$ . $\Box$

Observe that if $\eta_{i}=0$ , then our approximation error is zero and we incur no bias in our problem formulation. This corresponds to the case where the map $D_{i}$ is realizable [shalev2014understanding]. Though, small values of $\eta_{i}$ are appropriate as $D_{i}$ may be complex enough that approximating it with a finite-dimensional parameter space is difficult.

Definition 3.

(Sensitivity [lin2023plug]) The hypothesis class ${\cal H}_{{\cal B}_{i}}$ is $\varepsilon_{i}$ -sensitive if, for any $\beta_{i},\beta_{i}^{\prime}\in{\cal B}_{i}$ ,

W_{1}(D_{\beta_{i}}(x),D_{\beta_{i}^{\prime}}(x))\leq\varepsilon_{i}\|\beta_{i}-\beta_{i}^{\prime}\|

(22)

for all $x\in{\cal X}$ . $\Box$

Sensitivity of ${\cal H}_{{\cal B}_{i}}$ is merely a convenient name for the condition that $\beta\mapsto D_{\beta_{i}}(x)$ be $\varepsilon_{i}$ -lipschitz continuous for all realizations of $x\in{\cal X}$ . In the result that follows, we demonstrate that an appropriately misspecified and sensitive hypothesis class induces a cost that has bounded distance to the ground truth cost in (1).

Theorem 5.

(Bounded Approximation) Suppose that the following conditions hold for all $i\in[n]$ :
(i) The hypothesis class ${\cal H}_{{\cal B}_{i}}$ is $\eta_{i}$ -misspecified, and $\varepsilon_{i}$ -sensitive.
(ii) The map learning problem is regular.
(iii) For all $x\in{\cal X}_{i}$ , $z_{i}\mapsto f_{i}(x,z_{i})$ is $L_{z_{i}}$ -Lipschitz continuous.
Then, the bound

|F_{\widehat{\beta}_{i}}(x)-F_{i}(x)|\leq\eta_{i}L_{z_{i}}+L_{z_{i}}\varepsilon_{i}C_{i}^{\prime}\zeta_{i}(m,\delta),

(23)

holds with probability $1-\delta$ for any $x\in{\cal X}$ , where

\zeta_{i}(m,\delta):=\sqrt{\frac{\log(m)(\ell_{i}+\log(1/\delta))}{m}}\,.

(24)

and where $C_{i}^{\prime}$ is as in $\eqref{eqn:beta_bound}$ . $\Box$

The proof of Theorem 5 is provided in the Appendix 6.4. By including an additional assumption on the Lipschitz continuity of the cost $f_{i}$ with respect to to other agents decisions $x_{-i}$ , we can provide an analog of the excess risk result in [lin2023plug].

Corollary 6.

Suppose that for all $i\in[n]$ , the function $x_{-i}\mapsto F_{\beta_{i}}(x_{i},x_{-i})$ is $L^{\beta}_{-i}$ -Lipschitz on ${\cal X}_{-i}$ for any $x_{i}\in{\cal X}_{i}$ . Then,

	$\displaystyle\|F_{i}(\widehat{x})-F_{i}(x^{*})\|\leq$	$\displaystyle~{}2\gamma_{i}L_{z_{i}}+2L_{z_{i}}\varepsilon_{i}C_{i}^{\prime}\zeta_{i}(m,\delta)$
		$\displaystyle+2\sqrt{2}\bar{L}_{i}\mathsf{diam}({\cal X})$		(25)

hold with probability $1-\delta$ for any $\widehat{x}\in\texttt{NASH}(G_{\widehat{\beta}},{\cal X})$ and $x^{*}\in\texttt{NASH}(G,{\cal X})$ , $\bar{L}_{i}:=\max\{L_{i}^{\beta^{*}},L_{-i}^{\beta^{*}},L_{i}^{\widehat{\beta}},L_{-i}^{\widehat{\beta}}\}$ , and $C_{i}^{\prime}$ is as in (20). $\Box$

The analysis in this section demonstrates that the estimation procedure in Algorithm 1 yields a cost function that approximates the original cost in (1) with an error the decreases as the number of samples increases. Furthermore, this bound exists independent of the conditioning of the Nash equilibrium problem we solve in the optimization phase. We note that (23) is similar to the result in [lin2023plug], but it is based on a different definition of regular problem (see Definition 1); the bound (25) is unique to this paper.

In the section that follows, we examine a family of hypothesis classes that allows the approximated game to be monotone, and provide suitable algorithms for solving them with convergence guarantees.

4 Solving Strongly-monotone Decision-dependent Games

Since the agents lack full knowledge of the system and hence the ground truth distributional map $D_{i}$ in (1), we cannot hope to enforce that $D_{i}$ satisfy any assumptions to encourage tractability of our optimization problem. We can however impose conditions on the hypothesis class ${\cal H}_{{\cal B}_{i}}$ , which is chosen by the agents. To successfully find a Nash equilibrium of the approximate problem in (17), it will be crucial that agents choose a class that balances expressiveness of the system (thereby making $\eta_{i}$ small) with tractability of the optimization.

Perhaps the simplest model capable of achieving this goal is the location-scale family [miller2021outside, narang2022learning, wood2023stochastic]. In our setting, a location scale family parameterization for agent $i$ is a distributional map $D_{B_{i}}$ having matrix parameter $B_{i}\in\mathbb{R}^{k_{i}\times d}$ where $z_{i}\sim D_{B_{i}}$ if and only if

z_{i}\overset{d}{=}\xi_{i}+B_{i}x

(26)

for stationary random variable $\xi_{i}\sim D_{\xi_{i}}$ . We note that this parameterization can be written alternatively as $z_{i}\overset{d}{=}\xi_{i}+B_{i}^{i}x_{i}+B_{-i}^{i}x_{-i}$ , where $B_{i}^{i}\in\mathbb{R}^{k_{i}\times d_{i}}$ and $B_{-i}^{i}\in\mathbb{R}^{k_{i}\times(d-d_{i})}$ are block matrices such that $B_{i}x=B_{i}^{i}x_{i}+B_{-i}^{i}x_{-i}$ due to linearity. The resulting partial gradient has the form

\nabla_{i}F_{i}(x)={\mathbb{E}}_{z_{i}\sim D(x)}\left[\nabla_{i}f_{i}(x,z_{i})+(B_{i}^{i})^{T}\nabla_{z_{i}}f_{i}(x,z_{i})\right],

which is typically much simpler to analyze than alternative models. Intuitively, this model allows us to express $z_{i}$ as the sum of a stationary random variable from a base distribution with a linear factor depending on $x$ , where the matrix parameter $B_{i}$ weights the responsiveness of the population to the agents decisions.

This model is particularly appealing as guarantees for learning $B_{i}$ are known and established in Proposition 4. Moreover, the matter of expressiveness is due to the fact that location scale families are a particular instance of strategic regression [perdomo2020performative, lin2023plug], in which member of the population interact with agents by modifying their stationary data (such as features in a learning task) $\xi_{i}$ in an optimal way upon observing $x$ :

z_{i}\overset{d}{=}\operatorname*{arg\,min}_{y}\left[-u_{\beta_{i}}(x,y)+\frac{1}{2}\|y-\xi_{i}\|^{2}\right],

where $u_{\beta_{i}}$ is a utility function parameterized by $\beta_{i}\in{\cal B}_{i}$ corresponding to the utility that members of the population derive from changing their data in response to the decisions in $x$ ; and the quadratic term $1/2\|y-\xi_{i}\|^{2}$ is the cost of changing their data from $\xi_{i}$ to $y$ . Indeed when $u_{\beta_{i}}(x,)=\langle y,B_{i}x\rangle$ for $\beta_{i}=B_{i}\in\mathbb{R}^{k_{i}\times d}$ , we recover the form above.

Furthermore, location scale families immediately satisfy several of the assumption required for further analysis. In particular, it is known that Sensitivity (Definition 3) holds with $\varepsilon_{i}=\max_{x\in{\cal X}}\|x\|^{2}$ , Lipschitz continuity of $x\mapsto D_{B_{i}}$ holds with $\gamma_{i}=\|B_{i}\|^{2}$ , and Lipschitz continuity of $G_{B_{i}}$ holds due to the following result.

Lemma 7.

(Lipschitz Gradient, [narang2022learning]) Suppose that $D_{\beta_{i}}$ is such that $z\overset{d}{=}B_{i}x+\xi_{i}$ with $\beta_{i}=B_{i}$ , and that for each $i\in[n]$ there exists $\zeta_{i}\geq 0$ such that $(x,z_{i})\mapsto\nabla_{i,z_{i}}f_{i}(x,z_{i})$ is $\zeta_{i}$ -Lipschitz continuous. Then $G_{\beta_{i}}$ is $L$ -Lipschitz continuous with

L:=\sqrt{\sum_{i=1}^{n}\zeta_{i}^{2}\max\{1,\|B_{i}^{i}\|^{2}\}(1+\|B_{i}\|^{2})}\,.

(27)

$\Box$

Strong monotonicity will follow from Theorem 1 provided that $G_{\beta_{i}}$ satisfy the remaining hypothesis on the $f_{\beta_{i}}$ —which tends to be on a case-by-case basis. We will not require that $G_{\beta_{i}}$ use this parameterization in our analysis, however we can proceed with knowledge a model class satisfying our hypothesis does exist.

4.1 Distributed Gradient-based Method

In our optimization phase, we seek to use a gradient-based algorithm that respects the agent’s communication structure with the system. For the sake of readability, we will suppress the $\beta_{i}$ subscript and instead refer to quantities $G_{i}$ keeping in mind that they will correspond to the approximate Nash equilibrium problem in (17) with solution $\widehat{x}$ .

We will assume that each agent has access to an estimator of the gradient $\nabla_{i}F_{i}$ and is capable of projecting onto their decision set ${\cal X}_{i}$ . In the constant step-size setup, each agent chooses a rate $\omega_{i}>0$ and performs the update

x_{i}^{t+1}=\mathsf{proj}_{{\cal X}_{i}}\left(x_{i}^{t}-\omega_{i}^{-1}g_{i}^{t}\right),

where $g_{i}^{t}$ is a stochastic gradient estimator for $\nabla_{i}F_{i}$ used at iteration $t$ , which is then reported to the system and made available to all agents. For the sake of analysis, we will assume without loss of generality that the steps-sizes satisfy the ordering

\omega_{1}\geq\omega_{2}\geq~{}\ldots~{}\geq\omega_{n}

and hence $\omega_{1}=\max_{i\in[n]}\omega_{i}$ and $\omega_{n}=\min_{i\in[n]}\omega_{i}$ . The collective update can be written compactly as

x^{t+1}=\mathsf{proj}_{{\cal X},W}\left(x^{t}-W^{-1}g^{t}\right),

(28)

where $W=\mathsf{diag}(\omega_{1}{\bf\mathds{1}}_{d_{1}},\ldots,\omega_{n}{\bf\mathds{1}}_{d_{n}})$ and $g^{t}$ is an estimator for $G(x^{t})$ at iteration $t$ . Convergence of this procedure hinges on the following assumptions.

Assumption 1.

The gradient function $G:{\cal X}\subseteq\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is $\alpha$ -strongly monotone and $L$ -Lipschitz continuous.

Assumption 2.

(Stochastic Framework) Let ${\mathbb{F}}=({\cal F}_{t})_{t\geq 0}$ with elements

{\cal F}_{t}=\sigma(g^{\tau},\tau\leq t)

(29)

be the natural filtration of the Borel $\sigma$ -algebra over $\mathbb{R}^{d}$ with respect to $g^{t}$ , and use the short-hand notation ${\mathbb{E}}_{t}\cdot:={\mathbb{E}}_{z\sim D(x^{t})}[\cdot|{\cal F}_{t}]$ as the conditional expectation over the product distribution $D(x^{t})=\prod_{i=1}^{n}D_{i}(x^{t})$ . There exist bounded sequences $\{\rho^{t}\}_{t\geq 0},\{\sigma^{t}\}_{t\geq 0}\subseteq\mathbb{R}_{+}$ such that

	(Bias)	$\displaystyle\qquad\qquad\\|{\mathbb{E}}_{t}g^{t}-G(x^{t})\\|\leq\rho^{t}$
	(Variance)	$\displaystyle\qquad\qquad{\mathbb{E}}_{t}\\|g^{t}-{\mathbb{E}}_{t}g^{t}\\|^{2}\leq(\sigma^{t})^{2}$

where $\rho^{t}\leq\rho$ and $\sigma^{2}\leq\sigma$ for all $t\geq 0$ .

Assumption 1 is standard for guaranteeing convergence of gradient play [facchinei2003finite], and the uniformly bounded variance component of Assumption 2 is standard for convergence for stochastic algorithms. As we will show shortly, convergence with bias is possible and the result reduces to the unbiased case when $\rho^{t}=0$ . The next result will quantify the one-step improvement of (32).

Lemma 8.

(One-step Improvement) Let Assumptions 1 and 2 hold. Then, the sequence generated by iteration (28) satisfies:

	$\displaystyle{\mathbb{E}}_{t}\\|x^{t+1}-\widehat{x}\\|_{W}^{2}\leq$	$\displaystyle\frac{\omega_{1}}{\omega_{n}+\alpha}\\|x^{t}-\widehat{x}\\|_{W}^{2}$
		$\displaystyle+\frac{2\omega_{1}\left(\omega_{1}\rho^{2}+\alpha\sigma^{2}\right)}{\alpha\omega_{n}\left(\omega_{n}+\alpha\right)}$

for all $t\geq 0$ , provided that $\omega_{1}/\omega_{n}^{2}\leq\alpha/(4L^{2})$ . $\Box$

The proof of Lemma 8 is provided in the Appendix 6.5. We note that setting $\omega_{i}=\omega$ for some $\omega>0$ recovers the result in [narang2022learning, Theorem 15]. Following this one-step analysis, we can show convergence to a neighborhood of the Nash equilibrium.

Theorem 9.

(Neighborhood Convergence) Let Assumptions 1 and 2 hold, and suppose that $(\omega_{1}-\omega_{n})<\alpha$ . Then,

\limsup_{t\to\infty}{\mathbb{E}}\|x^{t}-\widehat{x}\|^{2}\leq\frac{2\omega_{1}\left(\omega_{1}\rho^{2}+\alpha\sigma^{2}\right)}{\alpha\omega_{n}\left(\omega_{n}+\alpha\right)}.

(30)

The proof can be found in Appendix 6.6. The result shows that the algorithm converges linearly to a neighborhood of the Nash equilibrium $\widehat{x}$ , where the radius of the neighborhood is dictated by the step-size, variance, and bias bounds. When $\rho=\sigma=0$ , we retrieve linear convergence. In order to converge to $\widehat{x}$ directly, we will require a decaying step-size policy. For example, we consider the following policy:

\omega^{t}=\frac{\alpha(r+t-2)}{2}

(31)

for fixed constant $r>2$ , which we assumed to be shared by all agents. Hence, the decaying step-size update is given by

x^{t+1}=\mathsf{proj}_{{\cal X}}\left(x^{t}-(\omega^{t})^{-1}g^{t}\right).

(32)

In the theorem that follows, we show that this sequence converges to $\widehat{x}$ provided that the bias shares an asymptotic rate with $(\omega^{t})^{-1}$ .

Theorem 10.

(Convergence) Suppose that Assumptions 1 and 2 hold and that there exists $\bar{\rho},s>0$ such that

\|{\mathbb{E}}_{t}g^{t}-G(x^{t})\|\leq\frac{\bar{\rho}}{s+t}

(33)

for all $t\geq 0$ . Then,

{\mathbb{E}}\|x^{t}-x^{*}\|^{2}\leq\frac{A}{\alpha^{2}(r+t)}

(34)

where

A=\max\left\{\alpha^{2}r\|x^{0}-x^{*}\|^{2},4\bar{\rho}^{2}\max\left\{\frac{r}{s},1\right\}+\frac{8r\sigma^{2}}{r-2}\right\}.

The proof of this result follows by a standard induction argument, and can be found in Appendix 6.7.

5 Numerical Experiments on Electric Vehicle Charging

In this section, we consider a competitive game between $n$ distinct electric vehicle charging station operators, where stations are equipped with renewable power sources. The goal of each player is to set prices to maximize their own profit in a system where demand for their station will change in response to the prices set by other competing stations as well. The cost function (negative profit) takes the form

f_{i}(x,z_{i})=\underbrace{-z_{i}x_{i}+\frac{\lambda_{i}}{2}x_{i}^{2}}_{\text{service profit}}-\underbrace{p_{w}\phi(w_{i}-z_{i})}_{\text{renewable profit}}+\underbrace{p_{r}\phi(z_{i}-w_{i})}_{\text{operational cost}}

where $\phi(y)=\log(1+\exp(y))$ for all $y\in\mathbb{R}$ . The renewable profit and operational cost terms allow us to describe the trade-off between profit from renewable power generation sold to the grid at rate $p_{w}$ , and surplus power required from the grid to meet demand at rate $p_{r}$ . To set prices, we can formulate a Nash equilibrium problem over the expected costs $F_{i}(x)={\mathbb{E}}_{z_{i}\sim D_{i}(x)}[f_{i}(x,z_{i})]$ for $i\in[n]$ and $x\in{\cal X}=\Pi_{i=1}^{n}{\cal X}_{i}$ , where ${\cal X}_{i}=[p_{w},p_{r}]$ is the interval of price values between the wholesale and retail price.

Since the set of reasonable prices will be quite small, we hypothesize that the the price and demand have a linear relationship of the form $z_{i}\overset{d}{=}\xi_{i}+\langle b_{i},x_{i}\rangle$ where $b_{i}\in\mathbb{R}^{n}$ with $\xi_{i}\sim D_{\xi_{i}}$ corresponding to the base demand. Since we have a simple model, the first and second derivatives can be computed in closed form, and the relevant constants can be computed directly. Indeed, we find that the hypothesis of Theorem 1 are satisfied with $\lambda=\min_{i}\lambda_{i}$ which we set to $1$ , $L_{i}=1$ ,and $\gamma_{i}=\|b_{i}\|^{2}$ . We conclude that $G:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}$ is $\alpha=(1-2\|B\|_{F})$ -strongly monotone with where $B$ is the parameter matrix whose columns are $b_{i}$ .

Our data depicts the demand of electricity across an hour-long period for 6 ports of varying power profiles for each day in year. We standardize the data to be zero mean and unit variance across each station. Solutions are calculated by performing expected gradient play with constant step size; the expected mean is estimated via the empirical mean over the data set.

We set $b_{ii}=-1/18+\nu$ and $b_{ij}=1/18+\nu$ , where we use $\nu\sim{\cal N}(0,10^{-5})$ to simulate learning $B$ from samples. Hence demand for agent $i$ decreases as their own price increases, and increases as the price of other agents decreases. We run the stochastic gradient play algorithm initialized at $x^{0}=p_{r}{\bf\mathds{1}}_{n}$ with a single sample at each round and a decaying step size policy $\omega_{t}=\alpha(r+t-2)/2$ for $r=3$ . In Figure 3 we plot the mean error trajectory an confidence interval over 50 trials of 2000 iterations.

6 CONCLUSIONS

In this work, we studied a class of stochastic Nash equilibrium problems, characterized by data distributions that are dependant on the decisions of all involved players. We showed that a learning-based approach enables the formulation of an approximate Nash equilibrium problem that is solvable using a stochastic gradient play algorithm. The results of this procedure is a cost that can be related to cost of the original Nash equilibrium problem via an error that depends on both our approximation and estimation error. To demonstrate the flexibility of these findings, we simulated these techniques in an electric vehicle charging market problem in which service providers set prices, and users modify their demand based on prices set by providers. Future research will look at a scenario where the estimate of the distributional map is improved during the operation of the algorithm, based on the feedback received. Future applications will demonstrate the efficacy of more complex models (beyond linear)

APPENDIX

6.1 Proof of Lemma 2

For the sake of notation convenience, and visual clarity, we will suppress the $i$ index throughout the proof. We denote the gradient error by $J(\beta)=\nabla\widehat{R}(\beta)-\nabla R(\beta)$ for all $\beta\in\mathbb{R}^{\ell}$ .

To begin, we will generate coverings for the unit sphere in $\mathbb{R}^{\ell}$ and ${\cal B}\subseteq\mathbb{R}^{\ell}$ and use a discretization argument to create bounds over these finite sets. Fix $\beta\in{\cal B}$ and $u\in{\cal S}^{\ell-1}$ . Let $\{u_{j}\}_{j=1}^{N}$ be an arbitrary $1/2$ -covering of the sphere $\mathbb{S}^{d_{\ell_{i}}}$ with respect to the Euclidean norm. From [wainwright_2019, Lemma 5.7], we know that $N\leq 5^{\ell}$ . From our covering, we have that there exists $u_{j}$ in the covering such that $\|u-u_{j}\|\leq 1/2$ . Hence,

	$\displaystyle\langle u,J(\beta)\rangle$	$\displaystyle=\langle u_{j}+(u-u_{j}),J(\beta)\rangle$
		$\displaystyle=\langle u_{j},J(\beta)\rangle+\langle u-u_{j},J(\beta)\rangle$
		$\displaystyle\leq\langle u_{j},J(\beta)\rangle+\\|u-u_{j}\\|\\|J(\beta)\\|$
		$\displaystyle\leq\langle u_{j},J(\beta)\rangle+\frac{1}{2}\\|J(\beta)\\|$
		$\displaystyle\leq\max_{j\in[N]}\langle u_{j},J(\beta)\rangle+\frac{1}{2}\\|J(\beta)\\|\,.$

Since this is true for any $u\in{\cal S}^{d-1}$ , then it holds for $u=J(\beta)/\|J(\beta)\|$ . Thus the above becomes

\|J(\beta)\|\leq 2\langle u_{j},J(\beta)\rangle\leq 2\max_{j\in[N]}\langle u_{j},J(\beta)\rangle.

(35)

Now we fix $\varepsilon\in(0,1]$ , and choose and $\varepsilon$ -covering for the set ${\cal B}$ , which we will write as $\{\beta_{k}\}_{k=1}^{M}$ . Recall that ${\cal B}$ is bounded, so there exists a constant $r>0$ such that for all $\beta\in{\cal B}$ , $\|\beta\|\leq r$ . Hence ${\cal B}\subseteq B(r)$ . From [vershynin2018high, Proposition 4.2.12], we have that

M\leq\frac{\operatorname*{vol}\left(B(r)+\frac{\varepsilon}{2}B(1)\right)}{\operatorname*{vol}\left(\frac{\varepsilon}{2}B(1)\right)}=\frac{\operatorname*{vol}\left(\frac{3}{2}B(r)\right)}{\operatorname*{vol}\left(\frac{\varepsilon}{2}B(1)\right)}=\left(\frac{3r}{\varepsilon}\right)^{\ell}.

(36)

Thus, we conclude that $M\leq(3r/\varepsilon)^{\ell}$ .

Now by our discretization argument, there exists $k\in[M]$ such that $\|\beta-\beta_{k}\|\leq\varepsilon$ and hence

	$\displaystyle\max_{j\in[N]}\langle u_{j},J(\beta)\rangle$	$\displaystyle=\max_{j\in[N]}\langle u_{j},J(\beta_{k})+(J(\beta)-J(\beta_{k})\rangle$
		$\displaystyle=\max_{j\in[N]}\langle u_{j},J(\beta_{k})\rangle+\langle u_{j},J(\beta)-J(\beta_{k})\rangle$
		$\displaystyle\leq\max_{j\in[N]}\langle u_{j},J(\beta_{k})\rangle+\max_{j\in[N]}\langle u_{j},J(\beta)-J(\beta_{k})\rangle$
		$\displaystyle\leq\max_{k\in[M]}\max_{j\in[N]}\langle u_{j},J(\beta_{k})\rangle$
		$\displaystyle\ +\sup_{\\|\alpha-\alpha^{\prime}\\|\leq\varepsilon}\max_{j\in[N]}\langle u_{j},J(\alpha)-J(\alpha^{\prime})\rangle.$

We observe that if $\alpha,\alpha^{\prime}\in{\cal B}$ are such that $\|\alpha-\alpha^{\prime}\|\leq\varepsilon$ , then applying our smoothness assumption yields

	$\displaystyle\langle u_{j},J(\alpha)-J(\alpha^{\prime})\rangle$
	$\displaystyle=\langle u_{j},(\nabla\widehat{R}(\alpha)-\nabla R(\alpha))-(\nabla\widehat{R}(\alpha^{\prime})-\nabla R(\alpha^{\prime}))\rangle$
	$\displaystyle=\langle u_{j},\nabla\widehat{R}(\alpha)-\nabla\widehat{R}(\alpha^{\prime})\rangle+\langle u_{j},\nabla R(\alpha^{\prime})-\nabla R(\alpha)\rangle$
	$\displaystyle\leq\\|u_{j}\\|\\|\nabla\widehat{R}(\alpha)-\nabla\widehat{R}(\alpha^{\prime})\\|+\\|u_{j}\\|\\|\nabla R(\alpha)-\nabla R(\alpha^{\prime})\\|$
	$\displaystyle\leq L_{\beta_{i}}\\|\alpha-\alpha^{\prime}\\|+L_{\beta_{i}}\\|\alpha-\alpha^{\prime}\\|$
	$\displaystyle\leq 2L_{\beta}\varepsilon,$

where the second-to-last inequality uses $\|u_{j}\|=1$ .

To bound the remaining term, we use the concentration of sub-exponential random variables, due to Bernstein’s Inequality combined with the Union Bound. We have that

{\mathbb{P}}\left(\langle u_{j},J(\beta_{k})\rangle\geq t\right)\leq 2\exp\left(-\frac{mt^{2}}{2\theta^{2}}\right)

for all $t\leq\theta$ , and hence

	$\displaystyle{\mathbb{P}}\left(\max_{k\in[M]}\max_{j\in[N]}\langle u_{j},J(\beta_{k}\rangle\geq t\right)$
	$\displaystyle={\mathbb{P}}\left(\bigcup_{k\in[M]}\bigcup_{j\in[N]}\{\langle u_{j},J(\beta_{k}\rangle\geq t\}\right)$
	$\displaystyle\leq\sum_{k\in[M]}\sum_{j\in[N]}{\mathbb{P}}\left(\{\langle u_{j},J(\beta_{k}\rangle\geq t\}\right)$
	$\displaystyle\leq\sum_{k\in[M]}\sum_{j\in[N]}2\exp\left(-\frac{mt^{2}}{2\theta^{2}}\right)$
	$\displaystyle=M\cdot N\cdot 2\exp\left(-\frac{mt^{2}}{2\theta^{2}}\right)$
	$\displaystyle\leq 2\left(\frac{15r}{\varepsilon}\right)^{\ell}\exp\left(-\frac{mt^{2}}{2\theta^{2}}\right)$

for all $t\leq\theta$ , where we used the fact that $M\leq(3r/\varepsilon)^{\ell}$ and $N\leq 5^{\ell}$ . Setting the right hand side equal to $2\delta$ yields

t=\sqrt{2}\theta\sqrt{\frac{\ell\log(15r/\varepsilon)+\log(1/\delta)}{m}}.

(37)

Next we choose $\varepsilon=\frac{1}{15r}\sqrt{\frac{\ell+\log(1/\delta)}{m}}$ so that

	$\displaystyle t$	$\displaystyle=\sqrt{2}\theta\sqrt{\frac{\ell\log(15r/\varepsilon)+\log(1/\delta)}{m}}$
		$\displaystyle=\sqrt{2}\theta\sqrt{\frac{\frac{\ell}{2}\log(m)-\frac{\ell}{2}\log(\ell+\log(1/\delta))+\log(1/\delta)}{m}}$
		$\displaystyle\leq\sqrt{2}\theta\sqrt{\frac{\ell\log(m)+\log(1/\delta)}{m}}$
		$\displaystyle\leq\sqrt{2}\theta\sqrt{\frac{\log(m)(\ell+\log(1/\delta))}{m}}.$

By requiring that $m$ satisfy $m/\log(m)\geq 2(\ell+\log(1/\delta))$ , we enforce that $t\leq\theta$ . In combining, we observe that

	$\displaystyle t+2\varepsilon L$
	$\displaystyle\leq\sqrt{2}\theta\sqrt{\frac{\log(m)(\ell+\log(1/\delta))}{m}}+\frac{2L}{15r}\sqrt{\frac{\ell+\log(1/\delta)}{m}}$
	$\displaystyle\leq 2\left(\theta+\frac{L}{15r}\right)\sqrt{\frac{\log(m)(\ell+\log(1/\delta))}{m}}$
	$\displaystyle\leq 4\max\left\{\frac{L}{15r},\theta\right\}\sqrt{\frac{\log(m)(\ell+\log(1/\delta))}{m}},$

and the result follows.

6.2 Proof of Theorem 3

We suppress the subscript $i$ for notational simplicity. We recall that that the $\mu$ -strong convexity of the map $\beta\mapsto R(x,z;\beta)$ implies $\mu$ -strong monotonicity of $\nabla R(\beta)$ , and $\nabla\widehat{R}(\beta)$ . It follows that

	$\displaystyle\mu\\|\widehat{\beta}-\beta^{*}\\|^{2}$	$\displaystyle\leq\langle\widehat{\beta}-\beta^{},\nabla R(\widehat{\beta})-\nabla R(\beta^{})\rangle$
		$\displaystyle=\langle\widehat{\beta}-\beta^{},\nabla R(\widehat{\beta})\rangle-\langle\widehat{\beta}-\beta^{},\nabla R(\beta^{*})\rangle$
		$\displaystyle\leq\langle\widehat{\beta}-\beta^{*},\nabla R(\widehat{\beta})\rangle$
		$\displaystyle\leq\langle\widehat{\beta}-\beta^{},\nabla R(\widehat{\beta})\rangle+\langle\beta^{}-\widehat{\beta},\nabla\widehat{R}(\widehat{\beta})\rangle$
		$\displaystyle=\langle\widehat{\beta}-\beta^{*},\nabla R(\widehat{\beta})-\nabla\widehat{R}(\widehat{\beta})\rangle$
		$\displaystyle\leq\\|\widehat{\beta}-\beta^{*}\\|\sup_{\beta\in{\cal B}}\\|\nabla R(\beta)-\nabla\widehat{R}(\beta)\\|$

and hence

\|\widehat{\beta}-\beta^{*}\|\leq\frac{1}{\mu}\sup_{\beta\in{\cal B}}\|\nabla R(\beta)-\nabla\widehat{R}(\beta)\|.

(38)

The result now follows by applying Lemma 2. $\Box$

6.3 Proof of Proposition 4

We suppress the $i$ index throughout. The associated risk function is $R(x,z,B)=\frac{1}{2}\|Bx-z\|^{2}$ , so that $\nabla R(x,z,B)=(Bx-z)x^{T}=Bxx^{T}-zx^{T}$ and $\nabla^{2}R(x,z,B)=xx^{T}$ are the corresponding gradient and hessian. We observe that enforcing $\gamma I\leq{\mathbb{E}}[xx^{T}]\leq LI$ for some $\gamma,L>0$ ensures $\gamma$ -strong convexity and $L$ -smoothness of the expected risk. Similarly, the empirical risk has gradient $\nabla R_{m}(B)=1/m(BXX^{T}-ZX^{T})$ , and hessian $\nabla^{2}R_{m}(B)=(1/m)XX^{T}$ . Thus $R_{m}$ is convex the hessian is symmetric, then it is positive semi-definite and thus $R_{m}$ is convex. Furthermore, smoothness of $R_{m}$ follows with constant $\max\{L,\|XX^{T}\|_{2}\}$ . Lastly, since $zx^{T}$ and $xx^{T}$ have sub-exponential entries, the gradient is sub-exponential and the result follows. $\Box$

6.4 Proof of Theorem 5

We observe that for any fixed $x\in\mathcal{X}$ , we have that $|F_{\widehat{\beta}_{i}}(x)-F_{i}(x)|\leq|F_{\widehat{\beta}_{i}}(x)-F_{\beta_{i}^{*}}(x)|+|F_{\beta_{i}^{*}}(x)-F_{i}(x)|$ . The first term describes our statistical error at $x$ . We denote $\Pi(D_{\widehat{\beta}_{i}},D_{\beta_{i}^{*}})$ as a coupling on ${\cal P}(\mathbb{R}^{m_{i}})$ so that

	$\displaystyle\|F_{\widehat{\beta}_{i}}(x)-F_{\beta_{i}^{*}}(x)\|$
	$\displaystyle=\left\|\inf_{\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{}}(x))}\ {\mathbb{E}}_{(z,z^{\prime})\sim\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{}}(x))}\left(f(x,z)-f(x,z^{\prime})\right)\right\|$
	$\displaystyle\leq\inf_{\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{}}(x))}\ {\mathbb{E}}_{(z,z^{\prime})\sim\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{}}(x))}\left\|f(x,z)-f(x,z^{\prime})\right\|$
	$\displaystyle\leq L_{z_{i}}\left(\inf_{\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{}}(x))}\ {\mathbb{E}}_{(z,z^{\prime})\sim\Pi(D_{\widehat{\beta_{i}}}(x),D_{\beta_{i}^{}}(x))}\\|z_{i}-z_{i}^{{}^{\prime}}\\|\right)$
	$\displaystyle=L_{z_{i}}W_{1}(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{*}}(x))$
	$\displaystyle\leq L_{z_{i}}\varepsilon_{i}\\|\widehat{\beta}_{i}-\beta_{i}^{*}\\|\,.$

By similar argument, we find that $|F_{\beta_{i}^{*}}(x)-F_{i}(x)|\leq L_{z_{i}}W_{1}(D_{\beta_{i}^{*}}(x),D_{i}(x))\leq L_{z_{i}}\gamma_{i}$ . In combining, we get $|F_{\widehat{\beta}_{i}}(x)-F_{i}(x)|\leq L_{z_{i}}\varepsilon_{i}\|\widehat{\beta}_{i}-\beta_{i}^{*}\|+L_{z_{i}}\gamma_{i}$ . Lastly, $\|\widehat{\beta}_{i}-\beta_{i}^{*}\|$ can be bounded as in Theorem 3.

Regarding the second bound, we have that

	$\displaystyle F_{i}(\widehat{x})-F_{i}(x^{*})$
	$\displaystyle=\left[F_{i}(\widehat{x})-F_{\beta_{i}^{}}(\widehat{x})\right]+\left[F_{\beta_{i}^{}}(\widehat{x})-F_{\widehat{\beta_{i}}}(\widehat{x})\right]$
	$\displaystyle\ +\left[F_{\widehat{\beta_{i}}}(\widehat{x})-F_{\widehat{\beta_{i}}}(x^{})\right]+\left[F_{\widehat{\beta_{i}}}(x^{})-F_{\beta_{i}^{}}(x^{*})\right]$
	$\displaystyle\ +\left[F_{\beta_{i}^{}}(x^{})-F_{\beta_{i}^{}}(x^{})\right]+\left[F_{\beta_{i}^{}}(x^{})-F_{i}(x^{})\right]$
	$\displaystyle\leq 2\\|F_{i}-F_{\beta_{i}^{}}\\|_{\infty}+2\\|F_{\beta_{i}^{}}-F_{\widehat{\beta_{i}}}\\|_{\infty}$
	$\displaystyle\ +\left[F_{\widehat{\beta_{i}}}(\widehat{x})-F_{\widehat{\beta_{i}}}(x^{*})\right]+\left[F_{\beta_{i}^{}}(x^{*})-F_{\beta_{i}^{}}(x^{*})\right]$

where $x_{i}^{**}\in\mathcal{X}^{**}$ , where $\mathcal{X}^{**}$ is the set of equilibria of

{x}_{i}^{**}\in\operatorname*{arg\,min}_{x_{i}\in{\cal X}_{i}}F_{{\beta}^{*}_{i}}(x_{i},{x}^{**}_{-i}),\quad i\in[n]

(39)

where

F_{{\beta}^{*}_{i}}(x_{i},{x}^{**}_{-i}):=\underset{z_{i}\sim D_{{\beta}^{*}_{i}}(x_{i},{x}^{**}_{-i})}{{\mathbb{E}}}f_{i}(x_{i},{x}^{**}_{-i},z_{i})\,.

Then,

	$\displaystyle F_{\widehat{\beta_{i}}}(\widehat{x})-F_{\widehat{\beta_{i}}}(x^{**})$	$\displaystyle\leq\left[F_{\widehat{\beta_{i}}}(\widehat{x}_{i},\widehat{x}_{-i})-F_{\widehat{\beta_{i}}}(x^{**}_{i},\widehat{x}_{-i})\right]$
		$\displaystyle\ +\left[F_{\widehat{\beta_{i}}}(x^{}_{i},\widehat{x}_{-i})-F_{\widehat{\beta_{i}}}(x^{}_{i},x^{**}_{-i})\right]$
		$\displaystyle\leq L_{i}^{\widehat{\beta}}\\|\widehat{x}_{i}-x_{i}^{}\\|+L_{-i}^{\widehat{\beta}}\\|\widehat{x}_{-i}-x^{}_{-i}\\|$
		$\displaystyle\leq\sqrt{2}\max\{L_{i}^{\widehat{\beta}},L_{-i}^{\widehat{\beta}}\}\\|\widehat{x}-x^{**}\\|$

where we have used the inequality $\sqrt{a}+\sqrt{b}\leq\sqrt{2}\sqrt{a+b}$ for some $a,b\geq 0$ .

Next, consider

	$\displaystyle F_{\beta^{}_{i}}(x^{})-F_{\beta^{}_{i}}(x^{*})$	$\displaystyle=[F_{\beta^{}_{i}}(x_{i}^{},x_{-i}^{})-F_{\beta^{}_{i}}(x_{i}^{*},x^{}_{-i})]$
		$\displaystyle~{}~{}+[F_{\beta^{}_{i}}(x_{i}^{},x^{}_{-i})-F_{\beta^{}_{i}}(x_{i}^{},x_{-i}^{*})]$
		$\displaystyle\leq L_{-i}^{\beta^{}}\\|x_{-i}^{}-x_{-i}^{}\\|+L_{i}^{\beta^{}}\\|x_{i}^{}-x_{i}^{}\\|$
		$\displaystyle\leq\sqrt{2}\max\{L_{i}^{\beta^{}},L_{-i}^{\beta^{}}\}\\|x^{*}-x^{}\\|\,.$

Combining the bounds yields

	$\displaystyle F_{i}(\widehat{x})-F_{i}(x^{*})$
	$\displaystyle\leq 2\\|F_{i}-F_{\beta^{}_{i}}\\|_{\infty}+2\\|F_{\beta^{}_{i}}-F_{\widehat{\beta_{i}}}\\|_{\infty}$
	$\displaystyle~{}~{}+\sqrt{2}\max\{L_{i}^{\widehat{\beta}},L_{-i}^{\widehat{\beta}}\}\\|\widehat{x}-x^{**}\\|$
	$\displaystyle~{}~{}+\sqrt{2}\max\{L_{i}^{\beta^{}},L_{-i}^{\beta^{}}\}\\|x^{*}-x^{}\\|$
	$\displaystyle\leq 2\gamma_{i}L_{z_{i}}+2\varepsilon L_{z_{i}}\\|\widehat{\beta}_{i}-\beta_{i}^{*}\\|$
	$\displaystyle~{}~{}+\sqrt{2}(\max\{L_{i}^{\widehat{\beta}},L_{-i}^{\widehat{\beta}}\}+\max\{L_{i}^{\beta^{}},L_{-i}^{\beta^{}}\})\mathsf{diam}(\mathcal{X})$
	$\displaystyle\leq 2\gamma_{i}L_{z_{i}}+2\varepsilon L_{z_{i}}\\|\widehat{\beta}_{i}-\beta_{i}^{*}\\|+2\sqrt{2}\bar{L}_{i}\mathsf{diam}(\mathcal{X})$

Then, (25) follows using the bound on $\|\widehat{\beta}_{i}-\beta_{i}^{*}\|$ from Theorem 3. $\Box$

6.5 Proof of Lemma 8

Consider the function $\varphi:\mathbb{R}^{d}\rightarrow\mathbb{R}$ defined by $\varphi(y)=\frac{1}{2}\|x^{t}-W^{-1}_{i}g_{i}^{t}-y\|_{W}^{2}$ for all $y\in{\cal X}$ . Then, $\varphi$ is $\omega_{n}$ -strongly convex over ${\cal X}$ and has a unique minimizer $x^{t+1}\in{\cal X}$ . This implies that:

\varphi(x^{*})\geq\varphi(x^{t+1})+\langle x^{*}-x^{t+1},\nabla\varphi(x^{t+1})\rangle+\frac{\omega_{n}}{2}\|x^{t+1}-x^{*}\|^{2}\,.

Since $\langle x-x^{t+1},\nabla\varphi(x^{t+1})\rangle\geq 0$ for all $x\in{\cal X}$ , we obtain

\|x_{i}^{k+1}-x_{i}^{*}\|\leq\|x_{i}^{k}-\eta_{i}g_{i}^{k}-x_{i}^{*}\|_{W}^{2}-\|x_{i}^{k}-\eta_{i}g_{i}^{k}-x_{i}^{k+1}\|_{W}^{2}.

It follows that

	$\displaystyle\frac{\omega_{1}}{\omega_{n}}\\|x^{t+1}-x^{*}\\|_{W}^{2}$	$\displaystyle\leq\\|x^{t}-x^{*}\\|_{W}^{2}-\\|x^{t}-x_{i}^{t+1}\\|_{W}^{2}$
		$\displaystyle-2\langle x^{t}-x^{*},g^{t}\rangle+2\eta_{i}\langle x^{t}-x^{t+1},g^{t}\rangle.$

We now consider the above in the conditional expectation ${\mathbb{E}}_{t}\cdot:={\mathbb{E}}_{z_{i}\sim D(x_{t})}[\ \cdot\ |{\cal F}_{t}]$ with ${\cal F}_{t}=\sigma(g^{t},\tau\geq t)$ . We find that

	$\displaystyle\frac{\omega_{1}}{\omega_{n}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|_{W}^{2}$
	$\displaystyle\leq{\mathbb{E}}_{t}\\|x^{t}-x^{*}\\|_{W}^{2}-{\mathbb{E}}_{t}\\|x^{t}-x^{t+1}\\|_{W}^{2}$
	$\displaystyle~{}~{}-2{\mathbb{E}}_{t}\langle x^{t}-x^{*},g_{i}^{t}\rangle-2{\mathbb{E}}_{t}\langle x^{t+1}-x^{t},g^{t}\rangle$
	$\displaystyle=\\|x^{t}-x^{*}\\|_{W}^{2}-{\mathbb{E}}_{t}\\|x^{t}-x^{t+1}\\|_{W}^{2}$
	$\displaystyle~{}~{}-2\langle x^{t}-x^{*},\mu^{t}\rangle-2{\mathbb{E}}_{t}\langle x^{t+1}-x^{t},g^{t}\rangle$
	$\displaystyle=\\|x^{t}-x^{*}\\|_{W}^{2}-{\mathbb{E}}_{t}\\|x^{t}-x^{t+1}\\|_{W}^{2}$
	$\displaystyle~{}~{}+2{\mathbb{E}}_{t}\langle x^{t}-x^{t+1},g^{t}-\mu^{t}\rangle+2{\mathbb{E}}_{t}\langle x^{*}-x^{t+1},\mu^{t}\rangle$
	$\displaystyle=\\|x^{t}-x^{}\\|_{W}^{2}-{\mathbb{E}}_{t}\\|x^{t}-x^{t+1}\\|_{W}^{2}-2\langle x^{t+1}-x^{},G(x^{t+1})\rangle$
	$\displaystyle~{}~{}+2{\mathbb{E}}_{t}\langle x^{*}-x^{t+1},\mu^{t}-G(x^{t+1})\rangle$
	$\displaystyle~{}~{}+2{\mathbb{E}}_{t}\langle x^{t}-x^{t+1},g^{t}-\mu^{t}\rangle.$

To proceed, we bound the inner product terms. Using strong monotonicity, we have that

	$\displaystyle{\mathbb{E}}_{t}\langle x^{*}-x^{t+1},G(x^{t+1})\rangle$	$\displaystyle\geq\alpha{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|^{2}$
		$\displaystyle\geq\frac{\alpha}{\omega_{1}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|_{W}^{2}.$

Furthermore, we observe that

	$\displaystyle{\mathbb{E}}_{t}\langle x^{*}-x_{i}^{t+1},\mu^{t}-G(x^{t+1})\rangle$
	$\displaystyle\hskip 28.45274pt={\mathbb{E}}_{t}\langle x^{*}-x^{t+1},\mu^{t}-G(x^{t})\rangle$
	$\displaystyle\hskip 28.45274pt+{\mathbb{E}}_{t}\langle x^{*}-x^{t+1},G(x^{t})-G(x^{t+1})\rangle.$

To bound the remaining terms, we use arguments based on a weighted Young’s inequality. Let $\Delta_{1},\Delta_{2},\Delta_{3}>0$ be fixed constants. It follows that

	$\displaystyle 2{\mathbb{E}}_{t}\langle x^{t}-x^{t+1},g^{t}-\mu^{t}\rangle$
	$\displaystyle\leq\Delta_{1}{\mathbb{E}}_{t}\\|x^{t+1}-x^{t}\\|^{2}+\frac{1}{\Delta_{1}}{\mathbb{E}}_{t}\\|g^{t}-\mu^{t}\\|^{2}$
	$\displaystyle\leq\frac{\Delta_{1}}{\omega_{n}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{t}\\|_{W}^{2}+\frac{1}{\Delta_{1}}\sum_{i=1}^{n}{\mathbb{E}}_{t}\\|g^{t}-\mu^{t}\\|^{2}$
	$\displaystyle\leq\frac{\Delta_{1}}{\omega_{n}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{t}\\|_{W}^{2}+\frac{1}{\Delta_{1}}\sum_{i=1}^{n}\sigma_{i}^{2}$
	$\displaystyle\leq\frac{\Delta_{1}}{\omega_{n}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{t}\\|_{W}^{2}+\frac{\sigma^{2}}{\Delta_{1}},$

and

	$\displaystyle 2{\mathbb{E}}_{t}\langle x^{*}-x^{t+1},\mu^{t}-G(x^{t})\rangle$
	$\displaystyle\leq\Delta_{2}{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|^{2}+\frac{1}{\Delta_{2}}{\mathbb{E}}_{t}\\|\mu^{t}-G(x^{t})\\|^{2}$
	$\displaystyle\leq\frac{\Delta_{2}}{\omega_{n}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|_{W}^{2}+\frac{1}{\Delta_{2}}\sum_{i=1}^{n}{\mathbb{E}}_{t}\\|\mu^{t}-G(x^{t})\\|^{2}$
	$\displaystyle\leq\frac{\Delta_{2}}{\omega_{n}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|_{W}^{2}+\frac{1}{\Delta_{2}}\sum_{i=1}^{n}\rho_{i}^{2}$
	$\displaystyle\leq\frac{\Delta_{2}}{\omega_{n}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|_{W}^{2}+\frac{\rho^{2}}{\Delta_{2}}.$

Additionally, we have that

	$\displaystyle 2{\mathbb{E}}_{t}\langle x^{*}-x^{t+1},G(x^{t})-G(x^{t+1})\rangle$
	$\displaystyle\leq\Delta_{3}{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|^{2}+\frac{1}{\Delta_{3}}{\mathbb{E}}_{t}\\|G(x^{t})-G(x^{t+1})\\|^{2}$
	$\displaystyle\leq\frac{\Delta_{3}}{\omega_{n}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|_{W}^{2}+\frac{L^{2}}{\Delta_{3}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{t}\\|^{2}$
	$\displaystyle\leq\frac{\Delta_{3}}{\omega_{n}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|_{W}^{2}+\frac{L^{2}}{\omega_{n}\Delta_{3}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{t}\\|^{2}_{W}.$

Combining these estimates yields

	$\displaystyle\frac{\omega_{n}}{\omega_{1}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|^{2}_{W}$
	$\displaystyle\leq\\|x^{t}-x^{}\\|_{W}^{2}-{\mathbb{E}}_{t}\\|x^{t+1}-x^{t}\\|_{W}^{2}-\frac{2\alpha}{\omega_{1}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{}\\|_{W}^{2}$
	$\displaystyle~{}~{}+\left(\frac{\Delta_{1}}{\omega_{n}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{t}\\|_{W}^{2}+\frac{\sigma^{2}}{\Delta_{1}}\right)$
	$\displaystyle~{}~{}+\left(\frac{\Delta_{2}}{\omega_{n}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|_{W}^{2}+\frac{\rho^{2}}{\Delta_{2}}\right)$
	$\displaystyle~{}~{}+\left(\frac{\Delta_{3}}{\omega_{n}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|_{W}^{2}+\frac{L^{2}}{\omega_{n}\Delta_{3}}{\mathbb{E}}_{t}\\|x^{t+1}-x^{t}\\|^{2}_{W}\right)$
	$\displaystyle=\\|x^{t}-x^{*}\\|_{W}^{2}+\left(\frac{\Delta_{1}}{\omega_{n}}+\frac{L^{2}}{\omega_{n}\Delta_{3}}-1\right){\mathbb{E}}_{t}\\|x^{t+1}-x^{t}\\|_{W}^{2}$
	$\displaystyle~{}~{}+\left(\frac{\Delta_{2}}{\omega_{n}}+\frac{\Delta_{3}}{\omega_{n}}-\frac{2\alpha}{\omega_{1}}\right){\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|_{W}^{2}$
	$\displaystyle~{}~{}+\left(\frac{\sigma^{2}}{\Delta_{1}}+\frac{\rho^{2}}{\Delta_{2}}\right)$

and simplifying gives

	$\displaystyle\left(\frac{\omega_{n}}{\omega_{1}}+\frac{2\alpha}{\omega_{1}}-\frac{\Delta_{2}}{\omega_{n}}-\frac{\Delta_{3}}{\omega_{n}}\right){\mathbb{E}}_{t}\\|x^{t+1}-x^{*}\\|_{W}^{2}$
	$\displaystyle\leq\\|x^{t}-x^{*}\\|_{W}^{2}+\left(\frac{\sigma^{2}}{\Delta_{1}}+\frac{\rho^{2}}{\Delta_{2}}\right)$
	$\displaystyle~{}~{}+\left(\frac{\Delta_{1}}{\omega_{n}}+\frac{L^{2}}{\omega_{n}\Delta_{3}}-1\right){\mathbb{E}}_{t}\\|x^{t+1}-x^{t}\\|_{W}^{2}.$

To proceed, we choose $\Delta_{2}=\Delta_{3}=\frac{\alpha\omega_{1}}{\omega_{n}}$ and $\Delta_{1}=\omega_{1}^{-1}-2\omega_{1}L^{2}/(\alpha\omega_{n})$ to ensure that the coefficient on the ${\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|_{W}^{2}$ term is zero. Furthermore, enforcing that $\frac{\omega_{1}}{\omega_{n}^{2}}\leq\frac{\alpha}{4L^{2}}$ guarantees that $\Delta_{1}^{-1}\leq 2\omega_{n}^{-1}$ . Hence the variance term is finite. Substituting these values and simplifying yields the result. $\Box$

6.6 Proof of Theorem 9

For notational convenience, we will use the short-hand notation $e^{t}:=\|x^{t}-x^{*}\|_{W}^{2}$ , $c=\omega_{1}/(\alpha+\omega_{n})$ , and

A=2\frac{\alpha\sigma^{2}+\omega_{1}\rho^{2}}{\alpha\omega_{n}}\,.

Hence, the result in Lemma 8 can be written compactly as

{\mathbb{E}}_{t-1}e^{t}\leq ce^{t-1}+cA.

By recursively applying this result and applying the law of total expectation, we find that

\displaystyle{\mathbb{E}}e^{t}\leq c^{t}e^{0}+cA\sum_{j=1}^{t-1}c^{j}\leq c^{t}e^{0}+cA\frac{1-c^{t}}{1-c}.

Furthermore, if $(\omega_{1}-\omega_{n})<\alpha$ , then $c<1$ and the geometric series converges and is equal to its limit $1/(1-c)$ . Hence ${\mathbb{E}}e^{t}\leq c^{t}e^{0}+A\frac{c}{1-c}$ . $\Box$

6.7 Proof of Theorem 10

Fix $t\geq 0$ . For notational convenience, we will denote $e^{t}=\|x^{t}-x^{*}\|^{2}$ . Replacing the step-size matrix in Lemma 8 with $W=\omega^{t}I_{d\times d}$ yields

{\mathbb{E}}_{t}e^{t+1}\leq\frac{\omega^{t}}{\omega^{t}+\alpha}e^{t}+\frac{2\sigma^{2}}{\omega^{t}(\omega^{t}+\alpha)}+\frac{2(\rho^{t})^{2}}{\alpha(\omega^{t}+\alpha)}.

(40)

To proceed, we will use the observation that

\frac{1}{(s+t)(r+t)}=\frac{r+t}{(s+t)(r+t)^{2}}\leq\frac{\max\{\frac{r}{s},1\}}{(r+t)^{2}}

(41)

and

\frac{1}{(r+t)(r+t-2)}\leq\frac{\frac{r}{r-2}}{(r+t)^{2}}.

(42)

By substituting our expression for $\omega^{t}$ , $\rho^{t}$ , and $e^{t}$ into (40) we obtain

	$\displaystyle{\mathbb{E}}_{t}e^{t+1}$	$\displaystyle\leq\frac{r+t-2}{\alpha^{2}(r+t)^{2}}A+\frac{8\sigma^{2}}{\alpha^{2}(r+t-2)(r+t)}$
		$\displaystyle~{}~{}+\frac{4\bar{\rho}}{\alpha^{2}(s+t)(r+t)}$
		$\displaystyle\leq\frac{r+t-2}{\alpha^{2}(r+t)^{2}}A+\frac{8\sigma^{2}\left(\frac{r}{r-2}\right)}{\alpha^{2}(r+t)^{2}}+\frac{4\bar{\rho}\max\left\{\frac{r}{s},1\right\}}{\alpha^{2}(r+t)^{2}}$
		$\displaystyle=\frac{r+t-1}{\alpha^{2}(r+t)^{2}}A$
		$\displaystyle~{}~{}+\frac{-A+8\sigma^{2}\left(\frac{r}{r-2}\right)+4\bar{\rho}\max\left\{\frac{r}{s},1\right\}}{\alpha^{2}(r+t)^{2}}$
		$\displaystyle\leq\frac{r+t-1}{\alpha^{2}(r+t)^{2}}A$
		$\displaystyle\leq\frac{A}{\alpha^{2}(r+t+1)}.$

Here, the last steps follow from construction of $A$ , and the fact that $(r+t+1)(r+t-1)\leq(r+t)^{2}$ . $\Box$

ACKNOWLEDGMENTS

This research was supported in part by the National Science Foundation (NSF) Mathematical Sciences Graduate Internship (MSGI) Program and by NSF awards 1941896 and 2044946.

The MSGI program is administered by the Oak Ridge Institute for Science and Education (ORISE) through an inter-agency agreement between the U.S. Department of Energy (DOE) and NSF. ORISE is managed for DOE by ORAU. All opinions expressed in this paper are the author’s and do not necessarily reflect the policies and views of NSF, ORAU/ORISE, or DOE.

{IEEEbiography}

[ [Uncaptioned image] ] Killian Wood is a Ph.D. candidate in applied mathematics at the University of Colorado Boulder advised Prof. Emiliano Dall’Anese. He received a B.A. degree in mathematics from California State University Fullerton, Fullerton, California, USA in 2019 and an M.S. degree in applied mathematics from University of Colorado Boulder, Boulder, Colorado, USA in 2022. His research interests include the development and analysis of stochastic optimization for decision-dependent data distributions, as well as randomized derivative-free methods in scientific computing.

{IEEEbiography}

[ [Uncaptioned image] ] Ahmed S. Zamzam (Member, IEEE) is a senior research scientist at the National Renewable Energy Laboratory, where he is part of the Power Systems Engineering Center. He received the B.Sc. degree from Cairo University in 2013 and the Ph.D. degree in electrical engineering from the University of Minnesota, Minneapolis, MN, USA, in 2019. He received the Louis John Schnell Fellowship (2015) and the Doctoral Dissertation Fellowship (2018) from the University of Minnesota. His research interests include machine learning and optimization for smart grid applications, large-scale complex energy systems optimization, and grid data analytics.

{IEEEbiography}

[ [Uncaptioned image] ] Emiliano Dall’Anese (Member, IEEE) is an Associate Professor in the Department of Electrical, Computer, and Energy Engineering at the University of Colorado Boulder, where he is also an affiliate Faculty with the Department of Applied Mathematics. He received the Ph.D. in Information Engineering from the Department of Information Engineering, University of Padova, Italy, in 2011. He was with the University of Minnesota as a postdoc (2011-2014) and the National Renewable Energy Laboratory as a senior researcher (2014-2018). His research interests span the areas of optimization, control, and learning; current applications include power systems and autonomous systems. He received the National Science Foundation CAREER Award in 2020, the IEEE PES Prize Paper Award in 2021, and the IEEE Transactions on Control of Network Systems Best Paper Award in 2023.

	$\displaystyle\mu\\|\widehat{\beta}-\beta^{*}\\|^{2}$	$\displaystyle\leq\langle\widehat{\beta}-\beta^{},\nabla R(\widehat{\beta})-\nabla R(\beta^{})\rangle$
		$\displaystyle=\langle\widehat{\beta}-\beta^{},\nabla R(\widehat{\beta})\rangle-\langle\widehat{\beta}-\beta^{},\nabla R(\beta^{*})\rangle$
		$\displaystyle\leq\langle\widehat{\beta}-\beta^{*},\nabla R(\widehat{\beta})\rangle$
		$\displaystyle\leq\langle\widehat{\beta}-\beta^{},\nabla R(\widehat{\beta})\rangle+\langle\beta^{}-\widehat{\beta},\nabla\widehat{R}(\widehat{\beta})\rangle$
		$\displaystyle=\langle\widehat{\beta}-\beta^{*},\nabla R(\widehat{\beta})-\nabla\widehat{R}(\widehat{\beta})\rangle$
		$\displaystyle\leq\\|\widehat{\beta}-\beta^{*}\\|\sup_{\beta\in{\cal B}}\\|\nabla R(\beta)-\nabla\widehat{R}(\beta)\\|$

	$\displaystyle\|F_{\widehat{\beta}_{i}}(x)-F_{\beta_{i}^{*}}(x)\|$
	$\displaystyle=\left\|\inf_{\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{}}(x))}\ {\mathbb{E}}_{(z,z^{\prime})\sim\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{}}(x))}\left(f(x,z)-f(x,z^{\prime})\right)\right\|$
	$\displaystyle\leq\inf_{\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{}}(x))}\ {\mathbb{E}}_{(z,z^{\prime})\sim\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{}}(x))}\left\|f(x,z)-f(x,z^{\prime})\right\|$
	$\displaystyle\leq L_{z_{i}}\left(\inf_{\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{}}(x))}\ {\mathbb{E}}_{(z,z^{\prime})\sim\Pi(D_{\widehat{\beta_{i}}}(x),D_{\beta_{i}^{}}(x))}\\|z_{i}-z_{i}^{{}^{\prime}}\\|\right)$
	$\displaystyle=L_{z_{i}}W_{1}(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{*}}(x))$
	$\displaystyle\leq L_{z_{i}}\varepsilon_{i}\\|\widehat{\beta}_{i}-\beta_{i}^{*}\\|\,.$

	$\displaystyle F_{i}(\widehat{x})-F_{i}(x^{*})$
	$\displaystyle=\left[F_{i}(\widehat{x})-F_{\beta_{i}^{}}(\widehat{x})\right]+\left[F_{\beta_{i}^{}}(\widehat{x})-F_{\widehat{\beta_{i}}}(\widehat{x})\right]$
	$\displaystyle\ +\left[F_{\widehat{\beta_{i}}}(\widehat{x})-F_{\widehat{\beta_{i}}}(x^{})\right]+\left[F_{\widehat{\beta_{i}}}(x^{})-F_{\beta_{i}^{}}(x^{*})\right]$
	$\displaystyle\ +\left[F_{\beta_{i}^{}}(x^{})-F_{\beta_{i}^{}}(x^{})\right]+\left[F_{\beta_{i}^{}}(x^{})-F_{i}(x^{})\right]$
	$\displaystyle\leq 2\\|F_{i}-F_{\beta_{i}^{}}\\|_{\infty}+2\\|F_{\beta_{i}^{}}-F_{\widehat{\beta_{i}}}\\|_{\infty}$
	$\displaystyle\ +\left[F_{\widehat{\beta_{i}}}(\widehat{x})-F_{\widehat{\beta_{i}}}(x^{*})\right]+\left[F_{\beta_{i}^{}}(x^{*})-F_{\beta_{i}^{}}(x^{*})\right]$

	$\displaystyle F_{\widehat{\beta_{i}}}(\widehat{x})-F_{\widehat{\beta_{i}}}(x^{**})$	$\displaystyle\leq\left[F_{\widehat{\beta_{i}}}(\widehat{x}_{i},\widehat{x}_{-i})-F_{\widehat{\beta_{i}}}(x^{**}_{i},\widehat{x}_{-i})\right]$
		$\displaystyle\ +\left[F_{\widehat{\beta_{i}}}(x^{}_{i},\widehat{x}_{-i})-F_{\widehat{\beta_{i}}}(x^{}_{i},x^{**}_{-i})\right]$
		$\displaystyle\leq L_{i}^{\widehat{\beta}}\\|\widehat{x}_{i}-x_{i}^{}\\|+L_{-i}^{\widehat{\beta}}\\|\widehat{x}_{-i}-x^{}_{-i}\\|$
		$\displaystyle\leq\sqrt{2}\max\{L_{i}^{\widehat{\beta}},L_{-i}^{\widehat{\beta}}\}\\|\widehat{x}-x^{**}\\|$

	$\displaystyle F_{\beta^{}_{i}}(x^{})-F_{\beta^{}_{i}}(x^{*})$	$\displaystyle=[F_{\beta^{}_{i}}(x_{i}^{},x_{-i}^{})-F_{\beta^{}_{i}}(x_{i}^{*},x^{}_{-i})]$
		$\displaystyle~{}~{}+[F_{\beta^{}_{i}}(x_{i}^{},x^{}_{-i})-F_{\beta^{}_{i}}(x_{i}^{},x_{-i}^{*})]$
		$\displaystyle\leq L_{-i}^{\beta^{}}\\|x_{-i}^{}-x_{-i}^{}\\|+L_{i}^{\beta^{}}\\|x_{i}^{}-x_{i}^{}\\|$
		$\displaystyle\leq\sqrt{2}\max\{L_{i}^{\beta^{}},L_{-i}^{\beta^{}}\}\\|x^{*}-x^{}\\|\,.$

Solving Decision-Dependent Games by Learning from Feedback

Abstract

1 INTRODUCTION

1.1 RELATED WORK

1.2 CONTRIBUTIONS

1.3 ORGANIZATION

2 Notation and Preliminaries

2.1 Probability measures

2.2 Games

2.3 Monotonicity in Decision-Dependent Games

Theorem 1.

3 Learning-based Decision-Dependent Games

3.1 Parameter Estimation for Regular Problems

Definition 1.

Lemma 2.

Theorem 3.

Proposition 4.

3.2 Bounding the Approximation Error

Definition 2.

Definition 3.

Theorem 5.

Corollary 6.

4 Solving Strongly-monotone Decision-dependent Games

Lemma 7.

4.1 Distributed Gradient-based Method

Assumption 1.

Assumption 2.

Lemma 8.

Theorem 9.

Theorem 10.

5 Numerical Experiments on Electric Vehicle Charging

6 CONCLUSIONS

APPENDIX

6.1 Proof of Lemma 2

6.2 Proof of Theorem 3

6.3 Proof of Proposition 4

6.4 Proof of Theorem 5

6.5 Proof of Lemma 8

6.6 Proof of Theorem 9

6.7 Proof of Theorem 10

ACKNOWLEDGMENTS