This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\IEEEoverridecommandlockouts\overrideIEEEmargins
\editor
\corresp

CORRESPONDING AUTHOR: Emiliano Dall’Anese (e-mail: [email protected]) \authornoteThis work was supported by the National Science Foundation (NSF) awards 1941896 and 2044946, and by the NSF Mathematical Sciences Graduate Internship Program.

Solving Decision-Dependent Games by Learning from Feedback

Killian Wood\affilmark1    Ahmed Zamzam\affilmark3    Emiliano Dall’Anese\affilmark1,2 Department of Applied Mathematics, University of Colorado Boulder, CO 80309 USA Department of Electrical, Computer, and Energy Engineering, University of Colorado Boulder, CO 80309 USA National Renewable Energy Laboratory, Golden, CO 80523 USA
Abstract

This paper tackles the problem of solving stochastic optimization problems with a decision-dependent distribution in the setting of stochastic strongly-monotone games and when the distributional dependence is unknown. A two-stage approach is proposed, which initially involves estimating the distributional dependence on decision variables, and subsequently optimizing over the estimated distributional map. The paper presents guarantees for the approximation of the cost of each agent. Furthermore, a stochastic gradient-based algorithm is developed and analyzed for finding the Nash equilibrium in a distributed fashion. Numerical simulations are provided for a novel electric vehicle charging market formulation using real-world data.

{IEEEkeywords}

Optimization, Stochastic monotone games, Decision-dependent distribution, Learning.

1 INTRODUCTION

The efficacy of stochastic optimization [nemirovskistochastic] and stochastic games [lei2022stochastic, koshal2010single, liu2011stochastic, lei2020synchronous, franci2021stochastic] generally hinges on the premise that the underlying data distribution is stationary. This means that the distribution of the data, which parameterize the problem or the game, does not change throughout the execution of the algorithm used to solve the stochastic problem or game, and is neither influenced or dependent on time nor the optimization variables themselves. This is a common setup that has been considered when game-theoretic frameworks have been applied to problems in, for example, ride hailing [fabiani2022stochastic], routing [bakhshayesh2021decentralized], charging of electric vehicles (EVs) [fele2019scenario, paccagnan2018nash], power markets [kannan2011strategic], power systems [zhou2019online], and in several approaches for training of neural networks [franci2021training]. However, this assumption can be invalid in a variety of setups in which the cost to be minimized is parameterized by data that is received from populations or a collection of automated control systems, whose response is uncertain and depends on the output of the optimization problem itself. As an example, in a competitive market for electric EV charging [fele2019scenario, tushar2012economics], the operators seek to find the charging prices (i.e., the optimization variables) to maximize the revenue from EVs; however, the expected demand (i.e., the “data” of the problem) is indeed dependent on the price itself. More broadly, power consumption in power distribution grids depends on electric prices [mathieu2011examining]. A similar example pertains to ride hailing [fabiani2022stochastic].

To accommodate this scenario, the so-called stochastic optimization with decision-dependent distributions (also known as performative prediction [perdomo2020performative]) posits that we represent the data distribution used in optimization instead as a distributional map xD(x)x\mapsto D(x) where xx are decision variables [drusvyatskiy2022stochastic, miller2021outside, narang2022learning, perdomo2020performative, wood2023stochastic]. In this work, we study decision-dependent stochastic games in which players seeks to minimize their cost (based on their optimization variables) subject to other players optimization variables, and where the data distribution of each player depends on the actions of all players (we will use the term player and agent interchangeably).

We focus on solving the Nash equilibrium problem of a game, which is to find a decision from which no agent is incentivized by their own cost to deviate when played. Formally, the stochastic Nash equilibrium problem with decision-dependent distributions considered in this paper is to find a point x=(x1,,xn)nx^{*}=(x_{1}^{*},\ldots,x_{n}^{*})\in\mathbb{R}^{n} such that

xiargminxi𝒳iFi(xi,xi),i{1,,n}\displaystyle x_{i}^{*}\in\operatorname*{arg\,min}_{x_{i}\in{\cal X}_{i}}F_{i}(x_{i},x_{-i}^{*}),\quad\forall\,i\in\{1,\ldots,n\} (1)

with Fi(xi,xi)F_{i}(x_{i},x_{-i}^{*}) defined as:

Fi(xi,xi):=𝔼ziDi(xi,xi)fi(xi,xi,zi)\displaystyle F_{i}(x_{i},x_{-i}^{*}):=\underset{z_{i}\sim D_{i}(x_{i},x_{-i}^{*})}{{\mathbb{E}}}f_{i}(x_{i},x_{-i}^{*},z_{i}) (2)

where: ziz_{i} denotes a random variable supported on ki\mathbb{R}^{k_{i}}, fi:d×kif_{i}:\mathbb{R}^{d}\times\mathbb{R}^{k_{i}}\rightarrow\mathbb{R} is a scalar valued function that is convex and continuously differentiable in xix_{i}, 𝒳idi{\cal X}_{i}\subseteq\mathbb{R}^{d_{i}} is a compact convex set, and Di:d𝒫(ki)D_{i}:\mathbb{R}^{d}\rightarrow{\cal P}(\mathbb{R}^{k_{i}}) is a distributional map whose output is a probability distribution supported on ki\mathbb{R}^{k_{i}}.

Standard stochastic first-order methods are insufficient for solving problems of this form. As we will demonstrate later in the paper, even estimating the expected gradient from samples requires knowledge of the probability density function associated with DiD_{i}—which is not possible in a majority of practical applications.

Hereafter, we use the term “system” to refer to a population or a collection of automated controllers producing a response zikiz_{i}\in\mathbb{R}^{k_{i}} upon observing xx. To illustrate our setup, consider again the example where each agent represents an EV charging provider. Here, xidix_{i}\in\mathbb{R}^{d_{i}} represents the charging price at a station managed by provider ii, expressed in $/kWh. Correspondingly, ziz_{i} indicates demand for the service at that price, while fif_{i} is the service cost (or the negative of the total profit) for provider ii. This is an example of a competitive market in which the demand for service is a function of the price of all providers; see, for example, the game-theoretic approaches presented in [mediwaththe2018game, fele2019scenario] and the Stackelberg game presented in [tushar2012economics]. However, compared to existing game-theoretic models for EV markets, the framework proposed in this paper allows for an uncertain response of EV owners to price variations; this randomness is difficult to model, as it it related to the drivers’ preferences and other externalities such as the locations of the charging stations, etc., as explained in, e.g., [mediwaththe2018game, latinopoulos2017response, daina2017electric].

Challenges in solving problems of this form typically stem from the that fact that the distributional maps DiD_{i} are often unknown [sun2014bayesian, den2015dynamic, cheung2017dynamic, chen2021nonparametric]. To overcome this challenge, we propose a learning-based optimization procedure – in the spirit of the methods proposed for convex optimization in [lin2023plug, miller2021outside] – to tackle the multi-player decision-dependent stochastic game. The key idea behind this framework is that we first propose a parameterization for the distributional map in the system and estimate it from responses. Then, we use the estimated distributional map throughout the game without requiring further interaction with the system.

1.1 RELATED WORK

Our work incorporates themes from games, learning, as well as stochastic optimization with decision-dependent distributions. We highlight the relationship with this relevant literature below.

Games. Within the context of games, our work is specifically focused on solving Nash equilibrium problems using gradient-based methods and a variational inequality (VI) framework. The literature on stochastic games is extensive; for a comprehensive yet concise review of the subject, we refer the reader to the tutorials [scutari2010convex] and [lei2022stochastic]; see also pertinent references therein. A common denominator of existing frameworks is that the data distribution is stationary. The work of [rosen1965existence] demonstrates that strictly monotone games have unique solutions and that gradient play converges to it. The modern approach of solving Nash equilibrium problems for continuous games via variational inequalities can be attributed to Facchinei and Pang [facchinei2003finite, harker1990finite]. For solving strongly monotone variational inequalities, the projected gradient method is capable of converging linearly.

Our work adds the additional complexity of minimizing communication between agents and hence we use a distributed gradient approach in our optimization algorithm. Distributed gradient methods have been explored extensively in the literature on convex optimization, though less so in that of variational inequalities. We refer the reader to [nedic2020distributed] for a review in the convex optimization setting, and [kovalev2022optimal] for variational inequalities.

Decision-Dependent Data. This paper contributes to the growing body of literature that studies stochastic optimization with decision-dependent data distributions. While the concept of decision-dependent distributions has existed for some time, its roots go back to the framework “Performative Prediction” and its use within the machine learning community [perdomo2020performative]; this work posits the formulation of optimization problems in which the data distribution is explicitly dependent on the optimization variables, and proposes repeated retraining (and the limit points thereof) as a solution. This is a family of points that solve the induced stationary optimization problem. For repeated retraining, convergence of various stochastic gradient algorithms are studied in [drusvyatskiy2022stochastic, mendler2020stochastic] in the batch setting, and in the time-varying setting in [cutler2021stochastic, wood2021online]. The extension of problems with to games includes two-player zero-sum games in [wood2021online], and general multiplayer games in [narang2022learning]. Additional recent extensions to this line of work include distributionally robust optimization[basciftci2021distributionally, luo2020distributionally] and time varying optimization [wood2021online, cutler2021stochastic].

Recent works have since pivoted towards directly solving the optimization problem by incorporating a model for the distributional map. In [miller2021outside], the authors provide conditions that ensure convexity of the expected cost, alongside a learning-based algorithm for finding solutions to problems with location scale families. The show that location scale families can be learned from the system prior to optimization without requiring further interaction with the system during optimization. The work of [lin2023plug] extends this learning framework beyond location scale families and derives regret bounds that incorporate the associated approximation and statistical errors arising from the statistical learning problem. Outside of the distribution learning approach, derivative-free methods have been studied in [miller2021outside, narang2022learning, wood2023stochastic]. While these methods do not depend on a learned model of the distributional map, they generally exhibit slower convergence rates. On the other hand, adaptive learning strategies, particularly for location-scale models, have shown promise, as discussed in [narang2022learning] whereby the model is learned during optimization. However, their exploration remains limited to this specific model category.

This work complements the technical findings in [lin2023plug] by offering a generalization of the so called “plug-in” approach presented in this work to the multi-agent setting. Relative to [narang2022learning], we focus on solving the Nash equilibrium problem for a general class of models by learning the distributional map prior to optimization—thereby reducing the required number of interaction with the system to receive a reasonable answer.

1.2 CONTRIBUTIONS

In this work, we provide the following contributions to the body of work on stochastic optimization and Nash equilibrium problems with decision-dependent distributions.

  • (i)

    We propose an algorithm for finding a Nash equilibrium in stochastic games with decision-dependent distributions where: (i) the distributional map for each player’s cost is estimated from samples, and (ii) the estimated distributional map is used in gradient-based strategies.

  • (ii)

    We provide guarantees on the approximation error of distributional maps for a class of map learning problems.

  • (iii)

    We show that the parameterized cost approximates the ground-truth in high-probability.

  • (iv)

    We propose a stochastic gradient-based algorithm for solving a parameterized strongly-monotone game, and we demonstrate linear convergence in expectation.

  • (v)

    Finally, we provide numerical simulations of an EV charging market formulation using real-world data. The EV market formulation is new in the context of energy markets, thus providing contributions in this area.

1.3 ORGANIZATION

In Section 2, we provide necessary notation and background for our analysis. In Section 3 we discuss the proposed learning algorithm in detail and present our primary result. Section 5 discusses the details of the optimization stage. We provide our numerical simulations in Section 5. Proofs of the results are provided in the Appendix.

2 Notation and Preliminaries

Throughout the paper, d\mathbb{R}^{d} denotes the dd-dimensional Euclidean space with inner product ,\langle\cdot,\cdot\rangle, and Euclidean norm \|\cdot\|. For a matrix Xn×mX\in\mathbb{R}^{n\times m}, X\|X\| denotes the spectral norm. For a given integer nn, [n][n] denotes the set {1,2,,n}\{1,2,\ldots,n\} and 𝒮n1\mathcal{S}^{n-1} denotes the Euclidean hypersphere in nn dimensions, {xn|x2=1}\{x\in\mathbb{R}^{n}|\ \|x\|_{2}=1\}. The symbol 𝟙d{\bf\mathds{1}}_{d} is used to denote the dd-dimensional vector of all ones. Given vectors xnx\in\mathbb{R}^{n} and zmz\in\mathbb{R}^{m}, we let (x,z)n+m(x,z)\in\mathbb{R}^{n+m} denote their concatenation.

For a symmetric positive definite matrix Wd×dW\in\mathbb{R}^{d\times d}, the weighted inner product is defined by x,yW=x,Wy\langle x,y\rangle_{W}=\langle x,Wy\rangle and corresponding weighted norm xW=x,xW\|x\|_{W}=\sqrt{\langle x,x\rangle_{W}} for any x,ydx,y\in\mathbb{R}^{d}. The weighted projection onto a set 𝒳d{\cal X}\subseteq\mathbb{R}^{d} with respect to the symmetric positive definite matrix Wd×dW\in\mathbb{R}^{d\times d} is given by the map

𝗉𝗋𝗈𝗃𝒳,W(x):=argminy𝒳12xyW2\mathsf{proj}_{{\cal X},W}(x):=\operatorname*{arg\,min}_{y\in{\cal X}}\frac{1}{2}\|x-y\|_{W}^{2} (3)

for any xdx\in\mathbb{R}^{d}.

2.1 Probability measures

Throughout this work, we restrict our focus to random variables drawn from continuous probability distributions supported over the Euclidean space. When random variables X,YkX,Y\in\mathbb{R}^{k} are equal in distribution, i.e., P(Xx)=P(Yx)P(X\leq x)=P(Y\leq x) for all xkx\in\mathbb{R}^{k}, we write X=𝑑YX\overset{d}{=}Y.

Our analysis includes study of sub-exponential random vectors. A univariate random variable XX\in\mathbb{R} is said to be sub-exponential with modulus θ>0\theta>0 provided that the survival function satisfies (|X|t)2exp(t/θ){\mathbb{P}}(|X|\geq t)\leq 2\exp(-t/\theta) for all t0t\geq 0. By extension, a random vector XkX\in\mathbb{R}^{k} is sub-exponential provided that u,X\langle u,X\rangle is a sub-exponential random variable for all u𝒮k1u\in{\cal S}^{k-1}.

To compare probability distributions, we will be interested in computing the distance between their associated probability measures—for which we need a complete metric space. We let 𝒫(k)\mathcal{P}(\mathbb{R}^{k}) denote the set of finite first moment probability measures supported on k\mathbb{R}^{k} and write the Wasserstein-11 distance as

W1(μ,ν)=suph1{𝔼Xμ[h(X)]𝔼Yμ[h(Y)]}W_{1}(\mu,\nu)=\sup_{h\in\mathcal{L}_{1}}\bigg{\{}{\mathbb{E}}_{X\sim\mu}[h(X)]-{\mathbb{E}}_{Y\sim\mu}[h(Y)]\bigg{\}}

for any μ,ν𝒫(k)\mu,\nu\in\mathcal{P}(\mathbb{R}^{k}), where 1\mathcal{L}_{1} is the set of all 11-Lipschitz continuous functions h:kh:\mathbb{R}^{k}\rightarrow\mathbb{R}. Under these conditions, the set (𝒫(k),W1)(\mathcal{P}(\mathbb{R}^{k}),W_{1}) forms a complete metric space [bogachev2012monge].

2.2 Games

We consider a game that consists of nn players. Each player has a cost function FiF_{i}, distributional map DiD_{i}, and decision set 𝒳idi\mathcal{X}_{i}\subseteq\mathbb{R}^{d_{i}}. Hence, each player chooses a decision, or strategy xi𝒳idix_{i}\in\mathcal{X}_{i}\subseteq\mathbb{R}^{d_{i}}. The concatenation of the decision variables is written as x=(x1,,xn)𝒳dx=(x_{1},\ldots,x_{n})\in\mathcal{X}\subseteq\mathbb{R}^{d} where 𝒳=i=1n𝒳i\mathcal{X}=\prod_{i=1}^{n}\mathcal{X}_{i} and d=i=1ndid=\sum_{i=1}^{n}d_{i}. For a fixed agent ii, we will decompose the decision xx as x=(xi,xi)x=(x_{i},x_{-i}) where xiddix_{-i}\in\mathbb{R}^{d-d_{i}} is the strategy vector of all agents excluding the iith one.

The collection of costs FiF_{i} and decision sets 𝒳i{\cal X}_{i} defines the game

minxi𝒳iFi(xi,xi),i[n].\min_{x_{i}\in\mathcal{X}_{i}}F_{i}(x_{i},x_{-i}),\,\,\,\,\ i\in[n]. (4)

A Nash equilibrium of this game is a point x𝒳x^{*}\in\mathcal{X} provided that

xiargminxi𝒳iFi(xi,xi)x^{*}_{i}\in\operatorname*{arg\,min}_{x_{i}\in{\cal X}_{i}}F_{i}(x_{i},x_{-i}^{*}) (5)

for all i[n]i\in[n]. Intuitively, xx^{*} is a strategy such that no agent can be incentivized by its cost to deviate from xix_{i}^{*} when all other agents play xix_{-i}^{*}. Finding Nash equilibria is the primary focus of this work.

Games of this form are commonly cast into a variational inequality framework. This is due, in part, to the observation that the Nash equilibria x𝒳x^{*}\in{\cal X} are the solutions to the variational inequality

xx,G(x)0,x𝒳,\langle x-x^{*},G(x^{*})\rangle\geq 0,\,\,\,\,\forall\,x\in{\cal X},

where the gradient map G:ddG:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is defined as

G(x)=(1F1(x),,nFn(x)).G(x)=\left(\nabla_{1}F_{1}(x),\ldots,\nabla_{n}F_{n}(x)\right). (6)

Here, the notation i\nabla_{i} is used to represent the partial gradient xi\nabla_{x_{i}}. We will denote the set of Nash equilibrium of a game with gradient map GG and domain 𝒳{\cal X} as Nash(G,𝒳)\texttt{Nash}(G,{\cal X}). Existence of solutions to variational inequalities of this form is guaranteed provided that the set 𝒳{\cal X} is convex and compact and the gradient map GG is monotone; uniqueness is guaranteed when GG is strongly-monotone [facchinei2003finite]. We say that GG is α\alpha-strongly-monotone on 𝒳{\cal X} provided that there exists α>0\alpha>0 such that

xy,G(x)G(y)αxy2,x,y𝒳,\langle x-y,G(x)-G(y)\rangle\geq\alpha\|x-y\|^{2},\,\,\,\forall\,x,y\in{\cal X}, (7)

and monotone when α=0\alpha=0. In this work, we primarily focus on strongly-monotone games. While monotone games are tractable, methods for solving them with decision-dependent distributions require alternative gradient estimators—a topic we leave to future work.

2.3 Monotonicity in Decision-Dependent Games

In this work, we introduce the additional complexity to the formulation in (4) that the FiF_{i}’s are the expected cost over a distributional map Di:d𝒫(ki)D_{i}:\mathbb{R}^{d}\rightarrow{\cal P}(\mathbb{R}^{k_{i}}). In particular, we write the cost as

Fi(xi,xi):=𝔼ziDi(xi,xi)fi(xi,xi,zi).F_{i}(x_{i},x_{-i}):=\underset{z_{i}\sim D_{i}(x_{i},x_{-i})}{{\mathbb{E}}}f_{i}(x_{i},x_{-i},z_{i}). (8)

This can be written alternatively as the integral

Fi(x)=kifi(x,zi)pi(zi,x)dziF_{i}(x)=\int_{\mathbb{R}^{k_{i}}}f_{i}(x,z_{i})p_{i}(z_{i},x)\text{d}z_{i} (9)

where pip_{i} is the probability density function for the distribution Di(x)D_{i}(x). When the integral satisfies the Dominated Convergence Theorem, computing the gradient amounts to differentiating under the integral and using the product rule. We then obtain

iFi(x)=𝔼ziDi(x)[xifi(x,zi)+fi(x,zi)ilogpi(x;zi)],\nabla_{i}F_{i}(x)=\underset{z_{i}\sim D_{i}(x)}{{\mathbb{E}}}\left[\nabla_{x_{i}}f_{i}(x,z_{i})+f_{i}(x,z_{i})\nabla_{i}\log p_{i}(x;z_{i})\right], (10)

where we recall that G(x)=(1F1(x),,nFn(x))G(x)=\left(\nabla_{1}F_{1}(x),\ldots,\nabla_{n}F_{n}(x)\right). In short, characterizing the gradient of this decision-dependent game requires assumptions not only on fif_{i}, but also on the properties of the distributional map DiD_{i}. Sufficient conditions for strong monotonicity of the game in (4) are due to [narang2022learning] and are stated in terms of the decoupled costs, given by

Fi(x,y)=𝔼ziDi(y)fi(x,zi)F_{i}(x,y)=\underset{z_{i}\sim D_{i}(y)}{{\mathbb{E}}}f_{i}(x,z_{i}) (11)

for all x,ydx,y\in\mathbb{R}^{d}, and their associated decoupled partial gradients

Gi(x,y)=𝔼ziDi(y)ifi(x,zi),G_{i}(x,y)=\underset{z_{i}\sim D_{i}(y)}{{\mathbb{E}}}\nabla_{i}f_{i}(x,z_{i}), (12)

for all x,ydx,y\in\mathbb{R}^{d} and

Hi(x,y)=yi𝔼ziDi(y)fi(x,zi)H_{i}(x,y)=\nabla_{y_{i}}\underset{z_{i}\sim D_{i}(y)}{{\mathbb{E}}}f_{i}(x,z_{i}) (13)

for all x,ydx,y\in\mathbb{R}^{d}. A key observation used in the proof is that Gi(x)=iFi(x)=Gi(x,x)+Hi(x,x)G_{i}(x)=\nabla_{i}F_{i}(x)=G_{i}(x,x)+H_{i}(x,x).

Theorem 1.

(Strong Monotonicity, [narang2022learning]) Suppose that,

  • (i)

    For all y𝒳y\in{\cal X}, xG(x,y)x\mapsto G(x,y) is λ\lambda-strongly monotone,

  • (ii)

    For all x𝒳x\in{\cal X}, yH(x,y)y\mapsto H(x,y) is monotone,

and that for all i[n]i\in[n],

  • (iii)

    For all x𝒳,ziifi(x,zi)x\in{\cal X},z_{i}\mapsto\nabla_{i}f_{i}(x,z_{i}) is LiL_{i}-Lipschitz continuous,

  • (iv)

    yDi(y)y\mapsto D_{i}(y) is γi\gamma_{i}-lipschitz continuous on (𝒫(ki),W1)({\cal P}(\mathbb{R}^{k_{i}}),W_{1}).

Set κ=i=1n(γiLiλ)2\kappa=\sqrt{\sum_{i=1}^{n}(\frac{\gamma_{i}L_{i}}{\lambda})^{2}}. Then if κ<1/2\kappa<1/2, xG(x)x\mapsto G(x) is α=(12κ)λ\alpha=(1-2\kappa)\lambda-strongly monotone. \Box

3 Learning-based Decision-Dependent Games

In this work, we aim to solve the stochastic Nash equilibrium problem with decision-dependent data distributions as formulated in (1). Methods for finding Nash equilibrium for games with decision dependent data distributions either use derivative free optimization, at the expense of an extremely slow rate, or use derivative information in conjunction with a learned model of the distributional map [narang2022learning].

In [lin2023plug], it is shown that a “plug-in” optimization approach, whereby a model for the distributional map is learned from samples prior to optimization, yields a bounded excess risk for the convex optimization problems with decision-dependent data. In this work, we leverage the properties of the system to simplify the communication structure of our approach, which we depict in Figure 1. We assume that realizations of ziz_{i} can be directly observed from the system, and the decisions xix_{-i} can be obtained from a server or are made public (for example, the prices of EV charging of different providers can be observed at the various stations).

Input: m,{Dxi}i=1nm,\{D_{x_{i}}\}_{i=1}^{n}
for j[m]j\in[m] do
       for i[n]i\in[n] do
             Draw xi(j)Dxix_{i}^{(j)}\sim D_{x_{i}} ;
            
       end for
      Deploy x(j)x^{(j)} ;
       Observe zi(j)Di(x(j))z_{i}^{(j)}\sim D_{i}(x^{(j)}) ;
      
end for
for i[n]i\in[n] do
       Fit β^iargminβii1mj=1mRi(x(j),zi(j),βi)\widehat{\beta}_{i}\in\operatorname*{arg\,min}_{\beta_{i}\in{\cal B}_{i}}\frac{1}{m}\sum_{j=1}^{m}R_{i}(x^{(j)},z_{i}^{(j)},\beta_{i}) ;
      
end for
Compute x^Nash(Gβ^,𝒳)\widehat{x}\in\texttt{Nash}(G_{\widehat{\beta}},{\cal X}) ;
Algorithm 1 Multi-phase Optimization

To accommodate this setting, our algorithm proposes a multistage approach consisting of the following phases: (i) sampling; (ii) learning; (iii) optimization. It is important to note that following the learning phase players only need to participate in gradient play without receiving any additional feedback from the system in the form of ziDi(x)z_{i}\sim D_{i}(x). This is distinct from existing approaches in which performatively stable points can only be reached after several (even thousands of) rounds of feedback [perdomo2020performative, narang2022learning, wood2023stochastic], and performatively optimal points can only be reached for models known to be location scale families a priori [miller2021outside, narang2022learning].

Refer to caption
Figure 1: Communication structure allows agents to interact with the system in square by sending decision xix_{i}. After deploying, agents can receive feedback from the system in the form of other agents decisions xix_{-i} and data ziz_{i}.

Sampling. In the sampling phase we require that players collaborate by each deploying a set of decisions {xi(j)}j=1mi.i.dDxi\{x_{i}^{(j)}\}_{j=1}^{m}\overset{i.i.d}{\sim}D_{x_{i}} so that they can collectively receive feedback zi(j)Di(x(j))z_{i}^{(j)}\sim D_{i}(x^{(j)}) from the system (in response to their deployed decisions {xi(j)}j=1m\{x_{i}^{(j)}\}_{j=1}^{m}). The result is that each agent has access to a dateset {x(j),zi(j)}j=1m\{x^{(j)},z_{i}^{(j)}\}_{j=1}^{m} which they can use to learn their distributional map DiD_{i}

Learning. In this procedure, each player will choose a hypothesis class of parameterized functions

i={Dβi|βiii},{\cal H}_{{\cal B}_{i}}=\left\{D_{\beta_{i}}|\ \beta_{i}\in{\cal B}_{i}\subseteq\mathbb{R}^{\ell_{i}}\right\}, (14)

as well as a suitable criterion or risk function RiR_{i}, to formulate their own expected risk minimization problem

βiargminβii𝔼xDx,ziDi(x)Ri(x,zi,βi)\beta^{*}_{i}\in\operatorname*{arg\,min}_{\beta_{i}\in{\cal B}_{i}}\underset{x\sim D_{x},\\ z_{i}\sim D_{i}(x)}{{\mathbb{E}}}R_{i}(x,z_{i},\beta_{i}) (15)

over the random variable (x,zi)(x,z_{i}) drawn from the coupled distribution (Dx,Di(x))(D_{x},D_{i}(x)). Then, using the set of samples from the previous sampling phase, they can formulate the corresponding empirical risk minimization problem

β^iargminβii1mj=1mRi(x(j),zi(j),βi).\widehat{\beta}_{i}\in\operatorname*{arg\,min}_{\beta_{i}\in{\cal B}_{i}}\frac{1}{m}\sum_{j=1}^{m}R_{i}(x^{(j)},z_{i}^{(j)},\beta_{i}). (16)

The result is a learned distributional map Dβ^iD_{\widehat{\beta}_{i}} approximating DiD_{i}, which we can now use to solve the approximate Nash equilibrium problem.

Optimization. Following the approximation phase, each player now has an learned model of their distributional map Dβ^iD_{\widehat{\beta}_{i}}, which can be used to formulate an approximation of the ground-truth cost FiF_{i} and hence an approximate Nash equilibrium problem:

x^iargminxi𝒳iFβ^i(xi,x^i)\displaystyle\widehat{x}_{i}\in\operatorname*{arg\,min}_{x_{i}\in{\cal X}_{i}}F_{\widehat{\beta}_{i}}(x_{i},\widehat{x}_{-i}) (17)

for all i[n]i\in[n], where

Fβ^i(xi,x^i):=𝔼ziDβ^i(xi,x^i)fi(xi,x^i,zi).\displaystyle F_{\widehat{\beta}_{i}}(x_{i},\widehat{x}_{-i}):=\underset{z_{i}\sim D_{\widehat{\beta}_{i}}(x_{i},\widehat{x}_{-i})}{{\mathbb{E}}}f_{i}(x_{i},\widehat{x}_{-i},z_{i})\,. (18)

Hereafter, we denote the Nash equilibrium of the approximate game as x^\widehat{x} to distinguish it from the ground truth xx^{*}. In Algorithm 1, we write the set of Nash equilibria for the operator Gβ^G_{\widehat{\beta}} with domain 𝒳{\cal X} as Nash(Gβ^,𝒳)\texttt{Nash}(G_{\widehat{\beta}},{\cal X}). In practice, we will assume the appropriate assumptions to guarantee uniqueness of this assignment; in which case the set inclusion is simply an equality.

By solving (17) instead of (1) we have introduced two errors: (i) the approximation error of the distributional map DiD_{i} by elements of the hypothesis class i{\cal H}_{{\cal B}_{i}}, and (ii) the estimation or statistical error by solving the ERM problem instead of the expected risk minimization problem. In [lin2023plug], the main result demonstrates that these two sources of error propagate through the optimization problem, and that the resulting excess risk can be bounded in terms of the sample complexity mm. Our goal is to expand this result and provide additional analysis to our setting.

3.1 Parameter Estimation for Regular Problems

A critical component of our analysis is the estimation or learning of the distributional map and the subsequent characterization of the estimation error. In this section, we outline a class of expected risk minimization problems, which we call regular problems, for which we can characterize the distance between expected risk minimization solutions and empirical risk minimization solutions. Throughout, we write Ri(βi)=𝔼(x,z)[R(x,zi,βi)]R_{i}(\beta_{i})={\mathbb{E}}_{(x,z)}[R(x,z_{i},\beta_{i})] and R^i(βi)=(1/m)j=1mRi(x(j),zi(j),βi)\widehat{R}_{i}(\beta_{i})=(1/m)\sum_{j=1}^{m}R_{i}(x^{(j)},z_{i}^{(j)},\beta_{i}) for βii\beta_{i}\in\mathbb{R}^{\ell_{i}} to denote the expected and empirical risk, respectively.

Definition 1.

(Map Learning Regularity) A map learning problem, consisting of the optimization problems with costs RiR_{i} and R^i\widehat{R}_{i} over i{\cal B}_{i}, is regular provided that:

  • (a)

    Convexity: The expected risk βiRi(βi)\beta_{i}\mapsto R_{i}(\beta_{i}) is μi\mu_{i}-strongly convex, and the empirical risk βiR^i(βi)\beta_{i}\mapsto\widehat{R}_{i}(\beta_{i}) is convex.

  • (b)

    Smoothness: For all realizations of x𝒳x\in{\cal X} and zikiz_{i}\in\mathbb{R}^{k_{i}}, βiβiRi(x,zi,βi)\beta_{i}\mapsto\nabla_{\beta_{i}}R_{i}(x,z_{i},\beta_{i}) is LβiL_{\beta_{i}}-Lipschitz continuous.

  • (c)

    Boundedness: The set ii{\cal B}_{i}\subseteq\mathbb{R}^{\ell_{i}} is convex and compact.

  • (d)

    Sub-Exponential gradient: For all βii\beta_{i}\in{\cal B}_{i}, βiRi(x,zi,βi)\nabla_{\beta_{i}}R_{i}(x,z_{i},\beta_{i}) is a sub-exponential vector with parameter θi>0\theta_{i}>0. \Box

Items (a) and (c), taken together, guarantee existence of β^\widehat{\beta} and uniqueness of β\beta^{*} as defined in (16) and (15), respectively. Furthermore, the inclusion of item (b) is necessary to guarantee that first-order stochastic gradient methods will converge at least sub-linearly to β^\widehat{\beta}. Lastly, the heavy-tail assumption [vershynin2018high] will allow us to describe the concentration of the gradient estimates. Together, they allow us to relate the solutions to the sample complexity in the following lemma.

Lemma 2.

(Uniform Gradient Bound) If the smoothness and sub-exponential gradient assumptions in Definition 1 hold for player i[n]i\in[n], then for any δ(0,1/2)\delta\in(0,1/2) and any mm such that m/log(m)2(i+log(1/δ))m/\log(m)\geq 2(\ell_{i}+\log(1/\delta)), we have that:

supβR^i(β)Ri(β)Cilog(m)(i+log(1/δ))m\sup_{\beta\in{\cal B}}\|\nabla\widehat{R}_{i}(\beta)-\nabla R_{i}(\beta)\|\leq C_{i}\sqrt{\frac{\log(m)(\ell_{i}+\log(1/\delta))}{m}} (19)

with probability at least 1δ1-\delta, where Ci=4max{Lβi/15ri,θi}C_{i}=4\max\{L_{\beta_{i}}/15r_{i},\theta_{i}\}. \Box

The proof of this result is provided in the Appendix 6.1. This result offers a broad generalization of [mou2019diffusion, Equation (19b)] to any risk with Lipschitz-continuous sub-exponential gradients over any convex and compact set. Our result is comparable to the 𝒪(im)\mathcal{O}(\sqrt{\ell_{i}}{m}) rate that can be found for specific problem instances such as linear least squares regression and logistic regression, but with the addition of a logm\sqrt{\log m} factor. Indeed, the generality of the risk function requires that we enforce compactness of the domain, thus giving rise to this extra logarithmic factor. This gradient estimation result will now allow us to reach our desired bounded distance result, which we present in the following theorem.

Theorem 3.

(ERM Approximation) If the map learning problem is regular for player i[n]i\in[n] (i.e., it satisfies the assumptions in Definition 1), then for any δ(0,1/2)\delta\in(0,1/2) and any mm such that m/log(m)2(i+log(1/δ))m/\log(m)\geq 2(\ell_{i}+\log(1/\delta)) we have that:

β^iβiCilog(m)(i+log(1/δ))m\|\widehat{\beta}_{i}-\beta_{i}^{*}\|\leq C_{i}^{\prime}\sqrt{\frac{\log(m)(\ell_{i}+\log(1/\delta))}{m}} (20)

with probability at least 1δ1-\delta, where Ci=(4/μi)max{Lβi/15ri,θi}C_{i}^{\prime}=(4/\mu_{i})\max\{L_{\beta_{i}}/15r_{i},\theta_{i}\}. \Box

The proof of Theorem 3 is provided in the Appendix 6.2. The power in this characterization lies in the fact that it holds for any statistical learning problem satisfying the assumptions listed in Definition 1, and is not specific to the setting of learning distributional maps. We note that our Definition 1, which is a property used in the Theorem 3, is different from the one in [lin2023plug] and it involves conditions that are easier to check.

As an example, we provide conditions for which a linear least squares problem satisfies the regularity conditions and hence is subject to the above ERM approximation result.

Proposition 4.

(Linear Least Squares Regularity) Consider the linear least squares problem with expected risk problem

BiargminBi12𝔼(x,z)Bxz2,B_{i}^{*}\in\operatorname*{arg\,min}_{B\in{\cal B}_{i}}\frac{1}{2}{\mathbb{E}}_{(x,z)}\|Bx-z\|^{2},

and empirical risk minimization problem

B^iargminBi12mj=1mBxi(j)zi(j)2.\widehat{B}_{i}\in\operatorname*{arg\,min}_{B\in{\cal B}_{i}}\frac{1}{2m}\sum_{j=1}^{m}\|Bx_{i}^{(j)}-{z_{i}^{(j)}}\|^{2}.

Let xiDix_{i}\sim D_{i} with zero mean and covariance matrix Σi\Sigma_{i}. If
(i) There exist γi,Li>0\gamma_{i},L_{i}>0 such that γiIΣiLiI\gamma_{i}I\leq\Sigma_{i}\leq L_{i}I,
(ii) The entries of xxTxx^{T} and zxTzx^{T} are sub-exponential,
(ii) The constraint set i{\cal B}_{i} is convex and compact.
Then, the map learning problem is regular. \Box

The proof of Proposition 4 is provided in Appendix 6.3. Deriving conditions for the more general case of non-linear regression is attainable but outside the scope of this work.

3.2 Bounding the Approximation Error

Finding a relationship between x^\widehat{x} and xx^{*} will require that we first characterize an appropriate hypothesis class of distributions for learning. Here, we formalize the notion of misspecification and sensitivity for a hypothesis class i{\cal H}_{{\cal B}_{i}}.

Definition 2.

(Misspecification [lin2023plug]) A hypothesis class i{\cal H}_{{\cal B}_{i}} is γi\gamma_{i}-misspecified provided that there exists a γi>0\gamma_{i}>0 such that

W1(Dβi(x),Di(x))ηiW_{1}(D_{\beta_{i}^{*}}(x),D_{i}(x))\leq\eta_{i} (21)

for all x𝒳x\in{\cal X}. \Box

Observe that if ηi=0\eta_{i}=0, then our approximation error is zero and we incur no bias in our problem formulation. This corresponds to the case where the map DiD_{i} is realizable [shalev2014understanding]. Though, small values of ηi\eta_{i} are appropriate as DiD_{i} may be complex enough that approximating it with a finite-dimensional parameter space is difficult.

Definition 3.

(Sensitivity [lin2023plug]) The hypothesis class i{\cal H}_{{\cal B}_{i}} is εi\varepsilon_{i}-sensitive if, for any βi,βii\beta_{i},\beta_{i}^{\prime}\in{\cal B}_{i},

W1(Dβi(x),Dβi(x))εiβiβiW_{1}(D_{\beta_{i}}(x),D_{\beta_{i}^{\prime}}(x))\leq\varepsilon_{i}\|\beta_{i}-\beta_{i}^{\prime}\| (22)

for all x𝒳x\in{\cal X}. \Box

Sensitivity of i{\cal H}_{{\cal B}_{i}} is merely a convenient name for the condition that βDβi(x)\beta\mapsto D_{\beta_{i}}(x) be εi\varepsilon_{i}-lipschitz continuous for all realizations of x𝒳x\in{\cal X}. In the result that follows, we demonstrate that an appropriately misspecified and sensitive hypothesis class induces a cost that has bounded distance to the ground truth cost in (1).

Theorem 5.

(Bounded Approximation) Suppose that the following conditions hold for all i[n]i\in[n]:
(i) The hypothesis class i{\cal H}_{{\cal B}_{i}} is ηi\eta_{i}-misspecified, and εi\varepsilon_{i}-sensitive.
(ii) The map learning problem is regular.
(iii) For all x𝒳ix\in{\cal X}_{i}, zifi(x,zi)z_{i}\mapsto f_{i}(x,z_{i}) is LziL_{z_{i}}-Lipschitz continuous.
Then, the bound

|Fβ^i(x)Fi(x)|ηiLzi+LziεiCiζi(m,δ),|F_{\widehat{\beta}_{i}}(x)-F_{i}(x)|\leq\eta_{i}L_{z_{i}}+L_{z_{i}}\varepsilon_{i}C_{i}^{\prime}\zeta_{i}(m,\delta), (23)

holds with probability 1δ1-\delta for any x𝒳x\in{\cal X}, where

ζi(m,δ):=log(m)(i+log(1/δ))m.\zeta_{i}(m,\delta):=\sqrt{\frac{\log(m)(\ell_{i}+\log(1/\delta))}{m}}\,. (24)

and where CiC_{i}^{\prime} is as in (20)\eqref{eqn:beta_bound}. \Box

The proof of Theorem 5 is provided in the Appendix 6.4. By including an additional assumption on the Lipschitz continuity of the cost fif_{i} with respect to to other agents decisions xix_{-i}, we can provide an analog of the excess risk result in  [lin2023plug].

Corollary 6.

Suppose that for all i[n]i\in[n], the function xiFβi(xi,xi)x_{-i}\mapsto F_{\beta_{i}}(x_{i},x_{-i}) is LiβL^{\beta}_{-i}-Lipschitz on 𝒳i{\cal X}_{-i} for any xi𝒳ix_{i}\in{\cal X}_{i}. Then,

|Fi(x^)Fi(x)|\displaystyle|F_{i}(\widehat{x})-F_{i}(x^{*})|\leq 2γiLzi+2LziεiCiζi(m,δ)\displaystyle~{}2\gamma_{i}L_{z_{i}}+2L_{z_{i}}\varepsilon_{i}C_{i}^{\prime}\zeta_{i}(m,\delta)
+22L¯i𝖽𝗂𝖺𝗆(𝒳)\displaystyle+2\sqrt{2}\bar{L}_{i}\mathsf{diam}({\cal X}) (25)

hold with probability 1δ1-\delta for any x^NASH(Gβ^,𝒳)\widehat{x}\in\texttt{NASH}(G_{\widehat{\beta}},{\cal X}) and xNASH(G,𝒳)x^{*}\in\texttt{NASH}(G,{\cal X}), L¯i:=max{Liβ,Liβ,Liβ^,Liβ^}\bar{L}_{i}:=\max\{L_{i}^{\beta^{*}},L_{-i}^{\beta^{*}},L_{i}^{\widehat{\beta}},L_{-i}^{\widehat{\beta}}\}, and CiC_{i}^{\prime} is as in (20). \Box

The analysis in this section demonstrates that the estimation procedure in Algorithm 1 yields a cost function that approximates the original cost in (1) with an error the decreases as the number of samples increases. Furthermore, this bound exists independent of the conditioning of the Nash equilibrium problem we solve in the optimization phase. We note that (23) is similar to the result in [lin2023plug], but it is based on a different definition of regular problem (see Definition 1); the bound (25) is unique to this paper.

In the section that follows, we examine a family of hypothesis classes that allows the approximated game to be monotone, and provide suitable algorithms for solving them with convergence guarantees.

4 Solving Strongly-monotone Decision-dependent Games

Since the agents lack full knowledge of the system and hence the ground truth distributional map DiD_{i} in (1), we cannot hope to enforce that DiD_{i} satisfy any assumptions to encourage tractability of our optimization problem. We can however impose conditions on the hypothesis class i{\cal H}_{{\cal B}_{i}}, which is chosen by the agents. To successfully find a Nash equilibrium of the approximate problem in (17), it will be crucial that agents choose a class that balances expressiveness of the system (thereby making ηi\eta_{i} small) with tractability of the optimization.

Perhaps the simplest model capable of achieving this goal is the location-scale family [miller2021outside, narang2022learning, wood2023stochastic]. In our setting, a location scale family parameterization for agent ii is a distributional map DBiD_{B_{i}} having matrix parameter Biki×dB_{i}\in\mathbb{R}^{k_{i}\times d} where ziDBiz_{i}\sim D_{B_{i}} if and only if

zi=𝑑ξi+Bixz_{i}\overset{d}{=}\xi_{i}+B_{i}x (26)

for stationary random variable ξiDξi\xi_{i}\sim D_{\xi_{i}}. We note that this parameterization can be written alternatively as zi=𝑑ξi+Biixi+Biixiz_{i}\overset{d}{=}\xi_{i}+B_{i}^{i}x_{i}+B_{-i}^{i}x_{-i}, where Biiki×diB_{i}^{i}\in\mathbb{R}^{k_{i}\times d_{i}} and Biiki×(ddi)B_{-i}^{i}\in\mathbb{R}^{k_{i}\times(d-d_{i})} are block matrices such that Bix=Biixi+BiixiB_{i}x=B_{i}^{i}x_{i}+B_{-i}^{i}x_{-i} due to linearity. The resulting partial gradient has the form

iFi(x)=𝔼ziD(x)[ifi(x,zi)+(Bii)Tzifi(x,zi)],\nabla_{i}F_{i}(x)={\mathbb{E}}_{z_{i}\sim D(x)}\left[\nabla_{i}f_{i}(x,z_{i})+(B_{i}^{i})^{T}\nabla_{z_{i}}f_{i}(x,z_{i})\right],

which is typically much simpler to analyze than alternative models. Intuitively, this model allows us to express ziz_{i} as the sum of a stationary random variable from a base distribution with a linear factor depending on xx, where the matrix parameter BiB_{i} weights the responsiveness of the population to the agents decisions.

This model is particularly appealing as guarantees for learning BiB_{i} are known and established in Proposition 4. Moreover, the matter of expressiveness is due to the fact that location scale families are a particular instance of strategic regression [perdomo2020performative, lin2023plug], in which member of the population interact with agents by modifying their stationary data (such as features in a learning task) ξi\xi_{i} in an optimal way upon observing xx:

zi=𝑑argminy[uβi(x,y)+12yξi2],z_{i}\overset{d}{=}\operatorname*{arg\,min}_{y}\left[-u_{\beta_{i}}(x,y)+\frac{1}{2}\|y-\xi_{i}\|^{2}\right],

where uβiu_{\beta_{i}} is a utility function parameterized by βii\beta_{i}\in{\cal B}_{i} corresponding to the utility that members of the population derive from changing their data in response to the decisions in xx; and the quadratic term 1/2yξi21/2\|y-\xi_{i}\|^{2} is the cost of changing their data from ξi\xi_{i} to yy. Indeed when uβi(x,)=y,Bixu_{\beta_{i}}(x,)=\langle y,B_{i}x\rangle for βi=Biki×d\beta_{i}=B_{i}\in\mathbb{R}^{k_{i}\times d}, we recover the form above.

Furthermore, location scale families immediately satisfy several of the assumption required for further analysis. In particular, it is known that Sensitivity (Definition 3) holds with εi=maxx𝒳x2\varepsilon_{i}=\max_{x\in{\cal X}}\|x\|^{2}, Lipschitz continuity of xDBix\mapsto D_{B_{i}} holds with γi=Bi2\gamma_{i}=\|B_{i}\|^{2}, and Lipschitz continuity of GBiG_{B_{i}} holds due to the following result.

Lemma 7.

(Lipschitz Gradient, [narang2022learning]) Suppose that DβiD_{\beta_{i}} is such that z=𝑑Bix+ξiz\overset{d}{=}B_{i}x+\xi_{i} with βi=Bi\beta_{i}=B_{i}, and that for each i[n]i\in[n] there exists ζi0\zeta_{i}\geq 0 such that (x,zi)i,zifi(x,zi)(x,z_{i})\mapsto\nabla_{i,z_{i}}f_{i}(x,z_{i}) is ζi\zeta_{i}-Lipschitz continuous. Then GβiG_{\beta_{i}} is LL-Lipschitz continuous with

L:=i=1nζi2max{1,Bii2}(1+Bi2).L:=\sqrt{\sum_{i=1}^{n}\zeta_{i}^{2}\max\{1,\|B_{i}^{i}\|^{2}\}(1+\|B_{i}\|^{2})}\,. (27)

\Box

Strong monotonicity will follow from Theorem 1 provided that GβiG_{\beta_{i}} satisfy the remaining hypothesis on the fβif_{\beta_{i}}—which tends to be on a case-by-case basis. We will not require that GβiG_{\beta_{i}} use this parameterization in our analysis, however we can proceed with knowledge a model class satisfying our hypothesis does exist.

4.1 Distributed Gradient-based Method

In our optimization phase, we seek to use a gradient-based algorithm that respects the agent’s communication structure with the system. For the sake of readability, we will suppress the βi\beta_{i} subscript and instead refer to quantities GiG_{i} keeping in mind that they will correspond to the approximate Nash equilibrium problem in (17) with solution x^\widehat{x}.

We will assume that each agent has access to an estimator of the gradient iFi\nabla_{i}F_{i} and is capable of projecting onto their decision set 𝒳i{\cal X}_{i}. In the constant step-size setup, each agent chooses a rate ωi>0\omega_{i}>0 and performs the update

xit+1=𝗉𝗋𝗈𝗃𝒳i(xitωi1git),x_{i}^{t+1}=\mathsf{proj}_{{\cal X}_{i}}\left(x_{i}^{t}-\omega_{i}^{-1}g_{i}^{t}\right),

where gitg_{i}^{t} is a stochastic gradient estimator for iFi\nabla_{i}F_{i} used at iteration tt, which is then reported to the system and made available to all agents. For the sake of analysis, we will assume without loss of generality that the steps-sizes satisfy the ordering

ω1ω2ωn\omega_{1}\geq\omega_{2}\geq~{}\ldots~{}\geq\omega_{n}

and hence ω1=maxi[n]ωi\omega_{1}=\max_{i\in[n]}\omega_{i} and ωn=mini[n]ωi\omega_{n}=\min_{i\in[n]}\omega_{i}. The collective update can be written compactly as

xt+1=𝗉𝗋𝗈𝗃𝒳,W(xtW1gt),x^{t+1}=\mathsf{proj}_{{\cal X},W}\left(x^{t}-W^{-1}g^{t}\right), (28)

where W=𝖽𝗂𝖺𝗀(ω1𝟙d1,,ωn𝟙dn)W=\mathsf{diag}(\omega_{1}{\bf\mathds{1}}_{d_{1}},\ldots,\omega_{n}{\bf\mathds{1}}_{d_{n}}) and gtg^{t} is an estimator for G(xt)G(x^{t}) at iteration tt. Convergence of this procedure hinges on the following assumptions.

Assumption 1.

The gradient function G:𝒳ddG:{\cal X}\subseteq\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is α\alpha-strongly monotone and LL-Lipschitz continuous.

Assumption 2.

(Stochastic Framework) Let 𝔽=(t)t0{\mathbb{F}}=({\cal F}_{t})_{t\geq 0} with elements

t=σ(gτ,τt){\cal F}_{t}=\sigma(g^{\tau},\tau\leq t) (29)

be the natural filtration of the Borel σ\sigma-algebra over d\mathbb{R}^{d} with respect to gtg^{t}, and use the short-hand notation 𝔼t:=𝔼zD(xt)[|t]{\mathbb{E}}_{t}\cdot:={\mathbb{E}}_{z\sim D(x^{t})}[\cdot|{\cal F}_{t}] as the conditional expectation over the product distribution D(xt)=i=1nDi(xt)D(x^{t})=\prod_{i=1}^{n}D_{i}(x^{t}). There exist bounded sequences {ρt}t0,{σt}t0+\{\rho^{t}\}_{t\geq 0},\{\sigma^{t}\}_{t\geq 0}\subseteq\mathbb{R}_{+} such that

(Bias) 𝔼tgtG(xt)ρt\displaystyle\qquad\qquad\|{\mathbb{E}}_{t}g^{t}-G(x^{t})\|\leq\rho^{t}
(Variance) 𝔼tgt𝔼tgt2(σt)2\displaystyle\qquad\qquad{\mathbb{E}}_{t}\|g^{t}-{\mathbb{E}}_{t}g^{t}\|^{2}\leq(\sigma^{t})^{2}

where ρtρ\rho^{t}\leq\rho and σ2σ\sigma^{2}\leq\sigma for all t0t\geq 0.

Assumption 1 is standard for guaranteeing convergence of gradient play [facchinei2003finite], and the uniformly bounded variance component of Assumption 2 is standard for convergence for stochastic algorithms. As we will show shortly, convergence with bias is possible and the result reduces to the unbiased case when ρt=0\rho^{t}=0. The next result will quantify the one-step improvement of (32).

Lemma 8.

(One-step Improvement) Let Assumptions 1 and 2 hold. Then, the sequence generated by iteration (28) satisfies:

𝔼txt+1x^W2\displaystyle{\mathbb{E}}_{t}\|x^{t+1}-\widehat{x}\|_{W}^{2}\leq ω1ωn+αxtx^W2\displaystyle\frac{\omega_{1}}{\omega_{n}+\alpha}\|x^{t}-\widehat{x}\|_{W}^{2}
+2ω1(ω1ρ2+ασ2)αωn(ωn+α)\displaystyle+\frac{2\omega_{1}\left(\omega_{1}\rho^{2}+\alpha\sigma^{2}\right)}{\alpha\omega_{n}\left(\omega_{n}+\alpha\right)}

for all t0t\geq 0, provided that ω1/ωn2α/(4L2)\omega_{1}/\omega_{n}^{2}\leq\alpha/(4L^{2}). \Box

The proof of Lemma 8 is provided in the Appendix 6.5. We note that setting ωi=ω\omega_{i}=\omega for some ω>0\omega>0 recovers the result in [narang2022learning, Theorem 15]. Following this one-step analysis, we can show convergence to a neighborhood of the Nash equilibrium.

Theorem 9.

(Neighborhood Convergence) Let Assumptions 1 and 2 hold, and suppose that (ω1ωn)<α(\omega_{1}-\omega_{n})<\alpha. Then,

lim supt𝔼xtx^22ω1(ω1ρ2+ασ2)αωn(ωn+α).\limsup_{t\to\infty}{\mathbb{E}}\|x^{t}-\widehat{x}\|^{2}\leq\frac{2\omega_{1}\left(\omega_{1}\rho^{2}+\alpha\sigma^{2}\right)}{\alpha\omega_{n}\left(\omega_{n}+\alpha\right)}. (30)

The proof can be found in Appendix 6.6. The result shows that the algorithm converges linearly to a neighborhood of the Nash equilibrium x^\widehat{x}, where the radius of the neighborhood is dictated by the step-size, variance, and bias bounds. When ρ=σ=0\rho=\sigma=0, we retrieve linear convergence. In order to converge to x^\widehat{x} directly, we will require a decaying step-size policy. For example, we consider the following policy:

ωt=α(r+t2)2\omega^{t}=\frac{\alpha(r+t-2)}{2} (31)

for fixed constant r>2r>2, which we assumed to be shared by all agents. Hence, the decaying step-size update is given by

xt+1=𝗉𝗋𝗈𝗃𝒳(xt(ωt)1gt).x^{t+1}=\mathsf{proj}_{{\cal X}}\left(x^{t}-(\omega^{t})^{-1}g^{t}\right). (32)

In the theorem that follows, we show that this sequence converges to x^\widehat{x} provided that the bias shares an asymptotic rate with (ωt)1(\omega^{t})^{-1}.

Theorem 10.

(Convergence) Suppose that Assumptions 1 and 2 hold and that there exists ρ¯,s>0\bar{\rho},s>0 such that

𝔼tgtG(xt)ρ¯s+t\|{\mathbb{E}}_{t}g^{t}-G(x^{t})\|\leq\frac{\bar{\rho}}{s+t} (33)

for all t0t\geq 0. Then,

𝔼xtx2Aα2(r+t){\mathbb{E}}\|x^{t}-x^{*}\|^{2}\leq\frac{A}{\alpha^{2}(r+t)} (34)

where

A=max{α2rx0x2,4ρ¯2max{rs,1}+8rσ2r2}.A=\max\left\{\alpha^{2}r\|x^{0}-x^{*}\|^{2},4\bar{\rho}^{2}\max\left\{\frac{r}{s},1\right\}+\frac{8r\sigma^{2}}{r-2}\right\}.

The proof of this result follows by a standard induction argument, and can be found in Appendix 6.7.

5 Numerical Experiments on Electric Vehicle Charging

In this section, we consider a competitive game between nn distinct electric vehicle charging station operators, where stations are equipped with renewable power sources. The goal of each player is to set prices to maximize their own profit in a system where demand for their station will change in response to the prices set by other competing stations as well. The cost function (negative profit) takes the form

fi(x,zi)=zixi+λi2xi2service profitpwϕ(wizi)renewable profit+prϕ(ziwi)operational costf_{i}(x,z_{i})=\underbrace{-z_{i}x_{i}+\frac{\lambda_{i}}{2}x_{i}^{2}}_{\text{service profit}}-\underbrace{p_{w}\phi(w_{i}-z_{i})}_{\text{renewable profit}}+\underbrace{p_{r}\phi(z_{i}-w_{i})}_{\text{operational cost}}

where ϕ(y)=log(1+exp(y))\phi(y)=\log(1+\exp(y)) for all yy\in\mathbb{R}. The renewable profit and operational cost terms allow us to describe the trade-off between profit from renewable power generation sold to the grid at rate pwp_{w}, and surplus power required from the grid to meet demand at rate prp_{r}. To set prices, we can formulate a Nash equilibrium problem over the expected costs Fi(x)=𝔼ziDi(x)[fi(x,zi)]F_{i}(x)={\mathbb{E}}_{z_{i}\sim D_{i}(x)}[f_{i}(x,z_{i})] for i[n]i\in[n] and x𝒳=Πi=1n𝒳ix\in{\cal X}=\Pi_{i=1}^{n}{\cal X}_{i}, where 𝒳i=[pw,pr]{\cal X}_{i}=[p_{w},p_{r}] is the interval of price values between the wholesale and retail price.

Since the set of reasonable prices will be quite small, we hypothesize that the the price and demand have a linear relationship of the form zi=𝑑ξi+bi,xiz_{i}\overset{d}{=}\xi_{i}+\langle b_{i},x_{i}\rangle where binb_{i}\in\mathbb{R}^{n} with ξiDξi\xi_{i}\sim D_{\xi_{i}} corresponding to the base demand. Since we have a simple model, the first and second derivatives can be computed in closed form, and the relevant constants can be computed directly. Indeed, we find that the hypothesis of Theorem 1 are satisfied with λ=miniλi\lambda=\min_{i}\lambda_{i} which we set to 11, Li=1L_{i}=1,and γi=bi2\gamma_{i}=\|b_{i}\|^{2}. We conclude that G:nnG:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} is α=(12BF)\alpha=(1-2\|B\|_{F})-strongly monotone with where BB is the parameter matrix whose columns are bib_{i}.

Refer to caption
Figure 2: Standardized demand data for six medium demand EVCS’s consisting of either 2 or 6 ports and port power values of 50, 150, and 350 kWh. Standardization maps raw demand instances to instances of demand that are deviations from the average at each station.
Refer to caption
Figure 3: Expected error curve and confidence interval for regularized stochastic gradient descent with decaying step size for a location-scale model.

Our data depicts the demand of electricity across an hour-long period for 6 ports of varying power profiles for each day in year. We standardize the data to be zero mean and unit variance across each station. Solutions are calculated by performing expected gradient play with constant step size; the expected mean is estimated via the empirical mean over the data set.

We set bii=1/18+νb_{ii}=-1/18+\nu and bij=1/18+νb_{ij}=1/18+\nu, where we use ν𝒩(0,105)\nu\sim{\cal N}(0,10^{-5}) to simulate learning BB from samples. Hence demand for agent ii decreases as their own price increases, and increases as the price of other agents decreases. We run the stochastic gradient play algorithm initialized at x0=pr𝟙nx^{0}=p_{r}{\bf\mathds{1}}_{n} with a single sample at each round and a decaying step size policy ωt=α(r+t2)/2\omega_{t}=\alpha(r+t-2)/2 for r=3r=3. In Figure 3 we plot the mean error trajectory an confidence interval over 50 trials of 2000 iterations.

6 CONCLUSIONS

In this work, we studied a class of stochastic Nash equilibrium problems, characterized by data distributions that are dependant on the decisions of all involved players. We showed that a learning-based approach enables the formulation of an approximate Nash equilibrium problem that is solvable using a stochastic gradient play algorithm. The results of this procedure is a cost that can be related to cost of the original Nash equilibrium problem via an error that depends on both our approximation and estimation error. To demonstrate the flexibility of these findings, we simulated these techniques in an electric vehicle charging market problem in which service providers set prices, and users modify their demand based on prices set by providers. Future research will look at a scenario where the estimate of the distributional map is improved during the operation of the algorithm, based on the feedback received. Future applications will demonstrate the efficacy of more complex models (beyond linear)

APPENDIX

6.1 Proof of Lemma 2

For the sake of notation convenience, and visual clarity, we will suppress the ii index throughout the proof. We denote the gradient error by J(β)=R^(β)R(β)J(\beta)=\nabla\widehat{R}(\beta)-\nabla R(\beta) for all β\beta\in\mathbb{R}^{\ell}.

To begin, we will generate coverings for the unit sphere in \mathbb{R}^{\ell} and {\cal B}\subseteq\mathbb{R}^{\ell} and use a discretization argument to create bounds over these finite sets. Fix β\beta\in{\cal B} and u𝒮1u\in{\cal S}^{\ell-1}. Let {uj}j=1N\{u_{j}\}_{j=1}^{N} be an arbitrary 1/21/2-covering of the sphere 𝕊di\mathbb{S}^{d_{\ell_{i}}} with respect to the Euclidean norm. From [wainwright_2019, Lemma 5.7], we know that N5N\leq 5^{\ell}. From our covering, we have that there exists uju_{j} in the covering such that uuj1/2\|u-u_{j}\|\leq 1/2. Hence,

u,J(β)\displaystyle\langle u,J(\beta)\rangle =uj+(uuj),J(β)\displaystyle=\langle u_{j}+(u-u_{j}),J(\beta)\rangle
=uj,J(β)+uuj,J(β)\displaystyle=\langle u_{j},J(\beta)\rangle+\langle u-u_{j},J(\beta)\rangle
uj,J(β)+uujJ(β)\displaystyle\leq\langle u_{j},J(\beta)\rangle+\|u-u_{j}\|\|J(\beta)\|
uj,J(β)+12J(β)\displaystyle\leq\langle u_{j},J(\beta)\rangle+\frac{1}{2}\|J(\beta)\|
maxj[N]uj,J(β)+12J(β).\displaystyle\leq\max_{j\in[N]}\langle u_{j},J(\beta)\rangle+\frac{1}{2}\|J(\beta)\|\,.

Since this is true for any u𝒮d1u\in{\cal S}^{d-1}, then it holds for u=J(β)/J(β)u=J(\beta)/\|J(\beta)\|. Thus the above becomes

J(β)2uj,J(β)2maxj[N]uj,J(β).\|J(\beta)\|\leq 2\langle u_{j},J(\beta)\rangle\leq 2\max_{j\in[N]}\langle u_{j},J(\beta)\rangle. (35)

Now we fix ε(0,1]\varepsilon\in(0,1], and choose and ε\varepsilon-covering for the set {\cal B}, which we will write as {βk}k=1M\{\beta_{k}\}_{k=1}^{M}. Recall that {\cal B} is bounded, so there exists a constant r>0r>0 such that for all β\beta\in{\cal B}, βr\|\beta\|\leq r. Hence B(r){\cal B}\subseteq B(r). From [vershynin2018high, Proposition 4.2.12], we have that

Mvol(B(r)+ε2B(1))vol(ε2B(1))=vol(32B(r))vol(ε2B(1))=(3rε).M\leq\frac{\operatorname*{vol}\left(B(r)+\frac{\varepsilon}{2}B(1)\right)}{\operatorname*{vol}\left(\frac{\varepsilon}{2}B(1)\right)}=\frac{\operatorname*{vol}\left(\frac{3}{2}B(r)\right)}{\operatorname*{vol}\left(\frac{\varepsilon}{2}B(1)\right)}=\left(\frac{3r}{\varepsilon}\right)^{\ell}. (36)

Thus, we conclude that M(3r/ε)M\leq(3r/\varepsilon)^{\ell}.

Now by our discretization argument, there exists k[M]k\in[M] such that ββkε\|\beta-\beta_{k}\|\leq\varepsilon and hence

maxj[N]uj,J(β)\displaystyle\max_{j\in[N]}\langle u_{j},J(\beta)\rangle =maxj[N]uj,J(βk)+(J(β)J(βk)\displaystyle=\max_{j\in[N]}\langle u_{j},J(\beta_{k})+(J(\beta)-J(\beta_{k})\rangle
=maxj[N]uj,J(βk)+uj,J(β)J(βk)\displaystyle=\max_{j\in[N]}\langle u_{j},J(\beta_{k})\rangle+\langle u_{j},J(\beta)-J(\beta_{k})\rangle
maxj[N]uj,J(βk)+maxj[N]uj,J(β)J(βk)\displaystyle\leq\max_{j\in[N]}\langle u_{j},J(\beta_{k})\rangle+\max_{j\in[N]}\langle u_{j},J(\beta)-J(\beta_{k})\rangle
maxk[M]maxj[N]uj,J(βk)\displaystyle\leq\max_{k\in[M]}\max_{j\in[N]}\langle u_{j},J(\beta_{k})\rangle
+supααεmaxj[N]uj,J(α)J(α).\displaystyle\ +\sup_{\|\alpha-\alpha^{\prime}\|\leq\varepsilon}\max_{j\in[N]}\langle u_{j},J(\alpha)-J(\alpha^{\prime})\rangle.

We observe that if α,α\alpha,\alpha^{\prime}\in{\cal B} are such that ααε\|\alpha-\alpha^{\prime}\|\leq\varepsilon, then applying our smoothness assumption yields

uj,J(α)J(α)\displaystyle\langle u_{j},J(\alpha)-J(\alpha^{\prime})\rangle
=uj,(R^(α)R(α))(R^(α)R(α))\displaystyle=\langle u_{j},(\nabla\widehat{R}(\alpha)-\nabla R(\alpha))-(\nabla\widehat{R}(\alpha^{\prime})-\nabla R(\alpha^{\prime}))\rangle
=uj,R^(α)R^(α)+uj,R(α)R(α)\displaystyle=\langle u_{j},\nabla\widehat{R}(\alpha)-\nabla\widehat{R}(\alpha^{\prime})\rangle+\langle u_{j},\nabla R(\alpha^{\prime})-\nabla R(\alpha)\rangle
ujR^(α)R^(α)+ujR(α)R(α)\displaystyle\leq\|u_{j}\|\|\nabla\widehat{R}(\alpha)-\nabla\widehat{R}(\alpha^{\prime})\|+\|u_{j}\|\|\nabla R(\alpha)-\nabla R(\alpha^{\prime})\|
Lβiαα+Lβiαα\displaystyle\leq L_{\beta_{i}}\|\alpha-\alpha^{\prime}\|+L_{\beta_{i}}\|\alpha-\alpha^{\prime}\|
2Lβε,\displaystyle\leq 2L_{\beta}\varepsilon,

where the second-to-last inequality uses uj=1\|u_{j}\|=1.

To bound the remaining term, we use the concentration of sub-exponential random variables, due to Bernstein’s Inequality combined with the Union Bound. We have that

(uj,J(βk)t)2exp(mt22θ2){\mathbb{P}}\left(\langle u_{j},J(\beta_{k})\rangle\geq t\right)\leq 2\exp\left(-\frac{mt^{2}}{2\theta^{2}}\right)

for all tθt\leq\theta, and hence

(maxk[M]maxj[N]uj,J(βkt)\displaystyle{\mathbb{P}}\left(\max_{k\in[M]}\max_{j\in[N]}\langle u_{j},J(\beta_{k}\rangle\geq t\right)
=(k[M]j[N]{uj,J(βkt})\displaystyle={\mathbb{P}}\left(\bigcup_{k\in[M]}\bigcup_{j\in[N]}\{\langle u_{j},J(\beta_{k}\rangle\geq t\}\right)
k[M]j[N]({uj,J(βkt})\displaystyle\leq\sum_{k\in[M]}\sum_{j\in[N]}{\mathbb{P}}\left(\{\langle u_{j},J(\beta_{k}\rangle\geq t\}\right)
k[M]j[N]2exp(mt22θ2)\displaystyle\leq\sum_{k\in[M]}\sum_{j\in[N]}2\exp\left(-\frac{mt^{2}}{2\theta^{2}}\right)
=MN2exp(mt22θ2)\displaystyle=M\cdot N\cdot 2\exp\left(-\frac{mt^{2}}{2\theta^{2}}\right)
2(15rε)exp(mt22θ2)\displaystyle\leq 2\left(\frac{15r}{\varepsilon}\right)^{\ell}\exp\left(-\frac{mt^{2}}{2\theta^{2}}\right)

for all tθt\leq\theta, where we used the fact that M(3r/ε)M\leq(3r/\varepsilon)^{\ell} and N5N\leq 5^{\ell}. Setting the right hand side equal to 2δ2\delta yields

t=2θlog(15r/ε)+log(1/δ)m.t=\sqrt{2}\theta\sqrt{\frac{\ell\log(15r/\varepsilon)+\log(1/\delta)}{m}}. (37)

Next we choose ε=115r+log(1/δ)m\varepsilon=\frac{1}{15r}\sqrt{\frac{\ell+\log(1/\delta)}{m}} so that

t\displaystyle t =2θlog(15r/ε)+log(1/δ)m\displaystyle=\sqrt{2}\theta\sqrt{\frac{\ell\log(15r/\varepsilon)+\log(1/\delta)}{m}}
=2θ2log(m)2log(+log(1/δ))+log(1/δ)m\displaystyle=\sqrt{2}\theta\sqrt{\frac{\frac{\ell}{2}\log(m)-\frac{\ell}{2}\log(\ell+\log(1/\delta))+\log(1/\delta)}{m}}
2θlog(m)+log(1/δ)m\displaystyle\leq\sqrt{2}\theta\sqrt{\frac{\ell\log(m)+\log(1/\delta)}{m}}
2θlog(m)(+log(1/δ))m.\displaystyle\leq\sqrt{2}\theta\sqrt{\frac{\log(m)(\ell+\log(1/\delta))}{m}}.

By requiring that mm satisfy m/log(m)2(+log(1/δ))m/\log(m)\geq 2(\ell+\log(1/\delta)), we enforce that tθt\leq\theta. In combining, we observe that

t+2εL\displaystyle t+2\varepsilon L
2θlog(m)(+log(1/δ))m+2L15r+log(1/δ)m\displaystyle\leq\sqrt{2}\theta\sqrt{\frac{\log(m)(\ell+\log(1/\delta))}{m}}+\frac{2L}{15r}\sqrt{\frac{\ell+\log(1/\delta)}{m}}
2(θ+L15r)log(m)(+log(1/δ))m\displaystyle\leq 2\left(\theta+\frac{L}{15r}\right)\sqrt{\frac{\log(m)(\ell+\log(1/\delta))}{m}}
4max{L15r,θ}log(m)(+log(1/δ))m,\displaystyle\leq 4\max\left\{\frac{L}{15r},\theta\right\}\sqrt{\frac{\log(m)(\ell+\log(1/\delta))}{m}},

and the result follows.

6.2 Proof of Theorem 3

We suppress the subscript ii for notational simplicity. We recall that that the μ\mu-strong convexity of the map βR(x,z;β)\beta\mapsto R(x,z;\beta) implies μ\mu-strong monotonicity of R(β)\nabla R(\beta), and R^(β)\nabla\widehat{R}(\beta). It follows that

μβ^β2\displaystyle\mu\|\widehat{\beta}-\beta^{*}\|^{2} β^β,R(β^)R(β)\displaystyle\leq\langle\widehat{\beta}-\beta^{*},\nabla R(\widehat{\beta})-\nabla R(\beta^{*})\rangle
=β^β,R(β^)β^β,R(β)\displaystyle=\langle\widehat{\beta}-\beta^{*},\nabla R(\widehat{\beta})\rangle-\langle\widehat{\beta}-\beta^{*},\nabla R(\beta^{*})\rangle
β^β,R(β^)\displaystyle\leq\langle\widehat{\beta}-\beta^{*},\nabla R(\widehat{\beta})\rangle
β^β,R(β^)+ββ^,R^(β^)\displaystyle\leq\langle\widehat{\beta}-\beta^{*},\nabla R(\widehat{\beta})\rangle+\langle\beta^{*}-\widehat{\beta},\nabla\widehat{R}(\widehat{\beta})\rangle
=β^β,R(β^)R^(β^)\displaystyle=\langle\widehat{\beta}-\beta^{*},\nabla R(\widehat{\beta})-\nabla\widehat{R}(\widehat{\beta})\rangle
β^βsupβR(β)R^(β)\displaystyle\leq\|\widehat{\beta}-\beta^{*}\|\sup_{\beta\in{\cal B}}\|\nabla R(\beta)-\nabla\widehat{R}(\beta)\|

and hence

β^β1μsupβR(β)R^(β).\|\widehat{\beta}-\beta^{*}\|\leq\frac{1}{\mu}\sup_{\beta\in{\cal B}}\|\nabla R(\beta)-\nabla\widehat{R}(\beta)\|. (38)

The result now follows by applying Lemma 2. \Box

6.3 Proof of Proposition 4

We suppress the ii index throughout. The associated risk function is R(x,z,B)=12Bxz2R(x,z,B)=\frac{1}{2}\|Bx-z\|^{2}, so that R(x,z,B)=(Bxz)xT=BxxTzxT\nabla R(x,z,B)=(Bx-z)x^{T}=Bxx^{T}-zx^{T} and 2R(x,z,B)=xxT\nabla^{2}R(x,z,B)=xx^{T} are the corresponding gradient and hessian. We observe that enforcing γI𝔼[xxT]LI\gamma I\leq{\mathbb{E}}[xx^{T}]\leq LI for some γ,L>0\gamma,L>0 ensures γ\gamma-strong convexity and LL-smoothness of the expected risk. Similarly, the empirical risk has gradient Rm(B)=1/m(BXXTZXT)\nabla R_{m}(B)=1/m(BXX^{T}-ZX^{T}), and hessian 2Rm(B)=(1/m)XXT\nabla^{2}R_{m}(B)=(1/m)XX^{T}. Thus RmR_{m} is convex the hessian is symmetric, then it is positive semi-definite and thus RmR_{m} is convex. Furthermore, smoothness of RmR_{m} follows with constant max{L,XXT2}\max\{L,\|XX^{T}\|_{2}\}. Lastly, since zxTzx^{T} and xxTxx^{T} have sub-exponential entries, the gradient is sub-exponential and the result follows. \Box

6.4 Proof of Theorem 5

We observe that for any fixed x𝒳x\in\mathcal{X}, we have that |Fβ^i(x)Fi(x)||Fβ^i(x)Fβi(x)|+|Fβi(x)Fi(x)||F_{\widehat{\beta}_{i}}(x)-F_{i}(x)|\leq|F_{\widehat{\beta}_{i}}(x)-F_{\beta_{i}^{*}}(x)|+|F_{\beta_{i}^{*}}(x)-F_{i}(x)|. The first term describes our statistical error at xx. We denote Π(Dβ^i,Dβi)\Pi(D_{\widehat{\beta}_{i}},D_{\beta_{i}^{*}}) as a coupling on 𝒫(mi){\cal P}(\mathbb{R}^{m_{i}}) so that

|Fβ^i(x)Fβi(x)|\displaystyle|F_{\widehat{\beta}_{i}}(x)-F_{\beta_{i}^{*}}(x)|
=|infΠ(Dβ^i(x),Dβi(x))𝔼(z,z)Π(Dβ^i(x),Dβi(x))(f(x,z)f(x,z))|\displaystyle=\left|\inf_{\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{*}}(x))}\ {\mathbb{E}}_{(z,z^{\prime})\sim\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{*}}(x))}\left(f(x,z)-f(x,z^{\prime})\right)\right|
infΠ(Dβ^i(x),Dβi(x))𝔼(z,z)Π(Dβ^i(x),Dβi(x))|f(x,z)f(x,z)|\displaystyle\leq\inf_{\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{*}}(x))}\ {\mathbb{E}}_{(z,z^{\prime})\sim\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{*}}(x))}\left|f(x,z)-f(x,z^{\prime})\right|
Lzi(infΠ(Dβ^i(x),Dβi(x))𝔼(z,z)Π(Dβi^(x),Dβi(x))zizi)\displaystyle\leq L_{z_{i}}\left(\inf_{\Pi(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{*}}(x))}\ {\mathbb{E}}_{(z,z^{\prime})\sim\Pi(D_{\widehat{\beta_{i}}}(x),D_{\beta_{i}^{*}}(x))}\|z_{i}-z_{i}^{{}^{\prime}}\|\right)
=LziW1(Dβ^i(x),Dβi(x))\displaystyle=L_{z_{i}}W_{1}(D_{\widehat{\beta}_{i}}(x),D_{\beta_{i}^{*}}(x))
Lziεiβ^iβi.\displaystyle\leq L_{z_{i}}\varepsilon_{i}\|\widehat{\beta}_{i}-\beta_{i}^{*}\|\,.

By similar argument, we find that |Fβi(x)Fi(x)|LziW1(Dβi(x),Di(x))Lziγi|F_{\beta_{i}^{*}}(x)-F_{i}(x)|\leq L_{z_{i}}W_{1}(D_{\beta_{i}^{*}}(x),D_{i}(x))\leq L_{z_{i}}\gamma_{i}. In combining, we get |Fβ^i(x)Fi(x)|Lziεiβ^iβi+Lziγi|F_{\widehat{\beta}_{i}}(x)-F_{i}(x)|\leq L_{z_{i}}\varepsilon_{i}\|\widehat{\beta}_{i}-\beta_{i}^{*}\|+L_{z_{i}}\gamma_{i}. Lastly, β^iβi\|\widehat{\beta}_{i}-\beta_{i}^{*}\| can be bounded as in Theorem 3.

Regarding the second bound, we have that

Fi(x^)Fi(x)\displaystyle F_{i}(\widehat{x})-F_{i}(x^{*})
=[Fi(x^)Fβi(x^)]+[Fβi(x^)Fβi^(x^)]\displaystyle=\left[F_{i}(\widehat{x})-F_{\beta_{i}^{*}}(\widehat{x})\right]+\left[F_{\beta_{i}^{*}}(\widehat{x})-F_{\widehat{\beta_{i}}}(\widehat{x})\right]
+[Fβi^(x^)Fβi^(x)]+[Fβi^(x)Fβi(x)]\displaystyle\ +\left[F_{\widehat{\beta_{i}}}(\widehat{x})-F_{\widehat{\beta_{i}}}(x^{**})\right]+\left[F_{\widehat{\beta_{i}}}(x^{**})-F_{\beta_{i}^{*}}(x^{**})\right]
+[Fβi(x)Fβi(x)]+[Fβi(x)Fi(x)]\displaystyle\ +\left[F_{\beta_{i}^{*}}(x^{**})-F_{\beta_{i}^{*}}(x^{*})\right]+\left[F_{\beta_{i}^{*}}(x^{*})-F_{i}(x^{*})\right]
2FiFβi+2FβiFβi^\displaystyle\leq 2\|F_{i}-F_{\beta_{i}^{*}}\|_{\infty}+2\|F_{\beta_{i}^{*}}-F_{\widehat{\beta_{i}}}\|_{\infty}
+[Fβi^(x^)Fβi^(x)]+[Fβi(x)Fβi(x)]\displaystyle\ +\left[F_{\widehat{\beta_{i}}}(\widehat{x})-F_{\widehat{\beta_{i}}}(x^{**})\right]+\left[F_{\beta_{i}^{*}}(x^{**})-F_{\beta_{i}^{*}}(x^{*})\right]

where xi𝒳x_{i}^{**}\in\mathcal{X}^{**}, where 𝒳\mathcal{X}^{**} is the set of equilibria of

xiargminxi𝒳iFβi(xi,xi),i[n]{x}_{i}^{**}\in\operatorname*{arg\,min}_{x_{i}\in{\cal X}_{i}}F_{{\beta}^{*}_{i}}(x_{i},{x}^{**}_{-i}),\quad i\in[n] (39)

where

Fβi(xi,xi):=𝔼ziDβi(xi,xi)fi(xi,xi,zi).F_{{\beta}^{*}_{i}}(x_{i},{x}^{**}_{-i}):=\underset{z_{i}\sim D_{{\beta}^{*}_{i}}(x_{i},{x}^{**}_{-i})}{{\mathbb{E}}}f_{i}(x_{i},{x}^{**}_{-i},z_{i})\,.

Then,

Fβi^(x^)Fβi^(x)\displaystyle F_{\widehat{\beta_{i}}}(\widehat{x})-F_{\widehat{\beta_{i}}}(x^{**}) [Fβi^(x^i,x^i)Fβi^(xi,x^i)]\displaystyle\leq\left[F_{\widehat{\beta_{i}}}(\widehat{x}_{i},\widehat{x}_{-i})-F_{\widehat{\beta_{i}}}(x^{**}_{i},\widehat{x}_{-i})\right]
+[Fβi^(xi,x^i)Fβi^(xi,xi)]\displaystyle\ +\left[F_{\widehat{\beta_{i}}}(x^{**}_{i},\widehat{x}_{-i})-F_{\widehat{\beta_{i}}}(x^{**}_{i},x^{**}_{-i})\right]
Liβ^x^ixi+Liβ^x^ixi\displaystyle\leq L_{i}^{\widehat{\beta}}\|\widehat{x}_{i}-x_{i}^{**}\|+L_{-i}^{\widehat{\beta}}\|\widehat{x}_{-i}-x^{**}_{-i}\|
2max{Liβ^,Liβ^}x^x\displaystyle\leq\sqrt{2}\max\{L_{i}^{\widehat{\beta}},L_{-i}^{\widehat{\beta}}\}\|\widehat{x}-x^{**}\|

where we have used the inequality a+b2a+b\sqrt{a}+\sqrt{b}\leq\sqrt{2}\sqrt{a+b} for some a,b0a,b\geq 0.

Next, consider

Fβi(x)Fβi(x)\displaystyle F_{\beta^{*}_{i}}(x^{**})-F_{\beta^{*}_{i}}(x^{*}) =[Fβi(xi,xi)Fβi(xi,xi)]\displaystyle=[F_{\beta^{*}_{i}}(x_{i}^{**},x_{-i}^{**})-F_{\beta^{*}_{i}}(x_{i}^{**},x^{*}_{-i})]
+[Fβi(xi,xi)Fβi(xi,xi)]\displaystyle~{}~{}+[F_{\beta^{*}_{i}}(x_{i}^{**},x^{*}_{-i})-F_{\beta^{*}_{i}}(x_{i}^{*},x_{-i}^{*})]
Liβxixi+Liβxixi\displaystyle\leq L_{-i}^{\beta^{*}}\|x_{-i}^{**}-x_{-i}^{*}\|+L_{i}^{\beta^{*}}\|x_{i}^{**}-x_{i}^{*}\|
2max{Liβ,Liβ}xx.\displaystyle\leq\sqrt{2}\max\{L_{i}^{\beta^{*}},L_{-i}^{\beta^{*}}\}\|x^{**}-x^{*}\|\,.

Combining the bounds yields

Fi(x^)Fi(x)\displaystyle F_{i}(\widehat{x})-F_{i}(x^{*})
2FiFβi+2FβiFβi^\displaystyle\leq 2\|F_{i}-F_{\beta^{*}_{i}}\|_{\infty}+2\|F_{\beta^{*}_{i}}-F_{\widehat{\beta_{i}}}\|_{\infty}
+2max{Liβ^,Liβ^}x^x\displaystyle~{}~{}+\sqrt{2}\max\{L_{i}^{\widehat{\beta}},L_{-i}^{\widehat{\beta}}\}\|\widehat{x}-x^{**}\|
+2max{Liβ,Liβ}xx\displaystyle~{}~{}+\sqrt{2}\max\{L_{i}^{\beta^{*}},L_{-i}^{\beta^{*}}\}\|x^{**}-x^{*}\|
2γiLzi+2εLziβ^iβi\displaystyle\leq 2\gamma_{i}L_{z_{i}}+2\varepsilon L_{z_{i}}\|\widehat{\beta}_{i}-\beta_{i}^{*}\|
+2(max{Liβ^,Liβ^}+max{Liβ,Liβ})𝖽𝗂𝖺𝗆(𝒳)\displaystyle~{}~{}+\sqrt{2}(\max\{L_{i}^{\widehat{\beta}},L_{-i}^{\widehat{\beta}}\}+\max\{L_{i}^{\beta^{*}},L_{-i}^{\beta^{*}}\})\mathsf{diam}(\mathcal{X})
2γiLzi+2εLziβ^iβi+22L¯i𝖽𝗂𝖺𝗆(𝒳)\displaystyle\leq 2\gamma_{i}L_{z_{i}}+2\varepsilon L_{z_{i}}\|\widehat{\beta}_{i}-\beta_{i}^{*}\|+2\sqrt{2}\bar{L}_{i}\mathsf{diam}(\mathcal{X})

Then, (25) follows using the bound on β^iβi\|\widehat{\beta}_{i}-\beta_{i}^{*}\| from Theorem 3. \Box

6.5 Proof of Lemma 8

Consider the function φ:d\varphi:\mathbb{R}^{d}\rightarrow\mathbb{R} defined by φ(y)=12xtWi1gityW2\varphi(y)=\frac{1}{2}\|x^{t}-W^{-1}_{i}g_{i}^{t}-y\|_{W}^{2} for all y𝒳y\in{\cal X}. Then, φ\varphi is ωn\omega_{n}-strongly convex over 𝒳{\cal X} and has a unique minimizer xt+1𝒳x^{t+1}\in{\cal X}. This implies that:

φ(x)φ(xt+1)+xxt+1,φ(xt+1)+ωn2xt+1x2.\varphi(x^{*})\geq\varphi(x^{t+1})+\langle x^{*}-x^{t+1},\nabla\varphi(x^{t+1})\rangle+\frac{\omega_{n}}{2}\|x^{t+1}-x^{*}\|^{2}\,.

Since xxt+1,φ(xt+1)0\langle x-x^{t+1},\nabla\varphi(x^{t+1})\rangle\geq 0 for all x𝒳x\in{\cal X}, we obtain

xik+1xixikηigikxiW2xikηigikxik+1W2.\|x_{i}^{k+1}-x_{i}^{*}\|\leq\|x_{i}^{k}-\eta_{i}g_{i}^{k}-x_{i}^{*}\|_{W}^{2}-\|x_{i}^{k}-\eta_{i}g_{i}^{k}-x_{i}^{k+1}\|_{W}^{2}.

It follows that

ω1ωnxt+1xW2\displaystyle\frac{\omega_{1}}{\omega_{n}}\|x^{t+1}-x^{*}\|_{W}^{2} xtxW2xtxit+1W2\displaystyle\leq\|x^{t}-x^{*}\|_{W}^{2}-\|x^{t}-x_{i}^{t+1}\|_{W}^{2}
2xtx,gt+2ηixtxt+1,gt.\displaystyle-2\langle x^{t}-x^{*},g^{t}\rangle+2\eta_{i}\langle x^{t}-x^{t+1},g^{t}\rangle.

We now consider the above in the conditional expectation 𝔼t:=𝔼ziD(xt)[|t]{\mathbb{E}}_{t}\cdot:={\mathbb{E}}_{z_{i}\sim D(x_{t})}[\ \cdot\ |{\cal F}_{t}] with t=σ(gt,τt){\cal F}_{t}=\sigma(g^{t},\tau\geq t). We find that

ω1ωn𝔼txt+1xW2\displaystyle\frac{\omega_{1}}{\omega_{n}}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|_{W}^{2}
𝔼txtxW2𝔼txtxt+1W2\displaystyle\leq{\mathbb{E}}_{t}\|x^{t}-x^{*}\|_{W}^{2}-{\mathbb{E}}_{t}\|x^{t}-x^{t+1}\|_{W}^{2}
2𝔼txtx,git2𝔼txt+1xt,gt\displaystyle~{}~{}-2{\mathbb{E}}_{t}\langle x^{t}-x^{*},g_{i}^{t}\rangle-2{\mathbb{E}}_{t}\langle x^{t+1}-x^{t},g^{t}\rangle
=xtxW2𝔼txtxt+1W2\displaystyle=\|x^{t}-x^{*}\|_{W}^{2}-{\mathbb{E}}_{t}\|x^{t}-x^{t+1}\|_{W}^{2}
2xtx,μt2𝔼txt+1xt,gt\displaystyle~{}~{}-2\langle x^{t}-x^{*},\mu^{t}\rangle-2{\mathbb{E}}_{t}\langle x^{t+1}-x^{t},g^{t}\rangle
=xtxW2𝔼txtxt+1W2\displaystyle=\|x^{t}-x^{*}\|_{W}^{2}-{\mathbb{E}}_{t}\|x^{t}-x^{t+1}\|_{W}^{2}
+2𝔼txtxt+1,gtμt+2𝔼txxt+1,μt\displaystyle~{}~{}+2{\mathbb{E}}_{t}\langle x^{t}-x^{t+1},g^{t}-\mu^{t}\rangle+2{\mathbb{E}}_{t}\langle x^{*}-x^{t+1},\mu^{t}\rangle
=xtxW2𝔼txtxt+1W22xt+1x,G(xt+1)\displaystyle=\|x^{t}-x^{*}\|_{W}^{2}-{\mathbb{E}}_{t}\|x^{t}-x^{t+1}\|_{W}^{2}-2\langle x^{t+1}-x^{*},G(x^{t+1})\rangle
+2𝔼txxt+1,μtG(xt+1)\displaystyle~{}~{}+2{\mathbb{E}}_{t}\langle x^{*}-x^{t+1},\mu^{t}-G(x^{t+1})\rangle
+2𝔼txtxt+1,gtμt.\displaystyle~{}~{}+2{\mathbb{E}}_{t}\langle x^{t}-x^{t+1},g^{t}-\mu^{t}\rangle.

To proceed, we bound the inner product terms. Using strong monotonicity, we have that

𝔼txxt+1,G(xt+1)\displaystyle{\mathbb{E}}_{t}\langle x^{*}-x^{t+1},G(x^{t+1})\rangle α𝔼txt+1x2\displaystyle\geq\alpha{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|^{2}
αω1𝔼txt+1xW2.\displaystyle\geq\frac{\alpha}{\omega_{1}}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|_{W}^{2}.

Furthermore, we observe that

𝔼txxit+1,μtG(xt+1)\displaystyle{\mathbb{E}}_{t}\langle x^{*}-x_{i}^{t+1},\mu^{t}-G(x^{t+1})\rangle
=𝔼txxt+1,μtG(xt)\displaystyle\hskip 28.45274pt={\mathbb{E}}_{t}\langle x^{*}-x^{t+1},\mu^{t}-G(x^{t})\rangle
+𝔼txxt+1,G(xt)G(xt+1).\displaystyle\hskip 28.45274pt+{\mathbb{E}}_{t}\langle x^{*}-x^{t+1},G(x^{t})-G(x^{t+1})\rangle.

To bound the remaining terms, we use arguments based on a weighted Young’s inequality. Let Δ1,Δ2,Δ3>0\Delta_{1},\Delta_{2},\Delta_{3}>0 be fixed constants. It follows that

2𝔼txtxt+1,gtμt\displaystyle 2{\mathbb{E}}_{t}\langle x^{t}-x^{t+1},g^{t}-\mu^{t}\rangle
Δ1𝔼txt+1xt2+1Δ1𝔼tgtμt2\displaystyle\leq\Delta_{1}{\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|^{2}+\frac{1}{\Delta_{1}}{\mathbb{E}}_{t}\|g^{t}-\mu^{t}\|^{2}
Δ1ωn𝔼txt+1xtW2+1Δ1i=1n𝔼tgtμt2\displaystyle\leq\frac{\Delta_{1}}{\omega_{n}}{\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|_{W}^{2}+\frac{1}{\Delta_{1}}\sum_{i=1}^{n}{\mathbb{E}}_{t}\|g^{t}-\mu^{t}\|^{2}
Δ1ωn𝔼txt+1xtW2+1Δ1i=1nσi2\displaystyle\leq\frac{\Delta_{1}}{\omega_{n}}{\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|_{W}^{2}+\frac{1}{\Delta_{1}}\sum_{i=1}^{n}\sigma_{i}^{2}
Δ1ωn𝔼txt+1xtW2+σ2Δ1,\displaystyle\leq\frac{\Delta_{1}}{\omega_{n}}{\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|_{W}^{2}+\frac{\sigma^{2}}{\Delta_{1}},

and

2𝔼txxt+1,μtG(xt)\displaystyle 2{\mathbb{E}}_{t}\langle x^{*}-x^{t+1},\mu^{t}-G(x^{t})\rangle
Δ2𝔼txt+1x2+1Δ2𝔼tμtG(xt)2\displaystyle\leq\Delta_{2}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|^{2}+\frac{1}{\Delta_{2}}{\mathbb{E}}_{t}\|\mu^{t}-G(x^{t})\|^{2}
Δ2ωn𝔼txt+1xW2+1Δ2i=1n𝔼tμtG(xt)2\displaystyle\leq\frac{\Delta_{2}}{\omega_{n}}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|_{W}^{2}+\frac{1}{\Delta_{2}}\sum_{i=1}^{n}{\mathbb{E}}_{t}\|\mu^{t}-G(x^{t})\|^{2}
Δ2ωn𝔼txt+1xW2+1Δ2i=1nρi2\displaystyle\leq\frac{\Delta_{2}}{\omega_{n}}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|_{W}^{2}+\frac{1}{\Delta_{2}}\sum_{i=1}^{n}\rho_{i}^{2}
Δ2ωn𝔼txt+1xW2+ρ2Δ2.\displaystyle\leq\frac{\Delta_{2}}{\omega_{n}}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|_{W}^{2}+\frac{\rho^{2}}{\Delta_{2}}.

Additionally, we have that

2𝔼txxt+1,G(xt)G(xt+1)\displaystyle 2{\mathbb{E}}_{t}\langle x^{*}-x^{t+1},G(x^{t})-G(x^{t+1})\rangle
Δ3𝔼txt+1x2+1Δ3𝔼tG(xt)G(xt+1)2\displaystyle\leq\Delta_{3}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|^{2}+\frac{1}{\Delta_{3}}{\mathbb{E}}_{t}\|G(x^{t})-G(x^{t+1})\|^{2}
Δ3ωn𝔼txt+1xW2+L2Δ3𝔼txt+1xt2\displaystyle\leq\frac{\Delta_{3}}{\omega_{n}}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|_{W}^{2}+\frac{L^{2}}{\Delta_{3}}{\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|^{2}
Δ3ωn𝔼txt+1xW2+L2ωnΔ3𝔼txt+1xtW2.\displaystyle\leq\frac{\Delta_{3}}{\omega_{n}}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|_{W}^{2}+\frac{L^{2}}{\omega_{n}\Delta_{3}}{\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|^{2}_{W}.

Combining these estimates yields

ωnω1𝔼txt+1xW2\displaystyle\frac{\omega_{n}}{\omega_{1}}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|^{2}_{W}
xtxW2𝔼txt+1xtW22αω1𝔼txt+1xW2\displaystyle\leq\|x^{t}-x^{*}\|_{W}^{2}-{\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|_{W}^{2}-\frac{2\alpha}{\omega_{1}}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|_{W}^{2}
+(Δ1ωn𝔼txt+1xtW2+σ2Δ1)\displaystyle~{}~{}+\left(\frac{\Delta_{1}}{\omega_{n}}{\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|_{W}^{2}+\frac{\sigma^{2}}{\Delta_{1}}\right)
+(Δ2ωn𝔼txt+1xW2+ρ2Δ2)\displaystyle~{}~{}+\left(\frac{\Delta_{2}}{\omega_{n}}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|_{W}^{2}+\frac{\rho^{2}}{\Delta_{2}}\right)
+(Δ3ωn𝔼txt+1xW2+L2ωnΔ3𝔼txt+1xtW2)\displaystyle~{}~{}+\left(\frac{\Delta_{3}}{\omega_{n}}{\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|_{W}^{2}+\frac{L^{2}}{\omega_{n}\Delta_{3}}{\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|^{2}_{W}\right)
=xtxW2+(Δ1ωn+L2ωnΔ31)𝔼txt+1xtW2\displaystyle=\|x^{t}-x^{*}\|_{W}^{2}+\left(\frac{\Delta_{1}}{\omega_{n}}+\frac{L^{2}}{\omega_{n}\Delta_{3}}-1\right){\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|_{W}^{2}
+(Δ2ωn+Δ3ωn2αω1)𝔼txt+1xW2\displaystyle~{}~{}+\left(\frac{\Delta_{2}}{\omega_{n}}+\frac{\Delta_{3}}{\omega_{n}}-\frac{2\alpha}{\omega_{1}}\right){\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|_{W}^{2}
+(σ2Δ1+ρ2Δ2)\displaystyle~{}~{}+\left(\frac{\sigma^{2}}{\Delta_{1}}+\frac{\rho^{2}}{\Delta_{2}}\right)

and simplifying gives

(ωnω1+2αω1Δ2ωnΔ3ωn)𝔼txt+1xW2\displaystyle\left(\frac{\omega_{n}}{\omega_{1}}+\frac{2\alpha}{\omega_{1}}-\frac{\Delta_{2}}{\omega_{n}}-\frac{\Delta_{3}}{\omega_{n}}\right){\mathbb{E}}_{t}\|x^{t+1}-x^{*}\|_{W}^{2}
xtxW2+(σ2Δ1+ρ2Δ2)\displaystyle\leq\|x^{t}-x^{*}\|_{W}^{2}+\left(\frac{\sigma^{2}}{\Delta_{1}}+\frac{\rho^{2}}{\Delta_{2}}\right)
+(Δ1ωn+L2ωnΔ31)𝔼txt+1xtW2.\displaystyle~{}~{}+\left(\frac{\Delta_{1}}{\omega_{n}}+\frac{L^{2}}{\omega_{n}\Delta_{3}}-1\right){\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|_{W}^{2}.

To proceed, we choose Δ2=Δ3=αω1ωn\Delta_{2}=\Delta_{3}=\frac{\alpha\omega_{1}}{\omega_{n}} and Δ1=ω112ω1L2/(αωn)\Delta_{1}=\omega_{1}^{-1}-2\omega_{1}L^{2}/(\alpha\omega_{n}) to ensure that the coefficient on the 𝔼txt+1xtW2{\mathbb{E}}_{t}\|x^{t+1}-x^{t}\|_{W}^{2} term is zero. Furthermore, enforcing that ω1ωn2α4L2\frac{\omega_{1}}{\omega_{n}^{2}}\leq\frac{\alpha}{4L^{2}} guarantees that Δ112ωn1\Delta_{1}^{-1}\leq 2\omega_{n}^{-1}. Hence the variance term is finite. Substituting these values and simplifying yields the result. \Box

6.6 Proof of Theorem 9

For notational convenience, we will use the short-hand notation et:=xtxW2e^{t}:=\|x^{t}-x^{*}\|_{W}^{2}, c=ω1/(α+ωn)c=\omega_{1}/(\alpha+\omega_{n}), and

A=2ασ2+ω1ρ2αωn.A=2\frac{\alpha\sigma^{2}+\omega_{1}\rho^{2}}{\alpha\omega_{n}}\,.

Hence, the result in Lemma 8 can be written compactly as

𝔼t1etcet1+cA.{\mathbb{E}}_{t-1}e^{t}\leq ce^{t-1}+cA.

By recursively applying this result and applying the law of total expectation, we find that

𝔼etcte0+cAj=1t1cjcte0+cA1ct1c.\displaystyle{\mathbb{E}}e^{t}\leq c^{t}e^{0}+cA\sum_{j=1}^{t-1}c^{j}\leq c^{t}e^{0}+cA\frac{1-c^{t}}{1-c}.

Furthermore, if (ω1ωn)<α(\omega_{1}-\omega_{n})<\alpha, then c<1c<1 and the geometric series converges and is equal to its limit 1/(1c)1/(1-c). Hence 𝔼etcte0+Ac1c{\mathbb{E}}e^{t}\leq c^{t}e^{0}+A\frac{c}{1-c}. \Box

6.7 Proof of Theorem 10

Fix t0t\geq 0. For notational convenience, we will denote et=xtx2e^{t}=\|x^{t}-x^{*}\|^{2}. Replacing the step-size matrix in Lemma 8 with W=ωtId×dW=\omega^{t}I_{d\times d} yields

𝔼tet+1ωtωt+αet+2σ2ωt(ωt+α)+2(ρt)2α(ωt+α).{\mathbb{E}}_{t}e^{t+1}\leq\frac{\omega^{t}}{\omega^{t}+\alpha}e^{t}+\frac{2\sigma^{2}}{\omega^{t}(\omega^{t}+\alpha)}+\frac{2(\rho^{t})^{2}}{\alpha(\omega^{t}+\alpha)}. (40)

To proceed, we will use the observation that

1(s+t)(r+t)=r+t(s+t)(r+t)2max{rs,1}(r+t)2\frac{1}{(s+t)(r+t)}=\frac{r+t}{(s+t)(r+t)^{2}}\leq\frac{\max\{\frac{r}{s},1\}}{(r+t)^{2}} (41)

and

1(r+t)(r+t2)rr2(r+t)2.\frac{1}{(r+t)(r+t-2)}\leq\frac{\frac{r}{r-2}}{(r+t)^{2}}. (42)

By substituting our expression for ωt\omega^{t}, ρt\rho^{t}, and ete^{t} into (40) we obtain

𝔼tet+1\displaystyle{\mathbb{E}}_{t}e^{t+1} r+t2α2(r+t)2A+8σ2α2(r+t2)(r+t)\displaystyle\leq\frac{r+t-2}{\alpha^{2}(r+t)^{2}}A+\frac{8\sigma^{2}}{\alpha^{2}(r+t-2)(r+t)}
+4ρ¯α2(s+t)(r+t)\displaystyle~{}~{}+\frac{4\bar{\rho}}{\alpha^{2}(s+t)(r+t)}
r+t2α2(r+t)2A+8σ2(rr2)α2(r+t)2+4ρ¯max{rs,1}α2(r+t)2\displaystyle\leq\frac{r+t-2}{\alpha^{2}(r+t)^{2}}A+\frac{8\sigma^{2}\left(\frac{r}{r-2}\right)}{\alpha^{2}(r+t)^{2}}+\frac{4\bar{\rho}\max\left\{\frac{r}{s},1\right\}}{\alpha^{2}(r+t)^{2}}
=r+t1α2(r+t)2A\displaystyle=\frac{r+t-1}{\alpha^{2}(r+t)^{2}}A
+A+8σ2(rr2)+4ρ¯max{rs,1}α2(r+t)2\displaystyle~{}~{}+\frac{-A+8\sigma^{2}\left(\frac{r}{r-2}\right)+4\bar{\rho}\max\left\{\frac{r}{s},1\right\}}{\alpha^{2}(r+t)^{2}}
r+t1α2(r+t)2A\displaystyle\leq\frac{r+t-1}{\alpha^{2}(r+t)^{2}}A
Aα2(r+t+1).\displaystyle\leq\frac{A}{\alpha^{2}(r+t+1)}.

Here, the last steps follow from construction of AA, and the fact that (r+t+1)(r+t1)(r+t)2(r+t+1)(r+t-1)\leq(r+t)^{2}. \Box

ACKNOWLEDGMENTS

This research was supported in part by the National Science Foundation (NSF) Mathematical Sciences Graduate Internship (MSGI) Program and by NSF awards 1941896 and 2044946.

The MSGI program is administered by the Oak Ridge Institute for Science and Education (ORISE) through an inter-agency agreement between the U.S. Department of Energy (DOE) and NSF. ORISE is managed for DOE by ORAU. All opinions expressed in this paper are the author’s and do not necessarily reflect the policies and views of NSF, ORAU/ORISE, or DOE.

{IEEEbiography}

[[Uncaptioned image]] Killian Wood is a Ph.D. candidate in applied mathematics at the University of Colorado Boulder advised Prof. Emiliano Dall’Anese. He received a B.A. degree in mathematics from California State University Fullerton, Fullerton, California, USA in 2019 and an M.S. degree in applied mathematics from University of Colorado Boulder, Boulder, Colorado, USA in 2022. His research interests include the development and analysis of stochastic optimization for decision-dependent data distributions, as well as randomized derivative-free methods in scientific computing.

{IEEEbiography}

[[Uncaptioned image]] Ahmed S. Zamzam (Member, IEEE) is a senior research scientist at the National Renewable Energy Laboratory, where he is part of the Power Systems Engineering Center. He received the B.Sc. degree from Cairo University in 2013 and the Ph.D. degree in electrical engineering from the University of Minnesota, Minneapolis, MN, USA, in 2019. He received the Louis John Schnell Fellowship (2015) and the Doctoral Dissertation Fellowship (2018) from the University of Minnesota. His research interests include machine learning and optimization for smart grid applications, large-scale complex energy systems optimization, and grid data analytics.

{IEEEbiography}

[[Uncaptioned image]] Emiliano Dall’Anese (Member, IEEE) is an Associate Professor in the Department of Electrical, Computer, and Energy Engineering at the University of Colorado Boulder, where he is also an affiliate Faculty with the Department of Applied Mathematics. He received the Ph.D. in Information Engineering from the Department of Information Engineering, University of Padova, Italy, in 2011. He was with the University of Minnesota as a postdoc (2011-2014) and the National Renewable Energy Laboratory as a senior researcher (2014-2018). His research interests span the areas of optimization, control, and learning; current applications include power systems and autonomous systems. He received the National Science Foundation CAREER Award in 2020, the IEEE PES Prize Paper Award in 2021, and the IEEE Transactions on Control of Network Systems Best Paper Award in 2023.