⁰⁰footnotetext: ^∗Equal contribution. ¹Northwestern University; {boyiliu2018, jiayangli2024}@u.northwestern.edu, {y-nie, zhaoran.wang}@northwestern.edu. ²Yale University; [email protected]. ³The Chinese University of Hong Kong; [email protected]. ⁴University of Minnesota; [email protected]

Inducing Equilibria via Incentives:
Simultaneous Design-and-Play Ensures Global Convergence

Boyi Liu^1,∗ Jiayang Li^1,∗ Zhuoran Yang² Hoi-To Wai³ Mingyi Hong⁴
Yu (Marco) Nie¹ Zhaoran Wang¹

Abstract

To regulate a social system comprised of self-interested agents, economic incentives are often required to induce a desirable outcome. This incentive design problem naturally possesses a bilevel structure, in which a designer modifies the rewards of the agents with incentives while anticipating the response of the agents, who play a non-cooperative game that converges to an equilibrium. The existing bilevel optimization algorithms raise a dilemma when applied to this problem: anticipating how incentives affect the agents at equilibrium requires solving the equilibrium problem repeatedly, which is computationally inefficient; bypassing the time-consuming step of equilibrium-finding can reduce the computational cost, but may lead the designer to a sub-optimal solution. To address such a dilemma, we propose a method that tackles the designer’s and agents’ problems simultaneously in a single loop. Specifically, at each iteration, both the designer and the agents only move one step. Nevertheless, we allow the designer to gradually learn the overall influence of the incentives on the agents, which guarantees optimality after convergence. The convergence rate of the proposed scheme is also established for a broad class of games.

1 Introduction

A common thread in human history is how to “properly” regulate a social system comprised of self-interested individuals. In a laissez-faire economy, for example, the competitive market itself is the primary regulatory mechanism [47, 16]. However, a laissez-faire economy may falter due to the existence of significant “externalities” [8, 18], which may arise wherever the self-interested agents do not bear the external cost of their behaviors in the entirety. The right response, many argue, is to introduce corrective policies in the form of economic incentives (e.g., tolls, taxes, and subsidies) [32]. By modifying the rewards of the agents, these incentives can encourage (discourage) the agents to engage in activities that create positive (negative) side effects for the society, and thus guide the self-interests of the agents towards a socially desirable end. For example, carbon taxes can be levied on carbon emissions to protect the environment during the production of goods and services [35]. Surge pricing has been widely used to boost supply and dampen demand in volatile ride-hail markets [37]. Lately, subsidies and penalties were both introduced to overcome vaccine hesitancy in the world’s hard-fought battle against the COVID-19 pandemic.

The goal of this paper is to develop a provably efficient method for guiding the agents in a non-cooperative game towards a socially desirable outcome — e.g., the one that maximizes the social welfare — by modifying their payoffs with incentives. The resulting problem may be naturally interpreted as a Stackelberg game [50] in which the “incentive designer” is the leader while the agents being regulated are the followers. Hence, it naturally possesses a bilevel structure [3]: at the upper level, the “designer” optimizes the incentives by anticipating and regulating the best response of the agents, who play a non-cooperative game at the lower level. As the lower-level agents pursue their self-interests freely, their best response can be predicted by the Nash equilibrium [39], which dictates no agent can do better by unilaterally changing their strategy. Accordingly, the incentive design problem is a mathematical program with equilibrium constraints (MPEC) [30].

In the optimization literature, MPECs are well-known for their intractability [10]. Specifically, even getting a first-order derivative through their bilevel structure is a challenge. In the incentive design problem, for example, to calculate the gradient of the designer’s objective at equilibrium, which provides a principled direction for the designer to update the incentives, one must anticipate how the equilibrium is affected by the changes [17]. This is usually achieved by performing a sensitivity analysis, which in turn requires differentiation through the lower-level equilibrium problem, either implicitly or explicitly [25]. No matter how the sensitivity analysis is carried out, the equilibrium problem must be solved before updating the incentives. The resulting algorithm thus admits a double loop structure: in the outer loop, the designer iteratively moves along the gradient; but to find the gradient, it must allow the lower level game dynamics to run its course to arrive at the equilibrium given the current incentives.

Because of the inherent inefficiency of the double-loop structure, many heuristics methods have also been developed for bilevel programs in machine learning [27, 29, 14]. When applied to the incentive design problem, these methods assume that the designer does not solve the equilibrium exactly to evaluate the gradient. Instead, at each iteration, the game is allowed to run just a few rounds, enough for the designer to obtain a reasonable approximation of the gradient. Although such a method promises to reduce the computational cost significantly at each iteration, it may never converge to the same optimal solution obtained without the approximation.

Contribution. In a nutshell, correctly anticipating how incentives affect the agents at equilibrium requires solving the equilibrium problem repeatedly, which is computationally inefficient. On the other hand, simply bypassing the time-consuming step of equilibrium-finding may lead the designer to a sub-optimal solution. This dilemma prompts the following question that motivates this study: can we obtain the optimal solution to an incentive design problem without repeatedly solving the equilibrium problem?

In this paper, we propose an efficient principled method that tackles the designer’s problem and agents’ problem simultaneously in a single loop. At the lower level, we use the mirror descent method [40] to model the process through which the agents move towards equilibrium. At the upper level, we use the gradient descent method to update the incentives towards optimality. At each iteration, both the designer and the agents only move one step based on the first-order information. However, as discussed before, the upper gradient relies on the corresponding lower equilibrium, which is not available in the single-loop update. Hence, we propose to use the implicit differentiation formula—with equilibrium strategy replaced by the current strategy—to estimate the upper gradient, which might be biased at the beginning. Nevertheless, we prove that if we improve the lower-level solution with larger step sizes, the upper-level and lower-level problems may converge simultaneously at a fast rate. The proposed scheme hence guarantees optimality because it can anticipate the overall influence of the incentives on the agents eventually after convergence.

Organization. In Section 2, we discuss related work. In Section 3, we provide the mathematical formulation of the incentive design problem. In Section 4, we design algorithms for solving the problem. In Section 5, we establish conditions under which the proposed scheme globally converges to the optimal solution and analyze the convergence rate. The convergence analysis is restricted to games with a unique equilibrium. In Section 6, we discuss how to apply our algorithms to games with multiple equilibria. Eventually, we conduct experiments to test our algorithms in Section 7.

Notation. We denote $\langle\cdot,\cdot\rangle$ as the inner product in vector spaces. For a vector form $a=(a^{i})$ , we denote $a^{-i}=(a^{j})_{j\neq i}$ . For a finite set $\mathcal{X}\in\mathbb{R}^{n}$ , we denote $\Delta(\mathcal{X})=\{\pi\in\mathbb{R}_{+}^{n}:\sum_{x^{i}\in\mathcal{X}}\pi_{x^{i}}=1\}$ . For any vector norm $\|\cdot\|$ , we denote $\|\cdot\|_{*}=\sup_{\|z\|\leq 1}\langle\cdot,z\rangle$ as its dual norm. We refer readers to Appendix A for a collection of frequently used problem-specific notations.

2 Related work

The incentive design problem studied in this paper is a special case of mathematical programs with equilibrium constraints (MPEC) [19], a class of optimization problems constrained by equilibrium conditions. MPEC is closely related to bilevel programs [10], which bind two mathematical programs together by treating one program as part of the constraints for the other.

Bilevel Programming. In the optimization literature, bilevel programming was first introduced to tackle resource allocation problems [7] and has since found applications in such diverse topics as revenue management, network design, traffic control, and energy systems. In the past decade, researchers have discovered numerous applications of bilevel programming in machine learning, including meta-learning (ML) [14], adversarial learning [22], hyperparameter optimization, [31] and neural architecture search [27]. These newly found bilevel programs in ML are often solved by gradient descent methods, which require differentiating through the (usually unconstrained) lower-level optimization problem [28]. The differentiation can be carried out either implicitly on the optimality conditions as in the conventional sensitivity analysis [see e.g., 2, 43, 4], or explicitly by unrolling the numerical procedure used to solve the lower-level problem [see e.g., 31, 15]. In the explicit approach, one may “partially” unroll the solution procedure (i.e., stop after just a few rounds, or even only one round) to reduce the computational cost. Although this popular heuristic has delivered satisfactory performance on many practical tasks [29, 36, 14, 27], it cannot guarantee optimality for bilevel programs under the general setting, as it cannot derive the accurate upper-level gradient at each iteration [53].

MPEC. Unlike bilevel programs, MPEC is relatively under-explored in the ML literature so far. Recently, Li et al. [25] extended the explicit differentiation method for bilevel programs to MPECs. Their algorithm unrolls an iterative projection algorithm for solving the lower-level problem, which is formulated as a variational inequality (VI) problem. Leveraging the recent advance in differentiable programming [2], they embedded each projection iteration as a differentiable layer in a computational graph, and accordingly, transform the explicit differentiation as standard backpropagation through the graph. The algorithm proposed by Li et al. [26] has a similar overall structure, but theirs casts the lower-level solution process as the imitative logit dynamics [6] drawn from the evolutionary game theory, which can be more efficiently unrolled. Although backpropagation is efficient, constructing and storing such a graph — with potentially a large number of projection layers needed to find a good solution to the lower-level problem — is still demanding. To reduce this burden, partially unrolling the iterative projection algorithm is a solution. Yet, it still cannot guarantee optimality for MPECs due to the same reason as for bilevel programs.

The simultaneous design-and-play approach is proposed to address this dilemma. Our approach follows the algorithm of Hong et al. [21] and Chen et al. [9], which solves bilevel programs via single-loop update. Importantly, they solve both the upper- and the lower-level problem using a gradient descent algorithm and establish the relationship between the convergence rate of the single-loop algorithm and the step size used in gradient descent. However, their algorithms are limited to the cases where the lower-level optimization problem is unconstrained. Our work extends these single-loop algorithms to MPECs that have an equilibrium problem at the lower level. We choose mirror descent as the solution method to the lower-level problem because of its broad applicability to optimization problems with constraints [40] and generality in the behavioral interpretation of games [34, 23]. We show that the convergence of the proposed simultaneous design-and-play approach relies on the setting of the step size for both the upper- and lower-level updates, a finding that echos the key result in [21]. We first give the convergence rate under mirror descent and the unconstrained assumption and then extend the result to the constrained case. For the latter, we show that convergence cannot be guaranteed if the lower-level solution gets too close to the boundary early in the simultaneous solution process. To avoid this trap, the standard mirror descent method is revised to carefully steer the lower-level solution away from the boundary.

3 Problem Formulation

We study incentive design in both atomic games [39] and nonatomic games [45], classified depending on whether the set of agents is endowed with an atomic or a nonatomic measure. In social systems, both types of games can be useful, although the application context varies. Atomic games are typically employed when each agent has a non-trivial influence on the rewards of other agents. In a nonatomic game, on the contrary, a single agent’s influence is negligible and the reward could only be affected by the collective behavior of agents.

Atomic Game. Consider a game played by a finite set of agents $\mathcal{N}=\{1,\ldots,n\}$ , where each agent $i\in\mathcal{N}$ selects a strategy $a^{i}\in\mathcal{A}^{i}\subseteq{\mathbb{R}}^{d^{i}}$ to maximize the reward received, which is determined by a continuously differentiable function $u^{i}:\mathcal{A}=\prod_{i\in\mathcal{N}}\mathcal{A}^{i}\to\mathbb{R}$ . Formally, a joint strategy $a_{*}\in\mathcal{A}$ is a Nash equilibrium if

u^{i}(a^{i}_{*},a^{-i}_{*})\geq u^{i}(a^{i},a^{-i}_{*}),\quad\forall~{}a^{i}\in\mathcal{A}^{i},\quad\forall i\in{\mathcal{N}}.

Suppose that for all $i\in\mathcal{N}$ , the strategy set $\mathcal{A}^{i}$ is closed and convex, and the reward function $u^{i}$ is convex in $a^{i}$ , then $a_{*}\in\mathcal{A}$ is a Nash equilibrium if and only if there exists $\lambda^{i},\ldots,\lambda^{n}>0$ such that [46]

\sum_{i=1}^{n}\lambda^{i}\cdot\bigl{\langle}\nabla_{a^{i}}u^{i}(a_{*}),a^{i}-a^{i}_{*}\bigr{\rangle}\leq 0,\quad\text{for all}~{}a\in\mathcal{A}.

(3.1)

Example 3.1 (Oligopoly model).

In an oligopoly model, there is finite set ${\mathcal{N}}=\{1,\ldots,n\}$ of firms, each of which supplies the market with a quantity $a^{i}$ ( $a^{i}\geq 0$ ) of goods. Under this setting, we have $\mathcal{A}=\mathbb{R}^{n}_{+}$ . The good is then priced as $p(q)=p_{0}-\gamma\cdot q$ , where $p_{0},\gamma>0$ and $q=\sum_{j\in{\mathcal{N}}}a^{i}$ it the total output. The profit and the marginal profit of the firm $i$ are then given by

\displaystyle u^{i}(a)=a^{i}\cdot\biggl{(}p_{0}-\gamma\cdot\sum_{j\in{\mathcal{N}}}a^{j}\biggr{)}-c^{i},\quad\nabla_{a^{i}}u^{i}(a)=p_{0}-\gamma\cdot\biggl{(}a^{i}+\sum_{j\in{\mathcal{N}}}a^{j}\biggr{)},

respectively, where $c^{i}$ is the constant marginal cost¹¹1Throughout this paper, we use the term “reward” to describe the scenario where the agents aim to maximize $u^{i}$ , and use “cost” when the agents aim to do the opposite. for firm $i$ .

Nonatomic Game. Consider a game played by a continuous set of agents, which can be divided into a finite set of classes $\mathcal{N}=\{1,\ldots,n\}$ . We assume that each $i\in\mathcal{N}$ represents a class of infinitesimal and homogeneous agents sharing the finite strategy set $\mathcal{A}^{i}$ with $|\mathcal{A}^{i}|=d^{i}$ . The mass distribution for the class $i$ is defined as a vector $q^{i}\in\Delta(\mathcal{A}^{i})$ that gives the proportion of agents using each strategy. Let the cost of an agent in class $i$ to select a strategy $a\in\mathcal{A}^{i}$ given $q=(q^{1},\ldots,q^{n})$ be $c^{ia}(q)$ . Formally, a joint mass distribution $q\in\Delta(\mathcal{A})=\prod_{i\in\mathcal{N}}\Delta(\mathcal{A}^{i})$ is a Nash equilibrium, also known as a Wardrop equilibrium [51], if for all $i\in\mathcal{N}$ , there exists $b^{i}$ such that

\begin{cases}c^{ia}(q_{*})=b^{i},\quad\text{if}~{}q^{ia}_{*}>0,\\ c^{ia}(q_{*})\geq b^{i},\quad\text{if}~{}q^{ia}_{*}=0.\end{cases}

The following result extends the VI formulation to Nash equilibrium in nonatomic game: denote $c^{i}(q)=(c^{ia}(q))_{a\in\mathcal{A}^{i}}$ , then $q_{*}$ is a Nash equilibrium if and only if [11]

\sum_{i\in\mathcal{N}}\lambda^{i}\cdot\bigl{\langle}c^{i}(q_{*}),q^{i}-q^{i}_{*}\bigr{\rangle}\geq 0,\quad\text{for all}~{}q\in\Delta(\mathcal{A}).

(3.2)

Example 3.2 (Routing game).

Consider a set of agents traveling from source nodes to sink nodes in a directed graph with nodes $\mathcal{V}$ and edges $\mathcal{E}$ . Denote $\mathcal{N}\subseteq\mathcal{V}\times\mathcal{V}$ as the set of source-sink pairs, $\mathcal{A}^{i}\subseteq 2^{\mathcal{E}}$ as the set of paths connecting $i\in\mathcal{N}$ and $\mathcal{E}^{ia}\subseteq\mathcal{E}$ as the set of all edges on the path $a\in\mathcal{A}^{i}$ . Suppose that each source-sink pair $i\in\mathcal{N}$ is associated with $\rho^{i}$ nonatomic agents aiming to choose a route from $\mathcal{A}^{i}$ to minimize the total cost incurred. Let $q^{ia}\in\Delta(\mathcal{A}^{i})$ be the proportion of travelers using the path $a\in\mathcal{A}_{w}$ , $x_{e}\in\mathbb{R}_{+}$ be the number of travelers using the edge $e$ and $t^{e}(x^{e})\in\mathbb{R}_{+}$ be the cost for using edge $e$ . Then we have $x^{e}=\sum_{i\in\mathcal{N}}\sum_{a\in\mathcal{A}^{i}}\rho^{i}\cdot q^{ia}\cdot\delta^{eia},$ where $\delta^{eik}$ equals 1 if $e\in\mathcal{E}^{ia}$ and 0 otherwise. The total cost for a traveler selecting a path $a\in\mathcal{A}^{i}$ will then be $c^{ia}(q)=\sum_{e\in\mathcal{E}}t^{e}(x^{e})\cdot\delta^{eia}$ .

Incentive Design. Despite the difference, we see that an equilibrium of both atomic and nonatomic games can be formulated as a solution to a corresponding VI problem in the following form

\sum_{i\in\mathcal{N}}\lambda^{i}\cdot\bigl{\langle}v^{i}(x_{*}),x^{i}-x^{i}_{*}\bigr{\rangle}\leq 0,\quad\text{for all}~{}x\in\mathcal{X}=\prod_{i\in\mathcal{N}}\mathcal{X}^{i},

(3.3)

where $v^{i}$ and $\mathcal{X}^{i}$ denote different terms in the two types of games. Suppose that there exists an incentive designer aiming to induce a desired equilibrium. To this end, the designer can add incentives $\theta\in\Theta\subseteq{\mathbb{R}}^{d}$ , which is assumed to enter the reward/cost functions and thus leads to a parameterized $v_{\theta}^{i}(x)$ . We assume that the designer’s objective is determined by a function $f:\Theta\times\mathcal{X}\to{\mathbb{R}}$ . Denote $S(\theta)$ as the solution set to (3.3). We then obtain the uniform formulation of the incentive design problem for both atomic games and nonatomic games

\min_{\theta\in\Theta}f_{*}(\theta)=f(\theta,x_{*}),\quad\text{s.t.}\quad x_{*}\in S(\theta).

(3.4)

If the equilibrium problem admits multiple solutions, the agents may converge to different ones and it would be difficult to determine which one can better predict the behaviors of the agents without additional information. In this paper, we first consider the case where the game admits a unique equilibrium. Sufficient conditions under which the game admits a unique equilibrium will also be provided later. We would consider the non-unique case later and show that our algorithms can still become applicable by adding an appropriate regularizer in the cost function.

Stochastic Environment. In the aforementioned settings, $v_{\theta}^{i}(x)$ is a deterministic function. Although most MPEC algorithms in the optimization literature follow this deterministic setting, in this paper, we hope our algorithm can handle more realistic scenarios. Specifically, in the real world, the environment could be stochastic if some environment parameters that fluctuate over days. In a traffic system, for example, both worse weather and special events may affect the road condition, hence the travel time $v_{\theta}^{i}(x)$ experienced by the drivers. We expect our algorithm can still work in the face of such stochasticity. To this end, we assume that $v_{\theta}^{i}(x)$ represents the expected value of the cost function. On each day, however, the agents can only receive a noised feedback $\widehat{v}_{\theta}^{i}$ as an estimation. In the next section, we develop algorithms based on such noisy feedback.

4 Algorithm

We propose to update $\theta$ and $x$ simultaneously to improve the computational efficiency. The game dynamics at the lower level is modeled using the mirror descent method. Specifically, at the stage $k$ , given $\theta_{k}$ and $x_{k}$ , the agent first receives $v^{i}_{\theta_{k}}(x_{k})$ as the feedback. After receiving the feedback, the agents update their strategies via

x^{i}_{k+1}=\mathop{\mathrm{argmax}}_{x^{i}\in{\mathcal{X}}^{i}}\bigl{\{}\langle v^{i}_{\theta_{k}}(x_{k}),x^{i}\rangle-1/\beta^{i}_{k}\cdot D_{\psi^{i}}(x^{i}_{k},x^{i})\bigr{\}},

(4.1)

where $D_{\psi^{i}}(x^{i}_{k},x^{i})$ is the Bregman divergence induced by a strongly convex function $\psi^{i}$ . The accurate value of $\nabla f_{*}(\theta_{k})$ , the gradient of its objective function, equals

\quad\nabla_{\theta}f\bigl{(}\theta_{k},x_{*}({\theta_{k}})\bigr{)}+\bigl{[}\nabla_{\theta}x_{*}({\theta_{k}})\bigr{]}^{\top}\cdot\nabla_{x}f\bigl{(}\theta_{k},x_{*}({\theta_{k}})\bigr{)},

which requires the exact lower-level equilibrium $x_{*}({\theta_{k}})$ . However, at the stage $k$ , we only have the current strategy $x_{k}$ . Therefore, we also have to establish an estimator of $\nabla x_{*}(\theta_{k})$ and $\nabla f_{*}(\theta_{k})$ using $x_{k}$ , the form of which will be specified later.

Remark 4.1.

The standard gradient descent method is double-loop because at each $\theta_{k}$ it involves an inner loop for solving the exact value of $x_{*}(\theta_{k})$ and then calculating the exact gradient.

4.1 Unconstrained Game

We first consider unconstrained games with ${\mathcal{X}}^{i}=\mathbb{R}^{d^{i}}$ , for all $i\in\mathcal{N}$ . We select $\psi^{i}(\cdot)$ as smooth function, i.e., there exists a constant $H_{\psi}\geq 1$ such that for all $i\in{\mathcal{N}}$ and $x^{i},x^{i\prime}\in{\mathcal{X}}^{i}$ ,

\bigl{\|}\nabla\psi^{i}(x^{i})-\nabla\psi^{i}(x^{i\prime})\bigr{\|}_{2}\leq H_{\psi}\cdot\|x^{i}-x^{i\prime}\|_{2}.

(4.2)

Example of $\psi^{i}$ satisfying this assumption include (but is not limited to) $\psi^{i}(x^{i})=(x^{i})^{\top}Q^{i}x^{i}/2$ , where $Q^{i}\in\mathbb{R}^{d^{i}}\times\mathbb{R}^{d^{i}}$ is a positive definite matrix. It can be directly checked that we can set $H_{\psi}=\max_{i\in\mathcal{N}}\delta^{i}$ , where $\delta^{i}$ is the largest singular value of $Q^{i}$ . In this case, the corresponding Bregman divergence becomes $D_{\psi^{i}}(x^{i},x^{i\prime})=(x^{i}-x^{i\prime})^{\top}Q^{i}(x^{i}-x^{i\prime})/2$ , which is known as the squared Mahalanobis distance. Before laying out the algorithm, we first give the following lemma characterizing $\nabla_{\theta}x_{*}(\theta)$ .

Lemma 4.2.

When ${\mathcal{X}}^{i}={\mathbb{R}}^{d^{i}}$ and $\nabla_{x}v_{\theta}(x_{*}(\theta))$ is non-singular, it holds that

\displaystyle\nabla_{\theta}x_{*}(\theta)=-\Bigl{[}\nabla_{x}v_{\theta}\bigl{(}x_{*}(\theta)\bigr{)}\Bigr{]}^{-1}\cdot\nabla_{\theta}v_{\theta}\bigl{(}x_{*}(\theta)\bigr{)}.

Proof.

See Appendix B.2 for detailed proof. ∎

For any given $\theta\in\Theta$ and $x\in{\mathcal{X}}$ , we define

\displaystyle\widetilde{\nabla}f(\theta,x)=\nabla_{\theta}f(\theta,x)-\bigl{[}\nabla_{\theta}v_{\theta}(x)\bigr{]}^{\top}\cdot\bigl{[}\nabla_{x}v_{\theta}(x)\bigr{]}^{-1}\cdot\nabla_{x}f(\theta,x).

(4.3)

Although we cannot obtain the exact value of $\nabla f_{*}(\theta_{k})$ , we may use $\widehat{\nabla}f(\theta_{k},x_{k})$ as a surrogate and update $\theta_{k}$ based on $\widehat{\nabla}f_{*}(\theta_{k},x_{k})$ instead. Now we are ready to present the following bilevel incentive design algorithm for unconstrained games (see Algorithm 1).

Algorithm 1 Bilevel incentive design for unconstrained games

Input:

\theta_{0}\in\Theta

x_{0}\in{\mathcal{X}}={\mathbb{R}}^{d}

, where

d=\sum_{i\in\mathcal{N}}d^{i}

, sequence of step sizes

(\alpha_{k},\{\beta^{i}_{k}\}_{i\in{\mathcal{N}}})

. For

k=0,1,\dots

do:

Update strategy profile

\displaystyle x^{i}_{k+1}=\mathop{\mathrm{argmax}}_{x^{i}\in{\mathcal{X}}^{i}}\bigl{\{}\langle\widehat{v}^{i}_{k},x^{i}\rangle-1/\beta^{i}_{k}\cdot D_{\psi^{i}}(x^{i}_{k},x^{i})\bigr{\}},

(4.4)

for all

i\in{\mathcal{N}}

, where

\widehat{v}^{i}_{k}

is an estimator of

v^{i}_{\theta_{k}}(x_{k})

Update incentive parameter

\displaystyle\theta_{k+1}=\mathop{\mathrm{argmax}}_{\theta\in\Theta}\bigl{\{}\langle\widetilde{\nabla}f(\theta_{k},x_{k+1}),\theta\rangle-{1/\alpha_{k}}\cdot\|\theta-\theta_{k}\|_{2}^{2}\bigr{\}}.

EndFor

Output: Last iteration incentive parameter

\theta_{k+1}

and strategy profile

x_{k+1}

In Algorithm 1, if $\theta_{k}$ and $x_{k}$ converge to fixed points $\bar{\theta}$ and $\bar{x}$ , respectively, then $\bar{x}=x_{*}(\bar{\theta})$ is expected to be satisfied. Thus, we would also have $\widehat{\nabla}f(\bar{\theta},\bar{x})=\nabla f_{*}(\bar{\theta}).$ Thus, the optimality of $\bar{\theta}$ can be then guaranteed. It implies that the algorithm would find the optimal solution if it converges. Instead, the difficult part is how to design appropriate step sizes that ensure convergence. In this paper, we provide such conditions in Section 5.1.

4.2 Simplex-Constrained Game

We then consider the case where for all $i\in{\mathcal{N}}$ , $x^{i}$ is constrained within the probability simplex

\displaystyle\Delta\bigl{(}[d^{i}]\bigr{)}=\bigl{\{}x^{i}\in{\mathbb{R}}^{d^{i}}\,\big{|}\,x^{i}\geq 0,(x^{i})^{\top}{\bf 1}_{d^{i}}=1\bigr{\}},

where ${\bf 1}_{d^{i}}\in{\mathbb{R}}^{d^{i}}$ is the vector of all ones. Here we remark that any classic game-theoretic models are simplex-constrained. In fact, as long as the agent faces a finite number of choices and adopts a mixed strategy, its decision space would be a probability simplex [39]. In addition, some other types of decisions may also be constrained by a simplex. For example, financial investment concerns how to split the money on different assets. In such a scenario, the budget constraint can also be represented by a probability simplex.

In such a case, we naturally consider $\psi^{i}(x^{i})=-\sum_{j\in[d^{i}]}[x^{i}]_{j}\cdot\log[x^{i}]_{j}$ , which is the Shannon entropy. Such a choice gives the Bregman divergence $D_{\psi^{i}}(x^{i},x^{i\prime})=\sum_{j\in[d^{i}]}[x^{i}]_{j}\cdot\log([x^{i\prime}]_{j}/[x^{i}]_{j})$ , which is known as the KL divergence. In this case, we still first need to characterize $\nabla_{\theta}x_{*}(\theta)$ , which also has an analytic form. Specifically, if we define a function $h_{\theta}(x)=(h_{\theta}^{i}(x^{i}))_{i\in\mathcal{N}}$ that satisfies

h_{\theta}^{i}(x^{i})=\mathop{\mathrm{argmax}}_{x^{\prime i}\in{\mathcal{X}}^{i}}\Bigl{\{}\bigl{\langle}v^{i}_{\theta}(x^{i}),x^{\prime i}\bigr{\rangle}-1/\beta^{i}_{k}\cdot D_{\psi^{i}}(x^{i},x^{\prime i})\Bigr{\}},

then for any $\theta$ , $x^{i}_{*}(\theta)$ satisfies $x^{i}_{*}(\theta)=h_{\theta}(x^{i}_{*}(\theta))$ [12]. Implicitly differentiating through this fixed point equation then yields $\nabla x_{*}(\theta)=\nabla_{\theta}h_{\theta}(x_{*}(\theta))\cdot(I-\nabla_{x}h_{\theta}(x_{*}(\theta)))^{-1}$ . Then, similar to (4.3), we may use

\displaystyle\widetilde{\nabla}f(\theta,x)=\nabla_{\theta}f(\theta,x)-\nabla_{\theta}h_{\theta}(x)\cdot\bigl{(}I-\nabla_{x}h_{\theta}(x)\bigr{)}^{-1}\cdot\nabla_{x}f(\theta,x)

(4.5)

to approximate the actual gradient $\nabla f_{*}(\theta)$ and then update $\theta_{k}$ based on $\nabla f_{*}(\theta_{k})$ instead.

Remark 4.3.

The mapping $h_{\theta}(x)$ has an analytic expression, which reads

h_{\theta}^{i}(x)=x^{i}\cdot\exp\bigl{(}-\beta_{k}^{i}\cdot v_{\theta}^{i}(x)\bigr{)}\Big{/}\Bigl{\|}x^{i}\cdot\exp\bigl{(}-\beta_{k}^{i}\cdot v_{\theta}^{i}(x)\bigr{)}\Bigr{\|}_{1}.

Hence, both $\nabla_{x}h_{\theta}(x)$ and $\nabla_{\theta}h_{\theta}(x)$ can also be calculated analytically.

In addition to a different gradient estimate, we also modify Algorithm 1 to keep the iterations $x_{k}$ from hitting the boundary at the early stage. The modification involves an additional step that mixes the strategy with a uniform strategy ${\bf 1}_{d^{i}}/d^{i}$ , i.e., imposing an additional step

\displaystyle\widetilde{x}^{i}_{k+1}=(1-\nu_{k+1})\cdot x_{k+1}+\nu_{k+1}\cdot{\bf 1}_{d^{i}}/d^{i}

upon finishing the update (4.4), where $\nu_{k+1}\in(0,1)$ is a the mixing parameter, decreasing to 0 when $k\to\infty$ . In the following, we give the formal presentation of the modified bilevel incentive design algorithm for simplex-constrained games (see Algorithm 2).

Algorithm 2 Bilevel incentive design for simplex constrained games

Input:

\theta_{0}\in\Theta

x_{0}\in{\mathcal{X}}

, step sizes

(\alpha_{k},\{\beta_{k}^{i}\}_{i\in{\mathcal{N}}}),k\geq 0

, and mixing parameters

\nu_{k},k\geq 0

. For

k=0,1,\dots

do:

Update strategy profile

	$\displaystyle{x}^{i}_{k+1}$	$\displaystyle=\mathop{\mathrm{argmax}}_{x^{i}\in\Delta([d^{i}])}\bigl{\{}\langle\widehat{v}^{i}_{k},x^{i}\rangle-1/\beta^{i}_{k}\cdot D_{\psi^{i}}(\widetilde{x}^{i}_{k},x^{i})\bigr{\}},$
	$\displaystyle\widetilde{x}^{i}_{k+1}$	$\displaystyle=(1-\nu_{k})\cdot{x}_{k+1}+\nu_{k}\cdot{\bf 1}_{d^{i}}/d^{i},$		(4.6)

for all

i\in{\mathcal{N}}

, where

\widehat{v}^{i}_{k}

is an estimator of

v^{i}_{\theta_{k}}(\widetilde{x}_{k})

Update incentive parameter

\displaystyle\theta_{k+1}\leftarrow\mathop{\mathrm{argmax}}_{\theta\in\Theta}\bigl{\{}\langle\widetilde{\nabla}f(\theta_{k},\widetilde{x}_{k+1}),\theta\rangle-{1/\alpha_{k}}\cdot\|\theta-\theta_{k}\|_{2}^{2}\bigr{\}}.

EndFor

Output: Last iteration incentive parameter

\theta_{k+1}

and strategy profile

x_{k+1}

Similar to Algorithm 1, at the core of the convergence of Algorithm 2 is still the step size. This case is even more complicated, as we need to design $\alpha_{k}$ , $\beta_{k}$ , and $\nu_{k}$ at the same time. In this paper, a provably convergent scheme is provided in Section 5.2.

Before closing this section, we remark that the algorithm can be easily adapted to other types of constraints by using another $h_{\theta}(x)$ to model the game dynamics. Particularly, the projected gradient descent dynamics has very broad applicability. In this case, the algorithm for calculating $\nabla_{\theta}h_{\theta}(x)$ and $\nabla_{x}h_{\theta}(x)$ is given by, for example, Amos and Kolter [2]. The additional step (4.6) then becomes unnecessary as it is dedicated to simplex constraints.

5 Convergence Analysis

In this section, we study the convergence of the proposed algorithms. For simplicity, define $\overline{D}_{\psi}\bigl{(}x,x^{\prime}\bigr{)}=\sum_{i\in{\mathcal{N}}}D_{\psi^{i}}\bigl{(}x^{i},x^{i\prime}\bigr{)}.$ We make the following assumptions.

Assumption 5.1.

The lower-level problem in (3.4) satisfies the following conditions.

•

The strategy set ${\mathcal{X}}^{i}$ of agent $i$ is a nonempty, compact, and convex subset of ${\mathbb{R}}^{d^{i}}$ .
•

For each $i\in{\mathcal{N}}$ , the gradient $v_{\theta}^{i}(\cdot)$ is $H_{u}$ -Lipschitz continuous with respect to $\overline{D}_{\psi}$ , i.e., for all $i\in{\mathcal{N}}$ and $x,x^{\prime}\in{\mathcal{X}}$ , $\|v_{\theta}^{i}(x)-v_{\theta}^{i}(x^{\prime})\|^{2}_{*}\leq H^{2}_{u}\cdot\overline{D}_{\psi}(x,x^{\prime})$ .
•

There exist constants $\rho_{\theta},\rho_{x}>0$ such that for all $x\in{\mathcal{X}}$ and $\theta\in\Theta$ , $\|\nabla_{\theta}v_{\theta}(x)\|_{2}<\rho_{\theta}$ , and $\|[\nabla_{x}v_{\theta}(x)]^{-1}\|_{2}\leq 1/\rho_{x}$ .
•

For all $\theta\in\Theta$ , the equilibrium $x_{*}(\theta)$ of the game is strongly stable with respect to $\overline{D}_{\psi}$ , i.e., for all $x\in\mathcal{X}$ , $\sum_{i\in{\mathcal{N}}}\lambda^{i}\cdot\langle v^{i}_{\theta}(x),x^{i}_{*}(\theta)-x^{i}\rangle\geq\overline{D}_{\psi}(x_{*}(\theta),x)$ .

Assumption 5.2.

The upper-level problem in (3.4) satisfies the following properties.

•

The set $\Theta$ is compact and convex. The function $f_{*}(\theta)$ is $\mu$ -strongly convex and $\nabla f_{*}(\theta)$ has $2$ -norm uniformly bounded by $M$ .
•

The extended gradient $\widetilde{\nabla}f(\theta,x)$ is $\widetilde{H}$ -Lipschitz continuous with respect to $\overline{D}_{\psi}$ , i.e., for all $x,x^{\prime}\in{\mathcal{X}}$ , $\|\widetilde{\nabla}f(\theta,x)-\widetilde{\nabla}f(\theta,x^{\prime})\|^{2}_{2}\leq\widetilde{H}^{2}\cdot\overline{D}_{\psi}(x,x^{\prime})$ .

Assumption 5.3.

Define the filtration by $\mathcal{F}_{0}^{\theta}=\{\theta_{0}\}$ , $\mathcal{F}_{0}^{x}=\varnothing$ , $\mathcal{F}_{k}^{\theta}=\mathcal{F}_{k-1}^{\theta}\cup\{x_{k-1},\theta_{k}\}$ , and $\quad\mathcal{F}_{k}^{x}=\mathcal{F}_{k-1}^{x}\cup\{\theta_{k},x_{k}\}.$ We impose the following assumptions.

•

The feedback $\widehat{v}_{k}$ is an unbiased estimate, i.e., for all $i\in{\mathcal{N}}$ , we have $\mathbb{E}[\widehat{v}^{i}_{k}\,|\,\mathcal{F}_{k}^{x}]=v^{i}_{\theta_{k}}(x_{k})$ .
•

The feedback $\widehat{v}_{k}$ has bounded mean squared estimation errors, i.e., there exists $\delta_{u}>0$ such that $\mathbb{E}[\|\widehat{v}^{i}_{k}-v^{i}_{\theta_{k}}(x_{k})\|^{2}_{*}\,|\,\mathcal{F}_{k}^{x}]\leq\delta_{u}^{2}$ for all $i\in{\mathcal{N}}$ .

Below we discuss when the proposed assumptions hold, and if they are violated, how would our algorithm works. Assumption 5.1 includes the condition that $x_{*}(\theta)$ is strongly stable. In this case, it is the unique Nash equilibrium of the game [34]. It is also a common assumption in the analysis of the mirror descent dynamics itself [13]. We provide sufficient conditions for checking strong stability in Appendix B.1. We refer the readers to Section 6 for an explanation of how to extend our algorithm when this assumption is violated. Assumption 5.2 includes the convexity of the upper-level problem, which is usually a necessary condition to ensure global convergence. Yet, without convexity, our algorithm can still converge to a local minimum. Assumption 5.3 becomes unnecessary if we simply assume that the environment is deterministic. In this case, the accurate value of $v_{\theta}^{i}(x)$ is available. Yet, if the noises are added to the feedback, assuming that the noisy feedback is unbiased and bounded is still reasonable.

5.1 Unconstrained Game

In this part, we establish the convergence guarantee of Algorithm 1 for unconstrained games. We define the optimality gap $\epsilon^{\theta}_{k}$ and the equilibrium gap $\epsilon^{x}_{k+1}$ as

\displaystyle\epsilon^{\theta}_{k}:=\mathbb{E}\bigl{[}\|\theta_{k}-\theta_{*}\|_{2}^{2}\bigr{]},\quad\epsilon_{k+1}^{x}:=\mathbb{E}\Bigl{[}\overline{D}_{\psi}\bigl{(}x_{*}(\theta_{k}),x_{k+1}\bigr{)}\Bigr{]}.

We track such two gaps as the convergence criteria in the subsequent results.

Theorem 5.4.

For Algorithm 1, set the step sizes $\alpha_{k}=\alpha/(k+1)$ , $\beta_{k}=\beta/(k+1)^{2/3}$ , and $\beta_{k}^{i}=\lambda^{i}\cdot\beta_{k}$ with constants $\alpha>0$ and $\beta>0$ satisfying

\displaystyle\beta\leq 1/N\cdot H_{u}^{2}\|\lambda\|_{2}^{2},\quad\alpha/\beta^{3/2}\leq 1/12\cdot H_{\psi}\widetilde{H}H_{*},

where $H_{*}=\rho_{\theta}/\rho_{x}$ . Suppose that Assumptions 5.1-5.3 hold, then we have

\displaystyle\epsilon_{k}^{\theta}=O(k^{-2/3}),\quad\epsilon_{k}^{x}=O(k^{-2/3}).

Proof.

See Appendix C for detailed proof and a detailed expression of convergence rates. ∎

5.2 Simplex-Constrained Game

In this part, we establish the convergence guarantee of Algorithm 2 for simplex constrained games. We still define optimality gap $\epsilon^{\theta}_{k}$ as $\epsilon^{\theta}_{k}=\mathbb{E}\bigl{[}\|\theta_{k}-\theta_{*}\|_{2}^{2}\bigr{]}.$ Yet, corresponding to (4.6), we track $\widetilde{\epsilon}_{k+1}^{x}$ as a measure of convergence for the strategies of the agents, which is defined as

\displaystyle\widetilde{\epsilon}_{k+1}^{x}=\mathbb{E}\Bigl{[}\overline{D}_{\psi}\bigl{(}\widetilde{x}_{*}(\theta_{k}),\widetilde{x}_{k+1}\bigr{)}\Bigr{]},

where $\widetilde{x}_{*}(\theta_{k})=(1-\nu_{k})\cdot{x}_{*}(\theta_{k})+\nu_{k}\cdot{\bf 1}/d^{i}.$ We are now ready to give the convergence guarantee of Algorithm 2.

Theorem 5.5.

For Algorithm 2, set the step sizes $\alpha_{k}=\alpha/(k+1)^{1/2}$ , $\beta_{k}=\beta/(k+1)^{2/7}$ , $\beta_{k}^{i}=\lambda^{i}\cdot\beta_{k}$ , and $\nu_{k}=\nu/(k+1)^{4/7}$ with constants $\alpha>0$ and $\beta>0$ satisfying

\displaystyle\beta\leq 1/6\cdot NH_{u}^{2}\|\lambda\|_{2}^{2},\quad\alpha/\beta^{3/2}\leq 1/7\cdot\widetilde{H}\widetilde{H}_{*},

where $\widetilde{H}_{*}=(1+d)\rho_{\theta}/\rho_{x}$ . Suppose that Assumptions 5.1-5.3 hold. If there exists some constant $V_{*}>0$ such that $\|v_{\theta}(x_{*}(\theta))\|_{\infty}\leq V_{*}$ for any $\theta\in\Theta$ , we then have

\displaystyle\epsilon_{k}^{\theta}={O}(k^{-2/7}),\quad\widetilde{\epsilon}_{k}^{x}={O}(k^{-2/7}).

Proof.

See Appendix D for detailed proof and a detailed form of the convergence rates. ∎

6 Extensions to Games with Multiple Equilibria

We then briefly discuss how to apply our algorithms when the lower-level game has multiple equilibria.

Case I: If the function $v_{\theta}(x)=(v_{\theta}^{i}(x))_{i}$ is strongly monotone in the neighbourhood of each equilibrium, then all equilibria are strongly stable in a neighbourhood and hence isolated [38]. In this case, our algorithms can be directly applied as $\nabla_{x}v_{\theta}(x)$ is non-singular in these neighborhoods. It is commonly believed that the most likely equilibrium is the one reached by the game dynamics [52]; our algorithm naturally converges to this one.

Case II: If the function $v_{\theta}(x)=(v_{\theta}^{i}(x))_{i}$ is monotone but not strongly monotone, then the equilibrium set is a convex and closed region [34]. This case is challenging, as the matrix $\nabla_{x}v_{\theta}(x)$ needed to be inverted would become singular. Nevertheless, we can simply assume the agents are bounded rational [42, 1, 33]. The bounded rationality would result in a quantal response equilibrium for predicting the agents’ response. We refer the readers to Appendix F for a detailed explanation (with numerical examples for illustration). Here we briefly sum up the key takeaways: (1) it is equivalent to add a regularizer $\eta^{i}\cdot(\log(x^{i}+\epsilon)+1)$ to $v_{\theta}^{i}(x)$ for some $\eta^{i}>0$ and $\epsilon\geq 0$ ; (2) as long as $\theta>0$ , the strong stability condition in Assumption 5.1 would then be satisfied, hence a unique equilibrium would exist; (3) as long as $\epsilon>0$ , the Lipschitz continuous condition in Assumption 5.1 will also not be violated. In a nutshell, the bounded rationality assumption can simultaneously make our model more realistic and satisfy the assumptions in Section 5.

7 Numerical Experiments

In this section, we conduct two numerical experiments to test our algorithms. All numerical results reported in this section were produced either on a MacBook Pro (15-inch, 2017) with 2.9 GHz Quad-Core Intel Core i7 CPU.

7.1 Pollution Control via Emission Tax

We first consider the oligopoly model introduced in Example 3.1. We assume that producing $a_{i}$ units of output, firm $i$ would generate $e_{i}=d^{i}a_{i}$ units of emissions. We consider the following social welfare function [44]

W(a)=\int_{0}^{\sum_{i=1}^{n}a^{i}}(p_{0}-\gamma\cdot q)\mathop{}\!\mathrm{d}q-\sum_{i=1}^{n}c^{i}\cdot q^{i}-\tau\cdot\sum_{i=1}^{n}d^{i}a^{i},

where the first term is the consumers’ surplus, the second term is the total production cost, and the third term is the social damage caused by pollution. To maximize social welfare, an authority can impose emission taxes on the productions. Specifically, whenever producing $a_{i}$ units of output, firm $i$ could be charged $\pi^{i}\cdot a^{i}$ , where $\pi^{i}$ is specialized for the firm $i$ . In the experiment, we set $n=100$ , $p_{0}=100$ , $\gamma=1$ , $\tau=10$ , and $d^{i}=10-\exp(c^{i})+\epsilon^{i}$ , where $\{c^{i}\}_{i=1}^{100}$ are evenly spaced between 1 and 2 and $\{\epsilon^{i}\}_{i=1}^{100}$ are the white noises. Under this setting, $c_{i}$ and $e_{i}$ are negatively correlated, which is realistic because if a firm hopes to reduce their pollution by updating their emission control systems, the production cost must be increased accordingly.

Through this small numerical example, we hope to illustrate that the single-loop scheme developed in this paper is indeed much more efficient than previous double-loop algorithms. To this end, we compare our algorithm with two double-loop schemes proposed by Li et al. [25]. In both approaches, the lower-level equilibrium problem is solved exactly first at each iteration. But afterwards, the upper-level gradients are respectively obtained via automatic differentiation (AD) and implicit differentiation (ID). To make a fair comparison, the same hyperparameters — including the initial solutions, the learning rates, and the tolerance values for both upper- and lower-level problems — are employed for the tested algorithms (double-loop AD, double-loop ID, and our algorithm).

Table 1 reports statistics related to the computation performance, including the total CPU time, the total iteration number, and the CPU time per iteration. The results reveal that all tested algorithm take a similar number of iterations to reach the same level of precision. However, the running time per iteration required by our algorithm is significantly lower than the two double-loop approaches. Hence, in general, our scheme is more efficient.

Table 1: Performance of the tested algorithms for solving the emission tax design problem.

Method	Double-loop AD	Double-loop ID	Our algorithm
Total CPU time (s)	71.56	11.29	2.09
Total iteration number	519	626	813
Time per iteration (ms)	137.89	18.04	2.57

7.2 Second-Best Congestion Pricing

We then consider the routing game model introduced in Example 3.2. To minimize the total travel delay, an authority on behalf of the public sector could impose appropriate tolls on selected roads [49]. This problem of determining tolls is commonly known as the congestion pricing problem. The second-best scheme assumes that only a subset of links can be charged [48]. Specifically, we write $\pi\in\mathbb{R}_{+}^{|\mathcal{E}|}$ as the toll imposed on all the links and $\mathcal{E}_{\text{toll}}$ as the set of tollable links. We model total cost for a traveler selecting a path $a\in\mathcal{A}$ as

c^{ia}(\pi,q)=\sum_{e\in\mathcal{E}}\bigl{(}t^{e}(x^{e})+\pi^{e}\bigr{)}\cdot\delta^{eia}+\eta\cdot\bigl{(}\log(q^{ia})+1\bigr{)},

where we add an extra term $(\log(q^{ia})+1)$ to characterize the uncertainties in travelers’ route choices [20, 5]. It results in a quantal response equilibrium, as discussed in Section 6. We test our algorithm on a real-world traffic network: the Sioux-Falls network (See Lawphongpanich and Hearn [24] for its structure). We select 20 links (11, 35, 32, 68, 46, 21, 65, 52, 71, 74, 33, 64, 69, 14, 18, 39, 57, 48, 15, 51) for imposing congestion tolls.

We run Algorithm 2 to solve the problem and compare 4 different settings on the selection of step sizes. Setting A: $\alpha_{k}=\alpha/(k+1)^{1/2}$ , $\beta_{k}=\beta/(k+1)^{2/7}$ , and $\nu_{k}=\nu/(k+1)^{4/7}$ . It ensures convergence according to Theorem 5.5. Setting B: $\alpha_{k}=\alpha/(k+1)^{1/2}$ , $\beta_{k}=\beta/(k+1)^{1}$ , and $\nu_{k}=\nu/(k+1)^{4/7}$ . It only increases the decreasing rate of the step size in the inner loop. Setting C: $\alpha_{k}=\alpha/(k+1)^{1}$ , $\beta_{k}=\beta/(k+1)^{1}$ , and $\nu_{k}=\nu/(k+1)^{1}$ . It assumes that all step sizes decrease based on the classic $O(1/k)$ rate. Setting D: $\alpha_{k}=\alpha/(k+1)^{1/2}$ , $\beta_{k}=\beta/(k+1)^{2/7}$ , and $\nu_{k}=0$ . It does not adopt the mixing step proposed in our paper. We add white noise to the costs received by the agents based on Gaussian distributions and run our algorithm under each setting with 10 times. The mean value of upper-level optimality gaps and lower-level equilibrium gaps are reported in Figure 1 (the shadowed areas are plotted based on “mean $\pm$ std” area over all sampled trajectories).

Refer to caption — Figure 1: Comparison across different algorithms and step sizes when solving the second-best congestion pricing problem.

Below we summarize a few observations. First, without the mixing step proposed in our paper, the algorithm does hit the boundary too early. The algorithm fails after just a few iterations. It verifies our earlier claim that directly extending previous methods [21, 9] designed for bilevel optimization problems to MPECs is problematic. Second, the step size given by Theorem 5.5 ensures the fastest and the most stable convergence.

References

Ahmed et al. [2019] Ahmed, Z., Le Roux, N., Norouzi, M. and Schuurmans, D. (2019). Understanding the impact of entropy on policy optimization. In International conference on machine learning. PMLR.
Amos and Kolter [2017] Amos, B. and Kolter, J. Z. (2017). Optnet: Differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning.
Anandalingam and Friesz [1992] Anandalingam, G. and Friesz, T. L. (1992). Hierarchical optimization: An introduction. Annals of Operations Research 34 1–11.
Bai et al. [2019] Bai, S., Kolter, J. Z. and Koltun, V. (2019). Deep equilibrium models. In Advances in Neural Information Processing Systems.
Bar-Gera and Boyce [1999] Bar-Gera, H. and Boyce, D. (1999). Route flow entropy maximization in origin-based traffic assignment. In 14th International Symposium on Transportation and Traffic TheoryTransportation Research Institute.
Björnerstedt and Weibull [1994] Björnerstedt, J. and Weibull, J. W. (1994). Nash equilibrium and evolution by imitation. Tech. rep., IUI Working Paper.
Bracken and McGill [1973] Bracken, J. and McGill, J. T. (1973). Mathematical programs with optimization problems in the constraints. Operations Research 21 37–44.
Buchanan and Stubblebine [1962] Buchanan, J. M. and Stubblebine, W. C. (1962). Externality. In Classic papers in natural resource economics. Springer, 138–154.
Chen et al. [2021] Chen, T., Sun, Y. and Yin, W. (2021). A single-timescale stochastic bilevel optimization method. arXiv preprint arXiv:2102.04671 .
Colson et al. [2007] Colson, B., Marcotte, P. and Savard, G. (2007). An overview of bilevel optimization. Annals of operations research 153 235–256.
Dafermos [1980] Dafermos, S. (1980). Traffic equilibrium and variational inequalities. Transportation Science 14 42–54.
Dafermos [1983] Dafermos, S. (1983). An iterative scheme for variational inequalities. Mathematical Programming 26 40–47.
D’Orazio et al. [2021] D’Orazio, R., Loizou, N., Laradji, I. and Mitliagkas, I. (2021). Stochastic mirror descent: Convergence analysis and adaptive variants via the mirror stochastic polyak stepsize. arXiv preprint arXiv:2110.15412 .
Finn et al. [2017] Finn, C., Abbeel, P. and Levine, S. (2017). Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning. PMLR.
Franceschi et al. [2017] Franceschi, L., Donini, M., Frasconi, P. and Pontil, M. (2017). Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning. PMLR.
Friedman and Friedman [1990] Friedman, M. and Friedman, R. (1990). Free to choose: A personal statement. Houghton Mifflin Harcourt.
Friesz et al. [1990] Friesz, T. L., Tobin, R. L., Cho, H.-J. and Mehta, N. J. (1990). Sensitivity analysis based heuristic algorithms for mathematical programs with variational inequality constraints. Mathematical Programming 48 265–284.
Greenwald and Stiglitz [1986] Greenwald, B. C. and Stiglitz, J. E. (1986). Externalities in economies with imperfect information and incomplete markets. The quarterly journal of economics 101 229–264.
Harker and Pang [1988] Harker, P. T. and Pang, J.-S. (1988). Existence of optimal solutions to mathematical programs with equilibrium constraints. Operations research letters 7 61–64.
Hazelton [1998] Hazelton, M. L. (1998). Some remarks on stochastic user equilibrium. Transportation Research Part B: Methodological 32 101–108.
Hong et al. [2020] Hong, M., Wai, H.-T., Wang, Z. and Yang, Z. (2020). A two-timescale framework for bilevel optimization: Complexity analysis and application to actor-critic. arXiv preprint arXiv:2007.05170 .
Jiang et al. [2021] Jiang, H., Chen, Z., Shi, Y., Dai, B. and Zhao, T. (2021). Learning to defend by learning to attack. In International Conference on Artificial Intelligence and Statistics. PMLR.
Krichene et al. [2015] Krichene, W., Krichene, S. and Bayen, A. (2015). Convergence of mirror descent dynamics in the routing game. In 2015 European Control Conference (ECC). IEEE.
Lawphongpanich and Hearn [2004] Lawphongpanich, S. and Hearn, D. W. (2004). An mpec approach to second-best toll pricing. Mathematical programming 101 33–55.
Li et al. [2020] Li, J., Yu, J., Nie, Y. M. and Wang, Z. (2020). End-to-end learning and intervention in games. Advances in Neural Information Processing Systems 33.
Li et al. [2022] Li, J., Yu, J., Wang, Q., Liu, B., Wang, Z. and Nie, Y. M. (2022). Differentiable bilevel programming for stackelberg congestion games. arXiv preprint arXiv:2209.07618 .
Liu et al. [2018] Liu, H., Simonyan, K. and Yang, Y. (2018). Darts: Differentiable architecture search. arXiv preprint arXiv:1806.09055 .
Liu et al. [2021] Liu, R., Gao, J., Zhang, J., Meng, D. and Lin, Z. (2021). Investigating bi-level optimization for learning and vision from a unified perspective: A survey and beyond. arXiv preprint arXiv:2101.11517 .
Luketina et al. [2016] Luketina, J., Berglund, M., Greff, K. and Raiko, T. (2016). Scalable gradient-based tuning of continuous regularization hyperparameters. In International conference on machine learning. PMLR.
Luo et al. [1996] Luo, Z.-Q., Pang, J.-S. and Ralph, D. (1996). Mathematical programs with equilibrium constraints. Cambridge University Press.
Maclaurin et al. [2015] Maclaurin, D., Duvenaud, D. and Adams, R. (2015). Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning. PMLR.
Maskin [1994] Maskin, E. S. (1994). The invisible hand and externalities. The American Economic Review 84 333–337.
Melo [2021] Melo, E. (2021). On the uniqueness of quantal response equilibria and its application to network games. Economic Theory 1–45.
Mertikopoulos and Zhou [2019] Mertikopoulos, P. and Zhou, Z. (2019). Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming 173 465–507.
Metcalf and Weisbach [2009] Metcalf, G. E. and Weisbach, D. (2009). The design of a carbon tax. Harv. Envtl. L. Rev. 33 499.
Metz et al. [2016] Metz, L., Poole, B., Pfau, D. and Sohl-Dickstein, J. (2016). Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163 .
Mguni et al. [2019] Mguni, D., Jennings, J., Sison, E., Valcarcel Macua, S., Ceppi, S. and Munoz de Cote, E. (2019). Coordinating the crowd: Inducing desirable equilibria in non-cooperative systems. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems.
Nagurney [2013] Nagurney, A. (2013). Network economics: A variational inequality approach, vol. 10. Springer Science & Business Media.
Nash [1951] Nash, J. (1951). Non-cooperative games. Annals of mathematics 286–295.
Nemirovskij and Yudin [1983] Nemirovskij, A. S. and Yudin, D. B. (1983). Problem Complexity and Method Efficiency in Optimization. A Wiley-Interscience publication, Wiley.
Parise and Ozdaglar [2017] Parise, F. and Ozdaglar, A. (2017). Sensitivity analysis for network aggregative games. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC). IEEE.
Prashker and Bekhor [2004] Prashker, J. N. and Bekhor, S. (2004). Route choice models used in the stochastic user equilibrium problem: a review. Transport reviews 24 437–463.
Rajeswaran et al. [2019] Rajeswaran, A., Finn, C., Kakade, S. and Levine, S. (2019). Meta-learning with implicit gradients. arXiv preprint arXiv:1909.04630 .
Requate [1993] Requate, T. (1993). Pollution control in a cournot duopoly via taxes or permits. Journal of Economics 58 255–291.
Schmeidler [1973] Schmeidler, D. (1973). Equilibrium points of nonatomic games. Journal of statistical Physics 7 295–300.
Scutari et al. [2010] Scutari, G., Palomar, D. P., Facchinei, F. and Pang, J.-S. (2010). Convex optimization, game theory, and variational inequality theory. IEEE Signal Processing Magazine 27 35–49.
Smith [1791] Smith, A. (1791). An Inquiry into the Nature and Causes of the Wealth of Nations, vol. 1. Librito Mondi.
Verhoef [2002] Verhoef, E. T. (2002). Second-best congestion pricing in general networks. heuristic algorithms for finding second-best optimal toll levels and toll points. Transportation Research Part B: Methodological 36 707–729.
Vickrey [1969] Vickrey, W. S. (1969). Congestion theory and transport investment. The American Economic Review 59 251–260.
von Stackelberg [1952] von Stackelberg, H. (1952). The theory of the market economy. Oxford University Press.
Wardrop [1952] Wardrop, J. G. (1952). Some theoretical aspects of road traffic research. In Proceedings of the Institution of Civil Engineers, vol. 1.
Weibull [1997] Weibull, J. W. (1997). Evolutionary game theory. MIT press.
Wu et al. [2018] Wu, Y., Ren, M., Liao, R. and Grosse, R. (2018). Understanding short-horizon bias in stochastic meta-optimization. arXiv preprint arXiv:1803.02021 .

Appendix A Notations

We collect the most frequently used notations here.

Category	Symbol	Definition
Game (lower level)	${\mathcal{N}}$	set of all $N$ agents
	$x^{i}\in{\mathcal{X}}^{i}\subseteq\mathbb{R}^{d^{i}}$	agent $i$ ’s strategy
	$u^{i}:{\mathcal{X}}^{i}\to\mathbb{R}$	reward function of agent $i$
	$v^{i}:{\mathcal{X}}^{i}\to\mathbb{R}^{d^{i}}$	$\nabla_{x^{i}}u^{i}$ , unit reward of agent $i$
	$\lambda^{i}\in\mathbb{R}_{+}$	variational stability coefficient of agent $i$
	$u_{\theta}^{i},v_{\theta}^{i}$	incentivized reward and unit reward
Incentive designer (upper level)	$\theta\in\Theta$	incentive distributed to the agents
	$f:{\mathcal{X}}\times\Theta\to\mathbb{R}$	objective of the incentive designer
	$x_{*}\in S(\theta)$	equilibrium induced by incentive $\theta$
	$f_{*}:\Theta\to\mathbb{R}$	designer’s objective value under $x_{*}\in S(\theta)$
Algorithm ( $k$ -th iteration)	$D_{\psi^{i}}$	Bregman divergence in agent $i$ ’s strategy update
	$\beta_{k}^{i}$	equilibrium learning rate for agent $i$
	$\alpha_{k}$	incentive learning rate for the designer
	$\widehat{v}_{k}^{i}$	estimate of incentivized unit cost $v_{\theta_{k}}^{i}(x_{k}^{i})$
	$\widetilde{\nabla}{f}$	approximate incentive gradient
	$h_{\theta_{k}}^{i}$	equilibrium learning dynamic
	$\widetilde{x}^{i}_{k+1}$	agent $i$ ’s strategy mixed with uniform strategy
Analysis (assumptions)	$H_{u}$	Lipschitz continuity coefficient of $v_{\theta}^{i}$
	$\rho_{\theta}$	spectral upper bound of $\nabla_{\theta}v_{\theta}$
	$\rho_{x}$	spectral lower bound of $\nabla_{x}v_{\theta}$
	$\mu$	strong convexity coefficient of $f_{*}$
	$M$	$\ell_{2}$ upper bound of $\nabla f_{*}$
	$\widetilde{H}$	Lipschitz continuity coefficient of $\widetilde{\nabla}f$
	$\delta_{u}$	estimation MSE upper bound of $\widehat{v}_{k}^{i}$

Appendix B Detailed Discussions on the Games

In this section, we present detailed discussions on the properties of the games.

B.1 Strong Stability

We establish the following sufficient condition for strong stability.

Lemma B.1.

Define the matrix $H^{\lambda}(x)$ as a block matrix with $(i,j)$ -th block taking the form of

\displaystyle H^{\lambda}_{i,j}(x)=\lambda^{i}\cdot\nabla_{x^{j}}v^{i}(x).

Suppose that $\psi^{i}$ satisfies the smooth condition (4.2). If $H^{\lambda}(x)+H^{\lambda}(x)^{\top}\prec-2\cdot H_{\psi}\cdot I_{d}$ for some $\lambda\in{\mathbb{R}}_{+}^{N}$ , then the Nash equilibrium $x_{*}$ is $\lambda$ -strongly stable with respect to the Bregman divergence $\overline{D}_{\psi}$ .

Proof.

We define the $\lambda$ -weighted gradient of the game as

\displaystyle v^{\lambda}(x)={\rm Diag}(\lambda^{1}\cdot I_{d^{1}},\dots,\lambda^{N}\cdot I_{d^{N}})\cdot v(x).

By definition, we can verify that $H^{\lambda}(x)$ is the Jacobian matrix of $v^{\lambda}(x)$ with respect to $x$ . For any $x,x^{\prime}\in\mathcal{X}$ , let $x_{\omega}=\omega\cdot x+(1-\omega)\cdot x^{\prime}$ . We have

	$\displaystyle v^{\lambda}(x)-v^{\lambda}(x^{\prime})$	$\displaystyle=\int_{0}^{1}\frac{{\mathrm{d}}v^{\lambda}(x_{\omega})}{{\mathrm{d}}\omega}\cdot{\mathrm{d}}\omega$
		$\displaystyle=\int_{0}^{1}H^{\lambda}(x_{\omega})\cdot\frac{{\mathrm{d}}x_{\omega}}{{\mathrm{d}}\omega}\cdot{\mathrm{d}}\omega=\int_{0}^{1}H^{\lambda}(x_{\omega})(x-x^{\prime})\cdot{\mathrm{d}}\omega.$		(B.1)

Taking additional dot product $\langle\,\cdot\,,x-x^{\prime}\rangle$ on both sides of (B.1), we have

\begin{split}\bigl{\langle}v^{\lambda}(x)-v^{\lambda}(x^{\prime}),x-x^{\prime}\bigr{\rangle}&=1/2\cdot\int_{0}^{1}(x-x^{\prime})^{\top}\bigl{(}H^{\lambda}(x_{\omega})+H^{\lambda}(x_{\omega})^{\top}\bigr{)}(x-x^{\prime})\cdot{\mathrm{d}}\omega\\ &\leq-\int_{0}^{1}H_{\psi}\cdot\|x-x^{\prime}\|_{2}^{2}\cdot{\mathrm{d}}\omega=-H_{\psi}\cdot\|x-x^{\prime}\|_{2}^{2},\end{split}

(B.2)

where the inequality follows from $H^{\lambda}(x_{\omega})+H^{\lambda}(x_{\omega})^{\top}\preceq-2\cdot I_{d}$ . Setting $x^{\prime}=x_{*}$ , we further have

\bigl{\langle}v^{\lambda}(x),x-x_{*}\bigr{\rangle}\leq\bigl{\langle}v^{\lambda}(x_{*}),x-x_{*}\bigr{\rangle}-H_{\psi}\cdot\|x-x_{*}\|_{2}^{2}\leq-H_{\psi}\cdot\|x-x_{*}\|_{2}^{2},

(B.3)

where the second inequality follows from the fact that $x_{*}$ is a Nash equilibrium. Eventually, as $\psi^{i}$ satisfies (4.2), we have

\overline{D}_{\psi}(x_{*},x)=\sum_{i\in\mathcal{N}}D_{\psi_{i}}(x_{*}^{i},x^{i})\leq H_{\psi}\cdot\sum_{i\in\mathcal{N}}\|x^{i}-x^{i}_{*}\|_{2}^{2}=H_{\psi}\cdot\|x-x_{*}\|_{2}^{2}.

(B.4)

Combining (B.3) and (B.4), we then conclude the proof. ∎

Lemma B.2.

Define the matrix $\widetilde{H}^{\lambda}(x)$ as

\displaystyle\widetilde{H}^{\lambda}_{i,j}(x)=\lambda^{i}\cdot\nabla_{x^{j}}\widetilde{v}^{i}(x),\quad\text{where}~{}\widetilde{v}^{i}(x)={v}^{i}(x)+1/\lambda^{i}\cdot\log x^{i}.

Suppose that $\psi^{i}$ is the negative entropy function. If $\widetilde{H}^{\lambda}(x)+\widetilde{H}^{\lambda}(x)^{\top}\prec 0$ for some $\lambda\in{\mathbb{R}}_{+}^{N}$ , then Nash equilibrium $x_{*}$ is $\lambda$ -strongly stable with respect to the KL divergence.

Proof.

For any $x,x^{\prime}\in\mathcal{X}$ , let $x_{\omega}=\omega\cdot x+(1-\omega)\cdot x^{\prime}$ . Similar to (B.2), we first obtain

\begin{split}\bigl{\langle}\widetilde{v}^{\lambda}(x)-\widetilde{v}^{\lambda}(x^{\prime}),x-x^{\prime}\bigr{\rangle}&=1/2\cdot\int_{0}^{1}(x-x^{\prime})^{\top}\bigl{(}\widetilde{H}^{\lambda}(x_{\omega})+\widetilde{H}^{\lambda}(x_{\omega})^{\top}\bigr{)}(x-x^{\prime})\cdot{\mathrm{d}}\omega\leq 0,\end{split}

(B.5)

Noted that

\displaystyle\langle\log x^{i}-\log x^{i\prime},x^{i}-x^{i\prime}\rangle

\displaystyle=\langle\log x^{i}/x^{i\prime},x^{i}\rangle+\langle\log x^{i\prime}/x^{i},x^{i\prime}\rangle={\rm KL}(x^{i},x^{i\prime})+{\rm KL}(x^{i\prime},x^{i}),

by setting $x^{\prime}=x_{*}$ in (B.5), we further have

\displaystyle\bigl{\langle}v^{\lambda}(x),x-x_{*}\bigr{\rangle}\leq\bigl{\langle}v^{\lambda}(x_{*}),x-x_{*}\bigr{\rangle}-\sum_{i=1}^{n}{\rm KL}(x^{i}_{*},x^{i})-\sum_{i=1}^{n}{\rm KL}(x^{i},x^{i}_{*})\leq-\sum_{i=1}^{n}{\rm KL}(x^{i}_{*},x^{i}),

where the second inequality follows from the fact that $x_{*}$ is a Nash equilibrium. ∎

B.2 Sensitivity of Nash Equilibrium

Unconstrained vame. We now provide the proof of Lemma 4.2.

Proof of Lemma 4.2.

Since ${\mathcal{X}}^{i}={\mathbb{R}}^{d^{i}}$ for all $i\in{\mathcal{N}}$ , by the definition of $x_{*}(\theta)$ , we have $v_{\theta}(x_{*}(\theta))=0$ for all $\theta\in{\mathbb{R}}^{d}$ . Then, differentiating this equality with respect to $\theta$ on both ends, for any $i\in{\mathcal{N}}$ , we have

\displaystyle\nabla^{2}_{x,x^{i}}u_{\theta}^{i}\bigl{(}x_{*}({\theta})\bigr{)}\nabla_{\theta}x_{*}({\theta})+\nabla^{2}_{\theta,x^{i}}u^{i}_{\theta}\bigl{(}x_{*}(\theta)\bigr{)}=0.

Thus, we have

\displaystyle\nabla_{\theta}x_{*}({\theta})=-\bigl{[}\nabla_{x}v_{\theta}(x)\bigr{]}^{-1}\nabla_{\theta}v_{\theta}(x)\big{|}_{x=x_{*}(\theta)}.

∎

As a consequence of Lemma 4.2, we have the following lemma addressing the sensitivity of Nash equilibrium with respect to the incentive parameter $\theta$ .

Lemma B.3.

Under Assumption 5.1, we have for $\{\widetilde{x}_{k}\}_{k\geq 0}$ generated by Algorithm 1,

\displaystyle\bigl{\|}x_{*}(\theta_{k})-x_{*}(\theta_{k-1})\bigr{\|}_{2}\leq{H}_{*}\cdot\|\theta_{k}-\theta_{k-1}\|_{2},

where ${H}_{*}=\rho_{\theta}/\rho_{x}$ .

Proof.

We have for some $\overline{\theta}=\omega\cdot\theta_{k}+(1-\omega)\cdot\theta_{k-1}$ with $\omega\in[0,1]$ ,

	$\displaystyle\bigl{\\|}x_{}(\theta_{k})-x_{}(\theta_{k-1})\bigr{\\|}_{2}$	$\displaystyle=\bigl{\\|}\nabla_{\theta}x_{*}(\overline{\theta})\bigr{\\|}_{2}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2}$
		$\displaystyle=\biggl{\\|}\Bigl{[}\nabla_{x}v_{\theta}\bigl{(}x_{}(\overline{\theta})\bigr{)}\Bigr{]}^{-1}\nabla_{\theta}v_{\theta}\bigl{(}x_{}(\overline{\theta})\bigr{)}\biggr{\\|}_{2}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2}$
		$\displaystyle\leq\biggl{\\|}\Bigl{[}\nabla_{x}v_{\theta}\bigl{(}x_{}(\overline{\theta})\bigr{)}\Bigr{]}^{-1}\biggr{\\|}_{2}\cdot\Bigl{\\|}\nabla_{\theta}v_{\theta}\bigl{(}x_{}(\overline{\theta})\bigr{)}\Bigr{\\|}_{2}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2}$
		$\displaystyle\leq\rho_{x}/\rho_{\theta}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2},$

where the last inequality follows from Assumption 5.1. ∎

Simplex-Constrained Game. We first provide the following lemma.

Lemma B.4 (Theorem 1 in [41]).

When ${\mathcal{X}}^{i}=\Delta([d^{i}])$ and $\nabla_{x}v_{\theta}(x_{*}(\theta))$ is non-singular, the Jacobian of $x_{*}(\theta)$ takes the form

\displaystyle\nabla_{\theta}x_{*}(\theta)=-J_{\theta}\cdot\nabla_{\theta}v_{\theta}(x)\big{|}_{x=x_{*}(\theta)},

where

\displaystyle J_{\theta}=L-LA^{\top}[ALA^{\top}]^{-1}AL,\quad L=\Bigl{[}\nabla_{x}v_{\theta}\bigl{(}x_{*}(\theta)\bigr{)}\Bigr{]}^{-1}.

Here $A=[\mathrm{blkdiag}(B^{1},\ldots,B^{n});\mathrm{blkdiag}({\bf 1}_{d^{1}},\ldots,{\bf 1}_{d^{n}})]$ , where $B^{i}$ is obtained by deleting the rows $j$ in the identity matrix $I_{d^{i}}$ with $[x_{*}^{i}(\theta)]_{j}=0$ .

As a consequence of Lemma B.4, we have the following lemma as the simplex constrained version of Lemma B.3.

Lemma B.5.

Under Assumption 5.1, we have

\displaystyle\bigl{\|}x_{*}(\theta_{k})-x_{*}(\theta_{k-1})\bigr{\|}_{1}\leq\widetilde{H}_{*}\cdot\|\theta_{k}-\theta_{k-1}\|_{2},

where $\widetilde{H}_{*}=(1+d)\rho_{\theta}/\rho_{x}$ .

Proof.

Recall that $J_{\theta}$ is defined in Lemma B.4 and that $\|\cdot\|_{2}$ is the spectral norm when operating on a matrix. We have

	$\displaystyle\\|J_{\theta}\\|_{2}$	$\displaystyle\leq\\|L\\|_{2}+\bigl{\\|}LA^{\top}[ALA^{\top}]^{-1}AL\bigr{\\|}_{2}$
		$\displaystyle\leq\\|L\\|_{2}+\\|L\\|_{2}\cdot{{\rm tr}\bigl{(}A^{\top}[ALA^{\top}]^{-1}AL\bigr{)}}$
		$\displaystyle\leq\\|L\\|_{2}+{\\|L\\|_{2}\cdot{\rm tr}\bigl{(}[ALA^{\top}]^{-1}ALA^{\top}\bigr{)}}$
		$\displaystyle=(1+d)\cdot\\|L\\|_{2}\leq(1+d)/\rho_{x}.$

Thus, by Lemma B.4, we have for some $\overline{\theta}=\omega\cdot\theta_{k}+(1-\omega)\cdot\theta_{k-1}$ with $\omega\in[0,1]$ ,

	$\displaystyle\\|x_{}(\theta_{k})-x_{}(\theta_{k-1})\\|$	$\displaystyle\leq\bigl{\\|}\nabla_{\theta}x_{*}(\overline{\theta})\bigr{\\|}_{2}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2}$
		$\displaystyle\leq\\|J_{\theta}\\|_{2}\cdot\bigl{\\|}\nabla_{\theta}v_{\theta}(x)\bigr{\\|}_{2}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2}$
		$\displaystyle\leq(1+d)\rho_{\theta}/\rho_{x}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2},$

where the last inequality follows from Assumption 5.1. ∎

Appendix C Proof of Theorem 5.4

We first present the following two lemmas under the conditions presented in Theorem 5.4.

Lemma C.1.

For all $k\geq 0$ , we have

\displaystyle\epsilon_{k+1}^{x}

\displaystyle\leq\epsilon_{0}^{x}\cdot\prod^{k}_{j=0}(1-\beta_{j}/8)+\bigl{[}M^{2}/4\widetilde{H}^{2}+\delta_{u}^{2}\|\lambda\|_{2}^{2}\bigr{]}\cdot\sum_{l=0}^{k}\biggl{[}\beta^{2}_{l}\cdot\prod^{k}_{j=l+1}(1-\beta_{j}/8)\biggr{]}.

Proof.

See Appendix C.1 for detailed proof. ∎

Lemma C.2.

For all $k\geq 0$ , we have

	$\displaystyle\epsilon_{k+1}^{\theta}$	$\displaystyle\leq\prod^{k}_{j=0}(1-\mu\alpha_{j})\cdot\epsilon_{0}^{\theta}+2M^{2}\cdot\sum_{l=0}^{k}\biggl{[}\alpha_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}$
		$\displaystyle\qquad+\widetilde{H}^{2}\cdot\sum_{l=0}^{k}\bigg{[}(2\alpha_{l}^{2}+\alpha_{l}/\mu)\cdot\epsilon_{l+1}^{x}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}.$

Proof.

See Appendix C.2 for detailed proof. ∎

Now we are ready to present the proof of Theorem 5.4.

Proof of Theorem 5.4.

For $\beta_{k}=\beta/(k+1)^{2/3}$ , by [[21], Lemma C.4], we have

\displaystyle\prod^{k}_{j=1}(1-\beta_{j}/8)\leq\sum_{l=0}^{k}\biggl{[}\beta^{2}_{l}\cdot\prod^{k}_{j=l+1}(1-\beta_{j}/8)\biggr{]}\leq 8\beta_{k}.

Thus, by Lemma C.1, we have

	$\displaystyle\epsilon_{k+1}^{x}$	$\displaystyle\leq\epsilon_{0}^{x}\cdot\prod^{k}_{j=0}(1-\beta_{j}/8)+\bigl{[}M^{2}/4\widetilde{H}^{2}+\delta_{u}^{2}\\|\lambda\\|_{2}^{2}\bigr{]}\cdot\sum_{l=0}^{k}\biggl{[}\beta^{2}_{l}\cdot\prod^{k}_{j=l+1}(1-\beta_{j}/4)\biggr{]}$
		$\displaystyle\leq\bigl{[}8\epsilon_{0}^{x}+2M^{2}/\widetilde{H}^{2}+8\delta_{u}^{2}\\|\lambda\\|_{2}^{2}\bigr{]}\cdot\beta_{k}=O(k^{-2/3}).$		(C.1)

Similarly, we have for $\alpha_{k}=\alpha/(k+1)$ ,

\displaystyle\prod^{k}_{j=0}(1-\mu\alpha_{j})\leq\sum_{l=0}^{k}\biggl{[}\alpha_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}\leq\sum_{l=0}^{k}\biggl{[}\alpha_{l}^{5/3}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}\leq 2/\mu\cdot\alpha_{k}^{2/3}.

Thus, combining (C) with Lemma C.2, we obtain

$\displaystyle\epsilon_{k+1}^{\theta}$	$\displaystyle\leq\prod^{k}_{j=0}(1-\mu\alpha_{j})\cdot\epsilon_{0}^{\theta}+2M^{2}\cdot\sum_{l=0}^{k}\biggl{[}\alpha_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}$	(C.2)
	$\displaystyle\qquad+\widetilde{H}^{2}\cdot\sum_{l=0}^{k}\bigg{[}(2\alpha_{l}^{2}+\alpha_{l}/\mu)\cdot\epsilon_{l+1}^{x}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}$
	$\displaystyle\leq 2\epsilon_{0}^{\theta}/\mu\cdot\alpha_{k}^{2/3}+(2M^{2}+2\widetilde{H}^{2})\cdot\sum_{l=0}^{k}\biggl{[}\alpha_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}$
	$\displaystyle\qquad+\bigl{[}8\epsilon_{0}^{x}+2M^{2}/\widetilde{H}^{2}+8\delta_{u}^{2}\\|\lambda\\|_{2}^{2}\bigr{]}\cdot\widetilde{H}^{2}/\mu\cdot\sum_{l=0}^{k}\bigg{[}\alpha_{l}\beta_{l+1}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}$
	$\displaystyle\leq 2\epsilon_{0}^{\theta}/\mu\cdot\alpha_{k}^{2/3}+4\cdot(M^{2}+\widetilde{H}^{2})/\mu\cdot\alpha_{k}^{2/3}$
	$\displaystyle\qquad+\bigl{[}8\epsilon_{0}^{x}+2M^{2}/\widetilde{H}^{2}+8\delta_{u}^{2}\\|\lambda\\|_{2}^{2}\bigr{]}\cdot 2\beta\widetilde{H}^{2}/\alpha^{2/3}\mu^{2}\cdot\alpha_{k}^{2/3}=O(k^{-2/3}).$

Here the third inequality follows from $\beta_{k}=\beta/\alpha^{2/3}\cdot\alpha_{k}^{2/3}$ . Therefore, (C) and (C.2) conclude the proof of Theorem 5.4. ∎

C.1 Proof of Lemma C.1

Proof.

Recall that

\displaystyle x^{i}_{k+1}

\displaystyle=\mathop{\mathrm{argmax}}_{x^{i}\in{\mathcal{X}}^{i}}\bigl{\{}\langle\widehat{v}^{i}_{k},x^{i}\rangle-1/\beta^{i}_{k}\cdot D_{\psi^{i}}(x^{i}_{k},x^{i})\bigr{\}}.

We have

	$\displaystyle D_{\psi^{i}}\bigl{(}x^{i}_{*}(\theta_{k}),x^{i}_{k+1}\bigr{)}$	$\displaystyle\leq D_{\psi^{i}}\bigl{(}x^{i}_{}(\theta_{k}),x^{i}_{k}\bigr{)}-\beta^{i}_{k}\cdot\bigl{\langle}\widehat{v}^{i}_{k},x^{i}_{}(\theta_{k})-x^{i}_{k+1}\bigr{\rangle}-D_{\psi^{i}}\bigl{(}x^{i}_{k},x^{i}_{k+1}\bigr{)}$
		$\displaystyle\leq D_{\psi^{i}}\bigl{(}x^{i}_{}(\theta_{k}),x^{i}_{k}\bigr{)}-\beta^{i}_{k}\cdot\bigl{\langle}\widehat{v}^{i}_{k},x^{i}_{}(\theta_{k})-x^{i}_{k}\bigr{\rangle}$
		$\displaystyle\qquad-\beta^{i}_{k}\cdot\bigl{\langle}\widehat{v}^{i}_{k},x^{i}_{k}-x^{i}_{k+1}\bigr{\rangle}-1/2\cdot\\|x^{i}_{k}-x^{i}_{k+1}\\|^{2},$

where the second inequality follows from the fact that $D_{\psi^{i}}(x^{i},x^{i\prime})\geq 1/2\cdot\|x^{i}-x^{i\prime}\|^{2}$ . Taking the conditional expectation given $\mathcal{F}^{x}_{k}$ , we obtain

$\displaystyle\mathbb{E}\Bigl{[}D_{\psi^{i}}\bigl{(}x^{i}_{*}(\theta_{k}),x^{i}_{k+1}\bigr{)}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$	$\displaystyle\leq D_{\psi^{i}}\bigl{(}x^{i}_{}(\theta_{k}),x^{i}_{k}\bigr{)}-\beta^{i}_{k}\cdot\mathbb{E}\Bigl{[}\bigl{\langle}\widehat{v}^{i}_{k},x^{i}_{}(\theta_{k})-x^{i}_{k}\bigr{\rangle}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$	(C.3)
	$\displaystyle\qquad+\mathbb{E}\Bigl{[}-\beta^{i}_{k}\cdot\bigl{\langle}\widehat{v}^{i}_{k},x^{i}_{k}-x^{i}_{k+1}\bigr{\rangle}-1/2\cdot\bigl{\\|}x^{i}_{k}-x^{i}_{k+1}\bigr{\\|}^{2}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$
	$\displaystyle\leq D_{\psi^{i}}\bigl{(}x^{i}_{}(\theta_{k}),x^{i}_{k}\bigr{)}-\beta^{i}_{k}\cdot\bigl{\langle}v^{i}_{\theta_{k}}(x_{k}),x^{i}_{}(\theta_{k})-x^{i}_{k}\bigr{\rangle}+(\beta^{i}_{k})^{2}/2\cdot\mathbb{E}\bigl{[}\\|\widehat{v}^{i}_{k}\\|_{*}^{2}\,\big{\|}\,\mathcal{F}^{x}_{k}\bigr{]},$

where the second inequality follows from Assumption 5.3. Summing up (C.3) for all $i\in{\mathcal{N}}$ , we have

	$\displaystyle\mathbb{E}\Bigl{[}\overline{D}_{\psi}\bigl{(}x_{*}(\theta_{k}),x_{k+1}\bigr{)}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$	$\displaystyle\leq\overline{D}_{\psi}\bigl{(}x_{}(\theta_{k}),x_{k}\bigr{)}-\beta_{k}\cdot\sum_{i\in{\mathcal{N}}}\lambda^{i}\cdot\bigl{\langle}v^{i}_{\theta_{k}}(x_{k}),x^{i}_{}(\theta_{k})-x^{i}_{k}\bigr{\rangle}$
		$\displaystyle\qquad+\beta_{k}^{2}/2\cdot\sum_{i\in{\mathcal{N}}}(\lambda^{i})^{2}\cdot\mathbb{E}\bigl{[}\\|\widehat{v}^{i}_{k}\\|_{*}^{2}\,\big{\|}\,\mathcal{F}^{x}_{k}\bigr{]}.$		(C.4)

By the $\lambda$ -strong variational stability of $x_{i}(\theta_{k})$ , we have

\displaystyle-\sum_{i\in{\mathcal{N}}}\lambda^{i}\cdot\bigl{\langle}v^{i}_{\theta_{k}}(x_{k}),x^{i}_{*}(\theta_{k})-x^{i}_{k}\bigr{\rangle}\leq-\overline{D}_{\psi}\bigl{(}x_{*}(\theta_{k}),x_{k}\bigr{)}.

(C.5)

Moreover, by Assumption 5.3 we have

\displaystyle\mathbb{E}\bigl{[}\|\widehat{v}^{i}_{k}\|_{*}^{2}\,\big{|}\,\mathcal{F}^{x}_{k}\bigr{]}

\displaystyle\leq 2\cdot\mathbb{E}\Bigl{[}\bigl{\|}\widehat{v}^{i}_{k}-v^{i}_{\theta_{k}}(x_{k})\bigr{\|}_{*}^{2}+\bigl{\|}v^{i}_{\theta_{k}}(x_{k})\bigr{\|}_{*}^{2}\,\Big{|}\,\mathcal{F}^{x}_{k}\Bigr{]}\leq 2\delta_{u}^{2}+2\bigl{\|}v^{i}_{\theta_{k}}(x_{k})\bigr{\|}_{*}^{2},

summing up which gives

$\displaystyle\sum_{i\in{\mathcal{N}}}(\lambda^{i})^{2}\cdot\mathbb{E}\bigl{[}\\|\widehat{v}^{i}_{k}\\|_{*}^{2}\,\big{\|}\,\mathcal{F}^{x}_{k}\bigr{]}$	$\displaystyle\leq 2\delta_{u}^{2}\cdot\sum_{i\in{\mathcal{N}}}(\lambda^{i})^{2}+2\cdot\sum_{i\in{\mathcal{N}}}(\lambda^{i})^{2}\cdot\Bigl{\\|}v^{i}_{\theta_{k}}(x_{k})-v^{i}_{\theta_{k}}\bigl{(}x_{}(\theta_{k})\bigr{)}\Bigr{\\|}_{}^{2}$
	$\displaystyle\leq 2\delta_{u}^{2}\cdot\sum_{i\in{\mathcal{N}}}(\lambda^{i})^{2}+2NH_{u}^{2}\cdot\sum_{i\in{\mathcal{N}}}(\lambda^{i})^{2}\cdot\overline{D}_{\psi}\bigl{(}x_{*}(\theta_{k}),x_{k}\bigr{)}$
	$\displaystyle=2\delta_{u}^{2}\\|\lambda\\|^{2}_{2}+2NH_{u}^{2}\\|\lambda\\|_{2}^{2}\cdot\overline{D}_{\psi}\bigl{(}x_{*}(\theta_{k}),x_{k}\bigr{)},$	(C.6)

where the first inequality follows from the optimality condition that $v_{\theta_{k}}(x_{*}(\theta_{k}))=0$ , and the second inequality follows from Assumption 5.1. Thus, taking (C.5) and (C.1) into (C.1), we obtain

		$\displaystyle\mathbb{E}\Bigl{[}\overline{D}_{\psi}\bigl{(}x_{*}(\theta_{k}),x_{k+1}\bigr{)}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$		(C.7)
		$\displaystyle\quad\leq(1-\beta_{k})\cdot\overline{D}_{\psi}\bigl{(}x_{}(\theta_{k}),x_{k}\bigr{)}+\delta_{u}^{2}\\|\lambda\\|_{2}^{2}\beta_{k}^{2}+2NH_{u}^{2}\\|\lambda\\|_{2}^{2}\beta_{k}^{2}\cdot\overline{D}_{\psi}\bigl{(}x_{}(\theta_{k}),x_{k}\bigr{)}^{2}$
		$\displaystyle\quad\leq\bigl{(}1-\beta_{k}+NH_{u}^{2}\\|\lambda\\|_{2}^{2}\beta^{2}_{k}\bigr{)}\cdot\overline{D}_{\psi}\bigl{(}x_{*}(\theta_{k}),x_{k}\bigr{)}+\delta_{u}^{2}\\|\lambda\\|_{2}^{2}\beta^{2}_{k}$
		$\displaystyle\quad\leq(1-\beta_{k}/2)\cdot\overline{D}_{\psi}\bigl{(}x_{*}(\theta_{k}),x_{k}\bigr{)}+\delta_{u}^{2}\\|\lambda\\|_{2}^{2}\beta_{k}^{2},$

where the last inequality holds with $NH_{u}^{2}\|\lambda\|_{2}^{2}\beta^{2}_{k}\leq\beta_{k}$ , which is satisfied by $\beta\leq 1/NH_{u}^{2}\|\lambda\|_{2}^{2}$ . By Lemma E.1, we have for any $\gamma>\max\{1,H_{\psi}^{2}\}$ ,

		$\displaystyle\overline{D}_{\psi}\bigl{(}x_{}(\theta_{k}),x_{k}\bigr{)}-\biggl{(}1+\frac{1}{\gamma}\biggr{)}\cdot\overline{D}_{\psi}\bigl{(}x_{}(\theta_{k-1}),x_{k}\bigr{)}$		(C.8)
		$\displaystyle\quad\leq\frac{H_{\psi}^{2}\cdot(1+\gamma)^{2}-(1+\gamma)}{2\gamma}\cdot\sum_{i\in{\mathcal{N}}}\bigl{\\|}x^{i}_{}(\theta_{k})-x^{i}_{}(\theta_{k-1})\bigr{\\|}^{2}$
		$\displaystyle\quad\leq\frac{H_{\psi}^{2}\cdot(1+\gamma)^{2}-(1+\gamma)}{2\gamma}\cdot\bigl{\\|}x_{}(\theta_{k})-x_{}(\theta_{k-1})\bigr{\\|}^{2}$
		$\displaystyle\quad\leq\frac{H_{\psi}^{2}\cdot(1+\gamma)^{2}-(1+\gamma)}{2\gamma}\cdot H_{}^{2}\cdot\\|\theta_{k}-\theta_{k-1}\\|^{2}_{2}\leq\frac{H_{\psi}^{2}\cdot(1+\gamma)^{2}-(1+\gamma)}{2\gamma}\cdot H_{}^{2}\alpha^{2}_{k-1}\cdot\\|\widetilde{\nabla}f_{k-1}\\|^{2}_{2},$

where the third inequality follows from Lemma B.3, and the last inequality follows from the fact that proximal mapping is non-expensive. Taking (C.8) into (C.7) and choosing $\gamma=(4-2\beta_{k})/\beta_{k}$ , we obtain

$\displaystyle\mathbb{E}\Bigl{[}\overline{D}_{\psi}\bigl{(}x_{*}(\theta_{k}),x_{k+1}\bigr{)}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$	$\displaystyle\leq(1-\beta_{k}/2)\cdot\biggl{(}1+\frac{1}{\gamma}\biggr{)}\cdot\overline{D}_{\psi}\bigl{(}x_{*}(\theta_{k-1}),x_{k}\bigr{)}+\delta_{u}^{2}\\|\lambda\\|_{2}^{2}\beta_{k}^{2}$	(C.9)
	$\displaystyle\qquad+(1-\beta_{k}/2)\cdot\frac{H_{\psi}^{2}\cdot(1+\gamma)^{2}-(1+\gamma)}{2\gamma}\cdot H_{*}^{2}\alpha^{2}_{k-1}\cdot\mathbb{E}\bigl{[}\\|\widetilde{\nabla}f_{k-1}\\|^{2}_{2}\,\big{\|}\,\mathcal{F}^{x}_{k}\bigr{]},$
	$\displaystyle\leq(1-\beta_{k}/4)\cdot\overline{D}_{\psi}\bigl{(}x_{*}(\theta_{k-1}),x_{k}\bigr{)}+\delta_{u}^{2}\\|\lambda\\|_{2}^{2}\beta_{k}^{2}$
	$\displaystyle\qquad+(1-\beta_{k}/4)\cdot\frac{H_{\psi}^{2}\cdot(1+\gamma)-1}{2}\cdot H_{*}^{2}\alpha^{2}_{k-1}\cdot\mathbb{E}\bigl{[}\\|\widetilde{\nabla}f_{k-1}\\|^{2}_{2}\,\big{\|}\,\mathcal{F}^{x}_{k}\bigr{]}.$

By Assumptions 5.2 and 5.3, we have

		$\displaystyle\mathbb{E}\bigl{[}\\|\widetilde{\nabla}f_{k-1}\\|^{2}_{2}\,\big{\|}\,\mathcal{F}_{k-1}^{\theta}\bigr{]}$		(C.10)
		$\displaystyle\quad\leq 2\mathbb{E}\Bigl{[}\bigl{\\|}\widetilde{\nabla}f(\theta_{k-1},x_{k})-\nabla f_{}(\theta_{k-1})\bigr{\\|}_{2}^{2}+\bigl{\\|}\nabla f_{}(\theta_{k-1})\bigr{\\|}_{2}^{2}\,\Big{\|}\,\mathcal{F}_{k-1}^{\theta}\Bigr{]}$
		$\displaystyle\quad\leq 2\widetilde{H}^{2}\cdot\overline{D}_{\psi}\bigl{(}x_{*}(\theta_{k-1}),x_{k}\bigr{)}+2M^{2},$

taking which into (C.9) and taking expectation on both sides, we further obtain

	$\displaystyle\epsilon^{x}_{k+1}$	$\displaystyle\leq\Bigl{\{}1-\beta_{k}/4+\bigl{[}H_{\psi}^{2}\cdot(1+\gamma)-1\bigr{]}\cdot\widetilde{H}^{2}H_{*}^{2}\alpha^{2}_{k-1}\Bigr{\}}\cdot\epsilon_{k}^{x}+\delta_{u}^{2}\\|\lambda\\|_{2}^{2}\beta_{k}^{2}$
		$\displaystyle\qquad+(1-\beta_{k}/4)\cdot\frac{H_{\psi}^{2}\cdot(1+\gamma)-1}{2}\cdot H_{*}^{2}\alpha^{2}_{k-1}\cdot 2M^{2}.$

By $\gamma=(4-2\beta_{k})/\beta_{k}$ , we have

	$\displaystyle\bigl{[}H_{\psi}^{2}\cdot(1+\gamma)-1\bigr{]}\cdot\widetilde{H}^{2}H_{*}^{2}\alpha^{2}_{k-1}$	$\displaystyle=\bigl{[}H_{\psi}^{2}\cdot(4/\beta_{k}-1)-1\bigr{]}\cdot\widetilde{H}^{2}H_{*}^{2}\alpha^{2}_{k-1}$
		$\displaystyle\leq 16H_{\psi}^{2}\widetilde{H}^{2}H_{*}^{2}\alpha^{2}_{k}/\beta_{k}\leq\beta^{2}_{k}/8\leq\beta_{k}/8,$

where the first inequality follows from $\alpha_{k-1}\leq 2\alpha_{k}$ , and the second inequality follows from $\alpha_{k}=\alpha/(k+1)$ , $\beta_{k}=\beta/(k+1)^{2/3}$ , and $\alpha/\beta^{3/2}\leq 1/12H_{\psi}\widetilde{H}H_{*}$ . Thus, we obtain

\displaystyle\epsilon^{x}_{k+1}\leq(1-\beta_{k}/8)\cdot\epsilon_{k}^{x}+\bigl{[}M^{2}/4\widetilde{H}^{2}+\delta_{u}^{2}\|\lambda\|_{2}^{2}\bigr{]}\cdot\beta_{k}^{2}.

(C.11)

Recursively applying (C.11), we obtain

\displaystyle\epsilon_{k+1}^{x}

\displaystyle\leq\epsilon_{0}^{x}\cdot\prod^{k}_{j=0}(1-\beta_{j}/8)+\bigl{[}M^{2}/4\widetilde{H}^{2}+\delta_{u}^{2}\|\lambda\|_{2}^{2}\bigr{]}\cdot\sum_{l=0}^{k}\biggl{[}\beta^{2}_{l}\cdot\prod^{k}_{j=l+1}(1-\beta_{j}/4)\biggr{]}.

Thus, we conclude the proof Lemma C.1. ∎

C.2 Proof of Lemma C.2

Proof.

Since the projection $\mathop{\mathrm{argmax}}_{x\in{\mathcal{X}}}$ is non-expansive, we have

	$\displaystyle\\|\theta_{k+1}-\theta_{*}\\|^{2}$	$\displaystyle\leq\\|\theta_{k}+\alpha_{k}\cdot\widehat{\nabla}f_{k}-\theta_{*}\\|_{2}^{2}$
		$\displaystyle=\\|\theta_{k}-\theta_{}\\|_{2}^{2}+\alpha_{k}^{2}\cdot\\|\widehat{\nabla}f_{k}\\|_{2}^{2}+2\alpha_{k}\cdot\langle\widehat{\nabla}f_{k},\theta_{k}-\theta_{}\rangle.$

Taking the conditional expectation given $\mathcal{F}_{k}^{\theta}$ , we obtain

$\displaystyle\mathbb{E}\bigl{[}\\|\theta_{k+1}-\theta_{*}\\|^{2}\,\big{\|}\,\mathcal{F}_{k}^{\theta}]$	$\displaystyle\leq\\|\theta_{k}-\theta_{}\\|_{2}^{2}+2\alpha_{k}\cdot\bigl{\langle}\nabla_{\theta}f_{}(\theta_{k}),\theta_{k}-\theta_{*}\bigr{\rangle}+\alpha_{k}^{2}\cdot\mathbb{E}\bigl{[}\\|\widetilde{\nabla}f_{k}\\|_{2}^{2}\,\big{\|}\,\mathcal{F}^{\theta}_{k}\bigr{]}$
	$\displaystyle\qquad+2\alpha_{k}\cdot\mathbb{E}\Bigl{[}\bigl{\langle}\widetilde{\nabla}f(\theta_{k},x_{k+1})-\nabla_{\theta}f_{}(\theta_{k}),\theta_{k}-\theta_{}\bigr{\rangle}\,\Big{\|}\,\mathcal{F}^{\theta}_{k}\Bigr{]}$
	$\displaystyle\leq(1-2\mu\alpha_{k})\cdot\\|\theta_{k}-\theta_{*}\\|_{2}^{2}+\alpha_{k}^{2}\cdot\mathbb{E}\bigl{[}\\|\widetilde{\nabla}f_{k}\\|_{2}^{2}\,\big{\|}\,\mathcal{F}^{\theta}_{k}\bigr{]}$
	$\displaystyle\qquad+\alpha_{k}/\mu\cdot\mathbb{E}\Bigl{[}\bigl{\\|}\widetilde{\nabla}f(\theta_{k},x_{k+1})-\nabla_{\theta}f_{}(\theta_{k})\bigr{\\|}_{2}\,\Big{\|}\,\mathcal{F}^{\theta}_{k}\Bigr{]}+\mu\alpha_{k}\cdot\\|\theta_{k}-\theta_{}\\|_{2}^{2}$
	$\displaystyle\leq(1-\mu\alpha_{k})\cdot\\|\theta_{k}-\theta_{}\\|_{2}^{2}+\widetilde{H}^{2}\alpha_{k}/\mu\cdot\mathbb{E}\Bigl{[}\overline{D}_{\psi}\bigl{(}x_{}(\theta_{k}),x_{k+1}\bigr{)}\,\Big{\|}\,\mathcal{F}^{\theta}_{k}\Bigr{]}$
	$\displaystyle\qquad+\alpha_{k}^{2}\cdot\mathbb{E}\bigl{[}\\|\widetilde{\nabla}f_{k}\\|_{2}^{2}\,\big{\|}\,\mathcal{F}^{\theta}_{k}\bigr{]},$	(C.12)

where the second inequality follows from Assumption 5.2 and the Cauchy-Schwartz inequality, and the last inequality follows from Assumption 5.2. Applying (C.10) to (C.2) and taking expectation on both sides, we get

\displaystyle\epsilon_{k+1}^{\theta}

\displaystyle\leq(1-\mu\alpha_{k})\cdot\epsilon_{k}^{\theta}+2M^{2}\cdot\alpha_{k}^{2}+\widetilde{H}^{2}\cdot(2\alpha_{k}^{2}+\alpha_{k}/\mu)\cdot\epsilon_{k+1}^{x}.

(C.13)

Recursively applying (C.13), we obtain

	$\displaystyle\epsilon_{k+1}^{\theta}$	$\displaystyle\leq\prod^{k}_{j=0}(1-\mu\alpha_{j})\cdot\epsilon_{0}^{\theta}+2M^{2}\cdot\sum_{l=0}^{k}\biggl{[}\alpha_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}$
		$\displaystyle\qquad+\widetilde{H}^{2}\cdot\sum_{l=0}^{k}\bigg{[}(2\alpha_{l}^{2}+\alpha_{l}/\mu)\cdot\epsilon_{l+1}^{x}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}.$

Thus, we conclude the proof of Lemma C.2. ∎

Appendix D Proof of Theorem 5.5

We first present the following two lemmas under the conditions presented in Theorem 5.5

Lemma D.1.

For all $k\geq 0$ , we have

	$\displaystyle\widetilde{\epsilon}_{k+1}^{x}$	$\displaystyle\leq\widetilde{\epsilon}_{0}^{x}\cdot\prod^{k}_{j=0}(1-\beta_{j}/8)+\bigl{[}(\delta_{u}^{2}+3V_{*}^{2})\cdot\\|\lambda\\|_{2}^{2}+(M^{2}+3N\widetilde{H}^{2})/4\widetilde{H}^{2}\bigr{]}\cdot\biggl{[}\sum_{l=0}^{k}\beta_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\beta_{j}/8)\biggr{]}$
		$\displaystyle\qquad+N\cdot\biggl{[}\sum_{l=0}^{k}\bigl{(}\beta_{l}\nu_{l}\log(1/\nu_{l})+2\nu_{l+1}+2\nu_{l}^{2}\bigr{)}\cdot\prod^{k}_{j=l+1}(1-\beta_{j}/8)\biggr{]}.$

Proof.

See Appendix D.1 for detailed proof. ∎

Lemma D.2.

For all $k\geq 0$ , we have

	$\displaystyle\epsilon_{k+1}^{\theta}$	$\displaystyle\leq\prod^{k}_{j=0}(1-\mu\alpha_{j})\cdot\epsilon_{0}^{\theta}+(3M^{2}+6N\widetilde{H}^{2})\cdot\sum_{l=0}^{k}\biggl{[}\alpha_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}$
		$\displaystyle\qquad+\widetilde{H}^{2}\cdot\sum_{l=0}^{k}\bigg{[}(2\alpha_{l}^{2}+\alpha_{l}/\mu)\cdot\epsilon_{l+1}^{x}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}.$

Proof.

See Appendix D.2 for detailed proof. ∎

Now we are ready to prove Theorem 5.5.

Proof of Theorem 5.5.

For $\beta_{k}=\beta/(k+1)^{2/7}$ , by [[21], Lemma C.4], we have

\displaystyle\prod^{k}_{j=1}(1-\beta_{j}/8)\leq\sum_{l=0}^{k}\biggl{[}\beta^{2}_{l}\cdot\prod^{k}_{j=l+1}(1-\beta_{j}/8)\biggr{]}\leq 8\beta_{k}.

Thus, by Lemma C.1, we have

$\displaystyle\widetilde{\epsilon}_{k+1}^{x}$	$\displaystyle\leq\widetilde{\epsilon}_{0}^{x}\cdot\prod^{k}_{j=0}(1-\beta_{j}/8)+\bigl{[}(\delta_{u}^{2}+3V_{*}^{2})\cdot\\|\lambda\\|_{2}^{2}+(M^{2}+3N\widetilde{H}^{2})/4\widetilde{H}^{2}\bigr{]}\cdot\biggl{[}\sum_{l=0}^{k}\beta_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\beta_{j}/8)\biggr{]}$
	$\displaystyle\qquad+N\cdot\biggl{[}\sum_{l=0}^{k}\bigl{(}\beta_{l}\nu_{l}\log(1/\nu_{l})+2\nu_{l+1}+2\nu_{l}^{2}\bigr{)}\cdot\prod^{k}_{j=l+1}(1-\beta_{j}/8)\biggr{]}$
	$\displaystyle\leq\bigl{[}8\widetilde{\epsilon}_{0}^{x}+8(\delta_{u}^{2}+3V_{*}^{2})\cdot\\|\lambda\\|_{2}^{2}+(2M^{2}+6N\widetilde{H}^{2})/\widetilde{H}^{2}\bigr{]}\cdot\beta_{k}$
	$\displaystyle\qquad+N\cdot\biggl{[}\sum_{l=0}^{k}\bigl{(}\beta_{l}\nu_{l}\log(1/\nu_{l})+2\nu_{l+1}+2\nu_{l}^{2}\bigr{)}\cdot\prod^{k}_{j=l+1}(1-\beta_{j}/8)\biggr{]}.$	(D.1)

Since $\nu_{l}=1/(l+1)^{4/7}$ , we have

\displaystyle\beta_{l}\nu_{l}\log(1/\nu_{l})+2\nu_{l+1}+2\nu_{l}^{2}\leq(4/\beta^{2}+4/7\beta)\cdot\beta^{2}_{l},

taking which into (D), we obtain

	$\displaystyle\widetilde{\epsilon}_{k+1}^{x}$	$\displaystyle\leq\bigl{[}8\widetilde{\epsilon}_{0}^{x}+8(\delta_{u}^{2}+3V_{*}^{2})\cdot\\|\lambda\\|_{2}^{2}+(2M^{2}+6N\widetilde{H}^{2})/\widetilde{H}^{2}\bigr{]}\cdot\beta_{k}+32N\cdot(1/\beta^{2}+1/7\beta)\cdot\beta_{k}$
		$\displaystyle=\widetilde{c}\cdot\beta_{k}=O(k^{-2/7}),$		(D.2)

where

\displaystyle\widetilde{c}=8\widetilde{\epsilon}_{0}^{x}+8(\delta_{u}^{2}+3V_{*}^{2})\cdot\|\lambda\|_{2}^{2}+(2M^{2}+6N\widetilde{H}^{2})/\widetilde{H}^{2}+32N\cdot(1/\beta^{2}+4/7\beta).

Similarly, we have for $\alpha_{k}=\alpha/(k+1)^{1/2}$ ,

\displaystyle\prod^{k}_{j=0}(1-\mu\alpha_{j})\leq\sum_{l=0}^{k}\biggl{[}\alpha_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}\leq\sum_{l=0}^{k}\biggl{[}\alpha_{l}^{11/7}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}\leq 2/\mu\cdot\alpha_{k}^{4/7}

Thus, combining (C) with Lemma C.2, we obtain

$\displaystyle\epsilon_{k+1}^{\theta}$	$\displaystyle\leq\prod^{k}_{j=0}(1-\mu\alpha_{j})\cdot\epsilon_{0}^{\theta}+(3M^{2}+6N\widetilde{H}^{2})\cdot\sum_{l=0}^{k}\biggl{[}\alpha_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}$	(D.3)
	$\displaystyle\qquad+\widetilde{H}^{2}\cdot\sum_{l=0}^{k}\bigg{[}(2\alpha_{l}^{2}+\alpha_{l}/\mu)\cdot\epsilon_{l+1}^{x}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}$
	$\displaystyle\leq 2\epsilon_{0}^{\theta}/\mu\cdot\alpha_{k}^{4/7}+(3M^{2}+6N\widetilde{H}^{2}+2\widetilde{H}^{2})\cdot\sum_{l=0}^{k}\biggl{[}\alpha_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}$
	$\displaystyle\qquad+\widetilde{c}\cdot\widetilde{H}^{2}/\mu\cdot\sum_{l=0}^{k}\bigg{[}\alpha_{l}\beta_{l+1}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}$
	$\displaystyle\leq 2\epsilon_{0}^{\theta}/\mu\cdot\alpha_{k}^{4/7}+2\cdot(3M^{2}+6N\widetilde{H}^{2}+2\widetilde{H}^{2})/\mu\cdot\alpha_{k}^{4/7}+\widetilde{c}\cdot 2\beta\widetilde{H}^{2}/\alpha^{4/7}\mu^{2}\cdot\alpha_{k}^{4/7}$
	$\displaystyle=O(k^{-2/7}).$

Here the third inequality follows from $\beta_{k}=\beta/\alpha^{4/7}\cdot\alpha_{k}^{4/7}$ . Therefore, (D) and (D.3) conclude the proof of Theorem 5.5. ∎

D.1 Proof of Lemma D.1

Proof.

We first show that

D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k}),{x}^{i}_{k+1}\bigr{)}=D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k}),\widetilde{x}^{i}_{k}\bigr{)}-1/\beta_{k}^{i}\cdot\bigl{\langle}\widehat{v}_{k}^{i},\widetilde{x}^{i}_{*}(\theta_{k})-x^{i}_{k+1}\bigr{\rangle}-D_{\psi^{i}}(x^{i}_{k+1},\widetilde{x}_{k}^{i}).

(D.4)

Since

\displaystyle x^{i}_{k+1}

\displaystyle=\mathop{\mathrm{argmax}}_{x^{i}\in{\mathcal{X}}^{i}}\bigl{\{}\langle\widehat{v}^{i}_{k},x^{i}\rangle-1/\beta^{i}_{k}\cdot D_{\psi^{i}}(x^{i}_{k},x^{i})\bigr{\}},

we have the exact form of $x^{i}_{k+1}$ as

\displaystyle x^{i}_{k+1}\propto\widetilde{x}_{k}^{i}\cdot\exp\{\beta_{k}^{i}\cdot\widehat{v}_{k}^{i}\}.

(D.5)

By the definition of KL divergence, we have

		$\displaystyle D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{}(\theta_{k}),{x}^{i}_{k+1}\bigr{)}-D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{}(\theta_{k}),\widetilde{x}^{i}_{k}\bigr{)}$		(D.6)
		$\displaystyle\quad=\bigl{\langle}\log(\widetilde{x}^{i}_{k}/{x}^{i}_{k+1}),\widetilde{x}^{i}_{*}(\theta_{k})-x^{i}_{k+1}\bigr{\rangle}-D_{\psi^{i}}(x^{i}_{k+1},\widetilde{x}_{k}^{i})$
		$\displaystyle\quad=\bigl{\langle}\log(\widetilde{x}^{i}_{k}/{x}^{i}_{k+1})+\beta_{k}^{i}\cdot\widehat{v}_{k}^{i},\widetilde{x}^{i}_{}(\theta_{k})-x^{i}_{k+1}\bigr{\rangle}-\beta_{k}^{i}\cdot\bigl{\langle}\widehat{v}_{k}^{i},\widetilde{x}^{i}_{}(\theta_{k})-x^{i}_{k+1}\bigr{\rangle}-D_{\psi^{i}}(x^{i}_{k+1},\widetilde{x}_{k}^{i}).$

Let ${Z}_{k}^{i}=\sum_{j\in[d^{i}]}[\widetilde{x}_{k}^{i}]_{j}\cdot\exp\{1/\beta_{k}^{i}\cdot[\widehat{v}_{k}^{i}]_{j}\}$ be the normalization factor of the exact form of $x_{k+1}^{i}$ in (D.5). We have

		$\displaystyle\bigl{\langle}\log(\widetilde{x}^{i}_{k}/{x}^{i}_{k+1})+\beta_{k}^{i}\cdot\widehat{v}_{k}^{i},\widetilde{x}^{i}_{*}(\theta_{k})-x^{i}_{k+1}\bigr{\rangle}$		(D.7)
		$\displaystyle\quad=\bigl{\langle}\log\widetilde{x}^{i}_{k}-\log{x}^{i}_{k+1}+\beta_{k}^{i}\cdot\widehat{v}_{k}^{i},\widetilde{x}^{i}_{*}(\theta_{k})-x^{i}_{k+1}\bigr{\rangle}$
		$\displaystyle\quad=\bigl{\langle}\log\widetilde{x}^{i}_{k}-\log\widetilde{x}^{i}_{k}-\beta_{k}^{i}\cdot\widehat{v}^{i}_{k}+\log{Z}_{k}^{i}+\beta_{k}^{i}\cdot\widehat{v}_{k}^{i},\widetilde{x}^{i}_{*}(\theta_{k})-x^{i}_{k+1}\bigr{\rangle}$
		$\displaystyle\quad=\log Z_{k+1}^{i}\cdot\bigl{\langle}{\bf 1}_{d^{i}},\widetilde{x}^{i}_{*}(\theta_{k})-x^{i}_{k+1}\bigr{\rangle}=0,$

where the last equality follows from the fact that $\sum_{j\in[d^{i}]}[\widetilde{x}^{i}_{*}(\theta_{k})]_{j}=\sum_{j\in[d^{i}]}[{x}^{i}_{k+1}]_{j}=1$ . Taking (D.7) into (D.6), we obtain

\displaystyle D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k}),{x}^{i}_{k+1}\bigr{)}=D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k}),\widetilde{x}^{i}_{k}\bigr{)}-\beta_{k}^{i}\cdot\bigl{\langle}\widehat{v}_{k}^{i},\widetilde{x}^{i}_{*}(\theta_{k})-x^{i}_{k+1}\bigr{\rangle}-D_{\psi^{i}}(x^{i}_{k+1},\widetilde{x}_{k}^{i}),

which concludes the proof of (D.4).

Continuing from (D.4), we have

	$\displaystyle D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k}),{x}^{i}_{k+1}\bigr{)}$	$\displaystyle=D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{}(\theta_{k}),\widetilde{x}^{i}_{k}\bigr{)}-\beta_{k}^{i}\cdot\bigl{\langle}\widehat{v}_{k}^{i},\widetilde{x}^{i}_{}(\theta_{k})-x^{i}_{k+1}\bigr{\rangle}-D_{\psi^{i}}(x^{i}_{k+1},\widetilde{x}_{k}^{i})$
		$\displaystyle\leq D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{}(\theta_{k}),\widetilde{x}^{i}_{k}\bigr{)}-\beta_{k}^{i}\cdot\bigl{\langle}\widehat{v}_{k}^{i},\widetilde{x}^{i}_{}(\theta_{k})-\widetilde{x}^{i}_{k}\bigr{\rangle}$
		$\displaystyle\qquad-\beta_{k}^{i}\cdot\bigl{\langle}\widehat{v}_{k}^{i},\widetilde{x}^{i}_{k}-x^{i}_{k+1}\bigr{\rangle}-1/2\cdot\\|x^{i}_{k+1}-\widetilde{x}_{k}^{i}\\|_{1}^{2},$

where the second inequality follows from the fact that $D_{\psi^{i}}(x^{i},x^{i\prime})\geq 1/2\cdot\|x^{i}-x^{i\prime}\|_{1}^{2}$ . Taking the conditional expectation given $\mathcal{F}^{x}_{k}$ , we obtain

$\displaystyle\mathbb{E}\Bigl{[}D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k}),x^{i}_{k+1}\bigr{)}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$	$\displaystyle\leq D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{}(\theta_{k}),\widetilde{x}^{i}_{k}\bigr{)}-\beta^{i}_{k}\cdot\mathbb{E}\Bigl{[}\bigl{\langle}\widehat{v}^{i}_{k},\widetilde{x}^{i}_{}(\theta_{k})-\widetilde{x}^{i}_{k}\bigr{\rangle}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$	(D.8)
	$\displaystyle\qquad+\mathbb{E}\Bigl{[}-\beta^{i}_{k}\cdot\bigl{\langle}\widehat{v}^{i}_{k},\widetilde{x}^{i}_{k}-x^{i}_{k+1}\bigr{\rangle}-1/2\cdot\\|\widetilde{x}^{i}_{k}-x^{i}_{k+1}\\|_{1}^{2}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$
	$\displaystyle\leq D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{}(\theta_{k}),\widetilde{x}^{i}_{k}\bigr{)}-\bigl{\langle}\beta^{i}_{k}\cdot v^{i}_{\theta_{k}}(\widetilde{x}_{k}),\widetilde{x}^{i}_{}(\theta_{k})-{x}^{i}_{*}(\theta_{k})\bigr{\rangle}$
	$\displaystyle\qquad-\beta^{i}_{k}\cdot\bigl{\langle}v^{i}_{\theta_{k}}(\widetilde{x}_{k}),{x}^{i}_{*}(\theta_{k})-\widetilde{x}^{i}_{k}\bigr{\rangle}+(\beta^{i}_{k})^{2}/2\cdot\mathbb{E}\bigl{[}\\|\widehat{v}^{i}_{k}\\|_{\infty}^{2}\,\big{\|}\,\mathcal{F}^{x}_{k}\bigr{]}$
	$\displaystyle\leq D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{}(\theta_{k}),\widetilde{x}^{i}_{k}\bigr{)}-\beta^{i}_{k}\cdot\bigl{\langle}v^{i}_{\theta_{k}}(\widetilde{x}_{k}),{x}^{i}_{}(\theta_{k})-\widetilde{x}^{i}_{k}\bigr{\rangle}$
	$\displaystyle\qquad+(\beta^{i}_{k})^{2}/2\cdot\mathbb{E}\Bigl{[}\\|\widehat{v}^{i}_{k}\\|_{\infty}^{2}+\bigl{\\|}{v}^{i}_{\theta_{k}}(\widetilde{x}_{k})\bigr{\\|}_{\infty}^{2}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}+2\nu_{k}^{2}.$

where the last inequality follows from Cauchy-Schwartz inequality and the fact that $\|\widetilde{x}_{*}^{i}(\theta_{k})-x_{*}^{i}(\theta_{k})\|_{1}=\nu_{k}\cdot\|{x}_{*}^{i}(\theta_{k})-{\bf 1}_{d^{i}}/d^{i}\|_{1}\leq 2\nu_{k}$ . By the $\lambda$ -strong variational stability of $x_{*}(\theta_{k})$ , we have

\displaystyle-\sum_{i\in{\mathcal{N}}}\lambda^{i}\cdot\bigl{\langle}v^{i}_{\theta_{k}}(\widetilde{x}_{k}),x^{i}_{*}(\theta_{k})-\widetilde{x}^{i}_{k}\bigr{\rangle}\leq-\overline{D}_{\psi}\bigl{(}{x}_{*}(\theta_{k}),\widetilde{x}_{k}\bigr{)}.

(D.9)

Moreover, by Assumption 5.3, we have

	$\displaystyle\mathbb{E}\Bigl{[}\\|\widehat{v}^{i}_{k}\\|_{\infty}^{2}+\bigl{\\|}{v}^{i}_{\theta_{k}}(\widetilde{x}_{k})\bigr{\\|}_{\infty}^{2}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$
	$\displaystyle\quad\leq\mathbb{E}\Bigl{[}2\bigl{\\|}\widehat{v}^{i}_{k}-v^{i}_{\theta_{k}}(\widetilde{x}_{k})\bigr{\\|}_{\infty}^{2}+3\bigl{\\|}v^{i}_{\theta_{k}}(\widetilde{x}_{k})\bigr{\\|}_{\infty}^{2}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}\leq 2\delta_{u}^{2}+3\bigl{\\|}v^{i}_{\theta_{k}}(\widetilde{x}_{k})\bigr{\\|}_{\infty}^{2},$

summing up which gives

		$\displaystyle\sum_{i\in{\mathcal{N}}}(\lambda^{i})^{2}\cdot\mathbb{E}\Bigl{[}\\|\widehat{v}^{i}_{k}\\|_{\infty}^{2}+\bigl{\\|}{v}^{i}_{\theta_{k}}(\widetilde{x}_{k})\bigr{\\|}_{\infty}^{2}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$		(D.10)
		$\displaystyle\quad\leq 2\delta_{u}^{2}\cdot\sum_{i\in{\mathcal{N}}}(\lambda^{i})^{2}+6\cdot\sum_{i\in{\mathcal{N}}}(\lambda^{i})^{2}\cdot\biggl{(}\Bigl{\\|}v^{i}_{\theta_{k}}(\widetilde{x}_{k})-v^{i}_{\theta_{k}}\bigl{(}x_{}(\theta_{k})\bigr{)}\Bigr{\\|}_{\infty}^{2}+\Bigl{\\|}v^{i}_{\theta_{k}}\bigl{(}x_{}(\theta_{k})\bigr{)}\Bigr{\\|}_{\infty}^{2}\biggr{)}$
		$\displaystyle\quad\leq(2\delta_{u}^{2}+6V_{}^{2})\cdot\sum_{i\in{\mathcal{N}}}(\lambda^{i})^{2}+6NH_{u}^{2}\cdot\sum_{i\in{\mathcal{N}}}(\lambda^{i})^{2}\cdot\overline{D}_{\psi}\bigl{(}x_{}(\theta_{k}),\widetilde{x}_{k}\bigr{)}$
		$\displaystyle\quad=(2\delta_{u}^{2}+6V_{}^{2})\cdot\\|\lambda\\|^{2}_{2}+6NH_{u}^{2}\\|\lambda\\|_{2}^{2}\cdot\overline{D}_{\psi}\bigl{(}x_{}(\theta_{k}),\widetilde{x}_{k}\bigr{)},$

where the second inequality follows from $\|v_{\theta}(x_{*}(\theta))\|_{\infty}\leq V_{*}$ , and the third inequality follows from Assumption 5.1. Thus, taking (D.9) and (D.10) into (D.8), we obtain for $6NH_{u}^{2}\|\lambda\|_{2}^{2}\beta^{2}_{k}\leq\beta_{k}$ (i.e., $\beta\leq 1/6NH_{u}^{2}\|\lambda\|_{2}^{2}$ ) that,

	$\displaystyle\mathbb{E}\Bigl{[}\overline{D}_{\psi}\bigl{(}\widetilde{x}_{*}(\theta_{k}),x_{k+1}\bigr{)}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$
	$\displaystyle\quad\leq\overline{D}_{\psi}\bigl{(}\widetilde{x}_{}(\theta_{k}),\widetilde{x}_{k}\bigr{)}-\bigl{(}\beta_{k}-3NH_{u}^{2}\\|\lambda\\|_{2}^{2}\beta^{2}_{k}\bigr{)}\cdot\overline{D}_{\psi}\bigl{(}x_{}(\theta_{k}),\widetilde{x}_{k}\bigr{)}+(\delta_{u}^{2}+3V_{*}^{2})\cdot\\|\lambda\\|_{2}^{2}\beta^{2}_{k}+2N\nu_{k}^{2}$
	$\displaystyle\quad\leq(1-\beta_{k}/2)\cdot\overline{D}_{\psi}\bigl{(}\widetilde{x}_{}(\theta_{k}),\widetilde{x}_{k}\bigr{)}+(\delta_{u}^{2}+3V_{}^{2})\cdot\\|\lambda\\|_{2}^{2}\beta_{k}^{2}+N\beta_{k}\nu_{k}\log(1/\nu_{k})+2N\nu_{k}^{2},$

where the second inequality follows from Lemma E.3. By Lemma E.3, we further have for $\nu_{k}\leq O(1/k)$ ,

		$\displaystyle\mathbb{E}\Bigl{[}\overline{D}_{\psi}\bigl{(}\widetilde{x}_{*}(\theta_{k}),\widetilde{x}_{k+1}\bigr{)}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}-2N\nu_{k+1}$		(D.11)
		$\displaystyle\quad\leq\mathbb{E}\Bigl{[}\overline{D}_{\psi}\bigl{(}\widetilde{x}_{*}(\theta_{k}),x_{k+1}\bigr{)}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$
		$\displaystyle\quad\leq(1-\beta_{k}/2)\cdot\overline{D}_{\psi}\bigl{(}\widetilde{x}_{}(\theta_{k}),\widetilde{x}_{k}\bigr{)}+(\delta_{u}^{2}+3V_{}^{2})\cdot\\|\lambda\\|_{2}^{2}\beta_{k}^{2}+N\beta_{k}\nu_{k}\log(1/\nu_{k})+2N\nu_{k}^{2},$

By Lemma E.2, we have for any $\gamma>\max\{1,1/\nu_{k}^{2}\}$ ,

		$\displaystyle\overline{D}_{\psi}\bigl{(}\widetilde{x}_{}(\theta_{k}),\widetilde{x}_{k}\bigr{)}-\biggl{(}1+\frac{1}{\gamma}\biggr{)}\cdot\overline{D}_{\psi}\bigl{(}\widetilde{x}_{}(\theta_{k-1}),\widetilde{x}_{k}\bigr{)}$		(D.12)
		$\displaystyle\quad\leq\frac{(1+\gamma)^{2}/\nu_{k}^{2}-(1+\gamma)}{2\gamma}\cdot\bigl{\\|}\widetilde{x}_{}(\theta_{k})-\widetilde{x}_{}(\theta_{k-1})\bigr{\\|}_{1}^{2}$
		$\displaystyle\quad\leq\frac{(1+\gamma)^{2}/\nu_{k}^{2}-(1+\gamma)}{2\gamma}\cdot\\|\theta_{k}-\theta_{k-1}\\|^{2}_{2}\leq\frac{(1+\gamma)^{2}/\nu_{k}^{2}-(1+\gamma)}{2\gamma}\cdot\widetilde{H}_{*}^{2}\alpha_{k-1}^{2}\cdot\\|\widetilde{\nabla}f_{k-1}\\|^{2}_{2},$

where the second inequality follows from Lemma B.3. Taking (D.12) into (D.11) and choosing $\gamma=(4-2\beta_{k})/\beta_{k}$ , we obtain

		$\displaystyle\mathbb{E}\Bigl{[}\overline{D}_{\psi}\bigl{(}\widetilde{x}_{*}(\theta_{k}),\widetilde{x}_{k+1}\bigr{)}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$		(D.13)
		$\displaystyle\quad\leq(1-\beta_{k}/2)\cdot\biggl{(}1+\frac{1}{\gamma}\biggr{)}\cdot\overline{D}_{\psi}\bigl{(}x_{}(\theta_{k-1}),x_{k}\bigr{)}+(\delta_{u}^{2}+3V_{}^{2})\cdot\\|\lambda\\|_{2}^{2}\beta_{k}^{2}+N\beta_{k}\nu_{k}\log(1/\nu_{k})$
		$\displaystyle\quad\qquad+2N\cdot(\nu_{k}^{2}+\nu_{k+1})+(1-\beta_{k}/2)\cdot\frac{(1+\gamma)^{2}/\nu_{k}^{2}-(1+\gamma)}{2\gamma}\cdot\widetilde{H}_{*}^{2}\alpha^{2}_{k-1}\cdot\mathbb{E}\bigl{[}\\|\widetilde{\nabla}f_{k-1}\\|^{2}_{2}\,\big{\|}\,\mathcal{F}^{x}_{k}\bigr{]}$
		$\displaystyle\quad=(1-\beta_{k}/4)\cdot\overline{D}_{\psi}\bigl{(}x_{}(\theta_{k-1}),x_{k}\bigr{)}+(\delta_{u}^{2}+3V_{}^{2})\cdot\\|\lambda\\|_{2}^{2}\beta_{k}^{2}+N\beta_{k}\nu_{k}\log(1/\nu_{k})$
		$\displaystyle\quad\qquad+2N\cdot(\nu_{k}^{2}+\nu_{k+1})+(1-\beta_{k}/4)\cdot\frac{(1+\gamma)/\nu_{k}^{2}-1}{2}\cdot\widetilde{H}_{*}^{2}\alpha^{2}_{k-1}\cdot\mathbb{E}\bigl{[}\\|\widetilde{\nabla}f_{k-1}\\|^{2}_{2}\,\big{\|}\,\mathcal{F}^{x}_{k}\bigr{]}.$

By Assumption 5.3, we have

		$\displaystyle\mathbb{E}\bigl{[}\\|\widetilde{\nabla}f_{k-1}\\|^{2}_{2}\,\big{\|}\,\mathcal{F}_{k-1}^{\theta}\bigr{]}$		(D.14)
		$\displaystyle\quad\leq 3\cdot\mathbb{E}\biggl{[}\Bigl{\\|}\widetilde{\nabla}f(\theta_{k-1},\widetilde{x}_{k})-\widetilde{\nabla}f\bigl{(}\theta_{k-1},\widetilde{x}_{*}(\theta_{k-1})\bigr{)}\Bigr{\\|}_{2}^{2}\,\bigg{\|}\,\mathcal{F}_{k-1}^{\theta}\biggr{]}$
		$\displaystyle\qquad+3\cdot\mathbb{E}\biggl{[}\Bigl{\\|}\widetilde{\nabla}f\bigl{(}\theta_{k-1},\widetilde{x}_{}(\theta_{k-1})\bigr{)}-\nabla f_{}(\theta_{k-1})\Bigr{\\|}_{2}^{2}\,\bigg{\|}\,\mathcal{F}_{k-1}^{\theta}\biggr{]}+3\cdot\mathbb{E}\Bigl{[}\bigl{\\|}\nabla f_{*}(\theta_{k-1})\bigr{\\|}_{2}^{2}\,\Big{\|}\,\mathcal{F}_{k-1}^{\theta}\Bigr{]}$
		$\displaystyle\quad\leq 3\widetilde{H}^{2}\cdot\overline{D}_{\psi}\bigl{(}\widetilde{x}_{}(\theta_{k-1}),\widetilde{x}_{k}\bigr{)}+3\widetilde{H}^{2}\cdot\overline{D}_{\psi}\bigl{(}{x}_{}(\theta_{k-1}),\widetilde{x}_{*}(\theta_{k-1})\bigr{)}+3M^{2}$
		$\displaystyle\quad\leq 3\widetilde{H}^{2}\cdot\overline{D}_{\psi}\bigl{(}\widetilde{x}_{*}(\theta_{k-1}),\widetilde{x}_{k}\bigr{)}+3M^{2}+6N\widetilde{H}^{2}\nu_{k-1},$

where the third inequality follows from Assumptions 5.2 and 5.3, and the last inequality follows from Lemma E.3. Taking (D.14) into (D.13) and taking expectation on both sides, we obtain

$\displaystyle\widetilde{\epsilon}_{k+1}^{x}$	$\displaystyle\leq\Bigl{\{}1-\beta_{k}/4+3\bigl{[}(1+\gamma)/\nu_{k}^{2}-1\bigr{]}/2\cdot\widetilde{H}^{2}\widetilde{H}_{}^{2}\alpha^{2}_{k-1}\Bigr{\}}\cdot\widetilde{\epsilon}_{k}^{x}+(\delta_{u}^{2}+3V_{}^{2})\cdot\\|\lambda\\|_{2}^{2}\beta_{k}^{2}$
	$\displaystyle\qquad+(1-\beta_{k}/4)\cdot\frac{(1+\gamma)/\nu_{k}^{2}-1}{2}\cdot\widetilde{H}_{*}^{2}\alpha^{2}_{k-1}\cdot(2M^{2}+6N\widetilde{H}^{2}\nu_{k-1})$
	$\displaystyle\qquad-N\beta_{k}\nu_{k}\log\nu_{k}+2N\cdot(\nu_{k+1}+\nu_{k}^{2}).$	(D.15)

With $\gamma=(4\beta_{k}-2)/\beta_{k}$ , $\alpha_{k}=\alpha/(k+1)$ , $\beta_{k}=\beta/(k+1)^{2/7}$ , $\nu_{k}=1/(k+1)^{4/7}$ , and $\alpha/\beta^{3/2}\leq 1/7\widetilde{H}\widetilde{H}_{*}$ , we have

	$\displaystyle 3\bigl{[}(1+\gamma)/\nu_{k}^{2}-1\bigr{]}/2\cdot\widetilde{H}^{2}\widetilde{H}_{*}^{2}\alpha^{2}_{k-1}$	$\displaystyle=3\bigl{[}(4/\beta_{k}-1)/\nu_{k}^{2}-1\bigr{]}/2\cdot\widetilde{H}^{2}\widetilde{H}_{*}^{2}\alpha^{2}_{k-1}$
		$\displaystyle\leq 6\widetilde{H}^{2}\widetilde{H}_{*}^{2}\alpha^{2}_{k}/\beta_{k}\nu_{k}^{2}\leq\beta^{2}_{k}/8\leq\beta_{k}/8,$

taking which into (D.1), we obtain

$\displaystyle\widetilde{\epsilon}_{k+1}^{x}$	$\displaystyle\leq(1-\beta_{k}/8)\cdot\widetilde{\epsilon}_{k}^{x}+(\delta_{u}^{2}+3V_{*}^{2})\cdot\\|\lambda\\|_{2}^{2}\beta_{k}^{2}-N\beta_{k}\nu_{k}\log\nu_{k}+2N\cdot(\nu_{k+1}+\nu_{k}^{2})$
	$\displaystyle\qquad+\beta^{2}_{k}\cdot(M^{2}+3N\widetilde{H}^{2}\nu_{k-1})/4\widetilde{H}^{2}$
	$\displaystyle\leq(1-\beta_{k}/8)\cdot\epsilon_{k}^{x}+(\delta_{u}^{2}+3V_{*}^{2})\cdot\\|\lambda\\|_{2}^{2}\beta_{k}^{2}-N\beta_{k}\nu_{k}\log\nu_{k}+2N\cdot(\nu_{k+1}+\nu_{k}^{2})$
	$\displaystyle\qquad+\beta^{2}_{k}\cdot(M^{2}+3N\widetilde{H}^{2})/4\widetilde{H}^{2}.$	(D.16)

Recursively applying (D.1), we obtain

	$\displaystyle\epsilon_{k+1}^{x}$	$\displaystyle\leq\epsilon_{0}^{x}\cdot\prod^{k}_{j=0}(1-\beta_{j}/8)+\bigl{[}(\delta_{u}^{2}+3V_{*}^{2})\cdot\\|\lambda\\|_{2}^{2}+(M^{2}+3N\widetilde{H}^{2})/4\widetilde{H}^{2}\bigr{]}\cdot\biggl{[}\sum_{l=0}^{k}\beta_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\beta_{j}/8)\biggr{]}$
		$\displaystyle\qquad+N\cdot\biggl{[}\sum_{l=0}^{k}(-\beta_{l}\nu_{l}\log\nu_{l}+2\nu_{l+1}+2\nu_{l}^{2})\cdot\prod^{k}_{j=l+1}(1-\beta_{j}/8)\biggr{]}.$

∎

D.2 Proof of Lemma D.2

Proof.

Applying (D.14) to (C.2) and taking expectation on both sides, we get

	$\displaystyle\epsilon_{k+1}^{\theta}$	$\displaystyle\leq(1-\mu\alpha_{k})\cdot\epsilon_{k}^{\theta}+(3M^{2}+6N\widetilde{H}^{2}\nu_{k})\cdot\alpha_{k}^{2}+\widetilde{H}^{2}\cdot(3\alpha_{k}^{2}+\alpha_{k}/\mu)\cdot\epsilon_{k+1}^{x}$
		$\displaystyle\leq(1-\mu\alpha_{k})\cdot\epsilon_{k}^{\theta}+(3M^{2}+6N\widetilde{H}^{2})\cdot\alpha_{k}^{2}+\widetilde{H}^{2}\cdot(3\alpha_{k}^{2}+\alpha_{k}/\mu)\cdot\epsilon_{k+1}^{x},$		(D.17)

where the second inequality follows from $\nu_{k}\leq 1$ . Recursively applying (D.2), we obtain

	$\displaystyle\epsilon_{k+1}^{\theta}$	$\displaystyle\leq\prod^{k}_{j=0}(1-\mu\alpha_{j})\cdot\epsilon_{0}^{\theta}+(3M^{2}+6N\widetilde{H}^{2})\cdot\sum_{l=0}^{k}\biggl{[}\alpha_{l}^{2}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}$
		$\displaystyle\qquad+\widetilde{H}^{2}\cdot\sum_{l=0}^{k}\bigg{[}(2\alpha_{l}^{2}+\alpha_{l}/\mu)\cdot\epsilon_{l+1}^{x}\cdot\prod^{k}_{j=l+1}(1-\mu\alpha_{j})\biggr{]}.$

∎

Appendix E Properties of the Bregman Divergence

The following lemma is used in the analysis of unconstrained games.

Lemma E.1.

Let $\psi(\cdot)$ be $1$ -strongly convex with respect to the norm $\|\cdot\|$ . Assume that (4.2) holds. We have for any $\gamma>H_{\psi}^{2}\geq 1$ ,

\displaystyle D_{\psi}(x,z)-\biggl{(}1+\frac{1}{\gamma}\biggr{)}\cdot D_{\psi}(y,z)\leq\frac{H_{\psi}^{2}\cdot(1+\gamma)^{2}-(1+\gamma)}{2\gamma}\cdot\|x-y\|^{2}.

Proof.

By the definition of Bregman divergence, we have for any $\gamma>0$ ,

	$\displaystyle D_{\psi}(x,z)-D_{\psi}(y,z)$	$\displaystyle=\psi(x)-\bigl{\langle}\nabla\psi(z),x-z\bigr{\rangle}-\psi(y)+\bigl{\langle}\nabla\psi(z),y-z\bigr{\rangle}$
		$\displaystyle=-D_{\psi}(x,y)+\bigl{\langle}\nabla\psi(z)-\nabla\psi(x),y-x\bigr{\rangle}$
		$\displaystyle\leq-1/2\cdot\\|x-y\\|^{2}+\bigl{\\|}\nabla\psi(z)-\nabla\psi(x)\bigr{\\|}_{*}\cdot\\|y-x\\|,$

where the inequality follows from $1$ -strong convexity of $\psi(\cdot)$ and the Cauchy-Schwartz inequality. By (4.2), we further have

$\displaystyle D_{\psi}(x,z)-D_{\psi}(y,z)$	$\displaystyle\leq-1/2\cdot\\|x-y\\|^{2}+H_{\psi}\cdot\\|z-x\\|\cdot\\|y-x\\|,$
	$\displaystyle\leq\frac{1}{2(1+\gamma)}\cdot\\|x-z\\|^{2}+\frac{H_{\psi}^{2}\cdot(1+\gamma)-1}{2}\cdot\\|x-y\\|^{2}$
	$\displaystyle\leq\frac{1}{1+\gamma}\cdot D_{\psi}(x,z)+\frac{H_{\psi}^{2}\cdot(1+\gamma)-1}{2}\cdot\\|x-y\\|^{2},$	(E.1)

where the second inequality follows from $1$ -strong convexity of $\psi(\cdot)$ . Rearranging the terms in (E), we finish the proof of Lemma E.1. ∎

The following two lemmas are involved in the analysis of simplex constrained games.

Lemma E.2.

Let $\psi(\cdot)$ be the Shannon entropy. We have for any $\gamma_{k}>\max\{1,1/\nu_{k}^{2}\}$ ,

\displaystyle D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k}),\widetilde{x}^{i}_{k}\bigr{)}-\biggl{(}1+\frac{1}{\gamma}\biggr{)}\cdot D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k-1}),\widetilde{x}^{i}_{k}\bigr{)}

\displaystyle\leq\frac{(1+\gamma)^{2}/\nu_{k}^{2}-(1+\gamma)}{2\gamma}\cdot\bigl{\|}\widetilde{x}_{*}^{i}(\theta_{k})-\widetilde{x}_{*}^{i}(\theta_{k-1})\bigr{\|}_{1}^{2},

for $\{\widetilde{x}_{k}\}_{k\geq 0}$ generated by Algorithm 2.

Proof.

For the Shannon entropy $\psi(\cdot)$ , we have $\nabla_{x_{j}}\psi(\widetilde{x})=1+\log[\widetilde{x}]_{j}$ , which gives

\displaystyle\Bigl{|}\nabla_{[x^{i}]_{j}}\psi\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k})\bigr{)}-\nabla_{[x^{i}]_{j}}\psi(\widetilde{x}^{i}_{k})\Bigr{|}=\Bigl{|}\log\bigl{[}\widetilde{x}^{i}_{*}(\theta_{k})\bigr{]}_{j}-\log[\widetilde{x}^{i}_{k}]_{j}\Bigr{|}\leq 1/\nu_{k}\cdot\Bigl{|}\bigl{[}\widetilde{x}^{i}_{*}(\theta_{k})\bigr{]}_{j}-[\widetilde{x}^{i}_{k}]_{j}\Bigr{|}.

Thus, we have

\displaystyle\Bigl{\|}\nabla\psi\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k})\bigr{)}-\nabla\psi(\widetilde{x}^{i}_{k})\Bigr{\|}_{\infty}\leq 1/\nu_{k}\cdot\bigl{\|}\widetilde{x}^{i}_{*}(\theta_{k})-\widetilde{x}^{i}_{k}\bigr{\|}_{1}.

Replacing $H_{\psi}$ with $1/\nu_{k}$ in the proof of Lemma E.1, we conclude the proof of Lemma E.2. ∎

Lemma E.3.

Let $\{x_{k}\}_{k\geq 0}$ and $\{\widetilde{x}_{k}\}_{k\geq 0}$ be the sequences of strategy profiles generated by Algorithm 2 with $\nu_{k}\leq O(1/k)$ . We have

$\displaystyle\overline{D}_{\psi}\bigl{(}\widetilde{x}_{}(\theta_{k}),\widetilde{x}_{k}\bigr{)}-\overline{D}_{\psi}\bigl{(}\widetilde{x}_{}(\theta_{k}),{x}_{k}\bigr{)}$	$\displaystyle\leq 2N\nu_{k},$	(E.2)
$\displaystyle\overline{D}_{\psi}\bigl{(}\widetilde{x}_{}(\theta_{k}),\widetilde{x}_{k+1}\bigr{)}-\overline{D}_{\psi}\bigl{(}\widetilde{x}_{}(\theta_{k}),{x}_{k+1}\bigr{)}$	$\displaystyle\leq 2N\nu_{k+1},$	(E.3)
$\displaystyle\overline{D}_{\psi}\bigl{(}\widetilde{x}_{}(\theta_{k}),\widetilde{x}_{k}\bigr{)}-\overline{D}_{\psi}\bigl{(}{x}_{}(\theta_{k}),\widetilde{x}_{k}\bigr{)}$	$\displaystyle\leq 2N\nu_{k}\log(1/\nu_{k}).$	(E.4)

Proof.

By the definition of KL divergence, we have

\displaystyle{D}_{\psi}\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k}),\widetilde{x}^{i}_{k+1}\bigr{)}-{D}_{\psi}\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k}),{x}^{i}_{k+1}\bigr{)}=\bigl{\langle}\widetilde{x}_{*}^{i}(\theta_{k}),\log(x^{i}_{k+1}/\widetilde{x}_{k+1}^{i})\bigr{\rangle}.

(E.5)

By (4.6), we have for all $\nu_{k}\leq O(k^{-1})$ that

	$\displaystyle\log\frac{[x^{i}_{k+1}]_{j}}{[\widetilde{x}^{i}_{k+1}]_{j}}$	$\displaystyle=\log\biggl{\{}\frac{[{x}^{i}_{k+1}]_{j}}{(1-\nu_{k+1})\cdot[{x}^{i}_{k+1}]_{j}+\nu_{k+1}/d^{i}}\biggr{\}}$
		$\displaystyle=\log\biggl{\{}1+\frac{\nu_{k+1}\cdot\bigl{(}[{x}^{i}_{k+1}]_{j}-1/d^{i}\bigr{)}}{(1-\nu_{k+1})\cdot[{x}^{i}_{k+1}]_{j}+\nu_{k+1}/d^{i}}\biggr{\}}\leq 2\nu_{k+1}.$		(E.6)

Taking (E) into (E.5), we obtain

\displaystyle\overline{D}_{\psi}\bigl{(}\widetilde{x}_{*}(\theta_{k}),\widetilde{x}_{k+1}\bigr{)}-\overline{D}_{\psi}\bigl{(}\widetilde{x}_{*}(\theta_{k}),{x}_{k+1}\bigr{)}\leq\sum_{i\in{\mathcal{N}}}\bigl{\langle}\widetilde{x}^{i}_{*}(\theta_{k}),{\bf 1}_{d^{i}}\bigr{\rangle}\cdot 2\nu_{k+1}=2N\nu_{k+1},

which concludes the proof of (E.2). Similar arguments also yields (E.3). Also, we have

	$\displaystyle D_{\psi^{i}}\bigl{(}\widetilde{x}^{i}_{}(\theta_{k}),\widetilde{x}^{i}_{k}\bigr{)}-D_{\psi^{i}}\bigl{(}{x}^{i}_{}(\theta_{k}),\widetilde{x}^{i}_{k}\bigr{)}$
	$\displaystyle\quad=\Bigl{\langle}\widetilde{x}^{i}_{}(\theta_{k}),\log\bigl{(}\widetilde{x}^{i}_{}(\theta_{k})/\widetilde{x}^{i}_{k}\bigr{)}\Bigr{\rangle}-\Bigl{\langle}{x}^{i}_{}(\theta_{k}),\log\bigl{(}{x}^{i}_{}(\theta_{k})/\widetilde{x}^{i}_{k}\bigr{)}\Bigr{)}$
	$\displaystyle\quad=\Bigl{\langle}\widetilde{x}^{i}_{}(\theta_{k})-{x}^{i}_{}(\theta_{k}),\log\bigl{(}\widetilde{x}^{i}_{}(\theta_{k})/\widetilde{x}^{i}_{k}\bigr{)}\Bigr{\rangle}-D_{\psi^{i}}\bigl{(}x_{}^{i}(\theta_{k}),\widetilde{x}_{*}^{i}(\theta_{k})\bigr{)}$
	$\displaystyle\quad\leq\bigl{\\|}\widetilde{x}^{i}_{}(\theta_{k})-{x}^{i}_{}(\theta_{k})\bigr{\\|}_{1}\cdot\Bigl{\\|}\log\bigl{(}\widetilde{x}^{i}_{*}(\theta_{k})/\widetilde{x}^{i}_{k}\bigr{)}\Bigr{\\|}_{\infty}\leq 2\nu_{k}\log(1/\nu_{k}),$

summing up which for $i\in{\mathcal{N}}$ gives (E.4). ∎

Appendix F Extensions to Games with Multiple Equilibria

In this section, we provide more details on how to extend our algorithms to games with multiple equilibria. As discussed in Section 6, the only challenging case is when the function $v_{\theta(x)}$ is monotone but not strongly monotone, so that the equilibrium set is a convex and closed region.

Example F.1.

We consider a specific routing game (see Example 3.2). Specifically, we study a simple network as shown in Figure 2. Suppose that the number of nonatomic agents aiming to travel from node 1 to node 4 is $d=10$ .

The costs for using the 4 edges are given by $t^{1}(x^{1})=4+(x^{1})^{4}$ , $t^{2}(x^{2})=20+5(x^{2})^{4}$ , $t^{3}(x^{3})=1+30(x^{3})^{4}$ and $t^{4}(x^{4})=30+(x^{4})^{4}$ , respectively. There are 4 paths connecting node 1 and node 4 (path 1 uses link 1 and link 3, path 2 uses link 2 and link 4, path 3 uses link 1 and link 4, path 4 uses link 2 and 3). Let $q=(q^{a})_{a=1}^{4}$ be the number of agents using each path and $c(q)=(c^{a}(q))_{a=1}^{4}$ be cost of using each path. Then we have

c(q)=\Delta^{\mathsf{T}}t(x),

where $\Delta=[1,0,1,0;0,1,0,1;1,0,0,1;0,1,1,0]$ . Therefore, we have $\nabla c(q)=\Delta^{\mathsf{T}}\text{Diag}({t^{\prime}(x)})\Delta$ , which is semi-positive definite but not positive definite because the matrix $\Delta$ is singular. Under this setting, the set of equilibria to this routing game can be written as

\{f^{*}\geq 0:\Delta f^{*}=x^{*}\},\quad\text{where}~{}x^{*}=[6,4,3,7]^{\mathsf{T}}.

(F.1)

Yet, we can simply add a regularizer $\eta^{i}\cdot(\log(x^{i}+\epsilon)+1)$ to each $v_{\theta}^{i}(x)$ . We note that the Jacobian of $v_{\theta}^{i}(x)+\eta^{i}\cdot(\log(x^{i}+\epsilon)+1)$ is $\nabla v_{\theta}^{i}(x)+\eta^{i}\cdot\text{Diag}(1/(x^{i}+\epsilon))$ . The unique equilibrium then would be strongly stable (see Lemma B.2). Meanwhile, if $\epsilon>0$ , the Jacobian would also be upper-bounded, hence the function $v_{\theta}^{i}(x)+\eta^{i}\cdot(\log(x^{i}+\epsilon)+1)$ is still Lipschitz continuous. We summarize a few properties of this regularizer. First, as long as $\eta^{1}>0$ and $\epsilon>0$ , all conditions in Assumption 5.1 would be satisfied. Seond, if $\eta^{1}>0$ and $\epsilon=0$ , the resulting equilibrium is often referred to as a quantal response (response) equilibrium, would is more realistic than a Nash (Wardrop) equilibrium if we believe that human beings are not always fully rational [42, 1, 33]. Third, if $\eta^{1}>0$ and $\epsilon>0$ , the resulting equilibrium would be a “smoothed” quantal response equilibrium. It is close to the quantal response equilibrium when $\epsilon$ is sufficiently small, but it preserves the Lipschitz continuity.

	$\displaystyle\bigl{\\|}x_{}(\theta_{k})-x_{}(\theta_{k-1})\bigr{\\|}_{2}$	$\displaystyle=\bigl{\\|}\nabla_{\theta}x_{*}(\overline{\theta})\bigr{\\|}_{2}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2}$
		$\displaystyle=\biggl{\\|}\Bigl{[}\nabla_{x}v_{\theta}\bigl{(}x_{}(\overline{\theta})\bigr{)}\Bigr{]}^{-1}\nabla_{\theta}v_{\theta}\bigl{(}x_{}(\overline{\theta})\bigr{)}\biggr{\\|}_{2}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2}$
		$\displaystyle\leq\biggl{\\|}\Bigl{[}\nabla_{x}v_{\theta}\bigl{(}x_{}(\overline{\theta})\bigr{)}\Bigr{]}^{-1}\biggr{\\|}_{2}\cdot\Bigl{\\|}\nabla_{\theta}v_{\theta}\bigl{(}x_{}(\overline{\theta})\bigr{)}\Bigr{\\|}_{2}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2}$
		$\displaystyle\leq\rho_{x}/\rho_{\theta}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2},$

	$\displaystyle\\|J_{\theta}\\|_{2}$	$\displaystyle\leq\\|L\\|_{2}+\bigl{\\|}LA^{\top}[ALA^{\top}]^{-1}AL\bigr{\\|}_{2}$
		$\displaystyle\leq\\|L\\|_{2}+\\|L\\|_{2}\cdot{{\rm tr}\bigl{(}A^{\top}[ALA^{\top}]^{-1}AL\bigr{)}}$
		$\displaystyle\leq\\|L\\|_{2}+{\\|L\\|_{2}\cdot{\rm tr}\bigl{(}[ALA^{\top}]^{-1}ALA^{\top}\bigr{)}}$
		$\displaystyle=(1+d)\cdot\\|L\\|_{2}\leq(1+d)/\rho_{x}.$

	$\displaystyle\\|x_{}(\theta_{k})-x_{}(\theta_{k-1})\\|$	$\displaystyle\leq\bigl{\\|}\nabla_{\theta}x_{*}(\overline{\theta})\bigr{\\|}_{2}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2}$
		$\displaystyle\leq\\|J_{\theta}\\|_{2}\cdot\bigl{\\|}\nabla_{\theta}v_{\theta}(x)\bigr{\\|}_{2}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2}$
		$\displaystyle\leq(1+d)\rho_{\theta}/\rho_{x}\cdot\\|\theta_{k}-\theta_{k-1}\\|_{2},$

	$\displaystyle D_{\psi^{i}}\bigl{(}x^{i}_{*}(\theta_{k}),x^{i}_{k+1}\bigr{)}$	$\displaystyle\leq D_{\psi^{i}}\bigl{(}x^{i}_{}(\theta_{k}),x^{i}_{k}\bigr{)}-\beta^{i}_{k}\cdot\bigl{\langle}\widehat{v}^{i}_{k},x^{i}_{}(\theta_{k})-x^{i}_{k+1}\bigr{\rangle}-D_{\psi^{i}}\bigl{(}x^{i}_{k},x^{i}_{k+1}\bigr{)}$
		$\displaystyle\leq D_{\psi^{i}}\bigl{(}x^{i}_{}(\theta_{k}),x^{i}_{k}\bigr{)}-\beta^{i}_{k}\cdot\bigl{\langle}\widehat{v}^{i}_{k},x^{i}_{}(\theta_{k})-x^{i}_{k}\bigr{\rangle}$
		$\displaystyle\qquad-\beta^{i}_{k}\cdot\bigl{\langle}\widehat{v}^{i}_{k},x^{i}_{k}-x^{i}_{k+1}\bigr{\rangle}-1/2\cdot\\|x^{i}_{k}-x^{i}_{k+1}\\|^{2},$

$\displaystyle\mathbb{E}\Bigl{[}D_{\psi^{i}}\bigl{(}x^{i}_{*}(\theta_{k}),x^{i}_{k+1}\bigr{)}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$	$\displaystyle\leq D_{\psi^{i}}\bigl{(}x^{i}_{}(\theta_{k}),x^{i}_{k}\bigr{)}-\beta^{i}_{k}\cdot\mathbb{E}\Bigl{[}\bigl{\langle}\widehat{v}^{i}_{k},x^{i}_{}(\theta_{k})-x^{i}_{k}\bigr{\rangle}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$	(C.3)
	$\displaystyle\qquad+\mathbb{E}\Bigl{[}-\beta^{i}_{k}\cdot\bigl{\langle}\widehat{v}^{i}_{k},x^{i}_{k}-x^{i}_{k+1}\bigr{\rangle}-1/2\cdot\bigl{\\|}x^{i}_{k}-x^{i}_{k+1}\bigr{\\|}^{2}\,\Big{\|}\,\mathcal{F}^{x}_{k}\Bigr{]}$
	$\displaystyle\leq D_{\psi^{i}}\bigl{(}x^{i}_{}(\theta_{k}),x^{i}_{k}\bigr{)}-\beta^{i}_{k}\cdot\bigl{\langle}v^{i}_{\theta_{k}}(x_{k}),x^{i}_{}(\theta_{k})-x^{i}_{k}\bigr{\rangle}+(\beta^{i}_{k})^{2}/2\cdot\mathbb{E}\bigl{[}\\|\widehat{v}^{i}_{k}\\|_{*}^{2}\,\big{\|}\,\mathcal{F}^{x}_{k}\bigr{]},$

Inducing Equilibria via Incentives: Simultaneous Design-and-Play Ensures Global Convergence

Abstract

1 Introduction

2 Related work

3 Problem Formulation

Example 3.1 (Oligopoly model).

Example 3.2 (Routing game).

4 Algorithm

Remark 4.1.

4.1 Unconstrained Game

Lemma 4.2.

Proof.

4.2 Simplex-Constrained Game

Remark 4.3.

5 Convergence Analysis

Assumption 5.1.

Assumption 5.2.

Assumption 5.3.

5.1 Unconstrained Game

Theorem 5.4.

Proof.

5.2 Simplex-Constrained Game

Theorem 5.5.

Proof.

6 Extensions to Games with Multiple Equilibria

7 Numerical Experiments

7.1 Pollution Control via Emission Tax

7.2 Second-Best Congestion Pricing

References

Appendix A Notations

Appendix B Detailed Discussions on the Games

B.1 Strong Stability

Lemma B.1.

Proof.

Lemma B.2.

Proof.

B.2 Sensitivity of Nash Equilibrium

Proof of Lemma 4.2.

Lemma B.3.

Proof.

Lemma B.4 (Theorem 1 in [41]).

Lemma B.5.

Proof.

Appendix C Proof of Theorem 5.4

Lemma C.1.

Proof.

Lemma C.2.

Proof.

Proof of Theorem 5.4.

C.1 Proof of Lemma C.1

Proof.

C.2 Proof of Lemma C.2

Proof.

Appendix D Proof of Theorem 5.5

Lemma D.1.

Proof.

Lemma D.2.

Proof.

Proof of Theorem 5.5.

D.1 Proof of Lemma D.1

Proof.

D.2 Proof of Lemma D.2

Proof.

Appendix E Properties of the Bregman Divergence

Lemma E.1.

Proof.

Lemma E.2.

Proof.

Lemma E.3.

Proof.

Appendix F Extensions to Games with Multiple Equilibria

Example F.1.

Inducing Equilibria via Incentives:
Simultaneous Design-and-Play Ensures Global Convergence