Online Optimization in Games via Control Theory:
Connecting Regret, Passivity and Poincaré Recurrence

Yun Kuen Cheung
Royal Holloway
University of London Georgios Piliouras
Singapore University of
Technology and Design

Abstract

We present a novel control-theoretic understanding of online optimization and learning in games, via the notion of passivity. Passivity is a fundamental concept in control theory, which abstracts energy conservation and dissipation in physical systems. It has become a standard tool in analysis of general feedback systems, to which game dynamics belong. Our starting point is to show that all continuous-time Follow-the-Regularized-Leader (FTRL) dynamics, which include the well-known Replicator Dynamic, are lossless, i.e. it is passive with no energy dissipation. Interestingly, we prove that passivity implies bounded regret, connecting two fundamental primitives of control theory and online optimization.

The observation of energy conservation in FTRL inspires us to present a family of lossless learning dynamics, each of which has an underlying energy function with a simple gradient structure. This family is closed under convex combination; as an immediate corollary, any convex combination of FTRL dynamics is lossless and thus has bounded regret. This allows us to extend the framework of Fox and Shamma [FS13] to prove not just global asymptotic stability results for game dynamics, but Poincaré recurrence results as well. Intuitively, when a lossless game (e.g. graphical constant-sum game) is coupled with lossless learning dynamics, their feedback interconnection is also lossless, which results in a pendulum-like energy-preserving recurrent behavior, generalizing [PS14, MPP18].

	Lossless Learning Dynamic	Physical System
	Replicator Dynamic,
examples	Follow-the-Regularized-Leader	gravity
	(FTRL) dynamics
state	$\mathbf{q}~{}=~{}\int\mathbf{p}(t)\,\mathsf{d}t~{}=~{}$ cumulative payoffs	$h~{}=~{}$ vertical height
energy	$E(\mathbf{q})~{}=~{}$ storage function	$V(h)=$ (negative) potential energy
gradient of	$\mathbf{x}~{}=~{}\nabla E(\mathbf{q})~{}=~{}$	$F~{}=~{}\nabla V(h)~{}=~{}$
energy	mixed strategies of agents	gravitational force
	convex combination (CC):	linear combination (LC):
invariant	any CC of storage functions produces	any LC of potential energy function is
property	a lossless learning dynamic that	a potential energy function that yields
	yields the same CC of mixed strategies	the same LC of gravitational forces
another	$\mathbf{p}~{}=~{}$ instantaneous payoffs	$v~{}=~{}$ velocity
analogue	$\mathbf{p}~{}=~{}$ instantaneous payoffs	$v~{}=~{}$ velocity
change of	$\int\left\langle\mathbf{x},\mathbf{p}\right\rangle\,\mathsf{d}t$	$\int\left\langle F,v\right\rangle\,\mathsf{d}t$
energy value		$\int\left\langle F,v\right\rangle\,\mathsf{d}t$

Figure 1: Analogues between lossless online learning dynamic and physical system. The evolution of a system of learning dynamics can be thought of as capturing the movements of particles, thus tools from control theory can find direct application in the study of learning dynamics in games.

1 Introduction

Online optimization aims at designing algorithms that can maximize performance in unpredictable and even adversarially evolving environments. The standard benchmark for success in these environments is minimizing regret, which is defined as the difference between the accumulated performance of the algorithm and that of the best action in hindsight. One of the most important achievements of the field has been to establish that such regret minimizing algorithms exist [CBL06, SS12]. Amongst the most well-known such algorithms is the class of Follow-the-Regularized-Leader (FTRL) algorithms, which include as special cases ubiquitous meta-algorithms such as Multiplicative Weights Update [FS99, AHK05] and Gradient Descent [SS12]. It is well known that such algorithms can achieve $\mathcal{O}(\sqrt{T})$ regret by employing slowly decreasing step-sizes, and that this bound is effectively optimal given arbitrary payoff sequences. When applying such algorithms in games, as the sequence of payoffs becomes more predictable, stronger regret guarantees are possible [RS13, FLST16, SALS15, BP19]. In continuous-time model, FTRL dynamics are once again optimal achieving bounded regret in general settings [KM17, MPP18, BGP20]. Hence both from the perspective of optimization as well as game theory, FTRL dynamics constitute effectively an optimal choice.

Control theory, on the other hand, is motivated by a seemingly unrelated set of questions. It aims to develop methodologies for stabilizing complex processes and machines. Due to its intrinsic connections to real-world systems, control theory revolves around concepts with a strong grounding in physical systems. A fundamental property of numerous physical systems is passivity, which is typically defined in terms of energy dissipation, conservation and transformation [Wil72a, Wil72b, OPNSR13]. Passivity is an “input-output” property of a system, and expresses that a system which is supplied with bounded energy can only output bounded energy. Passive systems come equipped with a storage function that accumulates the supplied energy but perhaps with some loss (cf. energy loss due to friction in mechanical systems). Overall, passivity encodes a useful notion of stability, since such system cannot explode into unpredictable out-of-control motion as it would correspond to unbounded energy output.

Although the fields of online optimization and control theory are both well developed with long and distinct histories, their interconnection is still rather nascent. Online algorithms can be abstractly thought as input-output operators where the input is a stream of payoffs, and the output is a stream of behavioral outcomes. Both notions of regret and passivity are similar properties of such input-output algorithms/operators and encode a notion of predictability and stability around a reference frame. In regret, the reference frame is given by the cumulative payoff of past actions, in passivity by energy level sets. This raises our first set of questions:

Are there formal connections between regret and passivity? Moreso, can we interpret the optimal regret of FTRL dynamics from a passivity perspective? Are there similarly optimal learning dynamics / input-output operators?

Any formal connection across the two fields is clearly valuable, as it allows for a fusion of ideas and methodologies between two well-developed fields, and expedite the progress on areas of interest that are common to both, such as game theory [FL98, CBL06, MS15, MS18]. For related issues on the intersection of learning, control theory and games, see [Sha20] and the references therein.

Notably, Fox and Shamma [FS13] proposed a control-theoretic framework for analyzing learning in games. One of their key contributions is to identify a modular approach, where an analysis can be performed by studying a learning operator (which converts payoff input to strategy output) and a game operator (which converts strategy input to payoff output) independently, while the whole game dynamic is a feedback interconnection system of the two operators (see Figure 2). By focusing on coupling learning heuristics that are strictly passive with passive games (e.g. zero-sum games), the resulting strictly passive systems were shown to converge to equilibria, generalizing and unifying numerous prior results, e.g. [HS09].

The modular approach has allowed numerous works which study learning or game operators separately [MS16, PSM18, Mab18, GP19]. Despite this progress, settings of critical importance for AI such as understanding the perfectly recurrent non-equilibrating behaviors of Gradient Descent (and other FTRL dynamics) in zero-sum games has so far remained outside the reach of these techniques [PS14, MPP18, BRM⁺18, VGFP19, PML⁺20]. This raises our second question:

Can the pendulum-like cyclic behavior of FTRL dynamics in zero-sum games be understood and generalized via passivity?

Refer to caption — Figure 2: A feedback interconnection system that captures a game dynamic, by interconnecting a learning operator and a game operator. $\hat{\mathbf{r}}$ is (random) perturbation to payoffs; in this paper, we only consider the cases where $\hat{\mathbf{r}}$ is the zero function.

Our Contributions.

We provide affirmative answers to both questions raised above. We show that any finitely passive learning dynamic¹¹1 In control theory literature, a passive operator usually has a storage function with a finite lower bound. In this paper, storage functions can have a finite lower bound or not, depending on the types of the operators. We emphasize those operators which have a storage function with a finite lower bound as being finitely passive. For the definition of finitely passive learning dynamic, see Definition 5. guarantees constant regret (Theorem 8). By using the notion of convex conjugate from convex analysis, we show that any continuous-time FTRL dynamic is finitely passive and lossless (Theorem 7); the same holds for certain escort replicator dynamics [Har11] (Appendix D). These generalize [Mab18] which showed that Replicator Dynamic is finitely lossless. Combining the two theorems above immediately recovers the result in [MPP18] that any FTRL dynamic guarantees constant regret. We note that in the analysis of Mabrok [Mab18], the state space (i.e. the domain of the storage function) is the space of mixed strategies, while we turn to a new state space of cumulative payoffs for FTRL. This choice is crucial for the generalization to FTRL dynamics, and it permits a cleaner proof via the tools established in convex analysis.

A key observation that enables the above results is that FTRL dynamic admits a storage function with a simple gradient structure, which will be described formally in Section 5. This motivates us to study a new family of lossless learning dynamics, which is in one-one correspondence with the family of storage functions possessing that gradient structure. By observing that such storage functions are closed under convex combination, we discover that any convex combination of FTRL dynamics is finitely passive and lossless, and thus guarantees constant regret (Theorem 16). “Convex combination of FTRL dynamics” means: Suppose there are $k$ FTRL dynamics, indexed from $1$ to $k$ . When we use the $j$ -th one, it converts the cumulative payoffs at time $t$ to a mixed strategy $\mathbf{x}_{j}^{t}$ . A convex combination of these FTRL dynamics converts the cumulative payoffs at time $t$ to the mixed strategy $\sum_{j=1}^{k}\alpha_{j}\cdot\mathbf{x}_{j}^{t}$ , where $\alpha_{j}$ ’s are positive constants satisfying $\sum_{j=1}^{k}\alpha_{j}=1$ .

Convex combinations of lossless dynamics are directly analogous to linear combinations of conservative vector fields in analyzing physical dynamics (see Figure 1). This technique is also of practical relevance, since we might want to mix-and-match different dynamics to elicit their advantages. For instance, different learning dynamics may lean toward either exploitation or exploration. By combining them via convex combination with our own choice of $\alpha_{j}$ ’s, we can control our desired balance between exploitation and exploration (see Example 14).

We also show that for every graphical constant-sum game (e.g. two-person zero-sum game) that admits a fully-mixed Nash equilibrium, it corresponds to a finitely lossless game operator (Proposition 17). Thus, the game dynamic of any convex combinations of FTRL dynamics in such a game corresponds to a finitely lossless operator. We use this observation to show that the game dynamic is almost perfectly recurrent, via the notion of Poincaré recurrence (Theorem 19). This distinguishes our work from [FS13] and its subsequent works about learning in games, as they mostly concern stability/convergence, while we study recurrence.

Roadmap.

In Sections 2 and 3, we present the necessary background for this work, including the definitions of different operators, the notions of (lossless) passivity and storage function, and some basic results about passivity. In Section 4, we show the desirable properties (finitely losslessness, constant regret) of FTRL dynamics. In Section 5, we present a characterization of lossless learning dynamics via the above-mentioned gradient structure of storage functions, and we discuss some properties of convex combinations of such learning dynamics. The results about Poincaré recurrences of learning in graphical constant-sum games are presented in Section 6. All missing proofs can be found in the appendix.

2 Preliminary

In this section, we define the operators depicted in Figure 2. We first define learning dynamic and its corresponding learning operator. When there are multiple agents and each agent is using one learning dynamic, their learning operators can be concatenated naturally to form a merged learning operator. Then we define game operator, whose interconnection with a merged learning operator in the manner of Figure 2 is called a dynamical game system.

We use a bold lower case to denote a vector variable. Let $\mathbb{R}^{+}:=[0,+\infty)$ denote the set of non-negative real numbers. In this paper, every function from $\mathbb{R}^{+}$ to $\mathbb{R}^{d}$ is assumed to be square integrable, and we call it a function-of-time. An (input-output) operator is a mapping whose input and output are both functions-of-time. Let $\Delta^{n}$ denote the probability simplex over $n$ actions, i.e. $\Delta^{n}:=\{(x_{1},\ldots,x_{n})~{}\big{|}~{}\sum_{j=1}^{n}x_{j}=1;~{}\text{for}~{}1\leq j\leq n,~{}x_{j}\geq 0\}$ . $\left\langle\mathbf{a},\mathbf{b}\right\rangle$ denotes the inner product of the vectors $\mathbf{a},\mathbf{b}$ of same dimension $d$ , i.e. $\left\langle\mathbf{a},\mathbf{b}\right\rangle=\sum_{j=1}^{d}a_{j}b_{j}$ .

Learning Dynamic and Learning Operator.

We focus on the following type of continuous-time learning dynamics. An agent has an action set $A$ ; let $n:=|A|$ . The process starts at time $0$ . For any time $t\geq 0$ and for each action $j\in A$ , the agent computes the cumulative payoff if she chooses action $j$ in the time interval $[0,t]$ , denoted by $q_{j}(t)$ . Formally, for each action $j$ , let $p_{j}(\tau)$ denote the (instantaneous) payoff to action $j$ at time $\tau$ . Then

q_{j}(t):=q_{j}^{0}+\int_{0}^{t}p_{j}(\tau)\,\mathsf{d}\tau,

where $q_{j}^{0}$ is a constant chosen at the beginning of the process. Let $\mathbf{q}(t):=(q_{1}(t),\cdots,q_{n}(t))$ . For any $t\geq 0$ , the agent uses a conversion function $f$ which takes $\mathbf{q}(t)$ as input, and outputs a mixed strategy $\mathbf{x}(t)\in\Delta^{n}$ over the $n$ actions. The process can be expressed compactly as an ordinary differential equations (ODE) system:

$\displaystyle\mathbf{q}(0)$	$\displaystyle=\mathbf{q}^{0}$	(Initial condition)
$\displaystyle\dot{\mathbf{q}}(t)$	$\displaystyle=\mathbf{p}(t)$	(Cumulative payoff/state update)	(1)
$\displaystyle\mathbf{x}(t)$	$\displaystyle=f(\mathbf{q}(t)).$	(Behavioral/strategy output)

When $\mathbf{q}^{0}=(q_{1}^{0},\ldots,q_{n}^{0})$ is already given, the conversion function $f$ specifies the learning dynamic. The learning dynamic can be viewed as an operator, which takes the function-of-time $\mathbf{p}:\mathbb{R}^{+}\rightarrow\mathbb{R}^{n}$ as input, and the output is another function-of-time $\mathbf{x}:\mathbb{R}^{+}\rightarrow\Delta^{n}$ .

Regret.

For any $\mathbf{p}:\mathbb{R}^{+}\rightarrow\mathbb{R}^{n}$ , the regret of a learning dynamic at time $T>0$ is

\left(\max_{j\in A}\int_{0}^{T}p_{j}(\tau)\,\mathsf{d}\tau\right)-\int_{0}^{T}\left\langle\mathbf{p}(\tau),\mathbf{x}(\tau)\right\rangle\,\mathsf{d}\tau.

We say a learning dynamic guarantees constant regret if for any $\mathbf{p}:\mathbb{R}^{+}\rightarrow\mathbb{R}^{n}$ and any $T>0$ , the regret at time $T$ is bounded from above by a constant that depends on $\mathbf{q}^{0}$ only.

Replicator Dynamic and FTRL Learning Dynamics.

When the conversion function $f$ is the logit choice map

f_{\text{RD}}(\mathbf{q})=\left(\frac{\exp(q_{1})}{\mathcal{N}}~{},~{}\frac{\exp(q_{2})}{\mathcal{N}}~{},~{}\cdots~{},~{}\frac{\exp(q_{n})}{\mathcal{N}}\right),

(2)

where $\mathcal{N}=\sum_{j=1}^{n}\exp(q_{j})$ , the learning dynamic (1) is equivalent to the well-known Replicator Dynamic [HS98, San10], which is the continuous-time analogue of Multiplicative Weights Update.

A FTRL learning dynamic is specified by a strictly convex regularizer function $h:\Delta^{n}\rightarrow\mathbb{R}$ , which determines the conversion function $f$ as below:

f(\mathbf{q})~{}=~{}\operatorname*{arg\,max}_{\mathbf{x}\in\Delta^{n}}\left\{~{}\left\langle\mathbf{q},\mathbf{x}\right\rangle-h(\mathbf{x})~{}\right\}.

(3)

It is known that Replicator Dynamic is a special case of FTRL, by setting $h(\mathbf{x})=\sum_{j=1}^{n}x_{j}\log x_{j}$ . Online Gradient Descent (OGD) is another commonly studied learning dynamic that is also a special case of FTRL with $L^{2}$ regularization, i.e. $h(\mathbf{x})=\frac{1}{2}\sum_{j=1}^{n}(x_{j})^{2}$ . [H⁺16] When $n=2$ , the conversion function of OGD is

f_{\text{OGD}}(\mathbf{q})~{}=~{}\begin{cases}(1,0)&\text{if }q_{1}-q_{2}\geq 1;\\ \left(\frac{q_{1}-q_{2}+1}{2},\frac{q_{2}-q_{1}+1}{2}\right)&\text{if }1>q_{1}-q_{2}>-1;\\ (0,1)&\text{if }q_{1}-q_{2}\leq-1.\end{cases}

(4)

Merged Learning Operator.

When a system has $m\geq 2$ agents, and each agent uses a learning dynamic, we concatenate the corresponding learning operators together to form a merged learning operator (MLO). Precisely, the input to the MLO is $\hat{\mathbf{p}}=(\mathbf{p}^{1},\cdots,\mathbf{p}^{m})$ , and its output is $\hat{\mathbf{x}}=(\mathbf{x}^{1},\cdots,\mathbf{x}^{m})$ , where $\mathbf{x}^{i}$ is the output of the learning operator of agent $i$ when its input is $\mathbf{p}^{i}$ . In this paper, we use hat notations (e.g. $\hat{\mathbf{p}},\hat{\mathbf{x}}$ ) to denote variables formed by such concatenations of variables of individual agents.

Game Operator and Dynamical Game System.

Here, we provide a general definition of game operators (as in Figure 2), and leave the discussion about graphical constant-sum games, which appear in our Poincaré recurrence results, to Section 6.

A game has $m$ agents. Each agent $i$ has $n_{i}$ actions. After each agent chooses a mixed strategy over her own actions, the game determines a payoff vector $\hat{\mathbf{p}}\in\mathbb{R}^{n_{1}}\times\cdots\times\mathbb{R}^{n_{m}}$ , where $\hat{p}_{k\ell}$ is the payoff to action $\ell$ of agent $k$ . Let $\Delta:=\Delta^{n_{1}}\times\cdots\times\Delta^{n_{m}}$ , and $\mathcal{P}:=\mathbb{R}^{n_{1}}\times\cdots\times\mathbb{R}^{n_{m}}$ . We can think of the game as a function $G:\Delta\rightarrow\mathcal{P}$ . Its game operator takes a function-of-time $\hat{\mathbf{x}}:\mathbb{R}^{+}\rightarrow\Delta$ as input, and it outputs a function-of-time $\hat{\mathbf{p}}:\mathbb{R}^{+}\rightarrow\mathcal{P}$ , where $\hat{\mathbf{p}}(t)=G(\hat{\mathbf{x}}(t))$ for all $t\geq 0$ .

A dynamical game system (DGS) comprises of $m$ agents. The game operator has input $\hat{\mathbf{x}}$ and output $\hat{\mathbf{p}}$ . Each agent uses a learning dynamic of the form (1). The agents’ learning operators are concatenated to form a MLO, which has input $\hat{\mathbf{p}}=(\mathbf{p}^{1},\cdots,\mathbf{p}^{m})$ and output $\hat{\mathbf{x}}=(\mathbf{x}^{1},\cdots,\mathbf{x}^{m})$ when $\hat{\mathbf{r}}\equiv\mathbf{0}$ . The MLO is interconnected with the game operator in the manner of Figure 2.

3 Passivity

To motivate the notions of passivity and energy, consider an electrical network connected to a power source, where the voltage and current across the network at time $\tau$ are respectively $v(\tau)$ and $i(\tau)$ . Let $E(t)$ denote the energy stored in the network at time $t$ . We have $E(t)\leq E(0)+\int_{0}^{t}v(\tau)\cdot i(\tau)\,d\tau$ . The reason for the inequality (but not an exact equality) is that energy might dissipate from the network. In this setting, the function-of-time $v$ is the input, while the function-of-time $i$ is the output, so the network is indeed an operator; and as we shall see, this operator is passive.

Passivity of State Space System.

To generalize the above idea to passivity of an ODE system, we need several mathematical notations. Let $\mathbb{L}_{2}$ denote the Hilbert space of square integrable functions mapping $\mathbb{R}^{+}$ to $\mathbb{R}^{n}$ with inner product: $\left\langle f,g\right\rangle_{T}:=\int_{0}^{T}\left\langle f(t),g(t)\right\rangle\,\mathsf{d}t$ . Let $\mathbb{L}_{2,e}:=\{f:\mathbb{R}^{+}\rightarrow\mathbb{R}^{n}~{}\big{|}~{}\left\langle f,f\right\rangle_{T}<\infty~{}\text{for all }T\in\mathbb{R}^{+}\}$ . An (input-output) operator is simply a mapping $S:\mathbb{U}\rightarrow\mathbb{Y}$ , where $\mathbb{U},\mathbb{Y}\subset\mathbb{L}_{2,e}$ .

We consider the following type of operators, which can be represented by an ODE system called state space system (SSS) of the following general form:

$\displaystyle\mathbf{z}(0)$	$\displaystyle~{}=~{}\mathbf{z}^{0};$
$\displaystyle\dot{\mathbf{z}}(t)$	$\displaystyle~{}=~{}g_{1}(\mathbf{z}(t),\mathbf{u}(t));$	(5)
$\displaystyle\mathbf{y}(t)$	$\displaystyle~{}=~{}g_{2}(\mathbf{z}(t),\mathbf{u}(t)),$

where $\mathbf{z}^{0}\in\mathcal{Z}\subset\mathbb{R}^{d_{1}}$ , $\mathbf{z}:\mathbb{R}^{+}\rightarrow\mathcal{Z}$ , and $\mathbf{u},\mathbf{y}:\mathbb{R}^{+}\rightarrow\mathbb{R}^{d_{2}}$ . The set $\mathcal{Z}$ is called the set of states. As the notations suggest, $\mathbf{u},\mathbf{y}$ are the input and output of this operator respectively. When $\mathbf{u}$ is fed into the operator, the first two equalities define a well-posed ODE system, so a unique solution of $\mathbf{z}$ exists under mild conditions. Then $\mathbf{y}$ is the output determined by a function $g_{2}$ of the unique solution $\mathbf{z}$ and the input $\mathbf{u}$ . The learning dynamic (1) is such an operator, by viewing $\mathbf{q},\mathbf{p},\mathbf{x},f(\mathbf{q}(t))$ in (1) as $\mathbf{z},\mathbf{u},\mathbf{y},g_{2}(\mathbf{z}(t),\mathbf{u}(t))$ in (5) respectively. We are ready to present the definition of passivity for such operators.

Definition 1.

A SSS is passive if there exists a storage function $L:\mathcal{Z}\rightarrow\mathbb{R}$ such that for all $\mathbf{z}^{0}\in\mathcal{Z}$ , $t\in\mathbb{R}^{+}$ and all input-output pairs $\mathbf{u}\in\mathbb{U},~{}\mathbf{y}\in\mathbb{Y}$ , we have

L(\mathbf{z}(t))~{}\leq~{}L(\mathbf{z}^{0})+\int_{0}^{t}\left\langle\mathbf{u}(\tau),\mathbf{y}(\tau)\right\rangle\,\mathsf{d}\tau.

(6)

If the equality always holds, then we say the SSS is lossless passive or simply lossless. If a SSS is passive (resp. lossless) via a storage function $L$ that has a finite lower bound, we say it is finitely passive (resp. finitely lossless).

We note the reminiscence of inequality (6) with the inequality in the motivating example of electrical network.

When a SSS is finitely passive/lossless, we may assume, without loss of generality, that the finite lower bound of its storage function is zero. We make this assumption on all finitely passive/lossless SSS in the rest of this paper.

Feedback Interconnection System.

In Figure 2, we presented a feedback interconnection (FIC) system. While it is intuitive for readers to have a first understanding of the concept in control theory, for our analysis it is more convenient to use Figure 3. The FIC system in Figure 3 consists of two SSS $S_{1}:\mathbb{U}\rightarrow\mathbb{Y}$ and $S_{2}:\mathbb{Y}\rightarrow\mathbb{U}$ , with an external input source $\mathbf{r}\in\mathbb{U}$ , while its output is $\mathbf{y}_{1}\in\mathbb{Y}$ ; note that the FIC system is an operator by definition. The variables are related via: $\mathbf{u}_{1}=\mathbf{r}-\mathbf{y}_{2}$ , $\mathbf{y}_{1}=S_{1}(\mathbf{u}_{1})$ , $\mathbf{u}_{2}=\mathbf{y}_{1}$ , and $\mathbf{y}_{2}=S_{2}(\mathbf{u}_{2})$ .

An important property of passive operators is that passivity is composable, i.e. the composition of two passive operators results in a passive system. Intuitively, if no operator in the total system is able to produce energy, then the system as a whole cannot produce energy either. The following theorem formally captures this intuition for FIC systems.

Theorem 2 ([FS13] Theorem 3.2).

Consider the FIC system in Figure 3. Suppose that for $i=1,2$ , $S_{i}$ is passive via storage function $L_{i}$ . Then the FIC system is a passive operator via storage function $L_{1}+L_{2}$ . Precisely, for any $\mathbf{z}_{1}^{0}\in\mathcal{Z}_{1}$ , $\mathbf{z}_{2}^{0}\in\mathcal{Z}_{2}$ and $t\in\mathbb{R}^{+}$ ,

L_{1}(\mathbf{z}_{1}(t))+L_{2}(\mathbf{z}_{2}(t))~{}\leq~{}L_{1}(\mathbf{z}_{1}^{0})+L_{2}(\mathbf{z}_{2}^{0})+\int_{0}^{t}\left\langle\mathbf{r}(\tau),\mathbf{y}_{1}(\tau)\right\rangle\,\mathsf{d}\tau.

If $S_{1},S_{2}$ are lossless, then the FIC system is lossless via storage function $L_{1}+L_{2}$ , i.e. the inequality above becomes an equality.

Note that in the above theorem, if the FIC system is lossless and $\mathbf{r}$ is the zero function, then we have $L_{1}(\mathbf{z}_{1}(t))+L_{2}(\mathbf{z}_{2}(t))=L_{1}(\mathbf{z}_{1}^{0})+L_{2}(\mathbf{z}_{2}^{0})$ for all $t\geq 0$ , i.e. the value of $L_{1}+L_{2}$ does not change over time. The underlying dynamic is said to admit a constant-of-motion.

DGS as a FIC System.

In the context of DGS, Figure 3 is obtained after several modifications from Figure 2:

•

In Figure 2, the game operator’s output is $\hat{\mathbf{p}}$ , which is added to $\hat{\mathbf{r}}$ to form the MLO’s input. In Figure 3, the game operator’s output is $-\hat{\mathbf{p}}$ instead, and it is subtracted from $\mathbf{r}$ to form the MLO’s input.
•

The MLO’s output is $\hat{\mathbf{x}}$ in Figure 2, but the MLO’s output in Figure 3 has a constant shift $\hat{\mathbf{x}}^{*}=(\mathbf{x}^{*,1},\ldots,\mathbf{x}^{*,m})\in\Delta$ . Precisely, the output is $\hat{\mathbf{x}}-\hat{\mathbf{x}}^{*}$ . We call this operator the “MLO with shift $\hat{\mathbf{x}}^{*}$ ”.
•

In Figure 3, the game operator’s input is $\hat{\mathbf{x}}-\hat{\mathbf{x}}^{*}$ instead of $\hat{\mathbf{x}}$ , while its output is $-\hat{\mathbf{p}}$ instead of $\hat{\mathbf{p}}$ .

Basic Results about Passivity of Learning Operators.

By viewing a DGS as a FIC system, we are interested in MLO and game operators which possess good properties like passivity. The key advantage of this approach is it permits a DGS to be decoupled into two distinct operators, which can be analysed separately.

$S_{1}$ is a MLO. First, we show that it is passive if all learning operators possessed by the agents are passive. This allows us to turn our focus to analyzing whether each individual learning operator is passive/lossless or not.

Proposition 3.

Suppose for each agent $i$ , her learning operator with shift $\mathbf{x}^{*,i}$ is passive (resp. lossless) via a storage function $L^{i}$ . Then the MLO with shift $\hat{\mathbf{x}}^{*}=(\mathbf{x}^{*,1},\ldots,\mathbf{x}^{*,m})$ is passive (resp. lossless) via the storage function $\sum_{i=1}^{m}L^{i}$ .

By the next proposition, the choice of shift does not affect passivity of the learning operator. Thus, we say a learning dynamic is passive when its learning operator with any shift is passive.

Proposition 4.

Let $S^{a},S^{b}$ be two learning operators of the same learning dynamic, with shifts $\mathbf{x}^{*,a},\mathbf{x}^{*,b}$ respectively. $S^{a}$ is passive (resp. lossless) via storage function $L^{a}(\mathbf{q})$ if and only if $S^{b}$ is passive (resp. lossless) via storage function $L^{a}(\mathbf{q})-\left\langle(\mathbf{x}^{*,b}-\mathbf{x}^{*,a}),\mathbf{q}\right\rangle+c$ , where $c$ is any constant.

The above proposition works even when the shifts are not mixed strategies, e.g. when $\mathbf{x}^{*,a}$ is the zero vector. We use $E$ to denote a storage function of a learning operator with zero shift, and $L$ to denote a storage function of a learning operator with a shift of a mixed strategy.

While a shift does not affect passivity, it does affect whether the learning operator is finitely passive or not. In order to prove that certain learning dynamics guarantee constant regret, we need their learning operators with some specific shifts to be finitely passive. This motivates the following definition of finitely passive learning dynamics. Let $\mathbf{e}_{j}$ denote the vector with the $j$ -th entry be one, and all other entries be zero.

Definition 5.

A learning dynamic is finitely passive (resp. finitely lossless) if for every action $j$ , its learning operator with shift $\mathbf{e}_{j}$ is finitely passive (resp. finitely lossless).

Proposition 6.

If a learning dynamic is finitely passive (resp. finitely lossless), then for any mixed strategy $\mathbf{x}^{*}$ , the learning operator with shift $\mathbf{x}^{*}$ is finitely passive (resp. finitely lossless).

4 Passivity of Learning Operators

4.1 FTRL Dynamics are Finitely Lossless

We start the analysis by establishing a strong connection between FTRL dynamics and passivity. Specifically, FTRL dynamics are finitely lossless.

Theorem 7.

Given any FTRL dynamic over $n$ actions and with regularizer function $h$ , let the convex conjugate of $h$ be

h^{*}(\mathbf{q}):=\max_{\mathbf{x}\in\Delta^{n}}\left\{\left\langle\mathbf{q},\mathbf{x}\right\rangle-h(\mathbf{x})\right\}.

(7)

Then for any $\mathbf{x}^{*}\in\Delta^{n}$ , the learning operator with shift $\mathbf{x}^{*}$ is finitely lossless via the storage function $L(\mathbf{q})$ given below:

L(\mathbf{q})~{}=~{}h^{*}(\mathbf{q})-\left\langle\mathbf{q},\mathbf{x}^{*}\right\rangle+h(\mathbf{x}^{*}).

(8)

In particular, for any action $j$ , the learning operator with shift $\mathbf{e}_{j}$ is finitely lossless, and hence any FTRL dynamic is finitely lossless by Definition 5.

Proof: .

By the “maximizing argument” identity in p. 149 of [SS12], we have $\nabla h^{*}(\mathbf{q}(t))=\mathbf{x}(t)$ . Hence, $\nabla L(\mathbf{q}(t))=\nabla h^{*}(\mathbf{q}(t))-\mathbf{x}^{*}=\mathbf{x}(t)-\mathbf{x}^{*}$ . By the chain rule, $\frac{\mathsf{d}L(\mathbf{q}(t))}{\mathsf{d}t}=\left\langle\mathbf{x}(t)-\mathbf{x}^{*},\mathbf{p}(t)\right\rangle$ , and hence $L(\mathbf{q}(t))=L(\mathbf{q}(0))+\int_{0}^{t}\left\langle\mathbf{x}(\tau)-\mathbf{x}^{*},\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau$ , verifying that the operator is lossless. Moreover, $L$ is bounded below by zero, since by the definition of $h^{*}$ , for any $\mathbf{q}$ , $L(\mathbf{q})=h^{*}(\mathbf{q})-\left\langle\mathbf{q},\mathbf{x}^{*}\right\rangle+h(\mathbf{x}^{*})\geq\left(\left\langle\mathbf{q},\mathbf{x}^{*}\right\rangle-h(\mathbf{x}^{*})\right)-\left\langle\mathbf{q},\mathbf{x}^{*}\right\rangle+h(\mathbf{x}^{*})=0$ . ∎

We summarize the properties of the operators of FTRL dynamics with various shifts in Figure 4.

	FTRL with zero shift	FTRL with shift of a mixed strategy $\mathbf{x}^{*}$
input	$\mathbf{p}$ (payoff)	$\mathbf{p}$
output	$\mathbf{x}$ (mixed strategy)	$\mathbf{x}-\mathbf{x}^{*}$
state	$\mathbf{q}$ (cumulative payoff)	$\mathbf{q}$
storage function	$E(\mathbf{q})=h^{*}(\mathbf{q})$ (see (7))	$L(\mathbf{q})=h^{}(\mathbf{q})-\left\langle\mathbf{q},\mathbf{x}^{}\right\rangle+h(\mathbf{x}^{*})$
infimum	$-\infty$	$0$
property	lossless	finitely lossless

Figure 4: FTRL learning operators with various shifts.

4.2 Relationship to Regret

Theorem 8.

Any finitely passive learning dynamic guarantees constant regret.

Proof: .

Let $L^{j}$ denote the storage function of the learning operator with shift $\mathbf{e}_{j}$ . Since the learning dynamic is finitely passive, we can assume the infimum of $L^{j}$ is zero. By the definition of passivity,

L^{j}(\mathbf{q}(t))~{}\leq~{}L^{j}(\mathbf{q}(0))+\int_{0}^{t}\left\langle\mathbf{x}(\tau)-\mathbf{e}_{j},\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau.

Hence, the regret w.r.t. action $j$ at time $t$ satisfies:

\int_{0}^{t}\left\langle\mathbf{e}_{j},\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau~{}-~{}\int_{0}^{t}\left\langle\mathbf{x}(\tau),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau~{}\leq~{}L^{j}(\mathbf{q}(0))-L^{j}(\mathbf{q}(t))~{}\leq~{}L^{j}(\mathbf{q}(0)).

Thus, the regret up to time $t$ is bounded by $\max_{j}\left\{L^{j}(\mathbf{q}(0))\right\}$ , which is a constant that depends only on the initial state $\mathbf{q}^{0}$ . ∎

As an immediate corollary of Theorems 7 and 8, all FTRL dynamics guarantee bounded regret.

Corollary 9 ([MPP18]).

Every FTRL dynamic guarantees constant regret.

Theorem 8 states that finite passivity implies constant regret. The following proposition states that the converse (constant regret implies finite passivity) is also true, if we restrict to lossless learning dynamics.

Proposition 10.

Suppose that a learning dynamic is lossless. Then the learning dynamic guarantees constant regret if and only if it is finitely lossless.

5 A Characterization of Lossless Learning Dynamics, and Their
Convex Combinations

In Section 4.1, we showed that every FTRL dynamic is lossless. A FTRL dynamic is specified by a convex regularizer function $h$ , while the storage function $L$ has an implicit form in term of $h$ . Here, we ask how we can specify the storage function directly to generate a lossless learning dynamic. Not all storage functions work, so we first seek some necessary conditions on them. These conditions are stated using the storage function $E$ of the learning operator with zero shift. By Proposition 4, the lossless storage function of the learning operator with shift $\mathbf{x}^{*}$ is $E(\mathbf{q})-\left\langle\mathbf{q},\mathbf{x}^{*}\right\rangle+c$ .

Suppose there is a lossless learning dynamic in the form of (1); recall that when $\mathbf{q}^{0}$ is already given, the learning dynamic is specified by its conversion function $f$ . Let $E$ be the storage function of its learning operator with zero shift. We present two necessary conditions on $E$ and $f$ .

Necessary Condition 1 (NC1). (i) For any $j=1,\cdots,n$ , $\nabla_{j}E(\mathbf{q})\geq 0$ . (ii) For any $\mathbf{q}\in\mathbb{R}^{n}$ , we have $\nabla E(\mathbf{q})=f(\mathbf{q})$ .

Reason: Since the learning dynamic is lossless, $E(\mathbf{q}(t))=E(\mathbf{q}^{0})+\int_{0}^{t}\left\langle\mathbf{x}(\tau),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau$ . Taking time-derivative on both sides yields

\frac{\mathsf{d}E(\mathbf{q}(t))}{\mathsf{d}t}=\left\langle\mathbf{x}(t),\mathbf{p}(t)\right\rangle=\left\langle f(\mathbf{q}(t)),\mathbf{p}(t)\right\rangle.

On the other hand, by the chain rule, we have

\frac{\mathsf{d}E(\mathbf{q}(t))}{\mathsf{d}t}=\left\langle\nabla E(\mathbf{q}(t)),\dot{\mathbf{q}}\right\rangle=\left\langle\nabla E(\mathbf{q}(t)),\mathbf{p}(t)\right\rangle.

Thus, $\left\langle f(\mathbf{q}(t)),\mathbf{p}(t)\right\rangle=\left\langle\nabla E(\mathbf{q}(t)),\mathbf{p}(t)\right\rangle$ for any $\mathbf{p}(t)$ and $\mathbf{q}(t)$ . This readily implies conditions (ii) holds.²²2To formally argue this, for any $\mathbf{q}\in\mathbb{R}^{n}$ and for each action $j$ , we construct continuous $\mathbf{p}:[0,t]\rightarrow\mathbb{R}^{n}$ such that $\mathbf{q}^{0}+\int_{0}^{t}\mathbf{p}(\tau)\,\mathsf{d}\tau=\mathbf{q}$ and $\mathbf{p}(t)=\mathbf{e}_{j}$ . This implies that the $j$ -th component of $f(\mathbf{q})$ is same as the $j$ -th component of $\nabla E(\mathbf{q})$ . The construction is easy to make. Since $f(\mathbf{q})\in\Delta^{n}$ , condition (i) holds.

Necessary Condition 2 (NC2). For any real number $r$ and any $\mathbf{q}$ , $E(\mathbf{q}+r\cdot\mathbf{1})=E(\mathbf{q})+r$ .

Reason: By NC1, $\nabla E(\mathbf{q})=f(\mathbf{q})$ , which is in $\Delta^{n}$ . Thus, the directional derivative of $E$ along the direction $\mathbf{1}$ is $\left\langle\nabla E(\mathbf{q}),\mathbf{1}\right\rangle=\left\langle f(\mathbf{q}),\mathbf{1}\right\rangle=1$ .

Indeed, it is easy to verify that the above two necessary conditions are also sufficient.

Proposition 11.

Any smooth function $E(\mathbf{q})$ which satisfies NC1(i) and NC2 is a lossless storage function of a learning operator with zero shift. The conversion function $f$ satisfies $f(\mathbf{q})=\nabla E(\mathbf{q})$ .

Proof: .

Due to NC2, $\left\langle\nabla E(\mathbf{q}),\mathbf{1}\right\rangle=1$ , thus $\sum_{j=1}^{n}\nabla_{j}E(\mathbf{q})=1$ . This equality and NC1(i) implies $\nabla E(\mathbf{q})\in\Delta^{n}$ . Now, consider a learning dynamic with conversion function $f=\nabla E$ . Then for any function-of-time $\mathbf{p}$ and any $t>0$ , we have $\frac{\mathsf{d}E(\mathbf{q}(t))}{\mathsf{d}t}=\left\langle\nabla E(\mathbf{q}(t)),\dot{\mathbf{q}}\right\rangle=\left\langle f(\mathbf{q}(t)),\mathbf{p}(t)\right\rangle=\left\langle\mathbf{x}(t),\mathbf{p}(t)\right\rangle$ . Integrating both sides w.r.t. $t$ shows that the learning dynamic is lossless via the storage function $E$ . ∎

Example 12.

Consider the learning dynamic where the conversion function always outputs the uniform mixed strategy, i.e. for all $\mathbf{q}$ , $f(\mathbf{q})=\frac{1}{n}\cdot\mathbf{1}$ . By Proposition 11, it is a lossless learning dynamic via $E(\mathbf{q})=\left\langle\frac{1}{n}\cdot\mathbf{1},\mathbf{q}\right\rangle$ , a storage function that clearly satisfies NC1(i) and NC2. Of course, there is not really “learning” with this conversion function, and it is easy to construct examples to demonstrate that its regret can be unbounded. By Proposition 10, this learning dynamic is not finitely lossless. This shows the family of finitely lossless learning dynamics is a proper subset of the family of lossless learning dynamics.

The family of smooth functions $E$ satisfying NC1(i) and NC2 can be represented compactly as

\mathcal{E}:=\{E:\mathbb{R}^{n}\rightarrow\mathbb{R}~{}|~{}\forall\mathbf{q},~{}\nabla E(\mathbf{q})\geq\mathbf{0}~{}~{}\text{and}~{}~{}\left\langle\nabla E(\mathbf{q}),\mathbf{1}\right\rangle=1\}.

An interesting observation is that this family is closed under convex combination, i.e. if $E_{1},\ldots,E_{k}\in\mathcal{E}$ , then for any real numbers $\alpha_{1},\ldots,\alpha_{k}\geq 0$ such that $\sum_{\ell=1}^{k}\alpha_{\ell}=1$ , $\left(\sum_{\ell=1}^{k}\alpha_{\ell}\cdot E_{\ell}\right)\in\mathcal{E}$ . By Proposition 11, $E_{1},\ldots,E_{k}$ are lossless storage functions of some learning operators with zero shift. Suppose the conversion functions are $f_{1},\ldots,f_{k}$ respectively. Then by Proposition 11 again, $\left(\sum_{\ell=1}^{k}\alpha_{\ell}\cdot E_{\ell}\right)$ is a lossless storage function of a learning operator with zero shift, with conversion function $\left(\sum_{\ell=1}^{k}\alpha_{\ell}\cdot f_{\ell}\right)$ . This motivates the following definition of convex combination of learning dynamics.

Definition 13.

Given $k$ learning dynamics, each over $n$ actions, let $f_{\ell}$ denote the conversion function of the $\ell$ -th learning dynamic. Given any non-negative constants $\alpha_{1},\cdots,\alpha_{k}$ where $\sum_{\ell=1}^{k}\alpha_{\ell}=1$ , the convex combination of the $k$ learning dynamics with parameters $\alpha_{1},\cdots,\alpha_{k}$ is a learning dynamic with conversion function $\left(\sum_{\ell=1}^{k}\alpha_{\ell}\cdot f_{\ell}\right)$ .

Example 14.

Suppose we are using the half-half convex combination of Replicator Dynamic (RD) and Online Gradient Descent (OGD), and $n=2$ . Recall the conversion functions of RD and OGD in (2) and (4). Suppose that at some time $t$ , $\mathbf{q}(t)=(0.6,0)$ . Then the mixed strategy at time $t$ with the half-half convex combination of RD and OGD is

	$\displaystyle\mathbf{x}(t)$	$\displaystyle~{}=~{}\frac{1}{2}\cdot f_{\text{\emph{RD}}}\left((0.6,0)\right)+\frac{1}{2}\cdot f_{\text{\emph{OGD}}}\left((0.6,0)\right)$
		$\displaystyle~{}=~{}\frac{1}{2}\left[\left(\frac{\exp(0.6)}{\exp(0.6)+\exp(0)}~{},~{}\frac{\exp(0)}{\exp(0.6)+\exp(0)}\right)+\left(\frac{0.6-0+1}{2}~{},~{}\frac{0-0.6+1}{2}\right)\right]$
		$\displaystyle~{}\approx~{}\frac{1}{2}\left[(0.6457,0.3543)+(0.8,0.2)\right]~{}\approx~{}(0.7228,0.2772).$

By (4), $f_{\text{\emph{OGD}}}(\mathbf{q})$ outputs a strategy with zero probability of choosing action 2 whenever $q_{1}-q_{2}\geq 1$ . In contrast, $f_{\text{\emph{RD}}}(\mathbf{q})$ maintains a tiny but positive probability of choosing action 2 even when $q_{1}$ is much larger than $q_{2}$ . We may say OGD leans toward exploitation while RD leans toward exploration. By combining the two learning dynamics via convex combination with our own choice of $\alpha$ ’s, we obtain a new lossless learning dynamic with our desired balance between exploitation and exploration.

Convex combination not only preserves losslessness, but also finitely losslessness. Suppose there are several finitely lossless learning dynamics. By Definition 5, for every action $j$ , the storage functions of their learning operators with shift $\mathbf{e}_{j}$ have finite lower bounds. It is easy to verify that for a convex combination of these lossless learning dynamics, its learning operator with shift $\mathbf{e}_{j}$ is lossless via the same convex combination of the storage functions mentioned above. Since the storage functions have finite lower bounds, their convex combination has a finite lower bound too.

Theorem 15.

Given any $k$ lossless (resp. finitely lossless) learning dynamics, any convex combination of them is a lossless (resp. finitely lossless) learning dynamic.

Theorems 7, 8 and 15 lead to the interesting theorem below.

Theorem 16.

Any convex combination of any finitely lossless learning dynamics is a learning dynamic that guarantees constant regret. In particular, any convex combination of any FTRL learning dynamics is a learning dynamic that guarantees constant regret.

Remark. Note that the family of smooth storage functions that generate finitely lossless learning dynamics is

\left\{E\in\mathcal{E}~{}|~{}\forall j,~{}E(\mathbf{q})-\left\langle\mathbf{q},\mathbf{e}_{j}\right\rangle~{}\text{is bounded from below}\right\}.

Again, this family is closed under convex combination. By Theorem 7, the family of FTRL learning dynamics is a subset of the family of finitely lossless learning dynamics. It is not clear if the two families are equal; we believe not. For instance, for the half-half convex combination of Replicator Dynamic and Online Gradient Descent, we cannot find any regularizer function that validates it is a FTRL dynamic.

6 Lossless DGS and Poincaré Recurrence

In the last section, we present a potentially broad family of (finitely) lossless learning dynamics. In this section, our focus is on DGS in which agents use such learning dynamics. We first prove that for certain graphical constant-sum games, their game operators are finitely lossless. Thus, for any DGS comprising of such game and finitely lossless learning dynamics, it admits a constant-of-motion by Theorem 2 when $\mathbf{r}\equiv 0$ , i.e. the sum of the two storage function values is a constant for any $t\geq 0$ . Then we use this constant-of-motion and follow a principled approach proposed by Mertikopoulos et al. [MPP18] to show our main result here: the DGS is Poincaré recurrent. In the rest of this section, we first define graphical constant-sum game and Poincaré recurrence, then we apply the principled approach to prove our main result.

Graphical Constant-sum Game.

There are $m$ agents. Each agent $i$ has $n_{i}$ actions. In a graphical game [KLS01], there is a normal-form (matrix) game between every pair of agents, which we call an edge-game. The edge game between agents $i,k$ is specified by two matrices, $\mathbf{A}^{ik}\in\mathbb{R}^{n_{i}\times n_{k}}$ and $\mathbf{A}^{ki}\in\mathbb{R}^{n_{k}\times n_{i}}$ . When each agent $k\neq i$ chooses a mixed strategy $\mathbf{x}_{k}\in\Delta^{n_{k}}$ , the payoff vector of agent $i$ , which contains the payoff to each action of agent $i$ , is the sum of the payoff vectors in all her edge-games. Precisely, the payoff vector of agent $i$ is

\mathbf{p}_{i}~{}=~{}\sum_{1\leq k\leq m,~{}k\neq i}~{}\mathbf{A}^{ik}\cdot\mathbf{x}_{k}.

(9)

A graphical constant-sum game [DP09] is a graphical game such that for every pair of agents $\{i,k\}$ , there exists a constant $c^{\{i,k\}}$ satisfying $\mathbf{A}^{ik}_{j\ell}+\mathbf{A}^{ki}_{\ell j}=c^{\{i,k\}}$ for any action $j$ of agent $i$ and action $\ell$ of agent $k$ .

Game Operator.

Recall that a game operator has an input of mixed strategies of different agents with shift $\hat{\mathbf{x}}^{*}$ , while the output is the negative of the cumulative payoff vector to different actions. We point out one useful fact, which follows quite readily from [PS14].

Proposition 17.

If $\hat{\mathbf{x}}^{*}$ is a Nash equilibrium of a graphical constant-sum game, then the game operator with shift $\hat{\mathbf{x}}^{*}$ is passive; the storage function is the zero function. Moreover, if $\hat{\mathbf{x}}^{*}$ is fully-mixed (i.e. every entry in $\hat{\mathbf{x}}^{*}$ is positive), then the game operator is lossless via the zero storage function.

When the game is lossless, by the composability of passive operators (Theorem 2), if it is coupled with passive (resp. lossless) learning dynamics, then the DGS is passive (resp. lossless).

Poincaré Recurrence.

In a DGS, $\hat{\mathbf{p}}$ is a function of $\hat{\mathbf{x}}$ via the game operator, while $\hat{\mathbf{x}}$ is a function of $\hat{\mathbf{q}}$ via the conversion function. Thus, $\hat{\mathbf{p}}$ is a function of $\hat{\mathbf{q}}$ . We say the ODE system (1) is divergence-free if $\sum_{i=1}^{m}\sum_{j}\frac{\partial p_{ij}}{\partial q_{ij}}$ is zero everywhere. When $\hat{\mathbf{p}}$ is derived using a graphical game via (9), $p_{ij}$ does not depend on $q_{ij}$ since for every $k\neq i$ , $\mathbf{x}_{k}$ is a function of $\mathbf{q}_{k}$ only. Thus, the game dynamic is divergence-free.

Intuitively, an ODE system with domain $\mathbb{R}^{N}$ is Poincaré recurrent if almost all trajectories return arbitrarily close to their initial position infinitely often. In order to work formally with the notion of Poincaré recurrence we need to define a measure on $\mathbb{R}^{N}$ . We use the standard Lebesgue measure on $\mathbb{R}^{N}$ . Liouville’s formula states that divergence-free ODE systems preserve volume [Wei95]; see Appendix A for more discussion of volume and Liouville’s formula. Thus, the following Poincaré Recurrence Theorem is applicable if its bounded-orbit requirement is satisfied.

Theorem 18 ([Poi90, Bar06]).

If a transformation preserves volume and has bounded orbits, then it is Poincaré recurrent, i.e. for each open set there exist orbits that intersect this set infinitely often.

Given any $\epsilon>0$ , we can cover $\mathbb{R}^{n}$ by countably many balls of radius $\epsilon$ , and apply the theorem to each ball. We conclude that almost every point returns to within an $\epsilon$ neighbourhood of itself. Since $\epsilon>0$ is arbitrary, we conclude that almost every initial point is almost recurrent.

Poincaré Recurrence in DGS.

Recall that in each learning operator of a DGS, the state is represented by a vector $\mathbf{q}\in\mathbb{R}^{n}$ , which represents the cumulative payoffs. Clearly, it can be unbounded as time goes even in a graphical constant sum game³³3For instance, this happens for a two-person zero-sum game with every payoff entry to agent 1 is strictly positive., thus prohibiting us to use Theorem 18. Instead, as in [MPP18], we consider a transformation that maps $\mathbf{q}=(q_{1},\ldots,q_{n})$ to

(q_{1}^{\prime},q_{2}^{\prime},\cdots,q_{n-1}^{\prime})~{}:=(q_{1}-q_{n},\ldots,q_{n-1}-q_{n})\in\mathbb{R}^{n-1}.

It is well-known that for any FTRL dynamic with starting point $\mathbf{q}^{0}=(q^{0}_{1},\ldots,q^{0}_{n})$ and conversion function $f:\mathbb{R}^{n}\rightarrow\Delta^{n}$ , it is equivalent to the following dynamic with state variables $\mathbf{q}^{\prime}=(q_{1}^{\prime},\ldots,q_{n-1}^{\prime})\in\mathbb{R}^{n-1}$ , whereas $1\leq j\leq n-1$ :

$\displaystyle q^{\prime}_{j}(0)$	$\displaystyle~{}=~{}q^{0}_{j}-q^{0}_{n}$
$\displaystyle\dot{q}^{\prime}_{j}(t)$	$\displaystyle~{}=~{}p_{j}(t)-p_{n}(t)$
$\displaystyle\mathbf{x}(t)$	$\displaystyle~{}=~{}f(q^{\prime}_{1}(t),q^{\prime}_{2}(t),\cdots,q^{\prime}_{n-1}(t),0).$	(10)

Given a mixed strategy $\mathbf{x}^{*}$ , if we cast the output of the above dynamic to be $(x_{1}(t)-x_{1}^{*},\cdots,x_{n-1}(t)-x_{n-1}^{*})$ , then it is easy to verify that the learning operator is finitely lossless via the storage function $\overline{L}(q^{\prime}_{1},\ldots,q^{\prime}_{n-1})=L(q^{\prime}_{1},\cdots,q^{\prime}_{n-1},0)$ , where $L$ is the storage function of the original learning operator with shift $\mathbf{x}^{*}$ .

Theorem 19.

Poincaré recurrence occurs in the strategy space $\Delta$ for any dynamical game system where (1) each agent employs a learning dynamic which is a convex combination of FTRL; and (2) the underlying game is a graphical constant-sum game with a fully-mixed Nash equilibrium.

See Figure 5 for an illumination of Poincaré recurrence under conditions (1) and (2). To prove the theorem, we first show that Poincaré recurrence occurs in the space that contains $\hat{bbq}^{\prime}$ ( $\hat{\mathbf{q}}^{\prime}$ is the concatenation of the $\mathbf{q}^{\prime}$ of all agents). This comprises of two major steps:

•

The dynamic preserves volume, since the dynamic is divergence-free.
•

For any starting point $\hat{\mathbf{q}}^{\prime}(0)$ , the dynamic remains bounded. To show this, we use the fact that $\overline{L}$ is a constant-of-motion of the game dynamic, so for any $t\geq 0$ , $\hat{\mathbf{q}}^{\prime}(t)$ must stay within a level set of $\overline{L}$ , which is bounded (see Appendix C).

Poincaré recurrence in the space that contains $\hat{\mathbf{q}}^{\prime}$ implies Poincaré recurrence in the strategy space $\Delta$ , since the conversion function in (10) is continuous.

7 Conclusion

We present a control-theoretic perspective to understanding popular learning dynamics like FTRL and escort replicator dynamics. At the heart of it is the use of storage (energy) functions to govern how the dynamic turns history of payoffs to strategic choices. This mirrors the study of physical dynamics, e.g. electrical networks [Kha15]. Analysis via storage functions permits us to prove optimal regret bounds and inspires interesting generalizations of FTRL via convex combinations.

An important benefit of these control-theoretic tools is, as pointed out by Fox and Shamma [FS13], they allow decoupling of game dynamics into learning operators and game operators. This provides a framework to understand learning-in-games via a modular approach, by analyzing these operators separately. This technique can liberate us from analyzing each individual learning-in-game system in ad-hoc manner.

In our work, we initiate the study of connections between online optimization and control theory with continuous-time learning dynamics. An interesting problem is how such a connection can be generalized to discrete-time learning algorithms, e.g. Multiplicative Weights Update and its optimistic variant [BP18, Che18, DP19, CP20]. There does exist theory that generalizes passivity to discrete-time settings, e.g. Section VI.7 in [DV75]. We hope our work inspires further studies in this direction.

Lastly, we believe that this control-theoretic perspective is also useful for understanding learning dynamics/algorithms which are not always passive. The perspective can help us spot under which situations the learning dynamics/algorithms create or dissipate energy. By avoiding situations where energy is created, it is possible that we can achieve stable outcomes in learning processes.

Acknowledgements

This research/project is supported in part by NRF2019-NRF-ANR095 ALIAS grant, grant PIE-SGP-AI-2018-01, NRF 2018 Fellowship NRF-NRFF2018-07, AME Programmatic Fund (Grant No. A20H6b0151) from the Agency for Science, Technology and Research (A*STAR) and the National Research Foundation, Singapore under its AI Singapore Program (AISG Award No: AISG2-RP-2020-016).

References

[AHK05] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta algorithm and applications. Technical report, 2005.
[Bar06] Luis Barreira. Poincare recurrence: old and new. In XIVth International Congress on Mathematical Physics. World Scientific., pages 415–422, 2006.
[BGP20] James P. Bailey, Gauthier Gidel, and Georgios Piliouras. Finite regret and cycles with fixed step-size via alternating gradient descent-ascent. In Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 391–407. PMLR, 2020.
[BP18] James P. Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM Conference on Economics and Computation, Ithaca, NY, USA, June 18-22, 2018, pages 321–338. ACM, 2018.
[BP19] James Bailey and Georgios Piliouras. Fast and furious learning in zero-sum games: vanishing regret with non-vanishing step sizes. In Advances in Neural Information Processing Systems, pages 12977–12987, 2019.
[BRM⁺18] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. In ICML, 2018.
[CBL06] Nikolo Cesa-Bianchi and Gabor Lugoisi. Prediction, Learning, and Games. Cambridge University Press, 2006.
[Che18] Yun Kuen Cheung. Multiplicative weights updates with constant step-size in graphical constant-sum games. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 3532–3542, 2018.
[CP19] Yun Kuen Cheung and Georgios Piliouras. Vortices instead of equilibria in minmax optimization: Chaos and butterfly effects of online learning in zero-sum games. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pages 807–834. PMLR, 2019.
[CP20] Yun Kuen Cheung and Georgios Piliouras. Chaos, extremism and optimism: Volume analysis of learning in games. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
[DP09] Constantinos Daskalakis and Christos H. Papadimitriou. On a network generalization of the minmax theorem. In ICALP 2009: Proceedings of the 2009 International Colloquium on Automata, Languages, and Programming, 2009.
[DP19] Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10-12, 2019, San Diego, California, USA, volume 124 of LIPIcs, pages 27:1–27:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
[DV75] C.A. Desoer and M. Vidyasagar. Feedback Systems: Input-output Properties. Electrical science series. Academic Press, 1975.
[FL98] Drew Fudenberg and David K. Levine. The Theory of Learning in Games. MIT Press Books. The MIT Press, 1998.
[FLST16] Dylan J Foster, Thodoris Lykouris, Karthik Sridharan, and Eva Tardos. Learning in games: Robustness of fast convergence. In Advances in Neural Information Processing Systems, pages 4727–4735, 2016.
[FS99] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999.
[FS13] Michael J Fox and Jeff S Shamma. Population games, stable games, and passivity. Games, 4(4):561–583, 2013.
[GP19] D. Gadjov and L. Pavel. A passivity-based approach to nash equilibrium seeking over networks. IEEE Transactions on Automatic Control, 64(3):1077–1092, 2019.
[H⁺16] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
[Har11] Marc Harper. Escort evolutionary game theory. Physica D: Nonlinear Phenomena, 240(18):1411 – 1415, 2011.
[HS98] Josef Hofbauer and Karl Sigmund. Evolutionary Games and Population Dynamics. Cambridge University Press, Cambridge, UK, 1998.
[HS09] Josef Hofbauer and William H Sandholm. Stable games and their dynamics. Journal of Economic theory, 144(4):1665–1693, 2009.
[Kha15] Hassan K. Khalil. Nonlinear Control. Pearson Education, 2015.
[KLS01] M. J. Kearns, M. L. Littman, and S. P. Singh. Graphical models for game theory. In UAI, 2001.
[KM17] Joon Kwon and Panayotis Mertikopoulos. A continuous-time approach to online optimization. Journal of Dynamics and Games, 4(2):125–148, April 2017.
[Mab18] M. A. Mabrok. Passivity Analysis of Replicator Dynamics and its Variations. arXiv e-prints, page arXiv:1812.07164, Dec 2018.
[MPP18] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In ACM-SIAM Symposium on Discrete Algorithms, 2018.
[MS15] Jason R Marden and Jeff S Shamma. Game theory and distributed control. In Handbook of game theory with economic applications, volume 4, pages 861–899. Elsevier, 2015.
[MS16] M. A. Mabrok and J. S. Shamma. Passivity analysis of higher order evolutionary dynamics and population games. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 6129–6134, 2016.
[MS18] Jason R Marden and Jeff S Shamma. Game theory and control. Annual Review of Control, Robotics, and Autonomous Systems, 1:105–134, 2018.
[OPNSR13] Romeo Ortega, Julio Antonio Loría Perez, Per Johan Nicklasson, and Hebertt J Sira-Ramirez. Passivity-based control of Euler-Lagrange systems: mechanical, electrical and electromechanical applications. Springer Science & Business Media, 2013.
[PML⁺20] Julien Perolat, Remi Munos, Jean-Baptiste Lespiau, Shayegan Omidshafiei, Mark Rowland, Pedro Ortega, Neil Burch, Thomas Anthony, David Balduzzi, Bart De Vylder, et al. From poincaré recurrence to convergence in imperfect information games: Finding equilibrium via regularization. arXiv preprint arXiv:2002.08456, 2020.
[Poi90] Henri Poincaré. Sur le problème des trois corps et les équations de la dynamique. Acta mathematica, 13(1):A3–A270, 1890.
[PS14] Georgios Piliouras and Jeff S Shamma. Optimization despite chaos: Convex relaxations to complex limit sets via poincaré recurrence. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 861–873. SIAM, 2014.
[PSM18] S. Park, J. S. Shamma, and N. C. Martins. Passivity and evolutionary game dynamics. In 2018 IEEE Conference on Decision and Control (CDC), pages 3553–3560, 2018.
[RS13] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In COLT, 2013.
[SALS15] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of regularized learning in games. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 2989–2997, Cambridge, MA, USA, 2015. MIT Press.
[San10] William H. Sandholm. Population Games and Evolutionary Dynamics. MIT Press, 2010.
[Sha20] Jeff S. Shamma. Feedback control perspectives on learning, 2020.
[SS12] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
[VGFP19] Emmanouil-Vasileios Vlatakis-Gkaragkounis, Lampros Flokas, and Georgios Piliouras. Poincaré recurrence, cycles and spurious equilibria in gradient-descent-ascent for non-convex non-concave zero-sum games. In Advances in Neural Information Processing Systems, pages 10450–10461, 2019.
[Wei95] J. W. Weibull. Evolutionary Game Theory. MIT Press; Cambridge, MA: Cambridge University Press., 1995.
[Wil72a] Jan C. Willems. Dissipative dynamical systems part i: General theory. In Archive for Rational Mechanics and Analysis, volume 45(5), pages 321–351. 1972.
[Wil72b] Jan C. Willems. Dissipative dynamical systems part ii: Linear systems with quadratic supply rates. In Archive for Rational Mechanics and Analysis, volume 45(5), pages 352–393. 1972.

Appendix

Appendix A Volume and Liouville’s Formula

Let $g:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ be a function. Given an ODE system $\dot{\mathbf{z}}=g(\mathbf{z})$ but with a flexibility to choose the starting point, let $\Phi(\mathbf{z}^{0},t)$ be the solution of the ODE system at time $t$ with starting point $\mathbf{z}^{0}$ . Given any set $A$ , let $A(t)=\{\Phi(\mathbf{z}^{0},t)~{}|~{}\mathbf{z}^{0}\in A\}$ . When $A$ is measurable, under mild conditions on the ODE system, $A(t)$ is measurable and its volume is $\text{vol}[A(t)]=\int_{A(t)}\,\mathsf{d}v$ . Liouville’s formula states that the time derivative of the volume $A(t)$ is equal to the integral of the divergence of the ODE system over $A(t)$ :

\frac{d}{dt}\text{vol}[A(t)]=\int_{A(t)}\text{trace}\left(\frac{\partial g}{\partial\mathbf{z}}\right)\,\mathsf{d}v,

where $\frac{\partial g}{\partial\mathbf{z}}$ is the Jacobian of the ODE system. Note that $\text{trace}\left(\frac{\partial g}{\partial\mathbf{z}}\right)=\sum_{j=1}^{d}\frac{\partial g_{j}}{\partial z_{j}}$ , where $g_{j}$ is the $j$ -th component of the function $g$ . This immediately implies volume preservation for divergence-free systems.

Appendix B Missing Proofs

Proof of Proposition 4: .

It suffices to prove the forward (only if) direction, as the other direction is symmetric. By the definition of passivity,

L^{a}(\mathbf{q}(t))\leq L^{a}(\mathbf{q}^{0})+\int_{0}^{t}\left\langle(\mathbf{x}(\tau)-\mathbf{x}^{*,a}),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau.

This implies

	$\displaystyle L^{a}(\mathbf{q}(t))$	$\displaystyle~{}\leq~{}L^{a}(\mathbf{q}^{0})+\int_{0}^{t}\left\langle(\mathbf{x}(\tau)-\mathbf{x}^{,b}),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau+\int_{0}^{t}\left\langle(\mathbf{x}^{,b}-\mathbf{x}^{*,a}),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau$
		$\displaystyle~{}=~{}L^{a}(\mathbf{q}^{0})+\int_{0}^{t}\left\langle(\mathbf{x}(\tau)-\mathbf{x}^{,b}),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau+\left\langle(\mathbf{x}^{,b}-\mathbf{x}^{*,a}),(\mathbf{q}(t)-\mathbf{q}^{0})\right\rangle.$

Thus, by setting $L^{b}(\mathbf{q}):=L^{a}(\mathbf{q})-\left\langle(\mathbf{x}^{*,b}-\mathbf{x}^{*,a}),\mathbf{q}\right\rangle+c$ , we have

L^{b}(\mathbf{q}(t))\leq L^{b}(\mathbf{q}^{0})+\int_{0}^{t}\left\langle(\mathbf{x}(\tau)-\mathbf{x}^{*,b}),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau,

certifying passivity of the operator $S^{b}$ . ∎

Proof of Proposition 6: .

Suppose that for each action $j$ , the learning operator with shift $\mathbf{e}_{j}$ is finitely passive via storage function $L^{j}$ . We claim that the storage function $\sum_{j=1}^{n}x^{*}_{j}\cdot L^{j}$ can certify finitely passivity of the learning operator with shift of the mixed strategy $\mathbf{x}^{*}$ . To see why, note that for each action $j$ , we have

L^{j}(\mathbf{q}(t))\leq L^{j}(\mathbf{q}^{0})+\int_{0}^{t}\left\langle(\mathbf{x}(\tau)-\mathbf{e}_{j}),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau.

We are done by multiply both sides of the inequality by $x^{*}_{j}$ , then summing up over all $j$ ’s. ∎

Proof of Proposition 10: .

$(\Leftarrow)$ Done by Theorem 8.

$(\Rightarrow)$ Suppose the contrary, i.e. the learning algorithm is lossless, but there exists $j$ such that the learning operator with shift $\mathbf{e}_{j}$ is not finitely lossless. Thus, it has a storage function $L^{j}$ which is not bounded from below, and

L^{j}(\mathbf{q}(t))~{}=~{}L^{j}(\mathbf{q}(0))+\int_{0}^{t}\left\langle\mathbf{x}(\tau),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau-\int_{0}^{t}\left\langle\mathbf{e}_{j},\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau.

Following the calculation in the proof of Theorem 8, the regret w.r.t. action $j$ at time $t$ is exactly equal to $L^{j}(\mathbf{q}^{0})-L^{j}(\mathbf{q}(t))$ .

Since $L^{j}$ is not bounded from below, for any $r<0$ , there exists $\tilde{\mathbf{q}}$ such that $L^{j}(\tilde{\mathbf{q}})\leq r$ . It is easy to construct $\mathbf{p}$ such that $\mathbf{q}(t)=\mathbf{q}^{0}+\int_{0}^{t}\mathbf{p}(\tau)\,\mathsf{d}\tau=\tilde{\mathbf{q}}$ ; for instance, set $\mathbf{p}(\tau)=(\tilde{\mathbf{q}}-\mathbf{q}^{0})/t$ for all $\tau\in[0,t]$ . For this choice of $\mathbf{p}$ , the regret at time $t$ is $L^{j}(\mathbf{q}^{0})-L^{j}(\tilde{\mathbf{q}})\geq L^{j}(\mathbf{q}^{0})-r$ . Since we can choose arbitrarily negative value of $r$ , the learning dynamic cannot guarantee constant regret, a contradiction. ∎

Proof of Proposition 17: .

To show that the game operator is passive, according to Definition 1 and the input-output choice of $S_{2}$ (see Figure 3), it suffices to show that

\int_{0}^{t}\left\langle(\hat{\mathbf{x}}(\tau)-\hat{\mathbf{x}}^{*}),(-\hat{\mathbf{p}}(\tau))\right\rangle\,\mathsf{d}\tau~{}=~{}-\int_{0}^{t}\underbrace{\left\langle\hat{\mathbf{x}}(\tau),\hat{\mathbf{p}}(\tau)\right\rangle}_{V_{1}}\,\mathsf{d}\tau+\int_{0}^{t}\underbrace{\left\langle\hat{\mathbf{x}}^{*},\hat{\mathbf{p}}(\tau)\right\rangle}_{V_{2}}\,\mathsf{d}\tau~{}\geq~{}0.

Recall the definition of $c^{\{i,k\}}$ in a graphical constant-sum game. Since $V_{1}$ is simply the total payoffs to all agents, $V_{1}$ is the sum of the constants $c^{\{i,k\}}$ of all edge-games, i.e. $V_{1}=\sum_{i=1}^{m-1}\sum_{k=i+1}^{m}c^{\{i,k\}}$ . We denote this double summation by $V$ . It remains to show that $V_{2}\geq V$ always if we want to show the game operator is passive, and to show that $V_{2}=V$ always if we want to show the game operator is lossless.

Let the action set of agent $i$ be $S_{i}$ . We first expand $V_{2}=\left\langle\hat{\mathbf{x}}^{*},\hat{\mathbf{p}}\right\rangle$ as follows:

\left\langle\hat{\mathbf{x}}^{*},\hat{\mathbf{p}}\right\rangle~{}=~{}\sum_{i=1}^{m}\sum_{j\in S_{i}}x_{ij}^{*}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{m}[\mathbf{A}^{ik}\mathbf{x}_{k}]_{j}~{}=~{}\sum_{i=1}^{m}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{m}(\mathbf{x}_{i}^{*})^{\mathsf{T}}\mathbf{A}^{ik}\mathbf{x}_{k}~{}=~{}\sum_{i=1}^{m}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{m}(\mathbf{x}_{k})^{\mathsf{T}}(\mathbf{A}^{ik})^{\mathsf{T}}\mathbf{x}_{i}^{*}~{};

the last equality holds since $(\mathbf{x}_{i}^{*})^{\mathsf{T}}\mathbf{A}^{ik}\mathbf{x}_{k}$ is the transpose of $(\mathbf{x}_{k})^{\mathsf{T}}(\mathbf{A}^{ik})^{\mathsf{T}}\mathbf{x}_{i}^{*}$ . Then we rewrite the double summation on the RHS as follows:

	$\displaystyle\sum_{i=1}^{m}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{m}(\mathbf{x}_{k})^{\mathsf{T}}(\mathbf{A}^{ik})^{\mathsf{T}}\mathbf{x}_{i}^{*}$	$\displaystyle~{}=~{}\sum_{i=1}^{m}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{m}\left[c^{\{i,k\}}-(\mathbf{x}_{k})^{\mathsf{T}}\mathbf{A}^{ki}\mathbf{x}_{i}^{*}\right]~{}~{}~{}~{}~{}~{}\text{(definition of constant-sum edge-game)}$
		$\displaystyle~{}=~{}\sum_{i=1}^{m}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{m}c^{\{i,k\}}~{}-~{}\sum_{k=1}^{m}\sum_{\begin{subarray}{c}i=1\\ i\neq k\end{subarray}}^{m}(\mathbf{x}_{k})^{\mathsf{T}}\mathbf{A}^{ki}\mathbf{x}_{i}^{*}$
		$\displaystyle~{}=~{}2V~{}-~{}\sum_{k=1}^{m}\underbrace{\sum_{\begin{subarray}{c}i=1\\ i\neq k\end{subarray}}^{m}(\mathbf{x}_{k})^{\mathsf{T}}\mathbf{A}^{ki}\mathbf{x}_{i}^{*}}_{U_{k}}.$

It remains to bound the term $\sum_{k=1}^{m}U_{k}$ . Observe that for each agent $k$ , $U_{k}$ is the payoff to agent $k$ when she chooses the mixed strategy $\mathbf{x}_{k}$ , while every other agent $i$ chooses the mixed strategy $\mathbf{x}_{i}^{*}$ . Since the mixed strategies $\mathbf{x}_{i}^{*}$ are coming from a Nash equilibrium (NE), by the definition of NE, $U_{k}\leq v_{k}^{*}$ , where $v_{k}^{*}$ is the payoff to agent $k$ at the NE. Thus, $\sum_{k=1}^{m}U_{k}\leq\sum_{k=1}^{m}v_{k}^{*}$ , where the RHS is the total payoffs to all agents at the NE. Since the game is constant-sum, we have $\sum_{k=1}^{m}v_{k}^{*}=V$ . Hence, $V_{2}=\left\langle\hat{\mathbf{x}}^{*},\hat{\mathbf{p}}\right\rangle\geq 2V-V=V$ .

When the NE is fully-mixed, we have the following extra property: at the NE, for the agent $k$ , her payoff from each of her actions is the same, and equals to $v_{k}^{*}$ . Thus, $U_{k}$ exactly equals to $v_{k}^{*}$ , so $V_{2}=V$ . ∎

Appendix C Poincaré Recurrence

We first formally state the following corollary of Proposition 17 and Theorem 2.

Corollary 20.

The FIC system which corresponds to a dynamical game system, in which $S_{1}$ is any finitely lossless MLO and $S_{2}$ is any game operator which corresponds to a graphical constant-sum game with a fully-mixed Nash equilibrium, is finitely lossless. The storage function that demonstrates finitely losslessness of the FIC system is the same as the storage function of $S_{1}$ . When the external input $\mathbf{r}$ is the zero function, the storage function becomes a constant-of-motion.

To complete the proof of Theorem 19, we need to show the second property required by the principled approach of [MPP18]. It relies crucially on the following lemma. Recall that we have defined the following in the main paper, which converts the storage function for the original learning operator to the storage function of the new learning operator of (10).

\overline{L}(q^{\prime}_{1},q^{\prime}_{2},\cdots,q^{\prime}_{n-1})=L(q^{\prime}_{1},q^{\prime}_{2},\cdots,q^{\prime}_{n-1},0),

(11)

Lemma 21 (Adapted from [MPP18], Appendix D).

For any continuous FTRL dynamic and for any $\mathbf{x}^{*}\in\Delta^{n}$ , let $L$ be its finitely lossless storage function defined in (8), and let $\overline{L}$ be the function defined on $\mathbb{R}^{n-1}$ as in (11). Then any level set of $\overline{L}$ is bounded in $\mathbb{R}^{n-1}$ , i.e. for any real number $\bar{c}$ , the set below is bounded:

\{~{}(q^{\prime}_{1},\cdots,q^{\prime}_{n-1})~{}\big{|}~{}\overline{L}(q^{\prime}_{1},\cdots,q^{\prime}_{n-1})\leq\bar{c}~{}\}.

Recall the definition of FTRL and Theorem 7. For each agent $i$ , suppose she uses a convex combination of $\ell_{i}$ FTRL dynamics indexed by $i1,i2,\cdots,i\ell_{i}$ . Let the storage functions of these FTRL dynamics be $L^{i1},L^{i2},\cdots,L^{i\ell_{i}}$ . Also, let $\mathbf{q}^{\prime,i}$ denote a vector in $\mathbb{R}^{n_{i}-1}$ for agent $i$ . Then the storage function of the whole dynamical game system is

\sum_{i=1}^{m}\sum_{j=1}^{\ell_{i}}\alpha_{ij}\cdot\overline{L}^{ij}(\mathbf{q}^{\prime,i}),~{}~{}~{}\text{where}~{}\alpha_{ij}>0,~{}\text{and}~{}\forall i,~{}\sum_{j=1}^{\ell_{i}}\alpha_{ij}=1.

Due to Corollary 20, this storage function is a constant-of-motion when $\mathbf{r}\equiv 0$ , and thus is bounded by certain constant $\bar{c}$ when the starting point is already given. Since every $L^{ij}$ and hence $\overline{L}^{ij}$ has infimum zero, we must have: for each agent $i$ , $\alpha_{i1}\cdot\overline{L}^{i1}(\mathbf{q}^{\prime,i})\leq\bar{c}$ , and hence $\overline{L}^{i1}(\mathbf{q}^{\prime,i})\leq\bar{c}/\alpha_{i1}$ . Then by Lemma 21, for each agent $i$ , $\mathbf{q}^{\prime,i}(t)$ remains bounded for all $t$ , and thus the overall vector $\hat{\mathbf{q}}^{\prime}(t)=(\mathbf{q}^{\prime,1}(t),\mathbf{q}^{\prime,2}(t),\cdots,\mathbf{q}^{\prime,m}(t))$ also remains bounded for all $t$ .

Appendix D Escort Learning Dynamics

An escort learning dynamic [Har11] is a system of differential equations on variable $\mathbf{x}\in\Delta^{n}$ : for each $1\leq j\leq n$ ,

\dot{x}_{j}~{}=~{}\phi_{j}(x_{j})\cdot\left[p_{j}-\frac{\sum_{\ell=1}^{n}\phi_{\ell}(x_{\ell})\cdot p_{\ell}}{\sum_{\ell=1}^{n}\phi_{\ell}(x_{\ell})}\right],

where each $\phi_{j}$ is a positive function on domain $(0,1)$ . Note that when $\phi_{j}(x_{j})=x_{j}$ , this is Replicator Dynamic.

Proposition 22.

Suppose a learning dynamic has the following property: if it starts at a point in the interior of $\Delta^{n}$ , then it stays in the interior forever. We have: the learning dynamic is FTRL via a separable strictly convex regularizer function $h(\mathbf{x})=\sum_{i=1}^{n}h_{i}(\mathbf{x}_{i})$ if and only if it is an escort replicator dynamic.

Proof: .

If the specified learning dynamic is FTRL, recall that the conversion function takes $\mathbf{q}$ as input, and output the mixed strategy

\operatorname*{arg\,max}_{\mathbf{x}\in\Delta^{n}}\left\{\left\langle\mathbf{q},\mathbf{x}\right\rangle-h(\mathbf{x})\right\},

which we denote by $\mathbf{x}(\mathbf{q})$ in this proof. Let $\bar{x}_{j}=1/h_{j}^{\prime\prime}(x_{j})$ and $H:=\sum_{j}\bar{x}_{j}$ . When $\mathbf{x}(\mathbf{q})$ is in the interior of $\Delta^{n}$ for some $\mathbf{q}$ , by Appendix D of [CP19], we have

\frac{\partial x_{j}}{\partial q_{j}}~{}=~{}\bar{x}_{j}-\frac{[\bar{x}_{j}]^{2}}{H}~{}~{}~{}~{}~{}~{}~{}~{}\text{and}~{}~{}~{}~{}~{}~{}~{}~{}\forall\ell\neq j,~{}~{}\frac{\partial x_{j}}{\partial q_{\ell}}~{}=~{}-\frac{\bar{x}_{j}\bar{x}_{\ell}}{H}.

By the chain rule,

\dot{x}_{j}~{}=~{}\left[\bar{x}_{j}-\frac{[\bar{x}_{j}]^{2}}{H}\right]\cdot p_{j}~{}+~{}\sum_{\ell\neq j}\left[-\frac{\bar{x}_{j}\bar{x}_{\ell}}{H}\right]p_{\ell}~{}=~{}\bar{x}_{j}\left(p_{j}-\frac{\sum_{\ell=1}^{n}\bar{x}_{\ell}p_{\ell}}{\sum_{\ell=1}^{n}\bar{x}_{\ell}}\right).

By recognizing $\phi_{j}(x_{j})$ as $\bar{x}_{j}$ , the FTRL dynamic is an escort replicator dynamic. Precisely, we set $\phi_{j}(x_{j})=1/h_{j}^{\prime\prime}(x_{j})$ . Since $h$ is strictly convex, $h_{j}^{\prime\prime}$ is a positive function, hence $\phi_{j}$ is a positive function too.

Conversely, if the specified algorithm is an escort learning dynamic with escort function $\phi_{j}$ for each $j$ , to show that it is a FTRL dynamic with some strictly convex regularizer $h$ , we set $h$ to be separable, and for each $j$ , $h_{j}^{\prime\prime}(x_{j})=1/\phi_{j}(x_{j})$ . Thus, it suffice to set $h_{j}$ to be any double anti-derivative of $1/\phi_{j}$ . Since $h_{j}^{\prime\prime}(x_{j})=1/\phi_{j}(x_{j})>0$ , each $h_{j}$ is strictly convex, and hence $h$ is strictly convex. ∎

Appendix E More Plots Illuminating Poincaré Recurrences

Finally, we present more plots that illuminate Poincaré recurrences of learning in games in the next two pages.

Online Optimization in Games via Control Theory: Connecting Regret, Passivity and Poincaré Recurrence

Abstract

1 Introduction

Our Contributions.

Roadmap.

2 Preliminary

Learning Dynamic and Learning Operator.

Regret.

Replicator Dynamic and FTRL Learning Dynamics.

Merged Learning Operator.

Game Operator and Dynamical Game System.

3 Passivity

Passivity of State Space System.

Definition 1.

Feedback Interconnection System.

Theorem 2 ([FS13] Theorem 3.2).

DGS as a FIC System.

Basic Results about Passivity of Learning Operators.

Proposition 3.

Proposition 4.

Definition 5.

Proposition 6.

4 Passivity of Learning Operators

4.1 FTRL Dynamics are Finitely Lossless

Theorem 7.

Proof: .

4.2 Relationship to Regret

Theorem 8.

Proof: .

Corollary 9 ([MPP18]).

Proposition 10.

5 A Characterization of Lossless Learning Dynamics, and Their Convex Combinations

Proposition 11.

Proof: .

Example 12.

Definition 13.

Example 14.

Theorem 15.

Theorem 16.

6 Lossless DGS and Poincaré Recurrence

Graphical Constant-sum Game.

Game Operator.

Proposition 17.

Poincaré Recurrence.

Theorem 18 ([Poi90, Bar06]).

Poincaré Recurrence in DGS.

Theorem 19.

7 Conclusion

Acknowledgements

References

Appendix

Appendix A Volume and Liouville’s Formula

Appendix B Missing Proofs

Proof of Proposition 4: .

Proof of Proposition 6: .

Proof of Proposition 10: .

Proof of Proposition 17: .

Appendix C Poincaré Recurrence

Corollary 20.

Lemma 21 (Adapted from [MPP18], Appendix D).

Appendix D Escort Learning Dynamics

Proposition 22.

Proof: .

Appendix E More Plots Illuminating Poincaré Recurrences

Online Optimization in Games via Control Theory:
Connecting Regret, Passivity and Poincaré Recurrence

5 A Characterization of Lossless Learning Dynamics, and Their
Convex Combinations