This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Optimization in Games via Control Theory:
Connecting Regret, Passivity and Poincaré Recurrence

Yun Kuen Cheung
Royal Holloway
University of London
   Georgios Piliouras
Singapore University of
Technology and Design
Abstract

We present a novel control-theoretic understanding of online optimization and learning in games, via the notion of passivity. Passivity is a fundamental concept in control theory, which abstracts energy conservation and dissipation in physical systems. It has become a standard tool in analysis of general feedback systems, to which game dynamics belong. Our starting point is to show that all continuous-time Follow-the-Regularized-Leader (FTRL) dynamics, which include the well-known Replicator Dynamic, are lossless, i.e. it is passive with no energy dissipation. Interestingly, we prove that passivity implies bounded regret, connecting two fundamental primitives of control theory and online optimization.

The observation of energy conservation in FTRL inspires us to present a family of lossless learning dynamics, each of which has an underlying energy function with a simple gradient structure. This family is closed under convex combination; as an immediate corollary, any convex combination of FTRL dynamics is lossless and thus has bounded regret. This allows us to extend the framework of Fox and Shamma [FS13] to prove not just global asymptotic stability results for game dynamics, but Poincaré recurrence results as well. Intuitively, when a lossless game (e.g. graphical constant-sum game) is coupled with lossless learning dynamics, their feedback interconnection is also lossless, which results in a pendulum-like energy-preserving recurrent behavior, generalizing [PS14, MPP18].

Lossless Learning Dynamic Physical System
Replicator Dynamic,
examples Follow-the-Regularized-Leader gravity
(FTRL) dynamics
state 𝐪=𝐩(t)𝖽t=\mathbf{q}~{}=~{}\int\mathbf{p}(t)\,\mathsf{d}t~{}=~{}cumulative payoffs h=h~{}=~{}vertical height
energy E(𝐪)=E(\mathbf{q})~{}=~{}storage function V(h)=V(h)= (negative) potential energy
gradient of 𝐱=E(𝐪)=\mathbf{x}~{}=~{}\nabla E(\mathbf{q})~{}=~{} F=V(h)=F~{}=~{}\nabla V(h)~{}=~{}
energy mixed strategies of agents gravitational force
convex combination (CC): linear combination (LC):
invariant any CC of storage functions produces any LC of potential energy function is
property a lossless learning dynamic that a potential energy function that yields
yields the same CC of mixed strategies the same LC of gravitational forces
another 𝐩=\mathbf{p}~{}=~{}instantaneous payoffs v=v~{}=~{}velocity
analogue
change of 𝐱,𝐩𝖽t\int\left\langle\mathbf{x},\mathbf{p}\right\rangle\,\mathsf{d}t F,v𝖽t\int\left\langle F,v\right\rangle\,\mathsf{d}t
energy value
Figure 1: Analogues between lossless online learning dynamic and physical system. The evolution of a system of learning dynamics can be thought of as capturing the movements of particles, thus tools from control theory can find direct application in the study of learning dynamics in games.

1 Introduction

Online optimization aims at designing algorithms that can maximize performance in unpredictable and even adversarially evolving environments. The standard benchmark for success in these environments is minimizing regret, which is defined as the difference between the accumulated performance of the algorithm and that of the best action in hindsight. One of the most important achievements of the field has been to establish that such regret minimizing algorithms exist [CBL06, SS12]. Amongst the most well-known such algorithms is the class of Follow-the-Regularized-Leader (FTRL) algorithms, which include as special cases ubiquitous meta-algorithms such as Multiplicative Weights Update [FS99, AHK05] and Gradient Descent [SS12]. It is well known that such algorithms can achieve 𝒪(T)\mathcal{O}(\sqrt{T}) regret by employing slowly decreasing step-sizes, and that this bound is effectively optimal given arbitrary payoff sequences. When applying such algorithms in games, as the sequence of payoffs becomes more predictable, stronger regret guarantees are possible [RS13, FLST16, SALS15, BP19]. In continuous-time model, FTRL dynamics are once again optimal achieving bounded regret in general settings [KM17, MPP18, BGP20]. Hence both from the perspective of optimization as well as game theory, FTRL dynamics constitute effectively an optimal choice.

Control theory, on the other hand, is motivated by a seemingly unrelated set of questions. It aims to develop methodologies for stabilizing complex processes and machines. Due to its intrinsic connections to real-world systems, control theory revolves around concepts with a strong grounding in physical systems. A fundamental property of numerous physical systems is passivity, which is typically defined in terms of energy dissipation, conservation and transformation [Wil72a, Wil72b, OPNSR13]. Passivity is an “input-output” property of a system, and expresses that a system which is supplied with bounded energy can only output bounded energy. Passive systems come equipped with a storage function that accumulates the supplied energy but perhaps with some loss (cf. energy loss due to friction in mechanical systems). Overall, passivity encodes a useful notion of stability, since such system cannot explode into unpredictable out-of-control motion as it would correspond to unbounded energy output.

Although the fields of online optimization and control theory are both well developed with long and distinct histories, their interconnection is still rather nascent. Online algorithms can be abstractly thought as input-output operators where the input is a stream of payoffs, and the output is a stream of behavioral outcomes. Both notions of regret and passivity are similar properties of such input-output algorithms/operators and encode a notion of predictability and stability around a reference frame. In regret, the reference frame is given by the cumulative payoff of past actions, in passivity by energy level sets. This raises our first set of questions:

Are there formal connections between regret and passivity? Moreso, can we interpret the optimal regret of FTRL dynamics from a passivity perspective? Are there similarly optimal learning dynamics / input-output operators?

Any formal connection across the two fields is clearly valuable, as it allows for a fusion of ideas and methodologies between two well-developed fields, and expedite the progress on areas of interest that are common to both, such as game theory [FL98, CBL06, MS15, MS18]. For related issues on the intersection of learning, control theory and games, see [Sha20] and the references therein.

Notably, Fox and Shamma [FS13] proposed a control-theoretic framework for analyzing learning in games. One of their key contributions is to identify a modular approach, where an analysis can be performed by studying a learning operator (which converts payoff input to strategy output) and a game operator (which converts strategy input to payoff output) independently, while the whole game dynamic is a feedback interconnection system of the two operators (see Figure 2). By focusing on coupling learning heuristics that are strictly passive with passive games (e.g. zero-sum games), the resulting strictly passive systems were shown to converge to equilibria, generalizing and unifying numerous prior results, e.g. [HS09].

The modular approach has allowed numerous works which study learning or game operators separately [MS16, PSM18, Mab18, GP19]. Despite this progress, settings of critical importance for AI such as understanding the perfectly recurrent non-equilibrating behaviors of Gradient Descent (and other FTRL dynamics) in zero-sum games has so far remained outside the reach of these techniques [PS14, MPP18, BRM+18, VGFP19, PML+20]. This raises our second question:

Can the pendulum-like cyclic behavior of FTRL dynamics in zero-sum games be understood and generalized via passivity?

Refer to caption
Figure 2: A feedback interconnection system that captures a game dynamic, by interconnecting a learning operator and a game operator. 𝐫^\hat{\mathbf{r}} is (random) perturbation to payoffs; in this paper, we only consider the cases where 𝐫^\hat{\mathbf{r}} is the zero function.

Our Contributions.

We provide affirmative answers to both questions raised above. We show that any finitely passive learning dynamic111 In control theory literature, a passive operator usually has a storage function with a finite lower bound. In this paper, storage functions can have a finite lower bound or not, depending on the types of the operators. We emphasize those operators which have a storage function with a finite lower bound as being finitely passive. For the definition of finitely passive learning dynamic, see Definition 5. guarantees constant regret (Theorem 8). By using the notion of convex conjugate from convex analysis, we show that any continuous-time FTRL dynamic is finitely passive and lossless (Theorem 7); the same holds for certain escort replicator dynamics [Har11] (Appendix D). These generalize [Mab18] which showed that Replicator Dynamic is finitely lossless. Combining the two theorems above immediately recovers the result in [MPP18] that any FTRL dynamic guarantees constant regret. We note that in the analysis of Mabrok [Mab18], the state space (i.e. the domain of the storage function) is the space of mixed strategies, while we turn to a new state space of cumulative payoffs for FTRL. This choice is crucial for the generalization to FTRL dynamics, and it permits a cleaner proof via the tools established in convex analysis.

A key observation that enables the above results is that FTRL dynamic admits a storage function with a simple gradient structure, which will be described formally in Section 5. This motivates us to study a new family of lossless learning dynamics, which is in one-one correspondence with the family of storage functions possessing that gradient structure. By observing that such storage functions are closed under convex combination, we discover that any convex combination of FTRL dynamics is finitely passive and lossless, and thus guarantees constant regret (Theorem 16). “Convex combination of FTRL dynamics” means: Suppose there are kk FTRL dynamics, indexed from 11 to kk. When we use the jj-th one, it converts the cumulative payoffs at time tt to a mixed strategy 𝐱jt\mathbf{x}_{j}^{t}. A convex combination of these FTRL dynamics converts the cumulative payoffs at time tt to the mixed strategy j=1kαj𝐱jt\sum_{j=1}^{k}\alpha_{j}\cdot\mathbf{x}_{j}^{t}, where αj\alpha_{j}’s are positive constants satisfying j=1kαj=1\sum_{j=1}^{k}\alpha_{j}=1.

Convex combinations of lossless dynamics are directly analogous to linear combinations of conservative vector fields in analyzing physical dynamics (see Figure 1). This technique is also of practical relevance, since we might want to mix-and-match different dynamics to elicit their advantages. For instance, different learning dynamics may lean toward either exploitation or exploration. By combining them via convex combination with our own choice of αj\alpha_{j}’s, we can control our desired balance between exploitation and exploration (see Example 14).

We also show that for every graphical constant-sum game (e.g. two-person zero-sum game) that admits a fully-mixed Nash equilibrium, it corresponds to a finitely lossless game operator (Proposition 17). Thus, the game dynamic of any convex combinations of FTRL dynamics in such a game corresponds to a finitely lossless operator. We use this observation to show that the game dynamic is almost perfectly recurrent, via the notion of Poincaré recurrence (Theorem 19). This distinguishes our work from [FS13] and its subsequent works about learning in games, as they mostly concern stability/convergence, while we study recurrence.

Roadmap.

In Sections 2 and 3, we present the necessary background for this work, including the definitions of different operators, the notions of (lossless) passivity and storage function, and some basic results about passivity. In Section 4, we show the desirable properties (finitely losslessness, constant regret) of FTRL dynamics. In Section 5, we present a characterization of lossless learning dynamics via the above-mentioned gradient structure of storage functions, and we discuss some properties of convex combinations of such learning dynamics. The results about Poincaré recurrences of learning in graphical constant-sum games are presented in Section 6. All missing proofs can be found in the appendix.

2 Preliminary

In this section, we define the operators depicted in Figure 2. We first define learning dynamic and its corresponding learning operator. When there are multiple agents and each agent is using one learning dynamic, their learning operators can be concatenated naturally to form a merged learning operator. Then we define game operator, whose interconnection with a merged learning operator in the manner of Figure 2 is called a dynamical game system.

We use a bold lower case to denote a vector variable. Let +:=[0,+)\mathbb{R}^{+}:=[0,+\infty) denote the set of non-negative real numbers. In this paper, every function from +\mathbb{R}^{+} to d\mathbb{R}^{d} is assumed to be square integrable, and we call it a function-of-time. An (input-output) operator is a mapping whose input and output are both functions-of-time. Let Δn\Delta^{n} denote the probability simplex over nn actions, i.e. Δn:={(x1,,xn)|j=1nxj=1;for1jn,xj0}\Delta^{n}:=\{(x_{1},\ldots,x_{n})~{}\big{|}~{}\sum_{j=1}^{n}x_{j}=1;~{}\text{for}~{}1\leq j\leq n,~{}x_{j}\geq 0\}. 𝐚,𝐛\left\langle\mathbf{a},\mathbf{b}\right\rangle denotes the inner product of the vectors 𝐚,𝐛\mathbf{a},\mathbf{b} of same dimension dd, i.e. 𝐚,𝐛=j=1dajbj\left\langle\mathbf{a},\mathbf{b}\right\rangle=\sum_{j=1}^{d}a_{j}b_{j}.

Learning Dynamic and Learning Operator.

We focus on the following type of continuous-time learning dynamics. An agent has an action set AA; let n:=|A|n:=|A|. The process starts at time 0. For any time t0t\geq 0 and for each action jAj\in A, the agent computes the cumulative payoff if she chooses action jj in the time interval [0,t][0,t], denoted by qj(t)q_{j}(t). Formally, for each action jj, let pj(τ)p_{j}(\tau) denote the (instantaneous) payoff to action jj at time τ\tau. Then

qj(t):=qj0+0tpj(τ)𝖽τ,q_{j}(t):=q_{j}^{0}+\int_{0}^{t}p_{j}(\tau)\,\mathsf{d}\tau,

where qj0q_{j}^{0} is a constant chosen at the beginning of the process. Let 𝐪(t):=(q1(t),,qn(t))\mathbf{q}(t):=(q_{1}(t),\cdots,q_{n}(t)). For any t0t\geq 0, the agent uses a conversion function ff which takes 𝐪(t)\mathbf{q}(t) as input, and outputs a mixed strategy 𝐱(t)Δn\mathbf{x}(t)\in\Delta^{n} over the nn actions. The process can be expressed compactly as an ordinary differential equations (ODE) system:

𝐪(0)\displaystyle\mathbf{q}(0) =𝐪0\displaystyle=\mathbf{q}^{0} (Initial condition)
𝐪˙(t)\displaystyle\dot{\mathbf{q}}(t) =𝐩(t)\displaystyle=\mathbf{p}(t) (Cumulative payoff/state update) (1)
𝐱(t)\displaystyle\mathbf{x}(t) =f(𝐪(t)).\displaystyle=f(\mathbf{q}(t)). (Behavioral/strategy output)

When 𝐪0=(q10,,qn0)\mathbf{q}^{0}=(q_{1}^{0},\ldots,q_{n}^{0}) is already given, the conversion function ff specifies the learning dynamic. The learning dynamic can be viewed as an operator, which takes the function-of-time 𝐩:+n\mathbf{p}:\mathbb{R}^{+}\rightarrow\mathbb{R}^{n} as input, and the output is another function-of-time 𝐱:+Δn\mathbf{x}:\mathbb{R}^{+}\rightarrow\Delta^{n}.

Regret.

For any 𝐩:+n\mathbf{p}:\mathbb{R}^{+}\rightarrow\mathbb{R}^{n}, the regret of a learning dynamic at time T>0T>0 is

(maxjA0Tpj(τ)𝖽τ)0T𝐩(τ),𝐱(τ)𝖽τ.\left(\max_{j\in A}\int_{0}^{T}p_{j}(\tau)\,\mathsf{d}\tau\right)-\int_{0}^{T}\left\langle\mathbf{p}(\tau),\mathbf{x}(\tau)\right\rangle\,\mathsf{d}\tau.

We say a learning dynamic guarantees constant regret if for any 𝐩:+n\mathbf{p}:\mathbb{R}^{+}\rightarrow\mathbb{R}^{n} and any T>0T>0, the regret at time TT is bounded from above by a constant that depends on 𝐪0\mathbf{q}^{0} only.

Replicator Dynamic and FTRL Learning Dynamics.

When the conversion function ff is the logit choice map

fRD(𝐪)=(exp(q1)𝒩,exp(q2)𝒩,,exp(qn)𝒩),f_{\text{RD}}(\mathbf{q})=\left(\frac{\exp(q_{1})}{\mathcal{N}}~{},~{}\frac{\exp(q_{2})}{\mathcal{N}}~{},~{}\cdots~{},~{}\frac{\exp(q_{n})}{\mathcal{N}}\right), (2)

where 𝒩=j=1nexp(qj)\mathcal{N}=\sum_{j=1}^{n}\exp(q_{j}), the learning dynamic (1) is equivalent to the well-known Replicator Dynamic [HS98, San10], which is the continuous-time analogue of Multiplicative Weights Update.

A FTRL learning dynamic is specified by a strictly convex regularizer function h:Δnh:\Delta^{n}\rightarrow\mathbb{R}, which determines the conversion function ff as below:

f(𝐪)=argmax𝐱Δn{𝐪,𝐱h(𝐱)}.f(\mathbf{q})~{}=~{}\operatorname*{arg\,max}_{\mathbf{x}\in\Delta^{n}}\left\{~{}\left\langle\mathbf{q},\mathbf{x}\right\rangle-h(\mathbf{x})~{}\right\}. (3)

It is known that Replicator Dynamic is a special case of FTRL, by setting h(𝐱)=j=1nxjlogxjh(\mathbf{x})=\sum_{j=1}^{n}x_{j}\log x_{j}. Online Gradient Descent (OGD) is another commonly studied learning dynamic that is also a special case of FTRL with L2L^{2} regularization, i.e. h(𝐱)=12j=1n(xj)2h(\mathbf{x})=\frac{1}{2}\sum_{j=1}^{n}(x_{j})^{2}[H+16] When n=2n=2, the conversion function of OGD is

fOGD(𝐪)={(1,0)if q1q21;(q1q2+12,q2q1+12)if 1>q1q2>1;(0,1)if q1q21.f_{\text{OGD}}(\mathbf{q})~{}=~{}\begin{cases}(1,0)&\text{if }q_{1}-q_{2}\geq 1;\\ \left(\frac{q_{1}-q_{2}+1}{2},\frac{q_{2}-q_{1}+1}{2}\right)&\text{if }1>q_{1}-q_{2}>-1;\\ (0,1)&\text{if }q_{1}-q_{2}\leq-1.\end{cases} (4)

Merged Learning Operator.

When a system has m2m\geq 2 agents, and each agent uses a learning dynamic, we concatenate the corresponding learning operators together to form a merged learning operator (MLO). Precisely, the input to the MLO is 𝐩^=(𝐩1,,𝐩m)\hat{\mathbf{p}}=(\mathbf{p}^{1},\cdots,\mathbf{p}^{m}), and its output is 𝐱^=(𝐱1,,𝐱m)\hat{\mathbf{x}}=(\mathbf{x}^{1},\cdots,\mathbf{x}^{m}), where 𝐱i\mathbf{x}^{i} is the output of the learning operator of agent ii when its input is 𝐩i\mathbf{p}^{i}. In this paper, we use hat notations (e.g. 𝐩^,𝐱^\hat{\mathbf{p}},\hat{\mathbf{x}}) to denote variables formed by such concatenations of variables of individual agents.

Game Operator and Dynamical Game System.

Here, we provide a general definition of game operators (as in Figure 2), and leave the discussion about graphical constant-sum games, which appear in our Poincaré recurrence results, to Section 6.

A game has mm agents. Each agent ii has nin_{i} actions. After each agent chooses a mixed strategy over her own actions, the game determines a payoff vector 𝐩^n1××nm\hat{\mathbf{p}}\in\mathbb{R}^{n_{1}}\times\cdots\times\mathbb{R}^{n_{m}}, where p^k\hat{p}_{k\ell} is the payoff to action \ell of agent kk. Let Δ:=Δn1××Δnm\Delta:=\Delta^{n_{1}}\times\cdots\times\Delta^{n_{m}}, and 𝒫:=n1××nm\mathcal{P}:=\mathbb{R}^{n_{1}}\times\cdots\times\mathbb{R}^{n_{m}}. We can think of the game as a function G:Δ𝒫G:\Delta\rightarrow\mathcal{P}. Its game operator takes a function-of-time 𝐱^:+Δ\hat{\mathbf{x}}:\mathbb{R}^{+}\rightarrow\Delta as input, and it outputs a function-of-time 𝐩^:+𝒫\hat{\mathbf{p}}:\mathbb{R}^{+}\rightarrow\mathcal{P}, where 𝐩^(t)=G(𝐱^(t))\hat{\mathbf{p}}(t)=G(\hat{\mathbf{x}}(t)) for all t0t\geq 0.

A dynamical game system (DGS) comprises of mm agents. The game operator has input 𝐱^\hat{\mathbf{x}} and output 𝐩^\hat{\mathbf{p}}. Each agent uses a learning dynamic of the form (1). The agents’ learning operators are concatenated to form a MLO, which has input 𝐩^=(𝐩1,,𝐩m)\hat{\mathbf{p}}=(\mathbf{p}^{1},\cdots,\mathbf{p}^{m}) and output 𝐱^=(𝐱1,,𝐱m)\hat{\mathbf{x}}=(\mathbf{x}^{1},\cdots,\mathbf{x}^{m}) when 𝐫^𝟎\hat{\mathbf{r}}\equiv\mathbf{0}. The MLO is interconnected with the game operator in the manner of Figure 2.

3 Passivity

To motivate the notions of passivity and energy, consider an electrical network connected to a power source, where the voltage and current across the network at time τ\tau are respectively v(τ)v(\tau) and i(τ)i(\tau). Let E(t)E(t) denote the energy stored in the network at time tt. We have E(t)E(0)+0tv(τ)i(τ)𝑑τE(t)\leq E(0)+\int_{0}^{t}v(\tau)\cdot i(\tau)\,d\tau. The reason for the inequality (but not an exact equality) is that energy might dissipate from the network. In this setting, the function-of-time vv is the input, while the function-of-time ii is the output, so the network is indeed an operator; and as we shall see, this operator is passive.

Passivity of State Space System.

To generalize the above idea to passivity of an ODE system, we need several mathematical notations. Let 𝕃2\mathbb{L}_{2} denote the Hilbert space of square integrable functions mapping +\mathbb{R}^{+} to n\mathbb{R}^{n} with inner product: f,gT:=0Tf(t),g(t)𝖽t\left\langle f,g\right\rangle_{T}:=\int_{0}^{T}\left\langle f(t),g(t)\right\rangle\,\mathsf{d}t. Let 𝕃2,e:={f:+n|f,fT<for all T+}\mathbb{L}_{2,e}:=\{f:\mathbb{R}^{+}\rightarrow\mathbb{R}^{n}~{}\big{|}~{}\left\langle f,f\right\rangle_{T}<\infty~{}\text{for all }T\in\mathbb{R}^{+}\}. An (input-output) operator is simply a mapping S:𝕌𝕐S:\mathbb{U}\rightarrow\mathbb{Y}, where 𝕌,𝕐𝕃2,e\mathbb{U},\mathbb{Y}\subset\mathbb{L}_{2,e}.

We consider the following type of operators, which can be represented by an ODE system called state space system (SSS) of the following general form:

𝐳(0)\displaystyle\mathbf{z}(0) =𝐳0;\displaystyle~{}=~{}\mathbf{z}^{0};
𝐳˙(t)\displaystyle\dot{\mathbf{z}}(t) =g1(𝐳(t),𝐮(t));\displaystyle~{}=~{}g_{1}(\mathbf{z}(t),\mathbf{u}(t)); (5)
𝐲(t)\displaystyle\mathbf{y}(t) =g2(𝐳(t),𝐮(t)),\displaystyle~{}=~{}g_{2}(\mathbf{z}(t),\mathbf{u}(t)),

where 𝐳0𝒵d1\mathbf{z}^{0}\in\mathcal{Z}\subset\mathbb{R}^{d_{1}}, 𝐳:+𝒵\mathbf{z}:\mathbb{R}^{+}\rightarrow\mathcal{Z}, and 𝐮,𝐲:+d2\mathbf{u},\mathbf{y}:\mathbb{R}^{+}\rightarrow\mathbb{R}^{d_{2}}. The set 𝒵\mathcal{Z} is called the set of states. As the notations suggest, 𝐮,𝐲\mathbf{u},\mathbf{y} are the input and output of this operator respectively. When 𝐮\mathbf{u} is fed into the operator, the first two equalities define a well-posed ODE system, so a unique solution of 𝐳\mathbf{z} exists under mild conditions. Then 𝐲\mathbf{y} is the output determined by a function g2g_{2} of the unique solution 𝐳\mathbf{z} and the input 𝐮\mathbf{u}. The learning dynamic (1) is such an operator, by viewing 𝐪,𝐩,𝐱,f(𝐪(t))\mathbf{q},\mathbf{p},\mathbf{x},f(\mathbf{q}(t)) in (1) as 𝐳,𝐮,𝐲,g2(𝐳(t),𝐮(t))\mathbf{z},\mathbf{u},\mathbf{y},g_{2}(\mathbf{z}(t),\mathbf{u}(t)) in (5) respectively. We are ready to present the definition of passivity for such operators.

Definition 1.

A SSS is passive if there exists a storage function L:𝒵L:\mathcal{Z}\rightarrow\mathbb{R} such that for all 𝐳0𝒵\mathbf{z}^{0}\in\mathcal{Z}, t+t\in\mathbb{R}^{+} and all input-output pairs 𝐮𝕌,𝐲𝕐\mathbf{u}\in\mathbb{U},~{}\mathbf{y}\in\mathbb{Y}, we have

L(𝐳(t))L(𝐳0)+0t𝐮(τ),𝐲(τ)𝖽τ.L(\mathbf{z}(t))~{}\leq~{}L(\mathbf{z}^{0})+\int_{0}^{t}\left\langle\mathbf{u}(\tau),\mathbf{y}(\tau)\right\rangle\,\mathsf{d}\tau. (6)

If the equality always holds, then we say the SSS is lossless passive or simply lossless. If a SSS is passive (resp. lossless) via a storage function LL that has a finite lower bound, we say it is finitely passive (resp. finitely lossless).

We note the reminiscence of inequality (6) with the inequality in the motivating example of electrical network.

When a SSS is finitely passive/lossless, we may assume, without loss of generality, that the finite lower bound of its storage function is zero. We make this assumption on all finitely passive/lossless SSS in the rest of this paper.

Refer to caption
Figure 3: A FIC system which includes two operators S1S_{1} and S2S_{2}. We assume 𝐫𝟎\mathbf{r}\equiv\mathbf{0}. S1S_{1} is a merged learning operator which converts payoffs to mixed strategies of agents with a shift 𝐱^\hat{\mathbf{x}}^{*}, while S2S_{2} is a game operator which converts the mixed strategies of agents (with shift 𝐱^\hat{\mathbf{x}}^{*}) to the negation of payoffs.

Feedback Interconnection System.

In Figure 2, we presented a feedback interconnection (FIC) system. While it is intuitive for readers to have a first understanding of the concept in control theory, for our analysis it is more convenient to use Figure 3. The FIC system in Figure 3 consists of two SSS S1:𝕌𝕐S_{1}:\mathbb{U}\rightarrow\mathbb{Y} and S2:𝕐𝕌S_{2}:\mathbb{Y}\rightarrow\mathbb{U}, with an external input source 𝐫𝕌\mathbf{r}\in\mathbb{U}, while its output is 𝐲1𝕐\mathbf{y}_{1}\in\mathbb{Y}; note that the FIC system is an operator by definition. The variables are related via: 𝐮1=𝐫𝐲2\mathbf{u}_{1}=\mathbf{r}-\mathbf{y}_{2}, 𝐲1=S1(𝐮1)\mathbf{y}_{1}=S_{1}(\mathbf{u}_{1}), 𝐮2=𝐲1\mathbf{u}_{2}=\mathbf{y}_{1}, and 𝐲2=S2(𝐮2)\mathbf{y}_{2}=S_{2}(\mathbf{u}_{2}).

An important property of passive operators is that passivity is composable, i.e. the composition of two passive operators results in a passive system. Intuitively, if no operator in the total system is able to produce energy, then the system as a whole cannot produce energy either. The following theorem formally captures this intuition for FIC systems.

Theorem 2 ([FS13] Theorem 3.2).

Consider the FIC system in Figure 3. Suppose that for i=1,2i=1,2, SiS_{i} is passive via storage function LiL_{i}. Then the FIC system is a passive operator via storage function L1+L2L_{1}+L_{2}. Precisely, for any 𝐳10𝒵1\mathbf{z}_{1}^{0}\in\mathcal{Z}_{1}, 𝐳20𝒵2\mathbf{z}_{2}^{0}\in\mathcal{Z}_{2} and t+t\in\mathbb{R}^{+},

L1(𝐳1(t))+L2(𝐳2(t))L1(𝐳10)+L2(𝐳20)+0t𝐫(τ),𝐲1(τ)𝖽τ.L_{1}(\mathbf{z}_{1}(t))+L_{2}(\mathbf{z}_{2}(t))~{}\leq~{}L_{1}(\mathbf{z}_{1}^{0})+L_{2}(\mathbf{z}_{2}^{0})+\int_{0}^{t}\left\langle\mathbf{r}(\tau),\mathbf{y}_{1}(\tau)\right\rangle\,\mathsf{d}\tau.

If S1,S2S_{1},S_{2} are lossless, then the FIC system is lossless via storage function L1+L2L_{1}+L_{2}, i.e. the inequality above becomes an equality.

Note that in the above theorem, if the FIC system is lossless and 𝐫\mathbf{r} is the zero function, then we have L1(𝐳1(t))+L2(𝐳2(t))=L1(𝐳10)+L2(𝐳20)L_{1}(\mathbf{z}_{1}(t))+L_{2}(\mathbf{z}_{2}(t))=L_{1}(\mathbf{z}_{1}^{0})+L_{2}(\mathbf{z}_{2}^{0}) for all t0t\geq 0, i.e. the value of L1+L2L_{1}+L_{2} does not change over time. The underlying dynamic is said to admit a constant-of-motion.

DGS as a FIC System.

In the context of DGS, Figure 3 is obtained after several modifications from Figure 2:

  • In Figure 2, the game operator’s output is 𝐩^\hat{\mathbf{p}}, which is added to 𝐫^\hat{\mathbf{r}} to form the MLO’s input. In Figure 3, the game operator’s output is 𝐩^-\hat{\mathbf{p}} instead, and it is subtracted from 𝐫\mathbf{r} to form the MLO’s input.

  • The MLO’s output is 𝐱^\hat{\mathbf{x}} in Figure 2, but the MLO’s output in Figure 3 has a constant shift 𝐱^=(𝐱,1,,𝐱,m)Δ\hat{\mathbf{x}}^{*}=(\mathbf{x}^{*,1},\ldots,\mathbf{x}^{*,m})\in\Delta. Precisely, the output is 𝐱^𝐱^\hat{\mathbf{x}}-\hat{\mathbf{x}}^{*}. We call this operator the “MLO with shift 𝐱^\hat{\mathbf{x}}^{*}”.

  • In Figure 3, the game operator’s input is 𝐱^𝐱^\hat{\mathbf{x}}-\hat{\mathbf{x}}^{*} instead of 𝐱^\hat{\mathbf{x}}, while its output is 𝐩^-\hat{\mathbf{p}} instead of 𝐩^\hat{\mathbf{p}}.

Basic Results about Passivity of Learning Operators.

By viewing a DGS as a FIC system, we are interested in MLO and game operators which possess good properties like passivity. The key advantage of this approach is it permits a DGS to be decoupled into two distinct operators, which can be analysed separately.

S1S_{1} is a MLO. First, we show that it is passive if all learning operators possessed by the agents are passive. This allows us to turn our focus to analyzing whether each individual learning operator is passive/lossless or not.

Proposition 3.

Suppose for each agent ii, her learning operator with shift 𝐱,i\mathbf{x}^{*,i} is passive (resp. lossless) via a storage function LiL^{i}. Then the MLO with shift 𝐱^=(𝐱,1,,𝐱,m)\hat{\mathbf{x}}^{*}=(\mathbf{x}^{*,1},\ldots,\mathbf{x}^{*,m}) is passive (resp. lossless) via the storage function i=1mLi\sum_{i=1}^{m}L^{i}.

By the next proposition, the choice of shift does not affect passivity of the learning operator. Thus, we say a learning dynamic is passive when its learning operator with any shift is passive.

Proposition 4.

Let Sa,SbS^{a},S^{b} be two learning operators of the same learning dynamic, with shifts 𝐱,a,𝐱,b\mathbf{x}^{*,a},\mathbf{x}^{*,b} respectively. SaS^{a} is passive (resp. lossless) via storage function La(𝐪)L^{a}(\mathbf{q}) if and only if SbS^{b} is passive (resp. lossless) via storage function La(𝐪)(𝐱,b𝐱,a),𝐪+cL^{a}(\mathbf{q})-\left\langle(\mathbf{x}^{*,b}-\mathbf{x}^{*,a}),\mathbf{q}\right\rangle+c, where cc is any constant.

The above proposition works even when the shifts are not mixed strategies, e.g. when 𝐱,a\mathbf{x}^{*,a} is the zero vector. We use EE to denote a storage function of a learning operator with zero shift, and LL to denote a storage function of a learning operator with a shift of a mixed strategy.

While a shift does not affect passivity, it does affect whether the learning operator is finitely passive or not. In order to prove that certain learning dynamics guarantee constant regret, we need their learning operators with some specific shifts to be finitely passive. This motivates the following definition of finitely passive learning dynamics. Let 𝐞j\mathbf{e}_{j} denote the vector with the jj-th entry be one, and all other entries be zero.

Definition 5.

A learning dynamic is finitely passive (resp. finitely lossless) if for every action jj, its learning operator with shift 𝐞j\mathbf{e}_{j} is finitely passive (resp. finitely lossless).

Proposition 6.

If a learning dynamic is finitely passive (resp. finitely lossless), then for any mixed strategy 𝐱\mathbf{x}^{*}, the learning operator with shift 𝐱\mathbf{x}^{*} is finitely passive (resp. finitely lossless).

4 Passivity of Learning Operators

4.1 FTRL Dynamics are Finitely Lossless

We start the analysis by establishing a strong connection between FTRL dynamics and passivity. Specifically, FTRL dynamics are finitely lossless.

Theorem 7.

Given any FTRL dynamic over nn actions and with regularizer function hh, let the convex conjugate of hh be

h(𝐪):=max𝐱Δn{𝐪,𝐱h(𝐱)}.h^{*}(\mathbf{q}):=\max_{\mathbf{x}\in\Delta^{n}}\left\{\left\langle\mathbf{q},\mathbf{x}\right\rangle-h(\mathbf{x})\right\}. (7)

Then for any 𝐱Δn\mathbf{x}^{*}\in\Delta^{n}, the learning operator with shift 𝐱\mathbf{x}^{*} is finitely lossless via the storage function L(𝐪)L(\mathbf{q}) given below:

L(𝐪)=h(𝐪)𝐪,𝐱+h(𝐱).L(\mathbf{q})~{}=~{}h^{*}(\mathbf{q})-\left\langle\mathbf{q},\mathbf{x}^{*}\right\rangle+h(\mathbf{x}^{*}). (8)

In particular, for any action jj, the learning operator with shift 𝐞j\mathbf{e}_{j} is finitely lossless, and hence any FTRL dynamic is finitely lossless by Definition 5.

Proof: .

By the “maximizing argument” identity in p. 149 of [SS12], we have h(𝐪(t))=𝐱(t)\nabla h^{*}(\mathbf{q}(t))=\mathbf{x}(t). Hence, L(𝐪(t))=h(𝐪(t))𝐱=𝐱(t)𝐱\nabla L(\mathbf{q}(t))=\nabla h^{*}(\mathbf{q}(t))-\mathbf{x}^{*}=\mathbf{x}(t)-\mathbf{x}^{*}. By the chain rule, 𝖽L(𝐪(t))𝖽t=𝐱(t)𝐱,𝐩(t)\frac{\mathsf{d}L(\mathbf{q}(t))}{\mathsf{d}t}=\left\langle\mathbf{x}(t)-\mathbf{x}^{*},\mathbf{p}(t)\right\rangle, and hence L(𝐪(t))=L(𝐪(0))+0t𝐱(τ)𝐱,𝐩(τ)𝖽τL(\mathbf{q}(t))=L(\mathbf{q}(0))+\int_{0}^{t}\left\langle\mathbf{x}(\tau)-\mathbf{x}^{*},\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau, verifying that the operator is lossless. Moreover, LL is bounded below by zero, since by the definition of hh^{*}, for any 𝐪\mathbf{q}, L(𝐪)=h(𝐪)𝐪,𝐱+h(𝐱)(𝐪,𝐱h(𝐱))𝐪,𝐱+h(𝐱)=0L(\mathbf{q})=h^{*}(\mathbf{q})-\left\langle\mathbf{q},\mathbf{x}^{*}\right\rangle+h(\mathbf{x}^{*})\geq\left(\left\langle\mathbf{q},\mathbf{x}^{*}\right\rangle-h(\mathbf{x}^{*})\right)-\left\langle\mathbf{q},\mathbf{x}^{*}\right\rangle+h(\mathbf{x}^{*})=0. ∎

We summarize the properties of the operators of FTRL dynamics with various shifts in Figure 4.

FTRL with zero shift FTRL with shift of a mixed strategy 𝐱\mathbf{x}^{*}
input 𝐩\mathbf{p} (payoff) 𝐩\mathbf{p}
output 𝐱\mathbf{x} (mixed strategy) 𝐱𝐱\mathbf{x}-\mathbf{x}^{*}
state 𝐪\mathbf{q} (cumulative payoff) 𝐪\mathbf{q}
storage function E(𝐪)=h(𝐪)E(\mathbf{q})=h^{*}(\mathbf{q}) (see (7)) L(𝐪)=h(𝐪)𝐪,𝐱+h(𝐱)L(\mathbf{q})=h^{*}(\mathbf{q})-\left\langle\mathbf{q},\mathbf{x}^{*}\right\rangle+h(\mathbf{x}^{*})
infimum -\infty 0
property lossless finitely lossless
Figure 4: FTRL learning operators with various shifts.

4.2 Relationship to Regret

Theorem 8.

Any finitely passive learning dynamic guarantees constant regret.

Proof: .

Let LjL^{j} denote the storage function of the learning operator with shift 𝐞j\mathbf{e}_{j}. Since the learning dynamic is finitely passive, we can assume the infimum of LjL^{j} is zero. By the definition of passivity,

Lj(𝐪(t))Lj(𝐪(0))+0t𝐱(τ)𝐞j,𝐩(τ)𝖽τ.L^{j}(\mathbf{q}(t))~{}\leq~{}L^{j}(\mathbf{q}(0))+\int_{0}^{t}\left\langle\mathbf{x}(\tau)-\mathbf{e}_{j},\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau.

Hence, the regret w.r.t. action jj at time tt satisfies:

0t𝐞j,𝐩(τ)𝖽τ0t𝐱(τ),𝐩(τ)𝖽τLj(𝐪(0))Lj(𝐪(t))Lj(𝐪(0)).\int_{0}^{t}\left\langle\mathbf{e}_{j},\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau~{}-~{}\int_{0}^{t}\left\langle\mathbf{x}(\tau),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau~{}\leq~{}L^{j}(\mathbf{q}(0))-L^{j}(\mathbf{q}(t))~{}\leq~{}L^{j}(\mathbf{q}(0)).

Thus, the regret up to time tt is bounded by maxj{Lj(𝐪(0))}\max_{j}\left\{L^{j}(\mathbf{q}(0))\right\}, which is a constant that depends only on the initial state 𝐪0\mathbf{q}^{0}. ∎

As an immediate corollary of Theorems 7 and 8, all FTRL dynamics guarantee bounded regret.

Corollary 9 ([MPP18]).

Every FTRL dynamic guarantees constant regret.

Theorem 8 states that finite passivity implies constant regret. The following proposition states that the converse (constant regret implies finite passivity) is also true, if we restrict to lossless learning dynamics.

Proposition 10.

Suppose that a learning dynamic is lossless. Then the learning dynamic guarantees constant regret if and only if it is finitely lossless.

5 A Characterization of Lossless Learning Dynamics, and Their
Convex Combinations

In Section 4.1, we showed that every FTRL dynamic is lossless. A FTRL dynamic is specified by a convex regularizer function hh, while the storage function LL has an implicit form in term of hh. Here, we ask how we can specify the storage function directly to generate a lossless learning dynamic. Not all storage functions work, so we first seek some necessary conditions on them. These conditions are stated using the storage function EE of the learning operator with zero shift. By Proposition 4, the lossless storage function of the learning operator with shift 𝐱\mathbf{x}^{*} is E(𝐪)𝐪,𝐱+cE(\mathbf{q})-\left\langle\mathbf{q},\mathbf{x}^{*}\right\rangle+c.

Suppose there is a lossless learning dynamic in the form of (1); recall that when 𝐪0\mathbf{q}^{0} is already given, the learning dynamic is specified by its conversion function ff. Let EE be the storage function of its learning operator with zero shift. We present two necessary conditions on EE and ff.

Necessary Condition 1 (NC1). (i) For any j=1,,nj=1,\cdots,n, jE(𝐪)0\nabla_{j}E(\mathbf{q})\geq 0. (ii) For any 𝐪n\mathbf{q}\in\mathbb{R}^{n}, we have E(𝐪)=f(𝐪)\nabla E(\mathbf{q})=f(\mathbf{q}).

Reason: Since the learning dynamic is lossless, E(𝐪(t))=E(𝐪0)+0t𝐱(τ),𝐩(τ)𝖽τE(\mathbf{q}(t))=E(\mathbf{q}^{0})+\int_{0}^{t}\left\langle\mathbf{x}(\tau),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau. Taking time-derivative on both sides yields

𝖽E(𝐪(t))𝖽t=𝐱(t),𝐩(t)=f(𝐪(t)),𝐩(t).\frac{\mathsf{d}E(\mathbf{q}(t))}{\mathsf{d}t}=\left\langle\mathbf{x}(t),\mathbf{p}(t)\right\rangle=\left\langle f(\mathbf{q}(t)),\mathbf{p}(t)\right\rangle.

On the other hand, by the chain rule, we have

𝖽E(𝐪(t))𝖽t=E(𝐪(t)),𝐪˙=E(𝐪(t)),𝐩(t).\frac{\mathsf{d}E(\mathbf{q}(t))}{\mathsf{d}t}=\left\langle\nabla E(\mathbf{q}(t)),\dot{\mathbf{q}}\right\rangle=\left\langle\nabla E(\mathbf{q}(t)),\mathbf{p}(t)\right\rangle.

Thus, f(𝐪(t)),𝐩(t)=E(𝐪(t)),𝐩(t)\left\langle f(\mathbf{q}(t)),\mathbf{p}(t)\right\rangle=\left\langle\nabla E(\mathbf{q}(t)),\mathbf{p}(t)\right\rangle for any 𝐩(t)\mathbf{p}(t) and 𝐪(t)\mathbf{q}(t). This readily implies conditions (ii) holds.222To formally argue this, for any 𝐪n\mathbf{q}\in\mathbb{R}^{n} and for each action jj, we construct continuous 𝐩:[0,t]n\mathbf{p}:[0,t]\rightarrow\mathbb{R}^{n} such that 𝐪0+0t𝐩(τ)𝖽τ=𝐪\mathbf{q}^{0}+\int_{0}^{t}\mathbf{p}(\tau)\,\mathsf{d}\tau=\mathbf{q} and 𝐩(t)=𝐞j\mathbf{p}(t)=\mathbf{e}_{j}. This implies that the jj-th component of f(𝐪)f(\mathbf{q}) is same as the jj-th component of E(𝐪)\nabla E(\mathbf{q}). The construction is easy to make. Since f(𝐪)Δnf(\mathbf{q})\in\Delta^{n}, condition (i) holds.

Necessary Condition 2 (NC2). For any real number rr and any 𝐪\mathbf{q}, E(𝐪+r𝟏)=E(𝐪)+rE(\mathbf{q}+r\cdot\mathbf{1})=E(\mathbf{q})+r.

Reason: By NC1, E(𝐪)=f(𝐪)\nabla E(\mathbf{q})=f(\mathbf{q}), which is in Δn\Delta^{n}. Thus, the directional derivative of EE along the direction 𝟏\mathbf{1} is E(𝐪),𝟏=f(𝐪),𝟏=1\left\langle\nabla E(\mathbf{q}),\mathbf{1}\right\rangle=\left\langle f(\mathbf{q}),\mathbf{1}\right\rangle=1.

Indeed, it is easy to verify that the above two necessary conditions are also sufficient.

Proposition 11.

Any smooth function E(𝐪)E(\mathbf{q}) which satisfies NC1(i) and NC2 is a lossless storage function of a learning operator with zero shift. The conversion function ff satisfies f(𝐪)=E(𝐪)f(\mathbf{q})=\nabla E(\mathbf{q}).

Proof: .

Due to NC2, E(𝐪),𝟏=1\left\langle\nabla E(\mathbf{q}),\mathbf{1}\right\rangle=1, thus j=1njE(𝐪)=1\sum_{j=1}^{n}\nabla_{j}E(\mathbf{q})=1. This equality and NC1(i) implies E(𝐪)Δn\nabla E(\mathbf{q})\in\Delta^{n}. Now, consider a learning dynamic with conversion function f=Ef=\nabla E. Then for any function-of-time 𝐩\mathbf{p} and any t>0t>0, we have 𝖽E(𝐪(t))𝖽t=E(𝐪(t)),𝐪˙=f(𝐪(t)),𝐩(t)=𝐱(t),𝐩(t)\frac{\mathsf{d}E(\mathbf{q}(t))}{\mathsf{d}t}=\left\langle\nabla E(\mathbf{q}(t)),\dot{\mathbf{q}}\right\rangle=\left\langle f(\mathbf{q}(t)),\mathbf{p}(t)\right\rangle=\left\langle\mathbf{x}(t),\mathbf{p}(t)\right\rangle. Integrating both sides w.r.t. tt shows that the learning dynamic is lossless via the storage function EE. ∎

Example 12.

Consider the learning dynamic where the conversion function always outputs the uniform mixed strategy, i.e. for all 𝐪\mathbf{q}, f(𝐪)=1n𝟏f(\mathbf{q})=\frac{1}{n}\cdot\mathbf{1}. By Proposition 11, it is a lossless learning dynamic via E(𝐪)=1n𝟏,𝐪E(\mathbf{q})=\left\langle\frac{1}{n}\cdot\mathbf{1},\mathbf{q}\right\rangle, a storage function that clearly satisfies NC1(i) and NC2. Of course, there is not really “learning” with this conversion function, and it is easy to construct examples to demonstrate that its regret can be unbounded. By Proposition 10, this learning dynamic is not finitely lossless. This shows the family of finitely lossless learning dynamics is a proper subset of the family of lossless learning dynamics.

The family of smooth functions EE satisfying NC1(i) and NC2 can be represented compactly as

:={E:n|𝐪,E(𝐪)𝟎andE(𝐪),𝟏=1}.\mathcal{E}:=\{E:\mathbb{R}^{n}\rightarrow\mathbb{R}~{}|~{}\forall\mathbf{q},~{}\nabla E(\mathbf{q})\geq\mathbf{0}~{}~{}\text{and}~{}~{}\left\langle\nabla E(\mathbf{q}),\mathbf{1}\right\rangle=1\}.

An interesting observation is that this family is closed under convex combination, i.e. if E1,,EkE_{1},\ldots,E_{k}\in\mathcal{E}, then for any real numbers α1,,αk0\alpha_{1},\ldots,\alpha_{k}\geq 0 such that =1kα=1\sum_{\ell=1}^{k}\alpha_{\ell}=1, (=1kαE)\left(\sum_{\ell=1}^{k}\alpha_{\ell}\cdot E_{\ell}\right)\in\mathcal{E}. By Proposition 11, E1,,EkE_{1},\ldots,E_{k} are lossless storage functions of some learning operators with zero shift. Suppose the conversion functions are f1,,fkf_{1},\ldots,f_{k} respectively. Then by Proposition 11 again, (=1kαE)\left(\sum_{\ell=1}^{k}\alpha_{\ell}\cdot E_{\ell}\right) is a lossless storage function of a learning operator with zero shift, with conversion function (=1kαf)\left(\sum_{\ell=1}^{k}\alpha_{\ell}\cdot f_{\ell}\right). This motivates the following definition of convex combination of learning dynamics.

Definition 13.

Given kk learning dynamics, each over nn actions, let ff_{\ell} denote the conversion function of the \ell-th learning dynamic. Given any non-negative constants α1,,αk\alpha_{1},\cdots,\alpha_{k} where =1kα=1\sum_{\ell=1}^{k}\alpha_{\ell}=1, the convex combination of the kk learning dynamics with parameters α1,,αk\alpha_{1},\cdots,\alpha_{k} is a learning dynamic with conversion function (=1kαf)\left(\sum_{\ell=1}^{k}\alpha_{\ell}\cdot f_{\ell}\right).

Example 14.

Suppose we are using the half-half convex combination of Replicator Dynamic (RD) and Online Gradient Descent (OGD), and n=2n=2. Recall the conversion functions of RD and OGD in (2) and (4). Suppose that at some time tt, 𝐪(t)=(0.6,0)\mathbf{q}(t)=(0.6,0). Then the mixed strategy at time tt with the half-half convex combination of RD and OGD is

𝐱(t)\displaystyle\mathbf{x}(t) =12fRD((0.6,0))+12fOGD((0.6,0))\displaystyle~{}=~{}\frac{1}{2}\cdot f_{\text{\emph{RD}}}\left((0.6,0)\right)+\frac{1}{2}\cdot f_{\text{\emph{OGD}}}\left((0.6,0)\right)
=12[(exp(0.6)exp(0.6)+exp(0),exp(0)exp(0.6)+exp(0))+(0.60+12,00.6+12)]\displaystyle~{}=~{}\frac{1}{2}\left[\left(\frac{\exp(0.6)}{\exp(0.6)+\exp(0)}~{},~{}\frac{\exp(0)}{\exp(0.6)+\exp(0)}\right)+\left(\frac{0.6-0+1}{2}~{},~{}\frac{0-0.6+1}{2}\right)\right]
12[(0.6457,0.3543)+(0.8,0.2)](0.7228,0.2772).\displaystyle~{}\approx~{}\frac{1}{2}\left[(0.6457,0.3543)+(0.8,0.2)\right]~{}\approx~{}(0.7228,0.2772).

By (4), fOGD(𝐪)f_{\text{\emph{OGD}}}(\mathbf{q}) outputs a strategy with zero probability of choosing action 2 whenever q1q21q_{1}-q_{2}\geq 1. In contrast, fRD(𝐪)f_{\text{\emph{RD}}}(\mathbf{q}) maintains a tiny but positive probability of choosing action 2 even when q1q_{1} is much larger than q2q_{2}. We may say OGD leans toward exploitation while RD leans toward exploration. By combining the two learning dynamics via convex combination with our own choice of α\alpha’s, we obtain a new lossless learning dynamic with our desired balance between exploitation and exploration.

Convex combination not only preserves losslessness, but also finitely losslessness. Suppose there are several finitely lossless learning dynamics. By Definition 5, for every action jj, the storage functions of their learning operators with shift 𝐞j\mathbf{e}_{j} have finite lower bounds. It is easy to verify that for a convex combination of these lossless learning dynamics, its learning operator with shift 𝐞j\mathbf{e}_{j} is lossless via the same convex combination of the storage functions mentioned above. Since the storage functions have finite lower bounds, their convex combination has a finite lower bound too.

Theorem 15.

Given any kk lossless (resp. finitely lossless) learning dynamics, any convex combination of them is a lossless (resp. finitely lossless) learning dynamic.

Theorems 78 and 15 lead to the interesting theorem below.

Theorem 16.

Any convex combination of any finitely lossless learning dynamics is a learning dynamic that guarantees constant regret. In particular, any convex combination of any FTRL learning dynamics is a learning dynamic that guarantees constant regret.

Remark. Note that the family of smooth storage functions that generate finitely lossless learning dynamics is

{E|j,E(𝐪)𝐪,𝐞jis bounded from below}.\left\{E\in\mathcal{E}~{}|~{}\forall j,~{}E(\mathbf{q})-\left\langle\mathbf{q},\mathbf{e}_{j}\right\rangle~{}\text{is bounded from below}\right\}.

Again, this family is closed under convex combination. By Theorem 7, the family of FTRL learning dynamics is a subset of the family of finitely lossless learning dynamics. It is not clear if the two families are equal; we believe not. For instance, for the half-half convex combination of Replicator Dynamic and Online Gradient Descent, we cannot find any regularizer function that validates it is a FTRL dynamic.

6 Lossless DGS and Poincaré Recurrence

In the last section, we present a potentially broad family of (finitely) lossless learning dynamics. In this section, our focus is on DGS in which agents use such learning dynamics. We first prove that for certain graphical constant-sum games, their game operators are finitely lossless. Thus, for any DGS comprising of such game and finitely lossless learning dynamics, it admits a constant-of-motion by Theorem 2 when 𝐫0\mathbf{r}\equiv 0, i.e. the sum of the two storage function values is a constant for any t0t\geq 0. Then we use this constant-of-motion and follow a principled approach proposed by Mertikopoulos et al. [MPP18] to show our main result here: the DGS is Poincaré recurrent. In the rest of this section, we first define graphical constant-sum game and Poincaré recurrence, then we apply the principled approach to prove our main result.

Graphical Constant-sum Game.

There are mm agents. Each agent ii has nin_{i} actions. In a graphical game [KLS01], there is a normal-form (matrix) game between every pair of agents, which we call an edge-game. The edge game between agents i,ki,k is specified by two matrices, 𝐀ikni×nk\mathbf{A}^{ik}\in\mathbb{R}^{n_{i}\times n_{k}} and 𝐀kink×ni\mathbf{A}^{ki}\in\mathbb{R}^{n_{k}\times n_{i}}. When each agent kik\neq i chooses a mixed strategy 𝐱kΔnk\mathbf{x}_{k}\in\Delta^{n_{k}}, the payoff vector of agent ii, which contains the payoff to each action of agent ii, is the sum of the payoff vectors in all her edge-games. Precisely, the payoff vector of agent ii is

𝐩i=1km,ki𝐀ik𝐱k.\mathbf{p}_{i}~{}=~{}\sum_{1\leq k\leq m,~{}k\neq i}~{}\mathbf{A}^{ik}\cdot\mathbf{x}_{k}. (9)

A graphical constant-sum game [DP09] is a graphical game such that for every pair of agents {i,k}\{i,k\}, there exists a constant c{i,k}c^{\{i,k\}} satisfying 𝐀jik+𝐀jki=c{i,k}\mathbf{A}^{ik}_{j\ell}+\mathbf{A}^{ki}_{\ell j}=c^{\{i,k\}} for any action jj of agent ii and action \ell of agent kk.

Game Operator.

Recall that a game operator has an input of mixed strategies of different agents with shift 𝐱^\hat{\mathbf{x}}^{*}, while the output is the negative of the cumulative payoff vector to different actions. We point out one useful fact, which follows quite readily from [PS14].

Proposition 17.

If 𝐱^\hat{\mathbf{x}}^{*} is a Nash equilibrium of a graphical constant-sum game, then the game operator with shift 𝐱^\hat{\mathbf{x}}^{*} is passive; the storage function is the zero function. Moreover, if 𝐱^\hat{\mathbf{x}}^{*} is fully-mixed (i.e. every entry in 𝐱^\hat{\mathbf{x}}^{*} is positive), then the game operator is lossless via the zero storage function.

When the game is lossless, by the composability of passive operators (Theorem 2), if it is coupled with passive (resp. lossless) learning dynamics, then the DGS is passive (resp. lossless).

Poincaré Recurrence.

In a DGS, 𝐩^\hat{\mathbf{p}} is a function of 𝐱^\hat{\mathbf{x}} via the game operator, while 𝐱^\hat{\mathbf{x}} is a function of 𝐪^\hat{\mathbf{q}} via the conversion function. Thus, 𝐩^\hat{\mathbf{p}} is a function of 𝐪^\hat{\mathbf{q}}. We say the ODE system (1) is divergence-free if i=1mjpijqij\sum_{i=1}^{m}\sum_{j}\frac{\partial p_{ij}}{\partial q_{ij}} is zero everywhere. When 𝐩^\hat{\mathbf{p}} is derived using a graphical game via (9), pijp_{ij} does not depend on qijq_{ij} since for every kik\neq i, 𝐱k\mathbf{x}_{k} is a function of 𝐪k\mathbf{q}_{k} only. Thus, the game dynamic is divergence-free.

Intuitively, an ODE system with domain N\mathbb{R}^{N} is Poincaré recurrent if almost all trajectories return arbitrarily close to their initial position infinitely often. In order to work formally with the notion of Poincaré recurrence we need to define a measure on N\mathbb{R}^{N}. We use the standard Lebesgue measure on N\mathbb{R}^{N}. Liouville’s formula states that divergence-free ODE systems preserve volume [Wei95]; see Appendix A for more discussion of volume and Liouville’s formula. Thus, the following Poincaré Recurrence Theorem is applicable if its bounded-orbit requirement is satisfied.

Theorem 18 ([Poi90, Bar06]).

If a transformation preserves volume and has bounded orbits, then it is Poincaré recurrent, i.e. for each open set there exist orbits that intersect this set infinitely often.

Given any ϵ>0\epsilon>0, we can cover n\mathbb{R}^{n} by countably many balls of radius ϵ\epsilon, and apply the theorem to each ball. We conclude that almost every point returns to within an ϵ\epsilon neighbourhood of itself. Since ϵ>0\epsilon>0 is arbitrary, we conclude that almost every initial point is almost recurrent.

Poincaré Recurrence in DGS.

Recall that in each learning operator of a DGS, the state is represented by a vector 𝐪n\mathbf{q}\in\mathbb{R}^{n}, which represents the cumulative payoffs. Clearly, it can be unbounded as time goes even in a graphical constant sum game333For instance, this happens for a two-person zero-sum game with every payoff entry to agent 1 is strictly positive., thus prohibiting us to use Theorem 18. Instead, as in [MPP18], we consider a transformation that maps 𝐪=(q1,,qn)\mathbf{q}=(q_{1},\ldots,q_{n}) to

(q1,q2,,qn1):=(q1qn,,qn1qn)n1.(q_{1}^{\prime},q_{2}^{\prime},\cdots,q_{n-1}^{\prime})~{}:=(q_{1}-q_{n},\ldots,q_{n-1}-q_{n})\in\mathbb{R}^{n-1}.

It is well-known that for any FTRL dynamic with starting point 𝐪0=(q10,,qn0)\mathbf{q}^{0}=(q^{0}_{1},\ldots,q^{0}_{n}) and conversion function f:nΔnf:\mathbb{R}^{n}\rightarrow\Delta^{n}, it is equivalent to the following dynamic with state variables 𝐪=(q1,,qn1)n1\mathbf{q}^{\prime}=(q_{1}^{\prime},\ldots,q_{n-1}^{\prime})\in\mathbb{R}^{n-1}, whereas 1jn11\leq j\leq n-1:

qj(0)\displaystyle q^{\prime}_{j}(0) =qj0qn0\displaystyle~{}=~{}q^{0}_{j}-q^{0}_{n}
q˙j(t)\displaystyle\dot{q}^{\prime}_{j}(t) =pj(t)pn(t)\displaystyle~{}=~{}p_{j}(t)-p_{n}(t)
𝐱(t)\displaystyle\mathbf{x}(t) =f(q1(t),q2(t),,qn1(t),0).\displaystyle~{}=~{}f(q^{\prime}_{1}(t),q^{\prime}_{2}(t),\cdots,q^{\prime}_{n-1}(t),0). (10)

Given a mixed strategy 𝐱\mathbf{x}^{*}, if we cast the output of the above dynamic to be (x1(t)x1,,xn1(t)xn1)(x_{1}(t)-x_{1}^{*},\cdots,x_{n-1}(t)-x_{n-1}^{*}), then it is easy to verify that the learning operator is finitely lossless via the storage function L¯(q1,,qn1)=L(q1,,qn1,0)\overline{L}(q^{\prime}_{1},\ldots,q^{\prime}_{n-1})=L(q^{\prime}_{1},\cdots,q^{\prime}_{n-1},0), where LL is the storage function of the original learning operator with shift 𝐱\mathbf{x}^{*}.

Theorem 19.

Poincaré recurrence occurs in the strategy space Δ\Delta for any dynamical game system where (1) each agent employs a learning dynamic which is a convex combination of FTRL; and (2) the underlying game is a graphical constant-sum game with a fully-mixed Nash equilibrium.

See Figure 5 for an illumination of Poincaré recurrence under conditions (1) and (2). To prove the theorem, we first show that Poincaré recurrence occurs in the space that contains bbq^\hat{bbq}^{\prime} (𝐪^\hat{\mathbf{q}}^{\prime} is the concatenation of the 𝐪\mathbf{q}^{\prime} of all agents). This comprises of two major steps:

  • The dynamic preserves volume, since the dynamic is divergence-free.

  • For any starting point 𝐪^(0)\hat{\mathbf{q}}^{\prime}(0), the dynamic remains bounded. To show this, we use the fact that L¯\overline{L} is a constant-of-motion of the game dynamic, so for any t0t\geq 0, 𝐪^(t)\hat{\mathbf{q}}^{\prime}(t) must stay within a level set of L¯\overline{L}, which is bounded (see Appendix C).

Poincaré recurrence in the space that contains 𝐪^\hat{\mathbf{q}}^{\prime} implies Poincaré recurrence in the strategy space Δ\Delta, since the conversion function in (10) is continuous.

Refer to caption
Figure 5: The trajectories of learning in a graphical constant-sum game called Cyclic Matching Pennies. In this game, there are three agents, each has two actions. 𝐀12=𝐀23=𝐀31=[1001]\mathbf{A}^{12}=\mathbf{A}^{23}=\mathbf{A}^{31}=\begin{bmatrix}1&0\\ 0&1\end{bmatrix}, while 𝐀21=𝐀32=𝐀13=[1001]\mathbf{A}^{21}=\mathbf{A}^{32}=\mathbf{A}^{13}=\begin{bmatrix}-1&0\\ 0&-1\end{bmatrix}. The mixed strategies of the three agents are denoted by (x1,1x1)(x_{1},1-x_{1}), (x2,1x2)(x_{2},1-x_{2}) and (x3,1x3)(x_{3},1-x_{3}) respectively. All trajectories start with x1=0.9x_{1}=0.9, x2=0.88x_{2}=0.88 and x3=0.4x_{3}=0.4 (the black dot in the figure). The trajectories are simulated when all agents use Replicator Dynamics (blue), Online Gradient Descent (Black), and half-half convex combination of the former two dynamics (red) respectively. All trajectories are recurrent.

7 Conclusion

We present a control-theoretic perspective to understanding popular learning dynamics like FTRL and escort replicator dynamics. At the heart of it is the use of storage (energy) functions to govern how the dynamic turns history of payoffs to strategic choices. This mirrors the study of physical dynamics, e.g. electrical networks [Kha15]. Analysis via storage functions permits us to prove optimal regret bounds and inspires interesting generalizations of FTRL via convex combinations.

An important benefit of these control-theoretic tools is, as pointed out by Fox and Shamma [FS13], they allow decoupling of game dynamics into learning operators and game operators. This provides a framework to understand learning-in-games via a modular approach, by analyzing these operators separately. This technique can liberate us from analyzing each individual learning-in-game system in ad-hoc manner.

In our work, we initiate the study of connections between online optimization and control theory with continuous-time learning dynamics. An interesting problem is how such a connection can be generalized to discrete-time learning algorithms, e.g. Multiplicative Weights Update and its optimistic variant [BP18, Che18, DP19, CP20]. There does exist theory that generalizes passivity to discrete-time settings, e.g. Section VI.7 in [DV75]. We hope our work inspires further studies in this direction.

Lastly, we believe that this control-theoretic perspective is also useful for understanding learning dynamics/algorithms which are not always passive. The perspective can help us spot under which situations the learning dynamics/algorithms create or dissipate energy. By avoiding situations where energy is created, it is possible that we can achieve stable outcomes in learning processes.

Acknowledgements

This research/project is supported in part by NRF2019-NRF-ANR095 ALIAS grant, grant PIE-SGP-AI-2018-01, NRF 2018 Fellowship NRF-NRFF2018-07, AME Programmatic Fund (Grant No. A20H6b0151) from the Agency for Science, Technology and Research (A*STAR) and the National Research Foundation, Singapore under its AI Singapore Program (AISG Award No: AISG2-RP-2020-016).

References

  • [AHK05] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta algorithm and applications. Technical report, 2005.
  • [Bar06] Luis Barreira. Poincare recurrence: old and new. In XIVth International Congress on Mathematical Physics. World Scientific., pages 415–422, 2006.
  • [BGP20] James P. Bailey, Gauthier Gidel, and Georgios Piliouras. Finite regret and cycles with fixed step-size via alternating gradient descent-ascent. In Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], volume 125 of Proceedings of Machine Learning Research, pages 391–407. PMLR, 2020.
  • [BP18] James P. Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM Conference on Economics and Computation, Ithaca, NY, USA, June 18-22, 2018, pages 321–338. ACM, 2018.
  • [BP19] James Bailey and Georgios Piliouras. Fast and furious learning in zero-sum games: vanishing regret with non-vanishing step sizes. In Advances in Neural Information Processing Systems, pages 12977–12987, 2019.
  • [BRM+18] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. In ICML, 2018.
  • [CBL06] Nikolo Cesa-Bianchi and Gabor Lugoisi. Prediction, Learning, and Games. Cambridge University Press, 2006.
  • [Che18] Yun Kuen Cheung. Multiplicative weights updates with constant step-size in graphical constant-sum games. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 3532–3542, 2018.
  • [CP19] Yun Kuen Cheung and Georgios Piliouras. Vortices instead of equilibria in minmax optimization: Chaos and butterfly effects of online learning in zero-sum games. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, volume 99 of Proceedings of Machine Learning Research, pages 807–834. PMLR, 2019.
  • [CP20] Yun Kuen Cheung and Georgios Piliouras. Chaos, extremism and optimism: Volume analysis of learning in games. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  • [DP09] Constantinos Daskalakis and Christos H. Papadimitriou. On a network generalization of the minmax theorem. In ICALP 2009: Proceedings of the 2009 International Colloquium on Automata, Languages, and Programming, 2009.
  • [DP19] Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10-12, 2019, San Diego, California, USA, volume 124 of LIPIcs, pages 27:1–27:18. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019.
  • [DV75] C.A. Desoer and M. Vidyasagar. Feedback Systems: Input-output Properties. Electrical science series. Academic Press, 1975.
  • [FL98] Drew Fudenberg and David K. Levine. The Theory of Learning in Games. MIT Press Books. The MIT Press, 1998.
  • [FLST16] Dylan J Foster, Thodoris Lykouris, Karthik Sridharan, and Eva Tardos. Learning in games: Robustness of fast convergence. In Advances in Neural Information Processing Systems, pages 4727–4735, 2016.
  • [FS99] Yoav Freund and Robert E Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1-2):79–103, 1999.
  • [FS13] Michael J Fox and Jeff S Shamma. Population games, stable games, and passivity. Games, 4(4):561–583, 2013.
  • [GP19] D. Gadjov and L. Pavel. A passivity-based approach to nash equilibrium seeking over networks. IEEE Transactions on Automatic Control, 64(3):1077–1092, 2019.
  • [H+16] Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  • [Har11] Marc Harper. Escort evolutionary game theory. Physica D: Nonlinear Phenomena, 240(18):1411 – 1415, 2011.
  • [HS98] Josef Hofbauer and Karl Sigmund. Evolutionary Games and Population Dynamics. Cambridge University Press, Cambridge, UK, 1998.
  • [HS09] Josef Hofbauer and William H Sandholm. Stable games and their dynamics. Journal of Economic theory, 144(4):1665–1693, 2009.
  • [Kha15] Hassan K. Khalil. Nonlinear Control. Pearson Education, 2015.
  • [KLS01] M. J. Kearns, M. L. Littman, and S. P. Singh. Graphical models for game theory. In UAI, 2001.
  • [KM17] Joon Kwon and Panayotis Mertikopoulos. A continuous-time approach to online optimization. Journal of Dynamics and Games, 4(2):125–148, April 2017.
  • [Mab18] M. A. Mabrok. Passivity Analysis of Replicator Dynamics and its Variations. arXiv e-prints, page arXiv:1812.07164, Dec 2018.
  • [MPP18] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In ACM-SIAM Symposium on Discrete Algorithms, 2018.
  • [MS15] Jason R Marden and Jeff S Shamma. Game theory and distributed control. In Handbook of game theory with economic applications, volume 4, pages 861–899. Elsevier, 2015.
  • [MS16] M. A. Mabrok and J. S. Shamma. Passivity analysis of higher order evolutionary dynamics and population games. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 6129–6134, 2016.
  • [MS18] Jason R Marden and Jeff S Shamma. Game theory and control. Annual Review of Control, Robotics, and Autonomous Systems, 1:105–134, 2018.
  • [OPNSR13] Romeo Ortega, Julio Antonio Loría Perez, Per Johan Nicklasson, and Hebertt J Sira-Ramirez. Passivity-based control of Euler-Lagrange systems: mechanical, electrical and electromechanical applications. Springer Science & Business Media, 2013.
  • [PML+20] Julien Perolat, Remi Munos, Jean-Baptiste Lespiau, Shayegan Omidshafiei, Mark Rowland, Pedro Ortega, Neil Burch, Thomas Anthony, David Balduzzi, Bart De Vylder, et al. From poincaré recurrence to convergence in imperfect information games: Finding equilibrium via regularization. arXiv preprint arXiv:2002.08456, 2020.
  • [Poi90] Henri Poincaré. Sur le problème des trois corps et les équations de la dynamique. Acta mathematica, 13(1):A3–A270, 1890.
  • [PS14] Georgios Piliouras and Jeff S Shamma. Optimization despite chaos: Convex relaxations to complex limit sets via poincaré recurrence. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 861–873. SIAM, 2014.
  • [PSM18] S. Park, J. S. Shamma, and N. C. Martins. Passivity and evolutionary game dynamics. In 2018 IEEE Conference on Decision and Control (CDC), pages 3553–3560, 2018.
  • [RS13] Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In COLT, 2013.
  • [SALS15] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of regularized learning in games. In Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pages 2989–2997, Cambridge, MA, USA, 2015. MIT Press.
  • [San10] William H. Sandholm. Population Games and Evolutionary Dynamics. MIT Press, 2010.
  • [Sha20] Jeff S. Shamma. Feedback control perspectives on learning, 2020.
  • [SS12] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
  • [VGFP19] Emmanouil-Vasileios Vlatakis-Gkaragkounis, Lampros Flokas, and Georgios Piliouras. Poincaré recurrence, cycles and spurious equilibria in gradient-descent-ascent for non-convex non-concave zero-sum games. In Advances in Neural Information Processing Systems, pages 10450–10461, 2019.
  • [Wei95] J. W. Weibull. Evolutionary Game Theory. MIT Press; Cambridge, MA: Cambridge University Press., 1995.
  • [Wil72a] Jan C. Willems. Dissipative dynamical systems part i: General theory. In Archive for Rational Mechanics and Analysis, volume 45(5), pages 321–351. 1972.
  • [Wil72b] Jan C. Willems. Dissipative dynamical systems part ii: Linear systems with quadratic supply rates. In Archive for Rational Mechanics and Analysis, volume 45(5), pages 352–393. 1972.

Appendix

Appendix A Volume and Liouville’s Formula

Let g:ddg:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} be a function. Given an ODE system 𝐳˙=g(𝐳)\dot{\mathbf{z}}=g(\mathbf{z}) but with a flexibility to choose the starting point, let Φ(𝐳0,t)\Phi(\mathbf{z}^{0},t) be the solution of the ODE system at time tt with starting point 𝐳0\mathbf{z}^{0}. Given any set AA, let A(t)={Φ(𝐳0,t)|𝐳0A}A(t)=\{\Phi(\mathbf{z}^{0},t)~{}|~{}\mathbf{z}^{0}\in A\}. When AA is measurable, under mild conditions on the ODE system, A(t)A(t) is measurable and its volume is vol[A(t)]=A(t)𝖽v\text{vol}[A(t)]=\int_{A(t)}\,\mathsf{d}v. Liouville’s formula states that the time derivative of the volume A(t)A(t) is equal to the integral of the divergence of the ODE system over A(t)A(t):

ddtvol[A(t)]=A(t)trace(g𝐳)𝖽v,\frac{d}{dt}\text{vol}[A(t)]=\int_{A(t)}\text{trace}\left(\frac{\partial g}{\partial\mathbf{z}}\right)\,\mathsf{d}v,

where g𝐳\frac{\partial g}{\partial\mathbf{z}} is the Jacobian of the ODE system. Note that trace(g𝐳)=j=1dgjzj\text{trace}\left(\frac{\partial g}{\partial\mathbf{z}}\right)=\sum_{j=1}^{d}\frac{\partial g_{j}}{\partial z_{j}}, where gjg_{j} is the jj-th component of the function gg. This immediately implies volume preservation for divergence-free systems.

Appendix B Missing Proofs

Proof of Proposition 4: .

It suffices to prove the forward (only if) direction, as the other direction is symmetric. By the definition of passivity,

La(𝐪(t))La(𝐪0)+0t(𝐱(τ)𝐱,a),𝐩(τ)𝖽τ.L^{a}(\mathbf{q}(t))\leq L^{a}(\mathbf{q}^{0})+\int_{0}^{t}\left\langle(\mathbf{x}(\tau)-\mathbf{x}^{*,a}),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau.

This implies

La(𝐪(t))\displaystyle L^{a}(\mathbf{q}(t)) La(𝐪0)+0t(𝐱(τ)𝐱,b),𝐩(τ)𝖽τ+0t(𝐱,b𝐱,a),𝐩(τ)𝖽τ\displaystyle~{}\leq~{}L^{a}(\mathbf{q}^{0})+\int_{0}^{t}\left\langle(\mathbf{x}(\tau)-\mathbf{x}^{*,b}),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau+\int_{0}^{t}\left\langle(\mathbf{x}^{*,b}-\mathbf{x}^{*,a}),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau
=La(𝐪0)+0t(𝐱(τ)𝐱,b),𝐩(τ)𝖽τ+(𝐱,b𝐱,a),(𝐪(t)𝐪0).\displaystyle~{}=~{}L^{a}(\mathbf{q}^{0})+\int_{0}^{t}\left\langle(\mathbf{x}(\tau)-\mathbf{x}^{*,b}),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau+\left\langle(\mathbf{x}^{*,b}-\mathbf{x}^{*,a}),(\mathbf{q}(t)-\mathbf{q}^{0})\right\rangle.

Thus, by setting Lb(𝐪):=La(𝐪)(𝐱,b𝐱,a),𝐪+cL^{b}(\mathbf{q}):=L^{a}(\mathbf{q})-\left\langle(\mathbf{x}^{*,b}-\mathbf{x}^{*,a}),\mathbf{q}\right\rangle+c, we have

Lb(𝐪(t))Lb(𝐪0)+0t(𝐱(τ)𝐱,b),𝐩(τ)𝖽τ,L^{b}(\mathbf{q}(t))\leq L^{b}(\mathbf{q}^{0})+\int_{0}^{t}\left\langle(\mathbf{x}(\tau)-\mathbf{x}^{*,b}),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau,

certifying passivity of the operator SbS^{b}. ∎

Proof of Proposition 6: .

Suppose that for each action jj, the learning operator with shift 𝐞j\mathbf{e}_{j} is finitely passive via storage function LjL^{j}. We claim that the storage function j=1nxjLj\sum_{j=1}^{n}x^{*}_{j}\cdot L^{j} can certify finitely passivity of the learning operator with shift of the mixed strategy 𝐱\mathbf{x}^{*}. To see why, note that for each action jj, we have

Lj(𝐪(t))Lj(𝐪0)+0t(𝐱(τ)𝐞j),𝐩(τ)𝖽τ.L^{j}(\mathbf{q}(t))\leq L^{j}(\mathbf{q}^{0})+\int_{0}^{t}\left\langle(\mathbf{x}(\tau)-\mathbf{e}_{j}),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau.

We are done by multiply both sides of the inequality by xjx^{*}_{j}, then summing up over all jj’s. ∎

Proof of Proposition 10: .

()(\Leftarrow) Done by Theorem 8.

()(\Rightarrow) Suppose the contrary, i.e. the learning algorithm is lossless, but there exists jj such that the learning operator with shift 𝐞j\mathbf{e}_{j} is not finitely lossless. Thus, it has a storage function LjL^{j} which is not bounded from below, and

Lj(𝐪(t))=Lj(𝐪(0))+0t𝐱(τ),𝐩(τ)𝖽τ0t𝐞j,𝐩(τ)𝖽τ.L^{j}(\mathbf{q}(t))~{}=~{}L^{j}(\mathbf{q}(0))+\int_{0}^{t}\left\langle\mathbf{x}(\tau),\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau-\int_{0}^{t}\left\langle\mathbf{e}_{j},\mathbf{p}(\tau)\right\rangle\,\mathsf{d}\tau.

Following the calculation in the proof of Theorem 8, the regret w.r.t. action jj at time tt is exactly equal to Lj(𝐪0)Lj(𝐪(t))L^{j}(\mathbf{q}^{0})-L^{j}(\mathbf{q}(t)).

Since LjL^{j} is not bounded from below, for any r<0r<0, there exists 𝐪~\tilde{\mathbf{q}} such that Lj(𝐪~)rL^{j}(\tilde{\mathbf{q}})\leq r. It is easy to construct 𝐩\mathbf{p} such that 𝐪(t)=𝐪0+0t𝐩(τ)𝖽τ=𝐪~\mathbf{q}(t)=\mathbf{q}^{0}+\int_{0}^{t}\mathbf{p}(\tau)\,\mathsf{d}\tau=\tilde{\mathbf{q}}; for instance, set 𝐩(τ)=(𝐪~𝐪0)/t\mathbf{p}(\tau)=(\tilde{\mathbf{q}}-\mathbf{q}^{0})/t for all τ[0,t]\tau\in[0,t]. For this choice of 𝐩\mathbf{p}, the regret at time tt is Lj(𝐪0)Lj(𝐪~)Lj(𝐪0)rL^{j}(\mathbf{q}^{0})-L^{j}(\tilde{\mathbf{q}})\geq L^{j}(\mathbf{q}^{0})-r. Since we can choose arbitrarily negative value of rr, the learning dynamic cannot guarantee constant regret, a contradiction. ∎

Proof of Proposition 17: .

To show that the game operator is passive, according to Definition 1 and the input-output choice of S2S_{2} (see Figure 3), it suffices to show that

0t(𝐱^(τ)𝐱^),(𝐩^(τ))𝖽τ=0t𝐱^(τ),𝐩^(τ)V1𝖽τ+0t𝐱^,𝐩^(τ)V2𝖽τ0.\int_{0}^{t}\left\langle(\hat{\mathbf{x}}(\tau)-\hat{\mathbf{x}}^{*}),(-\hat{\mathbf{p}}(\tau))\right\rangle\,\mathsf{d}\tau~{}=~{}-\int_{0}^{t}\underbrace{\left\langle\hat{\mathbf{x}}(\tau),\hat{\mathbf{p}}(\tau)\right\rangle}_{V_{1}}\,\mathsf{d}\tau+\int_{0}^{t}\underbrace{\left\langle\hat{\mathbf{x}}^{*},\hat{\mathbf{p}}(\tau)\right\rangle}_{V_{2}}\,\mathsf{d}\tau~{}\geq~{}0.

Recall the definition of c{i,k}c^{\{i,k\}} in a graphical constant-sum game. Since V1V_{1} is simply the total payoffs to all agents, V1V_{1} is the sum of the constants c{i,k}c^{\{i,k\}} of all edge-games, i.e. V1=i=1m1k=i+1mc{i,k}V_{1}=\sum_{i=1}^{m-1}\sum_{k=i+1}^{m}c^{\{i,k\}}. We denote this double summation by VV. It remains to show that V2VV_{2}\geq V always if we want to show the game operator is passive, and to show that V2=VV_{2}=V always if we want to show the game operator is lossless.

Let the action set of agent ii be SiS_{i}. We first expand V2=𝐱^,𝐩^V_{2}=\left\langle\hat{\mathbf{x}}^{*},\hat{\mathbf{p}}\right\rangle as follows:

𝐱^,𝐩^=i=1mjSixijk=1kim[𝐀ik𝐱k]j=i=1mk=1kim(𝐱i)𝖳𝐀ik𝐱k=i=1mk=1kim(𝐱k)𝖳(𝐀ik)𝖳𝐱i;\left\langle\hat{\mathbf{x}}^{*},\hat{\mathbf{p}}\right\rangle~{}=~{}\sum_{i=1}^{m}\sum_{j\in S_{i}}x_{ij}^{*}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{m}[\mathbf{A}^{ik}\mathbf{x}_{k}]_{j}~{}=~{}\sum_{i=1}^{m}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{m}(\mathbf{x}_{i}^{*})^{\mathsf{T}}\mathbf{A}^{ik}\mathbf{x}_{k}~{}=~{}\sum_{i=1}^{m}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{m}(\mathbf{x}_{k})^{\mathsf{T}}(\mathbf{A}^{ik})^{\mathsf{T}}\mathbf{x}_{i}^{*}~{};

the last equality holds since (𝐱i)𝖳𝐀ik𝐱k(\mathbf{x}_{i}^{*})^{\mathsf{T}}\mathbf{A}^{ik}\mathbf{x}_{k} is the transpose of (𝐱k)𝖳(𝐀ik)𝖳𝐱i(\mathbf{x}_{k})^{\mathsf{T}}(\mathbf{A}^{ik})^{\mathsf{T}}\mathbf{x}_{i}^{*}. Then we rewrite the double summation on the RHS as follows:

i=1mk=1kim(𝐱k)𝖳(𝐀ik)𝖳𝐱i\displaystyle\sum_{i=1}^{m}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{m}(\mathbf{x}_{k})^{\mathsf{T}}(\mathbf{A}^{ik})^{\mathsf{T}}\mathbf{x}_{i}^{*} =i=1mk=1kim[c{i,k}(𝐱k)𝖳𝐀ki𝐱i](definition of constant-sum edge-game)\displaystyle~{}=~{}\sum_{i=1}^{m}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{m}\left[c^{\{i,k\}}-(\mathbf{x}_{k})^{\mathsf{T}}\mathbf{A}^{ki}\mathbf{x}_{i}^{*}\right]~{}~{}~{}~{}~{}~{}\text{(definition of constant-sum edge-game)}
=i=1mk=1kimc{i,k}k=1mi=1ikm(𝐱k)𝖳𝐀ki𝐱i\displaystyle~{}=~{}\sum_{i=1}^{m}\sum_{\begin{subarray}{c}k=1\\ k\neq i\end{subarray}}^{m}c^{\{i,k\}}~{}-~{}\sum_{k=1}^{m}\sum_{\begin{subarray}{c}i=1\\ i\neq k\end{subarray}}^{m}(\mathbf{x}_{k})^{\mathsf{T}}\mathbf{A}^{ki}\mathbf{x}_{i}^{*}
=2Vk=1mi=1ikm(𝐱k)𝖳𝐀ki𝐱iUk.\displaystyle~{}=~{}2V~{}-~{}\sum_{k=1}^{m}\underbrace{\sum_{\begin{subarray}{c}i=1\\ i\neq k\end{subarray}}^{m}(\mathbf{x}_{k})^{\mathsf{T}}\mathbf{A}^{ki}\mathbf{x}_{i}^{*}}_{U_{k}}.

It remains to bound the term k=1mUk\sum_{k=1}^{m}U_{k}. Observe that for each agent kk, UkU_{k} is the payoff to agent kk when she chooses the mixed strategy 𝐱k\mathbf{x}_{k}, while every other agent ii chooses the mixed strategy 𝐱i\mathbf{x}_{i}^{*}. Since the mixed strategies 𝐱i\mathbf{x}_{i}^{*} are coming from a Nash equilibrium (NE), by the definition of NE, UkvkU_{k}\leq v_{k}^{*}, where vkv_{k}^{*} is the payoff to agent kk at the NE. Thus, k=1mUkk=1mvk\sum_{k=1}^{m}U_{k}\leq\sum_{k=1}^{m}v_{k}^{*}, where the RHS is the total payoffs to all agents at the NE. Since the game is constant-sum, we have k=1mvk=V\sum_{k=1}^{m}v_{k}^{*}=V. Hence, V2=𝐱^,𝐩^2VV=VV_{2}=\left\langle\hat{\mathbf{x}}^{*},\hat{\mathbf{p}}\right\rangle\geq 2V-V=V.

When the NE is fully-mixed, we have the following extra property: at the NE, for the agent kk, her payoff from each of her actions is the same, and equals to vkv_{k}^{*}. Thus, UkU_{k} exactly equals to vkv_{k}^{*}, so V2=VV_{2}=V. ∎

Appendix C Poincaré Recurrence

We first formally state the following corollary of Proposition 17 and Theorem 2.

Corollary 20.

The FIC system which corresponds to a dynamical game system, in which S1S_{1} is any finitely lossless MLO and S2S_{2} is any game operator which corresponds to a graphical constant-sum game with a fully-mixed Nash equilibrium, is finitely lossless. The storage function that demonstrates finitely losslessness of the FIC system is the same as the storage function of S1S_{1}. When the external input 𝐫\mathbf{r} is the zero function, the storage function becomes a constant-of-motion.

To complete the proof of Theorem 19, we need to show the second property required by the principled approach of [MPP18]. It relies crucially on the following lemma. Recall that we have defined the following in the main paper, which converts the storage function for the original learning operator to the storage function of the new learning operator of (10).

L¯(q1,q2,,qn1)=L(q1,q2,,qn1,0),\overline{L}(q^{\prime}_{1},q^{\prime}_{2},\cdots,q^{\prime}_{n-1})=L(q^{\prime}_{1},q^{\prime}_{2},\cdots,q^{\prime}_{n-1},0), (11)
Lemma 21 (Adapted from [MPP18], Appendix D).

For any continuous FTRL dynamic and for any 𝐱Δn\mathbf{x}^{*}\in\Delta^{n}, let LL be its finitely lossless storage function defined in (8), and let L¯\overline{L} be the function defined on n1\mathbb{R}^{n-1} as in (11). Then any level set of L¯\overline{L} is bounded in n1\mathbb{R}^{n-1}, i.e. for any real number c¯\bar{c}, the set below is bounded:

{(q1,,qn1)|L¯(q1,,qn1)c¯}.\{~{}(q^{\prime}_{1},\cdots,q^{\prime}_{n-1})~{}\big{|}~{}\overline{L}(q^{\prime}_{1},\cdots,q^{\prime}_{n-1})\leq\bar{c}~{}\}.

Recall the definition of FTRL and Theorem 7. For each agent ii, suppose she uses a convex combination of i\ell_{i} FTRL dynamics indexed by i1,i2,,iii1,i2,\cdots,i\ell_{i}. Let the storage functions of these FTRL dynamics be Li1,Li2,,LiiL^{i1},L^{i2},\cdots,L^{i\ell_{i}}. Also, let 𝐪,i\mathbf{q}^{\prime,i} denote a vector in ni1\mathbb{R}^{n_{i}-1} for agent ii. Then the storage function of the whole dynamical game system is

i=1mj=1iαijL¯ij(𝐪,i),whereαij>0,andi,j=1iαij=1.\sum_{i=1}^{m}\sum_{j=1}^{\ell_{i}}\alpha_{ij}\cdot\overline{L}^{ij}(\mathbf{q}^{\prime,i}),~{}~{}~{}\text{where}~{}\alpha_{ij}>0,~{}\text{and}~{}\forall i,~{}\sum_{j=1}^{\ell_{i}}\alpha_{ij}=1.

Due to Corollary 20, this storage function is a constant-of-motion when 𝐫0\mathbf{r}\equiv 0, and thus is bounded by certain constant c¯\bar{c} when the starting point is already given. Since every LijL^{ij} and hence L¯ij\overline{L}^{ij} has infimum zero, we must have: for each agent ii, αi1L¯i1(𝐪,i)c¯\alpha_{i1}\cdot\overline{L}^{i1}(\mathbf{q}^{\prime,i})\leq\bar{c}, and hence L¯i1(𝐪,i)c¯/αi1\overline{L}^{i1}(\mathbf{q}^{\prime,i})\leq\bar{c}/\alpha_{i1}. Then by Lemma 21, for each agent ii, 𝐪,i(t)\mathbf{q}^{\prime,i}(t) remains bounded for all tt, and thus the overall vector 𝐪^(t)=(𝐪,1(t),𝐪,2(t),,𝐪,m(t))\hat{\mathbf{q}}^{\prime}(t)=(\mathbf{q}^{\prime,1}(t),\mathbf{q}^{\prime,2}(t),\cdots,\mathbf{q}^{\prime,m}(t)) also remains bounded for all tt.

Appendix D Escort Learning Dynamics

An escort learning dynamic [Har11] is a system of differential equations on variable 𝐱Δn\mathbf{x}\in\Delta^{n}: for each 1jn1\leq j\leq n,

x˙j=ϕj(xj)[pj=1nϕ(x)p=1nϕ(x)],\dot{x}_{j}~{}=~{}\phi_{j}(x_{j})\cdot\left[p_{j}-\frac{\sum_{\ell=1}^{n}\phi_{\ell}(x_{\ell})\cdot p_{\ell}}{\sum_{\ell=1}^{n}\phi_{\ell}(x_{\ell})}\right],

where each ϕj\phi_{j} is a positive function on domain (0,1)(0,1). Note that when ϕj(xj)=xj\phi_{j}(x_{j})=x_{j}, this is Replicator Dynamic.

Proposition 22.

Suppose a learning dynamic has the following property: if it starts at a point in the interior of Δn\Delta^{n}, then it stays in the interior forever. We have: the learning dynamic is FTRL via a separable strictly convex regularizer function h(𝐱)=i=1nhi(𝐱i)h(\mathbf{x})=\sum_{i=1}^{n}h_{i}(\mathbf{x}_{i}) if and only if it is an escort replicator dynamic.

Proof: .

If the specified learning dynamic is FTRL, recall that the conversion function takes 𝐪\mathbf{q} as input, and output the mixed strategy

argmax𝐱Δn{𝐪,𝐱h(𝐱)},\operatorname*{arg\,max}_{\mathbf{x}\in\Delta^{n}}\left\{\left\langle\mathbf{q},\mathbf{x}\right\rangle-h(\mathbf{x})\right\},

which we denote by 𝐱(𝐪)\mathbf{x}(\mathbf{q}) in this proof. Let x¯j=1/hj′′(xj)\bar{x}_{j}=1/h_{j}^{\prime\prime}(x_{j}) and H:=jx¯jH:=\sum_{j}\bar{x}_{j}. When 𝐱(𝐪)\mathbf{x}(\mathbf{q}) is in the interior of Δn\Delta^{n} for some 𝐪\mathbf{q}, by Appendix D of [CP19], we have

xjqj=x¯j[x¯j]2Handj,xjq=x¯jx¯H.\frac{\partial x_{j}}{\partial q_{j}}~{}=~{}\bar{x}_{j}-\frac{[\bar{x}_{j}]^{2}}{H}~{}~{}~{}~{}~{}~{}~{}~{}\text{and}~{}~{}~{}~{}~{}~{}~{}~{}\forall\ell\neq j,~{}~{}\frac{\partial x_{j}}{\partial q_{\ell}}~{}=~{}-\frac{\bar{x}_{j}\bar{x}_{\ell}}{H}.

By the chain rule,

x˙j=[x¯j[x¯j]2H]pj+j[x¯jx¯H]p=x¯j(pj=1nx¯p=1nx¯).\dot{x}_{j}~{}=~{}\left[\bar{x}_{j}-\frac{[\bar{x}_{j}]^{2}}{H}\right]\cdot p_{j}~{}+~{}\sum_{\ell\neq j}\left[-\frac{\bar{x}_{j}\bar{x}_{\ell}}{H}\right]p_{\ell}~{}=~{}\bar{x}_{j}\left(p_{j}-\frac{\sum_{\ell=1}^{n}\bar{x}_{\ell}p_{\ell}}{\sum_{\ell=1}^{n}\bar{x}_{\ell}}\right).

By recognizing ϕj(xj)\phi_{j}(x_{j}) as x¯j\bar{x}_{j}, the FTRL dynamic is an escort replicator dynamic. Precisely, we set ϕj(xj)=1/hj′′(xj)\phi_{j}(x_{j})=1/h_{j}^{\prime\prime}(x_{j}). Since hh is strictly convex, hj′′h_{j}^{\prime\prime} is a positive function, hence ϕj\phi_{j} is a positive function too.

Conversely, if the specified algorithm is an escort learning dynamic with escort function ϕj\phi_{j} for each jj, to show that it is a FTRL dynamic with some strictly convex regularizer hh, we set hh to be separable, and for each jj, hj′′(xj)=1/ϕj(xj)h_{j}^{\prime\prime}(x_{j})=1/\phi_{j}(x_{j}). Thus, it suffice to set hjh_{j} to be any double anti-derivative of 1/ϕj1/\phi_{j}. Since hj′′(xj)=1/ϕj(xj)>0h_{j}^{\prime\prime}(x_{j})=1/\phi_{j}(x_{j})>0, each hjh_{j} is strictly convex, and hence hh is strictly convex. ∎

Appendix E More Plots Illuminating Poincaré Recurrences

Finally, we present more plots that illuminate Poincaré recurrences of learning in games in the next two pages.

Refer to caption
Refer to caption
Figure 6: Poincaré recurrences of Replicator Dynamics (RD; top) and Online Gradient Descent (OGD; bottom) in the classical two-agent Rock-Paper-Scissors game. agent 1 starts with mixed strategy 𝐱1(0)=(0.5,0.25,0.25)\mathbf{x}_{1}(0)=(0.5,0.25,0.25), while agent 2 starts with mixed strategy 𝐱2(0)=(0.6,0.3,0.1)\mathbf{x}_{2}(0)=(0.6,0.3,0.1). The two graphs plot the logarithm of the Euclidean distance between (𝐱1(t),𝐱2(t))(\mathbf{x}^{1}(t),\mathbf{x}^{2}(t)) and (𝐱1(0),𝐱2(0))(\mathbf{x}^{1}(0),\mathbf{x}^{2}(0)), from t=0.1t=0.1 to t=500t=500. Every downward spike corresponds to a moment where the flow gets back close to the starting point. In both cases, the distance drops below 10310^{-3} for multiple times.
Refer to caption
Refer to caption
Refer to caption
Figure 7: Poincaré recurrence of αRD+(1α)OGD\alpha\cdot\textsf{RD}+(1-\alpha)\cdot\textsf{OGD} in the classical Rock-Paper-Scissors game, for α=1/4\alpha=1/4 (top), α=1/2\alpha=1/2 (middle) and α=3/4\alpha=3/4 (bottom). The starting point is (𝐱1(0),𝐱2(0))=((0.5,0.25,0.25),(0.6,0.3,0.1))(\mathbf{x}_{1}(0),\mathbf{x}_{2}(0))=((0.5,0.25,0.25),(0.6,0.3,0.1)). The graphs plot the logarithm of the Euclidean distance between (𝐱1(t),𝐱2(t))(\mathbf{x}_{1}(t),\mathbf{x}_{2}(t)) and (𝐱1(0),𝐱2(0))(\mathbf{x}_{1}(0),\mathbf{x}_{2}(0)), from t=0.1t=0.1 to t=500t=500.