This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\NewEnviron

scaletikzpicturetowidth[1]\BODY

Time-dependent Blackwell approachability and application to absorbing games

Joon Kwon, Yijun Wan & Bruno Ziliotto
Abstract.

Blackwell’s approachability [5, 6] is a very general online learning framework where a Decision Maker obtains vector-valued outcomes, and aims at the convergence of the average outcome to a given “target” set. Blackwell gave a sufficient condition for the decision maker having a strategy guaranteeing such a convergence against an adversarial environment, as well as what we now call the Blackwell’s algorithm, which then ensures convergence. Blackwell’s approachability has since been applied to numerous problems, in regret minimization and game theory, in particular. We extend this framework by allowing the outcome function and the inner product to be time-dependent. We establish a general guarantee for the natural extension to this framework of Blackwell’s algorithm. In the case where the target set is an orthant, we present a family of time-dependent inner products which yields different convergence speeds for each coordinate of the average outcome. We apply this framework to absorbing games (an important class of stochastic games) for which we construct ε\varepsilon-uniformly optimal strategies using Blackwell’s algorithm in a well-chosen auxiliary approachability problem, thereby giving a novel illustration of the relevance of online learning tools for solving games.

1. Introduction

The fundamental von Neumann’s minimax theorem characterizes the highest payoff that each player of a finite two-player zero-sum game can guarantee. Blackwell [5, 6] proposed a surprising extension of this result for repeated games with vector-valued outcomes. For a given set called the target, he gave a sufficient condition for the player having a strategy that guarantees convergence in average of the outcomes to the target. This condition is also necessary for convex sets. When the condition is satisfied, the original algorithm proposed by Blackwell, as well as many recent alternatives [22, 17, 1, 26, 50, 18, 10], do guarantee such a convergence.

This approachability framework has since found numerous applications in online learning and game theory—see Perchet [44] for a survey. The very first application was an alternative solution to the fundamental sequential decision problem called regret minimization [5, 19, 8]. It has since been noticed that many variants and extensions of regret minimization are also special cases, e.g. internal/swap regret [54, 7], with variable stage duration [32], with sleeping experts [7], with fairness constraints [9], with global costs [13], etc. Further important areas of application include online resource allocation problems such as scheduling [23], capacity pooling [58], and reinforcement learning [31, 42, 57, 28]. A recent paper also proposed an interesting application of Blackwell’s algorithm to efficiently solving saddle-point problems [18].

Blackwell’s approachability has also been applied to game theory in various ways. It was first observed that regret minimization can be used to iteratively compute game solutions: Freund and Schapire [16] constructively proved the von Neumann’s minimax theorem using the exponential weights algorithm. More generally, Hart and Mas-Colell [21] used Blackwell’s approachability to define an algorithm called Regret Matching and showed convergence to the set of correlated equilibria. Regret Matching turned out to be very effective in practice and modern variants achieve state-of-the-art performance in solving large games such as poker [60, 55, 14]. Further applications include repeated games [33], with partial monitoring [43, 34, 35, 36], with incomplete information [25] and coalitional games [46]. Approachability was also studied in the context of stochastic games with vector valued payoffs [15, 45].

Another important example of an (online) learning tool that has been successfully applied to games is reinforcement learning [48], which has been declined into multi-agent reinforcement learning in the context of stochastic games [29]. Reinforcement learning methods typically need each state to be visited infinitely often. By contrast, we present a novel application of (an extension of) Blackwell’s approachability to the important class of absorbing games, for which such assumption cannot hold, because of irreversible state transitions.

Zero-sum stochastic games [49] feature two players who repeatedly play a zero-sum game with an outcome function that depends on a state variable. The state follows a Markov chain controlled by both players. In the TT-stage game, the payoff function is the expected average payoff over TT stages, and we call vTv_{T} its value. A large part of the literature studies the asymptotic properties of this model as the number of stages TT grows to infinity—see Mertens et al. [40], Sorin [53], Solan and Ziliotto [52], Solan [51] for general references. A subclass that has been intensively studied is absorbing games, where the state can move at most once during the game, thus in an irreversible fashion. When the state space and the action sets are finite, Kohlberg [24] proved the convergence of vTv_{T} as T+T\to+\infty [4]—the limit vv_{\infty} is called limit value. Moreover, he proved the existence of the uniform value [38], i.e.  for each ε>0\varepsilon>0, Player I (resp. Player II) has a strategy that guarantees vεv_{\infty}-\varepsilon (resp. v+εv_{\infty}+\varepsilon), for any sufficiently large TT. Strategies satisfying such a property are called ε\varepsilon-uniform optimal. The existence of the limit value and uniform value were generalized to any stochastic game with finite state space and finite action sets, respectively in Bewley and Kohlberg [4] and Mertens and Neyman [38]. The limit value may fail to exist when the state space is infinite [59] or when one of the action sets is infinite [56], even under strong topological assumptions. The situation is more positive in the case of absorbing games. Indeed, as far as the existence of limit value and uniform value is concerned, absorbing games with infinite state space can easily be reduced to a finite-state space setting. Moreover, absorbing games with compact action sets and separately continuous payoff and transition functions have a uniform value [39]. The latter result was refined in Hansen et al. [20], who proved that there exist ε\varepsilon-uniform optimal strategies that can be generated by a finite-state space automaton with transition functions that only depend on the stage of the game.

In this paper, we introduce an extension of Blackwell’s approachability framework where the outcome function and the inner product vary with time and apply it to obtain a new proof of the existence of the limit value and uniform value in absorbing games with compact action sets. As a result, we obtain a construction of ε\varepsilon-uniformly optimal strategies that is fairly different from the ones in Mertens et al. [39] and Hansen et al. [20]. The reader can find a detailed comparison between the constructions in Subsection 8. To our knowledge, this is the first time that approachability is used to design ε\varepsilon-uniformly optimal strategies. The connection between approachability and uniform value builds on the following simple idea. An ε\varepsilon-uniformly optimal strategy should induce good stage payoffs while ensuring that the state does not get absorbed into a bad state. This dual objective can be recast as an approachability problem with a two-dimensional vector payoff function. The simplicity and generality of such an idea make it amenable to promising extensions, and we hope that these tools and ideas will lead to a systematic approach for solving other stochastic games and sequential decision problems. In addition, our work constitutes another illustration of the relevance of online learning tools for solving games. Moreover, possible applications of our framework to regret minimization are mentioned in Section 9.

1.1. Contributions and Summary

We introduce in Section 2 an extension of Blackwell’s approachability framework where the outcome function and the inner product vary with time. We define the associated Blackwell’s condition and Blackwell’s algorithm.

In Section 3, we establish a general bound on the distance of the average outcome to the target set, measured by the time-dependent inner product. The bound holds as soon as Blackwell’s condition is satisfied and depends on the norm of each past outcome, where outcome vector at time t1t\geqslant 1 is measured with the corresponding norm at time tt—this feature being essential in the application to absorbing games.

In the case where the target set is an orthant, Corollary 3.3 proposes a family of time-dependent inner product which yields different convergence speeds for each coordinate of the average outcome vector.

In Sections 4 to 7, we recall the definition of absorbing games and showcase the above tools and results by applying them to the construction of ε\varepsilon-uniformly optimal strategies.

1.2. Related Work

Our presentation of Blackwell’s approachability focuses on the case of target sets which are closed convex cones, as in Abernethy et al. [1] where the properties of the convex cone are used to convert regret minimization algorithms into approachability ones. Time-dependent outcome functions also appear in Lee et al. [27] with the restriction that they are convex-concave.

1.3. Notation

Let d1d\geqslant 1 be an integer. For a vector udu\in\mathbb{R}^{d}, we denote u=(u(1),,u(d))u=(u^{(1)},\dots,u^{(d)}) its components.

2. A Generalized Approachability Framework

We define an extension of Blackwell’s approachability framework and condition, and establish a guarantee for the corresponding Blackwell’s algorithm. Our approach allows the outcome function and the inner product to vary with time.

Let 𝒜,\mathcal{A},\mathcal{B} be action sets (with no particular structure) for the Decision Maker and Nature respectively. Let d1d\geqslant 1 be an integer and \mathcal{R} a subset of (d)𝒜×(\mathbb{R}^{d})^{\mathcal{A}\times\mathcal{B}}.

The interaction goes as follows. At time t1t\geqslant 1,

  • Nature chooses outcome function ρt\rho_{t}\in\mathcal{R} and reveals it to the Decision Maker,

  • the Decision Maker chooses action at𝒜a_{t}\in\mathcal{A},

  • Nature chooses action btb_{t}\in\mathcal{B},

  • outcome vector rt:=ρt(at,bt)dr_{t}:=\rho_{t}(a_{t},b_{t})\in\mathbb{R}^{d} is revealed to the Decision Maker.

Formally, an algorithm of the Decision Maker is a sequence of maps (σt)t1(\sigma_{t})_{t\geqslant 1} where for t1t\geqslant 1, σt:(d)t1×t𝒜\sigma_{t}:(\mathbb{R}^{d})^{t-1}\times\mathcal{R}^{t}\to\mathcal{A}. For a given such algorithm, a sequence of outcome functions (ρt)t1(\rho_{t})_{t\geqslant 1} in \mathcal{R} and a sequence of Nature’s actions (bt)t1(b_{t})_{t\geqslant 1} in \mathcal{B}, the actions of the Decision Maker are then defined for t1t\geqslant 1 as:

at\displaystyle a_{t} =σt(r1,,rt1,ρ1,,ρt)\displaystyle=\sigma_{t}(r_{1},\dots,r_{t-1},\rho_{1},\dots,\rho_{t})
=σt(ρ1(a1,b1),,ρt1(at1,bt1),ρ1,,ρt).\displaystyle=\sigma_{t}\left(\rho_{1}(a_{1},b_{1}),\dots,\rho_{t-1}(a_{t-1},b_{t-1}),\rho_{1},\dots,\rho_{t}\right).

Our construction and analysis can be extended to general closed convex target sets as discussed in Section A, we here restrict to closed convex cones for simplicity.

Definition 2.1.

A nonempty subset 𝒞d\mathcal{C}\subset\mathbb{R}^{d} is a closed convex cone if it is closed and if for all z,z𝒞z,z^{\prime}\in\mathcal{C} and λ+\lambda\in\mathbb{R}_{+}, it holds that z+z𝒞z+z^{\prime}\in\mathcal{C} and λz𝒞\lambda z\in\mathcal{C}.

Remark 2.2.

It is immediate to verify that a closed convex cone is indeed convex. Consequently, the orthogonal projection onto a closed convex cone (with respect to any inner product) is well-defined.

Let 𝒞d\mathcal{C}\subset\mathbb{R}^{d} be a closed convex cone called the target and (,(t))t1(\left<\,\cdot\,,\,\cdot\,\right>_{(t)})_{t\geqslant 1} a sequence of inner products in d\mathbb{R}^{d}. For t1t\geqslant 1, denote by (t)\left\|\,\cdot\,\right\|_{(t)} and π(t)𝒞\pi_{(t)}^{\mathcal{C}} the associated Euclidean norm and the associated Euclidean projection onto 𝒞\mathcal{C} respectively. In other words, for rdr\in\mathbb{R}^{d},

r(t)\displaystyle\left\|r\right\|_{(t)} =r,r(t)1/2,\displaystyle=\left<r,r\right>_{(t)}^{1/2},
π(t)𝒞(r)\displaystyle\pi_{(t)}^{\mathcal{C}}(r) =argminr𝒞rr(t).\displaystyle=\operatorname*{arg\,min}_{r^{\prime}\in\mathcal{C}}\left\|r^{\prime}-r\right\|_{(t)}.

We now present the extension of Blackwell’s condition [5, 6] to our framework.

Definition 2.3 (Blackwell’s condition).

𝒞\mathcal{C} satisfies Blackwell’s condition (with respect to \mathcal{R} and (,(t))t1(\left<\,\cdot\,,\,\cdot\,\right>_{(t)})_{t\geqslant 1}) if for all t1t\geqslant 1, ρ\rho\in\mathcal{R}, rdr\in\mathbb{R}^{d}, there exists a[t,ρ,r]𝒜a\left[t,\rho,r\right]\in\mathcal{A} such that for all bb\in\mathcal{B},

ρ(a[t,ρ,r],b),rπ(t)𝒞(r)(t)0.\left<\rho\left(a\left[t,\rho,r\right],b\right),r-\pi_{(t)}^{\mathcal{C}}(r)\right>_{(t)}\leqslant 0.

The above map

a:××d𝒜(t,ρ,r)a[t,ρ,r]\begin{array}[]{rccc}a:&\mathbb{N}^{*}\times\mathcal{R}\times\mathbb{R}^{d}&\longrightarrow&\mathcal{A}\\ &(t,\rho,r)&\longmapsto&a\left[t,\rho,r\right]\end{array}

is called an oracle associated with 𝒞\mathcal{C}, \mathcal{R} and (,(t))t1(\left<\,\cdot\,,\,\cdot\,\right>_{(t)})_{t\geqslant 1}.

Definition 2.4 (Blackwell’s algorithm).

Let 𝒞\mathcal{C} satisfying Blackwell’s condition (with respect to \mathcal{R} and (,(t))t1(\left<\,\cdot\,,\,\cdot\,\right>_{(t)})_{t\geqslant 1}), aa an associated oracle, and a1𝒜a_{1}\in\mathcal{A}. The corresponding Blackwell’s algorithm is defined as σ1=a1\sigma_{1}=a_{1} and for all t2t\geqslant 2, r1,,rt1dr_{1},\dots,r_{t-1}\in\mathbb{R}^{d} and ρ1,,ρt\rho_{1},\dots,\rho_{t}\in\mathcal{R},

σt(r1,,rt1,ρ1,,ρt)=a[t,ρt,s=1t1rs].\sigma_{t}(r_{1},\dots,r_{t-1},\rho_{1},\dots,\rho_{t})=a\left[t,\rho_{t},\sum_{s=1}^{t-1}r_{s}\right].

In other words, for a given sequence of outcome functions (ρt)t1(\rho_{t})_{t\geqslant 1} in \mathcal{R} and a sequence of Nature’s actions (bt)t1(b_{t})_{t\geqslant 1} in \mathcal{B}, Blackwell’s algorithm yields the following actions for the Decision Maker:

at=a[t,ρt,s=1t1ρs(as,bs)],t2.a_{t}=a\left[t,\rho_{t},\sum_{s=1}^{t-1}\rho_{s}(a_{s},b_{s})\right],\quad t\geqslant 2.
Remark 2.5.

For the Blackwell’s algorithm to be used by the Decision Maker, the latter needs to know the outcome function ρt\rho_{t} and the inner product ,(t)\left<\,\cdot\,,\,\cdot\,\right>_{(t)} before choosing its action of stage t1t\geqslant 1, as the oracle depends on both.

In the case where the action sets are convex with 𝒜\mathcal{A} compact, and the outcome functions are bi-affine, an equivalent Blackwell’s dual condition is given below and is often easier to verify, but does not explicitly provide an oracle —there are however alternative approachability algorithms based on the dual condition [3]. An interesting consequence of the following is that whether a closed convex target set 𝒞\mathcal{C} satisfies Blackwell’s condition does not depend on the sequence of inner products (,(t))t1(\left<\,\cdot\,,\,\cdot\,\right>_{(t)})_{t\geqslant 1}.

Proposition 2.6 (Blackwell’s dual condition).

Assume that 𝒜\mathcal{A} and \mathcal{B} are convex sets of finite dimensional vectors spaces, such that 𝒜\mathcal{A} is compact and all outcome functions ρ\rho\in\mathcal{R} are bi-affine. Then, a closed convex cone 𝒞d\mathcal{C}\subset\mathbb{R}^{d} satisfies Blackwell’s condition with respect to \mathcal{R} and (,(t))t1(\left<\,\cdot\,,\,\cdot\,\right>_{(t)})_{t\geqslant 1} if, and only if,

(\ast) ρ,b,a𝒜,ρ(a,b)𝒞.\forall\rho\in\mathcal{R},\ \forall b\in\mathcal{B},\ \exists a\in\mathcal{A},\quad\rho(a,b)\in\mathcal{C}.
Proof.

Let t1t\geqslant 1. We first introduce the polar cone of 𝒞\mathcal{C} associated with inner product ,(t)\left<\,\cdot\,,\,\cdot\,\right>_{(t)}:

𝒞(t):={zd|r𝒞,r,z(t)0}.\mathcal{C}^{\circ}_{(t)}:=\left\{z\in\mathbb{R}^{d}\,\middle|\,\forall r\in\mathcal{C},\ \left<r,z\right>_{(t)}\leqslant 0\right\}.

If 𝒞\mathcal{C} is a closed convex cone, an important property is that (𝒞(t))(t)=𝒞(\mathcal{C}^{\circ}_{(t)})^{\circ}_{(t)}=\mathcal{C}, in other words rr belongs to 𝒞\mathcal{C} if, and only if maxz𝒞r,z(t)0\max_{z\in\mathcal{C}^{\circ}}\left<r,z\right>_{(t)}\leqslant 0. Moreover, Moreau’s decomposition theorem states that r=π(t)𝒞(r)+π(t)𝒞(t)(r)r=\pi_{(t)}^{\mathcal{C}}(r)+\pi_{(t)}^{\mathcal{C}^{\circ}_{(t)}}(r) for all rdr\in\mathbb{R}^{d}. As an immediate consequence, it holds that {rπ(t)𝒞(r)}rd=𝒞(t)\left\{r-\pi_{(t)}^{\mathcal{C}}(r)\right\}_{r\in\mathbb{R}^{d}}=\mathcal{C}^{\circ}_{(t)}.

Then, Blackwell’s condition can be written

t1,ρ,maxz𝒞(t)mina𝒜maxbρ(a,b),z(t)0.\forall t\geqslant 1,\ \forall\rho\in\mathcal{R},\quad\max_{z\in\mathcal{C}^{\circ}_{(t)}}\min_{a\in\mathcal{A}}\max_{b\in\mathcal{B}}\left<\rho(a,b),z\right>_{(t)}\leqslant 0.

Since the quantity ρ(a,b),z(t)\left<\rho(a,b),z\right>_{(t)} is affine in each of the variables aa, bb, and zz, all three sets are convex, and 𝒜\mathcal{A} is compact, we can apply Sion’s minimax theorem twice, and equivalently write:

t1,ρ,maxbmina𝒜maxz𝒞(t)ρ(a,b),z(t)0,\forall t\geqslant 1,\ \forall\rho\in\mathcal{R},\quad\max_{b\in\mathcal{B}}\min_{a\in\mathcal{A}}\max_{z\in\mathcal{C}^{\circ}_{(t)}}\left<\rho(a,b),z\right>_{(t)}\leqslant 0,

which is exactly the dual condition (\ast2.6). ∎

3. Analysis

We first give a general guarantee for Blackwell’s algorithm.

Theorem 3.1.

If the family of norms ((t))t1(\left\|\,\cdot\,\right\|_{(t)})_{t\geqslant 1} is nonincreasing (meaning (t+1)(t)\left\|\,\cdot\,\right\|_{(t+1)}\leqslant\left\|\,\cdot\,\right\|_{(t)} for all t1t\geqslant 1), and 𝒞\mathcal{C} satisfies Blackwell’s condition (with respect to \mathcal{R} and (,(t))t1(\left<\,\cdot\,,\,\cdot\,\right>_{(t)})_{t\geqslant 1}), then Blackwell’s algorithm, associated with a corresponding oracle, guarantees for all T1T\geqslant 1,

minr𝒞1Tt=1Trtr(T)1Tt=1Trt(t)2.\min_{r\in\mathcal{C}}\left\|\frac{1}{T}\sum_{t=1}^{T}r_{t}-r\right\|_{(T)}\leqslant\frac{1}{T}\sqrt{\sum_{t=1}^{T}\left\|r_{t}\right\|_{(t)}^{2}}.
Proof.

Denote R0=0R_{0}=0 and Rt=s=1trsR_{t}=\sum_{s=1}^{t}r_{s} (t1t\geqslant 1). For all t1t\geqslant 1,

minr𝒞Rtr(t)2\displaystyle\min_{r\in\mathcal{C}}\left\|R_{t}-r\right\|_{(t)}^{2} =Rtπ(t)𝒞(Rt)(t)2Rtπ(t)𝒞(Rt1)(t)2\displaystyle=\left\|R_{t}-\pi_{(t)}^{\mathcal{C}}(R_{t})\right\|^{2}_{(t)}\leqslant\left\|R_{t}-\pi_{(t)}^{\mathcal{C}}(R_{t-1})\right\|_{(t)}^{2}
=Rt1+rtπ(t)𝒞(Rt1)(t)2\displaystyle=\left\|R_{t-1}+r_{t}-\pi_{(t)}^{\mathcal{C}}(R_{t-1})\right\|_{(t)}^{2}
=Rt1π(t)𝒞(Rt1)(t)2+2Rt1π(t)𝒞(Rt1),rt(t)+rt(t)2.\displaystyle=\left\|R_{t-1}-\pi_{(t)}^{\mathcal{C}}(R_{t-1})\right\|_{(t)}^{2}+2\left<R_{t-1}-\pi_{(t)}^{\mathcal{C}}(R_{t-1}),r_{t}\right>_{(t)}+\left\|r_{t}\right\|_{(t)}^{2}.

First note that for t=1t=1, the first two terms of the above last expression are zero because Rt1=π(t)𝒞(Rt1)=0R_{t-1}=\pi_{(t)}^{\mathcal{C}}(R_{t-1})=0. For t2t\geqslant 2, we use the nonincreasingness of the family of norms to bound the first term of the above last expression as:

(1) Rt1π(t)𝒞(Rt1)(t)2Rt1π(t)𝒞(Rt1)(t1)2=minr𝒞Rt1r(t1)2,\left\|R_{t-1}-\pi_{(t)}^{\mathcal{C}}(R_{t-1})\right\|_{(t)}^{2}\leqslant\left\|R_{t-1}-\pi_{(t)}^{\mathcal{C}}(R_{t-1})\right\|_{(t-1)}^{2}=\min_{r\in\mathcal{C}}\left\|R_{t-1}-r\right\|_{(t-1)}^{2},

and the second term (the inner product) is nonpositive because actions (at)t2(a_{t})_{t\geqslant 2} are given by Blackwell’s algorithm, and by denoting aa the involved oracle we get

Rt1π(t)𝒞(Rt1),rt(t)=Rt1π(t)𝒞(Rt1),ρt(a[t,ρt,Rt1],bt)(t),\left<R_{t-1}-\pi_{(t)}^{\mathcal{C}}(R_{t-1}),r_{t}\right>_{(t)}=\left<R_{t-1}-\pi_{(t)}^{\mathcal{C}}(R_{t-1}),\rho_{t}(a\left[t,\rho_{t},R_{t-1}\right],b_{t})\right>_{(t)},

which is nonpositive by Blackwell’s condition.

Therefore, it holds for t2t\geqslant 2 that minr𝒞Rtr(t)2minr𝒞Rt1r(t1)2+rt(t)2\min_{r\in\mathcal{C}}\left\|R_{t}-r\right\|_{(t)}^{2}\leqslant\min_{r\in\mathcal{C}}\left\|R_{t-1}-r\right\|_{(t-1)}^{2}+\left\|r_{t}\right\|_{(t)}^{2} and minr𝒞R1r(1)2r1(1)2\min_{r\in\mathcal{C}}\left\|R_{1}-r\right\|_{(1)}^{2}\leqslant\left\|r_{1}\right\|_{(1)}^{2}. Summing and taking the square root gives

minr𝒞RTr(T)t=1Trt(t)2.\min_{r\in\mathcal{C}}\left\|R_{T}-r\right\|_{(T)}\leqslant\sqrt{\sum_{t=1}^{T}\left\|r_{t}\right\|_{(t)}^{2}}.

We then conclude by remarking that 𝒞\mathcal{C} being a cone, it holds that T𝒞=𝒞T\mathcal{C}=\mathcal{C} and

minr𝒞1Tt=1Trtr(T)=minrT𝒞1TRTr(T)=1Tminr𝒞RTr(T).\min_{r\in\mathcal{C}}\left\|\frac{1}{T}\sum_{t=1}^{T}r_{t}-r\right\|_{(T)}=\min_{r\in T\mathcal{C}}\frac{1}{T}\left\|R_{T}-r\right\|_{(T)}=\frac{1}{T}\min_{r\in\mathcal{C}}\left\|R_{T}-r\right\|_{(T)}.

In the basic setting where the inner product is constant and the outcome vectors are bounded by a given quantity, the above guarantee recovers the classical 1/T1/\sqrt{T} convergence speed.

Remark 3.2.

A similar analysis could be carried without the assumption that the norms ((t))t1(\left\|\,\cdot\,\right\|_{(t)})_{t\geqslant 1} are nonincreasing, but inequality (1) would not hold and the derived upper bound would contain additional correction terms and would not be as neat as in Theorem 3.1.

We assume in the following corollary that the target set is the negative orthant 𝒞=d\mathcal{C}=\mathbb{R}_{-}^{d} (which is indeed a closed convex cone) and present a general choice of time-dependent inner products which yields different convergence rates for each component of the average outcome vector. This construction and analysis will be applied to absorbing games in Sections 4 to 7.

Corollary 3.3.

Let (μt(1))t1,,(μt(d))t1(\mu_{t}^{(1)})_{t\geqslant 1},\dots,(\mu_{t}^{(d)})_{t\geqslant 1} be dd positive and nonincreasing sequences, and consider the following inner products and associated norms:

u,v(t):=i=1d(μt(i))2u(i)v(i),u(t):=u,u(t)1/2,u,vd,t1.\left<u,v\right>_{(t)}:=\sum_{i=1}^{d}(\mu_{t}^{(i)})^{2}u^{(i)}v^{(i)},\qquad\left\|u\right\|_{(t)}:=\left<u,u\right>_{(t)}^{1/2},\qquad u,v\in\mathbb{R}^{d},\quad t\geqslant 1.

If 𝒞=d\mathcal{C}=\mathbb{R}_{-}^{d} satisfies Blackwell’s condition with respect to \mathcal{R} and (,(t))t1(\left<\,\cdot\,,\,\cdot\,\right>_{(t)})_{t\geqslant 1}, then Blackwell’s algorithm associated with a corresponding oracle guarantees for each component 1id1\leqslant i\leqslant d, and all T1T\geqslant 1,

1Tt=1Trt(i)1TμT(i)t=1Trt(t)2.\frac{1}{T}\sum_{t=1}^{T}r_{t}^{(i)}\leqslant\frac{1}{T\mu_{T}^{(i)}}\sqrt{\sum_{t=1}^{T}\left\|r_{t}\right\|_{(t)}^{2}}.
Proof.

Because the sequences (μt(1))t1,,(μt(d))t1(\mu_{t}^{(1)})_{t\geqslant 1},\dots,(\mu_{t}^{(d)})_{t\geqslant 1} are nonincreasing, so are the associated norms ((t))t1(\left\|\,\cdot\,\right\|_{(t)})_{t\geqslant 1} and Theorem 3.1 applies and gives

minrd1Tt=1Trtr(T)1Tt=1Trt(t)2.\min_{r\in\mathbb{R}_{-}^{d}}\left\|\frac{1}{T}\sum_{t=1}^{T}r_{t}-r\right\|_{(T)}\leqslant\frac{1}{T}\sqrt{\sum_{t=1}^{T}\left\|r_{t}\right\|_{(t)}^{2}}.

Besides, for each component 1i0d1\leqslant i_{0}\leqslant d, we can write

minrd1Tt=1Trtr(T)\displaystyle\min_{r\in\mathbb{R}_{-}^{d}}\left\|\frac{1}{T}\sum_{t=1}^{T}r_{t}-r\right\|_{(T)} =minr(1),,r(d)0i=1d(μT(i))2(1Tt=1Trt(i)r(i))2\displaystyle=\min_{r^{(1)},\dots,r^{(d)}\leqslant 0}\sqrt{\sum_{i=1}^{d}(\mu_{T}^{(i)})^{2}\left(\frac{1}{T}\sum_{t=1}^{T}r_{t}^{(i)}-r^{(i)}\right)^{2}}
minr(i0)0μT(i0)|1Tt=1Trt(i0)r(i0)|\displaystyle\geqslant\min_{r^{(i_{0})}\leqslant 0}\mu_{T}^{(i_{0})}\left|\frac{1}{T}\sum_{t=1}^{T}r_{t}^{(i_{0})}-r^{(i_{0})}\right|
μT(i0)Tt=1Trt(i0).\displaystyle\geqslant\frac{\mu_{T}^{(i_{0})}}{T}\sum_{t=1}^{T}r_{t}^{(i_{0})}.

The result follows. ∎

4. Absorbing Games

We now apply the above approachability tools to two-player zero-sum absorbing games, thereby giving a novel illustration of the relevance of online learning tools for solving games. We give an alternative proof of a classical result [24, 39]: the existence, for each player and each ε>0\varepsilon>0, of a strategy that is ε\varepsilon-optimal irrespective of the duration of the game, provided that the duration is long enough. We first recall in this section the definition of absorbing games and the result we aim at proving, and Sections 5 to 7 are dedicated to the proof. Section 8 compares our approach to the related literature.

For a compact or measurable space AA, we denote by Δ(A)\Delta(A) the set of probability distributions over AA. If AA is compact, Δ(A)\Delta(A) equipped with the weak topology is also compact.

4.1. Definition

A two-player zero-sum absorbing game is described by a tuple (Ω,I,J,q,g,g)(\Omega,I,J,q,g,g^{*}), where:

  • Ω={Ω}{ω}\Omega=\left\{\Omega^{*}\right\}\cup\left\{\omega\right\} is the state space, which is the union of a finite set Ω\Omega^{*} of absorbing states, and {ω}\left\{\omega\right\} containing a unique non-absorbing state;

  • II and JJ are the pure action sets of Player I and Player II, respectively; they are assumed to be compact topological sets; the elements of Δ(I)\Delta(I) and Δ(J)\Delta(J) are called mixed actions;

  • q:I×JΔ(Ω)q:I\times J\to\Delta(\Omega) is the state transition function, and is assumed to be separately continuous on I×JI\times J; for iIi\in I, jJj\in J and ωΩ\omega\in\Omega, denote q(ω|i,j)q(\omega|i,j) the component (probability weight) of q(i,j)q(i,j) corresponding to ω\omega;

  • g:I×Jg:I\times J\to\mathbb{R} is the non-absorbing payoff function, and is assumed to be separately continuous on I×JI\times J and bounded;

  • g:Ωg^{*}:\Omega\rightarrow\mathbb{R} is the absorbing payoff function.

Starting from the initial state ω1=ω\omega_{1}=\omega, at each stage t1t\geqslant 1, the game proceeds as follows.

  • If the current state ωt\omega_{t} is ω\omega, then simultaneously, Player I (resp. Player II) chooses some action itIi_{t}\in I (resp. jtJj_{t}\in J) possibly drawn according to a mixed action xtΔ(I)x_{t}\in\Delta(I) (resp. ytΔ(J)y_{t}\in\Delta(J)). Player I (resp. Player II) gets a stage payoff g(it,jt)g(i_{t},j_{t}) (resp. g(it,jt)-g(i_{t},j_{t})). A new state ωt+1\omega_{t+1} is drawn according to distribution q(it,jt)q(i_{t},j_{t}), and players observe (it,jt,ωt+1)(i_{t},j_{t},\omega_{t+1}).

  • If the current state ωt\omega_{t} belongs to Ω\Omega^{*}, then there is no strategic interaction, and Player I gets a stage payoff g(ωt)g^{*}(\omega_{t}). The next state is again ωt+1=ωt\omega_{t+1}=\omega_{t}.

We denote the stage payoff as

(2) gt:={g(it,jt)if ωt=ωg(ωt)if ωtΩ.g_{t}:=\begin{cases}g(i_{t},j_{t})&\text{if $\omega_{t}=\omega$}\\ g^{*}(\omega_{t})&\text{if $\omega_{t}\in\Omega^{*}$}.\end{cases}
Remark 4.1.

Once an absorbing state is reached, the game is essentially over, at least from a strategic point of view. For this reason, we only consider ω\omega as possible initial state.

A sequence (ω1,i1,j1,,ωt,it,jt,)(Ω×I×J)(\omega_{1},i_{1},j_{1},...,\omega_{t},i_{t},j_{t},\dots)\in(\Omega\times I\times J)^{\mathbb{N}^{*}} is called a play.

  • A behavior strategy for a player specifies a mixed action for each possible set of past observations. Formally, for Player I, it is defined as a collection of measurable maps σ=(σt)t1\sigma=(\sigma_{t})_{t\geqslant 1}, where σt:(I×J)t1Δ(I)\sigma_{t}:(I\times J)^{t-1}\rightarrow\Delta(I). A behavior strategy τ=(τt)t1\tau=(\tau_{t})_{t\geqslant 1} for Player II is defined similarly.

  • A Markov strategy, in the context of absorbing games, is a strategy that plays as a function of the current stage. It can be identified with a sequence of elements in Δ(I)\Delta(I) (resp. Δ(J)\Delta(J)) for Player I (resp. Player II). Pure Markov strategies are identified with sequences in II and JJ.

  • A stationary strategy, in the context of absorbing games, is a strategy that plays independently of the past history. It can be identified with an element of Δ(I)\Delta(I) for Player I (resp. Δ(J)\Delta(J) for Player II).

The sets of behavior strategies for Player I and II are denoted by Σ\Sigma and 𝒯\mathcal{T}, respectively. A pair of strategies (σ,τ)Σ×𝒯(\sigma,\tau)\in\Sigma\times\mathcal{T} naturally defines, using Kolmogorov’s extension theorem, a probability distribution σ,τ\mathbb{P}_{\sigma,\tau} over the set of plays (Ω×I×J)(\Omega\times I\times J)^{\mathbb{N}_{*}}, satisfying, for (ω1,i1,j1,,ωt,it,jt,)σ,τ(\omega_{1},i_{1},j_{1},\dots,\omega_{t},i_{t},j_{t},\dots)\sim\mathbb{P}_{\sigma,\tau}, ω1=ω\omega_{1}=\omega almost surely, and then for all t1t\geqslant 1,

ht1:=(i1,j1,,it1,jt1)xt:=σt(ht1)yt:=σt(ht1)h_{t-1}:=(i_{1},j_{1},\dots,i_{t-1},j_{t-1})\qquad x_{t}:=\sigma_{t}(h_{t-1})\qquad y_{t}:=\sigma_{t}(h_{t-1})
itht1xtjtht1ytitjtht1,i_{t}\mid h_{t-1}\sim x_{t}\qquad j_{t}\mid h_{t-1}\sim y_{t}\qquad i_{t}\perp j_{t}\mid h_{t-1},
ωt+1|it,jt{δωtif ωtΩq(it,jt)otherwise.\omega_{t+1}|i_{t},j_{t}\sim\begin{cases}\delta_{\omega_{t}}&\text{if $\omega_{t}\in\Omega^{*}$}\\ q(i_{t},j_{t})&\text{otherwise}.\end{cases}

We denote by 𝔼σ,τ[]\mathbb{E}_{\sigma,\tau}\left[\,\cdot\,\right] the expectation with respect to σ,τ\mathbb{P}_{\sigma,\tau}.

Let T1T\geqslant 1 and λ(0,1)\lambda\in(0,1). The TT-stage game (resp. λ\lambda-discounted game), denoted by ΓT\Gamma_{T} (resp. Γλ\Gamma_{\lambda}), is the game with strategy sets Σ\Sigma and 𝒯\mathcal{T}, and payoff function

γT(σ,τ):=𝔼σ,τ[1Tt=1Tgt],σΣ,τ𝒯,\gamma_{T}(\sigma,\tau):=\mathbb{E}_{\sigma,\tau}\left[\frac{1}{T}\sum_{t=1}^{T}g_{t}\right],\qquad\sigma\in\Sigma,\ \tau\in\mathcal{T},

respectively

γλ(σ,τ):=𝔼σ,τ[t1λ(1λ)t1gt],σΣ,τ𝒯,\gamma_{\lambda}(\sigma,\tau):=\mathbb{E}_{\sigma,\tau}\left[\sum_{t\geqslant 1}\lambda(1-\lambda)^{t-1}g_{t}\right],\qquad\sigma\in\Sigma,\ \tau\in\mathcal{T},

where gtg_{t} is defined as in (2).

4.2. Values

The games Γλ\Gamma_{\lambda} and ΓT\Gamma_{T} are known to have values [30], which are

vT:=maxσΣminτ𝒯γT(σ,τ)=minτ𝒯maxσΣγT(σ,τ).v_{T}:=\max_{\sigma\in\Sigma}\min_{\tau\in\mathcal{T}}\gamma_{T}(\sigma,\tau)=\min_{\tau\in\mathcal{T}}\max_{\sigma\in\Sigma}\gamma_{T}(\sigma,\tau)\,.
(3) vλ:=maxσΣminτ𝒯γλ(σ,τ)=minτ𝒯maxσΣγλ(σ,τ).v_{\lambda}:=\max_{\sigma\in\Sigma}\min_{\tau\in\mathcal{T}}\gamma_{\lambda}(\sigma,\tau)=\min_{\tau\in\mathcal{T}}\max_{\sigma\in\Sigma}\gamma_{\lambda}(\sigma,\tau)\,.

The value vλv_{\lambda} can be interpreted as the payoff solution of the game Γλ\Gamma_{\lambda}: when players play rationally, Player I (resp. Player II) should get at least vλv_{\lambda} (at most resp. vλ-v_{\lambda}). A strategy is optimal for Player I (resp. for Player II) in Γλ\Gamma_{\lambda} if it reaches the left-hand-side maximum (resp. the right-hand side minimum) in (3). Similar interpretations and definitions hold for ΓT\Gamma_{T}.

In what follows, we consider the linear extension of the non-absorbing payoff function gg to Δ(I)×J\Delta(I)\times J, meaning

g(x,j):=Ig(i,j)dx(i),xΔ(I),jJ.g(x,j):=\int_{I}g(i,j)\,\mathrm{d}x(i),\qquad x\in\Delta(I),\ j\in J.

Similarly, the transition function qq is linearly extended to Δ(I)×J\Delta(I)\times J.

Optimal strategies can be a priori quite sophisticated. It turns out that in the discounted game, there exists “simple” optimal strategies  [30], as shown by the following proposition.

Proposition 4.2.

Let λ(0,1)\lambda\in(0,1). Each player has an optimal stationary strategy in Γλ\Gamma_{\lambda}. Moreover, if xλΔ(I)x_{\lambda}\in\Delta(I) is an optimal stationary strategy for Player I, then for all jJj\in J,

vλλg(xλ,j)+(1λ)(ωΩq(ω|xλ,j)g(ω)+q(ω|xλ,j)vλ).v_{\lambda}\leqslant\lambda g(x_{\lambda},j)+(1-\lambda)\left(\sum_{\omega^{*}\in\Omega^{*}}q(\omega^{*}|x_{\lambda},j)g^{*}(\omega^{*})+q(\omega|x_{\lambda},j)v_{\lambda}\right).

The above inequality stems from the fact that vλv_{\lambda} satisfies a functional equation called the Shapley equation (we refer the reader to Shapley [49] when the state space and the action sets are finite, and Maitra and Parthasarathy [30] for the general case).

Let us intuitively explain the above inequality. Assume that in state ω\omega in the discounted game Γλ\Gamma_{\lambda}, Player I plays the stationary strategy xλx_{\lambda} at every stage. Assume moreover that Player II plays pure action jj at the first stage, and then plays optimally in Γλ\Gamma_{\lambda} from stage 2. At stage 1, the expectation of the stage payoff is g(xλ,j)g(x_{\lambda},j), and the next state is distributed according to q(xλ,j)q(x_{\lambda},j). If the next state is ωΩ\omega^{*}\in\Omega^{*}, then the total payoff from the next stage is t2λ(1λ)t1g(ω)=(1λ)g(ω)\sum_{t\geqslant 2}\lambda(1-\lambda)^{t-1}g^{*}(\omega^{*})=(1-\lambda)g^{*}(\omega^{*}). If the next state is ω\omega, since both players play optimally from stage 2, the payoff from stage 2 is equal to (1λ)vλ(1-\lambda)v_{\lambda}. Hence, the right-hand side of the inequality is exactly what players get in Γλ\Gamma_{\lambda}. Since xλx_{\lambda} is optimal, this quantity should be higher than vλv_{\lambda}, and this yields the inequality.

Definition 4.3.

Let ww\in\mathbb{R}.

  1. (i)

    Player I can uniformly guarantee ww if for all ε>0\varepsilon>0, there exists σΣ\sigma\in\Sigma and T01T_{0}\geqslant 1 such that for all TT0T\geqslant T_{0} and τ𝒯\tau\in\mathcal{T}, γT(σ,τ)wε\gamma_{T}(\sigma,\tau)\geqslant w-\varepsilon.

  2. (ii)

    Similarly, Player II can uniformly guarantee ww if for each ε>0\varepsilon>0, there exists τ𝒯\tau\in\mathcal{T} and T01T_{0}\geqslant 1 such that for all TT0T\geqslant T_{0} and σΣ\sigma\in\Sigma, γT(σ,τ)w+ε\gamma_{T}(\sigma,\tau)\leqslant w+\varepsilon.

Mertens et al. [39] have proved the following result.

Theorem 4.4.

There exists a unique vv\in\mathbb{R} such that both players can uniformly guarantee vv. Moreover,

v=limT+vT=limλ0+vλ.v=\lim_{T\to+\infty}v_{T}=\lim_{\lambda\to 0^{+}}v_{\lambda}.

This generalizes a result of Kohlberg [24], that treated the case of finite action sets. Another proof of Theorem 4.4 was also given in Hansen et al. [20].

Remark 4.5.

The uniqueness of vv follows from Definition 4.3, and vv is called the uniform value. The fact that the existence of vv implies the convergence of vTv_{T} and vλv_{\lambda} to vv is rather straightforward, and can be found for instance in [53, Chapter 4]. This limit vv is called the limit value.

Our goal is to give a proof of Theorem 4.4 based on Blackwell’s approachability. We will prove that Player I can uniformly guarantee lim supλ0+vλ\limsup_{\lambda\rightarrow 0^{+}}v_{\lambda}. By switching players’ roles, we then immediately deduce that Player II can uniformly guarantee lim infλ0+vλ\liminf_{\lambda\rightarrow 0^{+}}v_{\lambda}, hence can uniformly guarantee lim supλ0+vλ\limsup_{\lambda\rightarrow 0^{+}}v_{\lambda}. Together with Remark 4.5, this proves the theorem. The reader can find a comparison between this new proof and the proofs from Kohlberg [24], Mertens et al. [39], Hansen et al. [20] in Section 8.

Let ε(0,1/2)\varepsilon\in(0,1/2) be fixed. The remaining of the paper is devoted to constructing a strategy for Player I that guarantees lim supλ0+vλO(ε)\limsup_{\lambda\rightarrow 0^{+}}v_{\lambda}-O(\varepsilon) in ΓT\Gamma_{T} for large enough TT. Up to translating and multiplying the payoffs by a scalar, we can assume the following without loss of generality.

Assumption 1.
lim supλ0+vλ=ε,g1andg1.\limsup\limits_{\lambda\to 0^{+}}\,v_{\lambda}=\varepsilon,\qquad\left\|g\right\|_{\infty}\leqslant 1\quad\text{and}\quad\left\|g^{*}\right\|_{\infty}\leqslant 1.

5. Balanced Strategies and Lower Bounds on the Average Payoff

For each λ(0,1)\lambda\in(0,1), let xλΔ(I)x_{\lambda}\in\Delta(I) be an optimal stationary strategy for Player I. We consider a sequence (λk)k1(\lambda_{k})_{k\geqslant 1} such that

(4) vλkk+lim supλ0+vλandxλk converges to some x0Δ(I).v_{\lambda_{k}}\xrightarrow[k\rightarrow+\infty]{}\limsup_{\lambda\to 0^{+}}v_{\lambda}\quad\text{and}\quad\text{$x_{\lambda_{k}}$ converges to some $x_{0}\in\Delta(I)$.}

Thanks to Assumption 1 and the fact that gg is separately continuous, there exists λ(0,1)\lambda\in(0,1) such that the following holds.


For the remaining of the paper, let λ(0,1)\lambda\in(0,1) be such that for all jJj\in J,

(5) |g(x0,j)g(xλ,j)|ε/2andvλε/2.|g(x_{0},j)-g(x_{\lambda},j)|\leqslant\varepsilon/2\qquad\text{and}\qquad v_{\lambda}\geqslant\varepsilon/2.

Definition 5.1.

A strategy (σt)t1(\sigma_{t})_{t\geqslant 1} of Player I is said to be balanced (with respect to x0x_{0} and xλx_{\lambda}) if for any t1t\geqslant 1 and history ht1=(i1,j1,,it1,jt1)(I×J)t1h_{t-1}=(i_{1},j_{1},\dots,i_{t-1},j_{t-1})\in(I\times J)^{t-1},

  1. (i)

    σt(ht1)\sigma_{t}(h_{t-1}) only depends on (j1,,jt1)(j_{1},\dots,j_{t-1});

  2. (ii)

    σt(ht1)\sigma_{t}(h_{t-1}) is a convex combination of x0x_{0} and xλx_{\lambda}.

A first important point is that, in order to prove that a balanced strategy guarantees some quantity, it is sufficient to verify that it guarantees that quantity against all pure Markov strategies of Player II. The proof is quite standard and is given in Section B.1 for completeness.

Lemma 5.2.

Let σΣ\sigma\in\Sigma be a strategy that satisfies (i) in Definition 5.1. Let ww\in\mathbb{R} and T1T\geqslant 1. Assume that for any pure Markov strategy τ=(jt)t1\tau=(j_{t})_{t\geqslant 1} of Player II, γT(σ,τ)w\gamma_{T}(\sigma,\tau)\geqslant w. Then for any τ𝒯\tau\in\mathcal{T}, γT(σ,τ)w\gamma_{T}(\sigma,\tau)\geqslant w.


For the remaining of the paper,

  • let σ\sigma be a balanced strategy for Player I with respect to x0x_{0} and xλx_{\lambda},

  • τ=(jt)t1\tau=(j_{t})_{t\geqslant 1} be a pure Markov strategy for Player II,

  • and for each t1t\geqslant 1, let at[0,1]a_{t}\in\left[0,1\right] be the weight on mixed action xλx_{\lambda} when σ\sigma is played against τ\tau, in other words:

    (6) σt(j1,,jt1)=atxλ+(1at)x0.\sigma_{t}(j_{1},\dots,j_{t-1})=a_{t}x_{\lambda}+(1-a_{t})x_{0}.

We now introduce three quantities that play a crucial role in our construction. For all jJj\in J, let

g(j):=ωΩg(ω)q(ω|xλ,j)andg(j):=g(x0,j).g^{\sharp}(j):=\sum_{\omega^{*}\in\Omega^{*}}g^{*}(\omega^{*})\,q(\omega^{*}|x_{\lambda},j)\quad\text{and}\quad g^{\flat}(j):=g(x_{0},j).

The quantity g(j)g^{\sharp}(j) corresponds to the future expected payoff (induced by xλx_{\lambda} and jj) in the case where the next state is absorbed, whereas the quantity g(j)g^{\flat}(j) is the present payoff in ω\omega, which is close to the payoff given by xλx_{\lambda} and jj, thanks to (5).

The case where g(j)=g(x0,j)0g^{\flat}(j)=g(x_{0},j)\geqslant 0 for all jJj\in J is easy, because the strategy for Player I that plays mixed action x0x_{0} at each stage obviously guarantees 0 in ΓT\Gamma_{T} for all T1T\geqslant 1, which reaches our goal. Therefore, we will consider only the following case.

Assumption 2.

There exists jJj\in J such that g(j)<0g^{\flat}(j)<0.

The following lemma gives an expression of the probability that the game is in state ω\omega at stage tt. The proof in given in Appendix B.2.

Lemma 5.3.

Let

(7) αt:=s=1t1(asq(ω|xλ,js)+(1as)q(ω|x0,js)),t1.\alpha_{t}:=\prod_{s=1}^{t-1}\big{(}a_{s}q(\omega|x_{\lambda},j_{s})+(1-a_{s})q(\omega|x_{0},j_{s})\big{)},\qquad t\geqslant 1.

Then, σ,τ[ωt=ω]=αt\mathbb{P}_{\sigma,\tau}\left[\omega_{t}=\omega\right]=\alpha_{t}.

We are now ready to state a key lower bound on the payoff γT(σ,τ)\gamma_{T}(\sigma,\tau) that only involves the quantities g(j)g^{\sharp}(j), g(j)g^{\flat}(j), αs\alpha_{s} and asa_{s}.

Proposition 5.4.

For all T1T\geqslant 1,

(8) γT(σ,τ)1Tt=1Tαtg(jt)+t=1Ts=1t1αsasg(js)ε.\gamma_{T}(\sigma,\tau)\geqslant\frac{1}{T}\sum_{t=1}^{T}\alpha_{t}g^{\flat}(j_{t})+\sum_{t=1}^{T}\sum_{s=1}^{t-1}\alpha_{s}a_{s}g^{\sharp}(j_{s})-\varepsilon.

We first prove the following lemma and then establish Proposition 5.4.

Lemma 5.5.

Let jJj\in J. The following statements hold.

  1. (i)
    λg(j)+(1λ)g(j)0.\lambda g^{\flat}(j)+(1-\lambda)g^{\sharp}(j)\geqslant 0.

    In particular, either g(j)0g^{\sharp}(j)\geqslant 0 or g(j)0g^{\flat}(j)\geqslant 0.

  2. (ii)
    ωΩg(ω)q(ω|x0,j)0.\sum_{\omega^{*}\in\Omega^{*}}g^{*}(\omega^{*})\,q(\omega^{*}|\,x_{0},j)\geqslant 0.
Proof.
  1. (i)

    Using (5) and Proposition 4.2, we get

    λg(j)+(1λ)g(j)\displaystyle\lambda g^{\flat}(j)+(1-\lambda)g^{\sharp}(j) =λg(x0,j)+(1λ)ωΩg(ω)q(ω|xλ,j)\displaystyle=\lambda\,g(x_{0},j)+(1-\lambda)\sum_{\omega^{*}\in\Omega^{*}}g^{*}(\omega^{*})q(\omega^{*}|x_{\lambda},j)
    λg(xλ,j)+(1λ)ωΩg(ω)q(ω|xλ,j)λε/2\displaystyle\geqslant\lambda\,g(x_{\lambda},j)+(1-\lambda)\sum_{\omega^{*}\in\Omega^{*}}g^{*}(\omega^{*})\,q(\omega^{*}|x_{\lambda},j)-\lambda\varepsilon/2
    vλ(1(1λ)q(ω|xλ,j))λε/2\displaystyle\geqslant v_{\lambda}\big{(}1-(1-\lambda)q(\omega|x_{\lambda},j)\big{)}-\lambda\varepsilon/2
    ε2(1(1λ))λε/2\displaystyle\geqslant\frac{\varepsilon}{2}\big{(}1-(1-\lambda)\big{)}-\lambda\varepsilon/2
    0.\displaystyle\geqslant 0.
  2. (ii)

    Using Assumption 1, Proposition 4.2, the continuity of the transition kernel qq with respect to iIi\in I (thus continuity with respect to the weak topology on the space of mixed actions), and a sequence (λk)k1(\lambda_{k})_{k\geqslant 1} satisfying (4),

    ωΩg(ω)q(ω|x0,j)\displaystyle\sum_{\omega^{*}\in\Omega^{*}}g^{*}(\omega^{*})q(\omega^{*}|x_{0},j) =limk+(λkg(xλk,j)+(1λk)ωΩg(ω)q(ω|xλk,j))\displaystyle=\lim_{k\rightarrow+\infty}\left(\lambda_{k}g(x_{\lambda_{k}},j)+(1-\lambda_{k})\sum_{\omega^{*}\in\Omega^{*}}g^{*}(\omega^{*})q(\omega^{*}|x_{\lambda_{k}},j)\right){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}}
    limk+vλk(1(1λk)q(ω|xλk,j))0.\displaystyle\geqslant\lim_{k\rightarrow+\infty}v_{\lambda_{k}}(1-(1-\lambda_{k})q(\omega|x_{\lambda_{k}},j))\geqslant 0.

Proof of Proposition 5.4.

We have

γT(σ,τ)\displaystyle\gamma_{T}(\sigma,\tau) =1T𝔼σ,τ[t=1T𝟙{ωt=ω}g(it,jt)]+1T𝔼σ,τ[t=1TωΩ𝟙{ωt=ω}g(ω)]\displaystyle=\frac{1}{T}\mathbb{E}_{\sigma,\tau}\left[\sum_{t=1}^{T}\mathbbm{1}_{\{\omega_{t}=\omega\}}g(i_{t},j_{t})\right]+\frac{1}{T}\mathbb{E}_{\sigma,\tau}\left[\sum_{t=1}^{T}\sum_{\omega^{*}\in\Omega^{*}}\mathbbm{1}_{\{\omega_{t}=\omega^{*}\}}g^{*}(\omega^{*})\right]
=\displaystyle= 1T𝔼σ,τ[t=1T𝟙{ωt=ω}g(it,jt)+t=1Ts=1t1𝟙{ωs=ω}ωΩg(ω)q(ω|is,js)].\displaystyle\frac{1}{T}\mathbb{E}_{\sigma,\tau}\left[\sum_{t=1}^{T}\mathbbm{1}_{\{\omega_{t}=\omega\}}g(i_{t},j_{t})+\sum_{t=1}^{T}\sum_{s=1}^{t-1}\mathbbm{1}_{\{\omega_{s}=\omega\}}\sum_{\omega^{*}\in\Omega^{*}}g^{*}(\omega^{*})q(\omega^{*}\,|\,i_{s},j_{s})\right].

Let us bound from below the term inside the first sum. We have

𝔼σ,τ[(𝟙{ωt=ω}g(it,jt)]\displaystyle\mathbb{E}_{\sigma,\tau}\left[(\mathbbm{1}_{\{\omega_{t}=\omega\}}g(i_{t},j_{t})\right] =σ,τ[ωt=ω](atg(xλ,jt)+(1at)g(x0,jt))\displaystyle=\mathbb{P}_{\sigma,\tau}\left[\omega_{t}=\omega\right]\left(a_{t}g(x_{\lambda},j_{t})+(1-a_{t})g(x_{0},j_{t})\right)
αt(at(g(x0,jt)ε/2)+(1at)g(x0,jt))\displaystyle\geqslant\alpha_{t}\left(a_{t}(g(x_{0},j_{t})-\varepsilon/2)+(1-a_{t})g(x_{0},j_{t})\right)
αtg(jt)ε,\displaystyle\geqslant\alpha_{t}g^{\flat}(j_{t})-\varepsilon,

where the first equality stems from the independence of ωt\omega_{t} and iti_{t}, and the second inequality comes from (5). Using the same independence and Property (ii) from Lemma 5.5, we get

𝔼σ,τ[𝟙{ωs=ω}ωΩg(ω)q(ω|is,js)]=αsωΩg(ω)(asq(ω|xλ,js)+(1as)q(ω|x0,js))αsasg(js).\mathbb{E}_{\sigma,\tau}\left[\mathbbm{1}_{\{\omega_{s}=\omega\}}\sum_{\omega^{*}\in\Omega^{*}}g^{*}(\omega^{*})q(\omega^{*}|i_{s},j_{s})\right]\\ =\alpha_{s}\sum_{\omega^{*}\in\Omega^{*}}g^{*}(\omega^{*})\left(a_{s}q(\omega^{*}|x_{\lambda},j_{s})+(1-a_{s})q(\omega^{*}|x_{0},j_{s})\right)\\ \geqslant\alpha_{s}a_{s}g^{\sharp}(j_{s}).\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\quad

The result follows from plugging the two previous inequalities into the first equality. ∎

Now that the proof of Proposition 5.4 is complete, let us explain at a broad level how we can relate the lower bound to the approachability tools that we developed in the first part of the paper. The lower bound (8) involves two sums. A first idea would be to consider a two-dimensional vector payoff corresponding to the two terms involved in the sums, namely (αsg(js),αsasg(js))(-\alpha_{s}g^{\flat}(j_{s}),-\alpha_{s}a_{s}g^{\sharp}(j_{s})), and show that Player I can approach the set 2\mathbb{R}^{2}_{-} for this vector payoff function. This would hopefully prove that both sums are larger than some O(ε)O(\varepsilon), when TT is large enough. This simple idea does not work, for two reasons. First, the component αsg(js)-\alpha_{s}g^{\flat}(j_{s}) does not depend on asa_{s}, hence there is no hope that Player I can force this component to be negative, even on average. Moreover, the second sum is a double sum. As a consequence, a single quantity asa_{s} can have a contribution to the average payoff that does not vanish as T+T\to+\infty. Hence, we in fact need all asa_{s} to be very small, which motivates the introduction of a scaling factor that multiplies the asa_{s}. These considerations lead to the following modified lower bound.

Proposition 5.6.

Let (μt)t1(\mu_{t})_{t\geqslant 1} be a positive and nonincreasing sequence and (at)t1(a_{t})_{t\geqslant 1} defined as in (6). If at[0,μt]a_{t}\in\left[0,\mu_{t}\right] for all t1t\geqslant 1, then for all T4/(λε2μT)T\geqslant 4/(\lambda\varepsilon^{2}\mu_{T}),

(9) γT(σ,τ)1T(t=1Tαt(1μt1at)g(jt)+t=1Ts=1t1αsasg(js))2ε.\gamma_{T}(\sigma,\tau)\geqslant\frac{1}{T}\left(\sum_{t=1}^{T}\alpha_{t}\left(1-\mu_{t}^{-1}a_{t}\right)g^{\flat}(j_{t})+\sum_{t=1}^{T}\sum_{s=1}^{t-1}\alpha_{s}a_{s}g^{\sharp}(j_{s})\right)\\ -2\varepsilon.
Proof.

From Proposition 5.4, we have

γT(σ,τ)\displaystyle\gamma_{T}(\sigma,\tau) 1T(t=1Tαtg(jt)+t=1Ts=1t1αsasg(js))ε\displaystyle\geqslant\frac{1}{T}\left(\sum_{t=1}^{T}\alpha_{t}g^{\flat}(j_{t})+\sum_{t=1}^{T}\sum_{s=1}^{t-1}\alpha_{s}a_{s}g^{\sharp}(j_{s})\right)-\varepsilon
=1T(t=1Tαt(1μt1at)g(jt)+t=1Ts=1t1αsasg(js))\displaystyle=\frac{1}{T}\left(\sum_{t=1}^{T}\alpha_{t}\left(1-\mu_{t}^{-1}a_{t}\right)g^{\flat}(j_{t})+\sum_{t=1}^{T}\sum_{s=1}^{t-1}\alpha_{s}a_{s}g^{\sharp}(j_{s})\right)
+1Tt=1Tμt1αtatg(jt)ε.\displaystyle\qquad\qquad\qquad+\frac{1}{T}\sum_{t=1}^{T}\mu^{-1}_{t}\alpha_{t}a_{t}\,g^{\flat}(j_{t})-\varepsilon.

To prove Proposition 5.6, it is enough to show that for T4/(λε2μT)T\geqslant 4/(\lambda\varepsilon^{2}\mu_{T}), the above right-most average is higher than ε-\varepsilon.

Denote q(j):=1q(ω|xλ,j)q^{*}(j):=1-q(\omega\,|\,x_{\lambda},j), which is the probability that the game is absorbed when Player I plays xλx_{\lambda} and Player II plays jj, and J(ε):={jJ:q(j)λε/2}J(\varepsilon):=\{j\in J:q^{*}(j)\leqslant\lambda\varepsilon/2\}. Let jJ(ε)j\in J(\varepsilon). Using Proposition 4.2 and the fact that g1g^{*}\leqslant 1,

λg(xλ,j)\displaystyle\lambda g(x_{\lambda},j) vλ(1(1λ)q(ω|xλ,j))(1λ)ωΩg(ω)q(ω|xλ,j)\displaystyle\geqslant v_{\lambda}\big{(}1-(1-{\lambda})q(\omega|x_{\lambda},j)\big{)}-(1-\lambda)\sum_{\omega^{*}\in\Omega^{*}}g^{*}(\omega^{*})q(\omega^{*}|x_{\lambda},j)
vλ(1(1λ)(1q(j)))(1λ)ωΩq(ω|xλ,j)\displaystyle\geqslant v_{\lambda}\big{(}1-(1-{\lambda})(1-q^{*}(j))\big{)}-(1-\lambda)\sum_{\omega^{*}\in\Omega^{*}}q(\omega^{*}|x_{\lambda},j)
=vλ(λ+(1λ)q(j))(1λ)q(j)\displaystyle=v_{\lambda}\big{(}\lambda+(1-{\lambda})q^{*}(j)\big{)}-(1-\lambda)q^{*}(j)
=λvλ+(1λ)q(j)(vλ1)\displaystyle=\lambda v_{\lambda}+(1-\lambda)q^{*}(j)(v_{\lambda}-1)
λε2λε2=0,\displaystyle\geqslant\frac{\lambda\varepsilon}{2}-\frac{\lambda\varepsilon}{2}=0,

where the last inequality follows from (5) and the fact that jJ(ε)j\in J(\varepsilon). Hence g(xλ,j)0g(x_{\lambda},j)\geqslant 0, and we deduce that

1Tt=1Tαtμt1atg(xλ,jt)1T1tTjtJ(ε)αtμt1atg(xλ,jt).\frac{1}{T}\sum_{t=1}^{T}\alpha_{t}\mu^{-1}_{t}a_{t}g(x_{\lambda},j_{t})\geqslant\frac{1}{T}\sum_{\begin{subarray}{c}1\leqslant t\leqslant T\\ j_{t}\notin J(\varepsilon)\end{subarray}}\alpha_{t}\mu^{-1}_{t}a_{t}g(x_{\lambda},j_{t}).

Let t1t\geqslant 1 such that jtJJ(ε)j_{t}\in J\setminus J(\varepsilon). We have

αt+1\displaystyle\alpha_{t+1} =(atq(ω|xλ,jt)+(1at)q(ω|x0,jt))αt\displaystyle=\big{(}a_{t}q(\omega|x_{\lambda},j_{t})+(1-a_{t})q(\omega|x_{0},j_{t})\big{)}\alpha_{t}
(at(1q(jt))+(1at))αt(1λε2at)αt,\displaystyle\leqslant\big{(}a_{t}(1-q^{*}(j_{t}))+(1-a_{t})\big{)}\alpha_{t}\leqslant\left(1-\frac{\lambda\varepsilon}{2}a_{t}\right)\alpha_{t},

from which we deduce that for all t1t\geqslant 1,

αt1s<tjsJ(ε)(1λε2as).\alpha_{t}\leqslant\prod_{\begin{subarray}{c}1\leqslant s<t\\ j_{s}\notin J(\varepsilon)\end{subarray}}\left(1-\frac{\lambda\varepsilon}{2}a_{s}\right).

Combining with the fact that g1g\geqslant-1 and μt\mu_{t} is non-increasing, we obtain

1tTjtJ(ε)αtμt1atg(xλ,jt)\displaystyle\sum_{\begin{subarray}{c}1\leqslant t\leqslant T\\ j_{t}\notin J(\varepsilon)\end{subarray}}\alpha_{t}\mu^{-1}_{t}a_{t}g(x_{\lambda},j_{t}) μT11tTjtJ(ε)αtat\displaystyle\geqslant-\mu^{-1}_{T}\sum_{\begin{subarray}{c}1\leqslant t\leqslant T\\ j_{t}\notin J(\varepsilon)\end{subarray}}\alpha_{t}a_{t}
μT11tTjtJ(ε)at1s<tjsJ(ε)(1λε2as).\displaystyle\geqslant-\mu^{-1}_{T}\sum_{\begin{subarray}{c}1\leqslant t\leqslant T\\ j_{t}\notin J(\varepsilon)\end{subarray}}a_{t}\prod_{\begin{subarray}{c}1\leqslant s<t\\ j_{s}\notin J(\varepsilon)\end{subarray}}\left(1-\frac{\lambda\varepsilon}{2}a_{s}\right).

Define ps:=0p_{s}:=0 if jsJ(ε)j_{s}\in J(\varepsilon), and ps:=λεas/2p_{s}:=\lambda\varepsilon a_{s}/2 otherwise, so that

1tTjtJ(ε)at1s<tjsJ(ε)(1λε2as)=2λεt=1pts=1t1(1ps).\sum_{\begin{subarray}{c}1\leqslant t\leqslant T\\ j_{t}\notin J(\varepsilon)\end{subarray}}a_{t}\prod_{\begin{subarray}{c}1\leqslant s<t\\ j_{s}\notin J(\varepsilon)\end{subarray}}\left(1-\frac{\lambda\varepsilon}{2}a_{s}\right)=\frac{2}{\lambda\varepsilon}\sum_{t=1}^{\infty}p_{t}\prod_{s=1}^{t-1}\left(1-p_{s}\right).

Let X1,X2,X_{1},X_{2},\dots be a sequence of independent Bernoulli random variables with [Xt=1]=pt\mathbb{P}[X_{t}=1]=p_{t}. Then

t=1pts=1t1(1ps)=t=1[X1,,Xt1=0,Xt=1]=[t,Xt=1]1.\sum_{t=1}^{\infty}p_{t}\prod_{s=1}^{t-1}\left(1-p_{s}\right)=\sum_{t=1}^{\infty}\mathbb{P}[X_{1},\dots,X_{t-1}=0,X_{t}=1]=\mathbb{P}[\exists\,t,\,X_{t}=1]\leqslant 1.

We deduce that

1tTjtJ(ε)at1s<tjsJ(ε)(1λε2as)2λε.\sum_{\begin{subarray}{c}1\leqslant t\leqslant T\\ j_{t}\notin J(\varepsilon)\end{subarray}}a_{t}\prod_{\begin{subarray}{c}1\leqslant s<t\\ j_{s}\notin J(\varepsilon)\end{subarray}}\left(1-\frac{\lambda\varepsilon}{2}a_{s}\right)\leqslant\frac{2}{\lambda\varepsilon}.

Therefore if T4/(λε2μT)T\geqslant 4/(\lambda\varepsilon^{2}\mu_{T}),

1Tt=1Tαtμt1atg(xλ,jt)\displaystyle\frac{1}{T}\sum_{t=1}^{T}\alpha_{t}\mu^{-1}_{t}a_{t}g(x_{\lambda},j_{t})\geqslant 2λεμTTε2.\displaystyle-\frac{2}{\lambda\varepsilon\mu_{T}T}\geqslant-\frac{\varepsilon}{2}.

Using (5) and the fact that atμta_{t}\leqslant\mu_{t} by assumption, we deduce that

1Tt=1Tμt1αtatg(x0,jt)ε,\frac{1}{T}\sum_{t=1}^{T}\mu^{-1}_{t}\alpha_{t}a_{t}g(x_{0},j_{t})\geqslant-\varepsilon,

which completes the proof. ∎

In the next section, we apply the approachability tools from the first section to a two-dimensional vector payoff, composed of the two terms that appear in each sum of the lower bound from Proposition 5.6.

6. An Auxiliary Approachability Problem

We consider the sets of actions 𝒜=[0,1]\mathcal{A}=[0,1], =J\mathcal{B}=J, and target set 𝒞=2\mathcal{C}=\mathbb{R}_{-}^{2}. For α0\alpha\geqslant 0 and μ>0\mu>0, denote

ρα,μ:(a,j)(αag(j)α(μ1a1)g(j)),\rho_{\alpha,\mu}:(a,j)\mapsto\begin{pmatrix}-\alpha ag^{\sharp}(j)\\ \alpha(\mu^{-1}a-1)g^{\flat}(j)\end{pmatrix},

and let ={ρα,μ}α0μ(0,1]\mathcal{R}=\left\{\rho_{\alpha,\mu}\right\}_{\begin{subarray}{c}\alpha\geqslant 0\\ \mu\in(0,1]\end{subarray}} be the set of possible outcome functions. Let (μt)t1(\mu_{t})_{t\geqslant 1} be a positive and nonincreasing sequence in (0,1](0,1], and define the following inner product on 2\mathbb{R}^{2}:

u,v(t)=u(1)v(1)+μt2u(2)v(2),u,v2,t1.\left<u,v\right>_{(t)}=u^{(1)}v^{(1)}+\mu_{t}^{2}u^{(2)}v^{(2)},\qquad u,v\in\mathbb{R}^{2},\quad t\geqslant 1.

For R2R\in\mathbb{R}^{2}, define

π(R):=(max(0,R(1)),max(0,R(2))).\pi(R):=(\max(0,R^{(1)}),\max(0,R^{(2)})).
Proposition 6.1 (Blackwell’s condition).

Let t1t\geqslant 1, α0\alpha\geqslant 0, μ(0,1]\mu\in(0,1] and R=(R(1),R(2))2R=(R^{(1)},R^{(2)})\in\mathbb{R}^{2}. Denote the components of π(R)\pi(R) as (R~(1),R~(2))(\tilde{R}^{(1)},\tilde{R}^{(2)}).

  1. (i)

    π(R)=RargminR2RR(t)\pi(R)=R-\operatorname*{arg\,min}_{R^{\prime}\in\mathbb{R}_{-}^{2}}\left\|R^{\prime}-R\right\|_{(t)}.

  2. (ii)

    The following quantity

    a[t,ρα,μ,R]:={0if π(R)=0supjJg(j)<0μt2g(j)R~(2)μt2μ1g(j)R~(2)g(j)R~(1)otherwise,a[t,\rho_{\alpha,\mu},R]:=\begin{cases}0&\text{if $\pi(R)=0$}\\ \displaystyle\sup_{\begin{subarray}{c}j\in J\\ g^{\flat}(j)<0\end{subarray}}\frac{\mu_{t}^{2}g^{\flat}(j)\tilde{R}^{(2)}}{\mu^{2}_{t}\mu^{-1}g^{\flat}(j)\tilde{R}^{(2)}-g^{\sharp}(j)\tilde{R}^{(1)}}&\text{otherwise,}\end{cases}

    is well-defined and satisfies 0a[t,ρα,μ,R]μ10\leqslant a[t,\rho_{\alpha,\mu},R]\leqslant\mu\leqslant 1;

  3. (iii)

    For all jJj\in J and R2R\in\mathbb{R}^{2},

    ρα,μ(a[t,ρα,μ,R],j),π(R)(t)0.\left<\rho_{\alpha,\mu}(a[t,\rho_{\alpha,\mu},R],j),\pi(R)\right>_{(t)}\leqslant 0.

In other words, 𝒞=2\mathcal{C}=\mathbb{R}_{-}^{2} satisfies Blackwell’s condition with respect to \mathcal{R} and (,(t))t1(\left<\,\cdot\,,\,\cdot\,\right>_{(t)})_{t\geqslant 1}, with mapping (t,ρ,R)a[t,ρ,R](t,\rho,R)\mapsto a[t,\rho,R] being an associated oracle.

Proof.

We will use repeatedly the fact that R~(1)\tilde{R}^{(1)} and R~(2)\tilde{R}^{(2)} are nonnegative by definition.

  1. (i)

    π(R)\pi(R) can be written as

    π(R)\displaystyle\pi(R) =RargminR2RR(t)2\displaystyle=R-\operatorname*{arg\,min}_{R^{\prime}\in\mathbb{R}_{-}^{2}}\left\|R^{\prime}-R\right\|_{(t)}^{2}
    =RargminR(1),R(2)0{(R(1)R(1))2+μt2(R(2)R(2))2}.\displaystyle=R-\operatorname*{arg\,min}_{R^{\prime(1)},R^{\prime(2)}\leqslant 0}\left\{(R^{\prime(1)}-R^{(1)})^{2}+\mu_{t}^{2}(R^{\prime(2)}-R^{(2)})^{2}\right\}.

    Therefore, for each component i{1,2}i\in\left\{1,2\right\},

    R~(i)=R(i)argminR(i)0(R(i)R(i))2=R(i)min(0,R(i))=max(R(i),0)0.\tilde{R}^{(i)}=R^{(i)}-\operatorname*{arg\,min}_{R^{\prime(i)}\leqslant 0}(R^{\prime(i)}-R^{(i)})^{2}=R^{(i)}-\min(0,R^{(i)})=\max(R^{(i)},0)\geqslant 0.
  2. (ii)

    a[t,ρα,μ,R]a[t,\rho_{\alpha,\mu},R] is either zero or defined as the supremum of a fraction. In the latter case, meaning π(R)0\pi(R)\neq 0, first note that the supremum is taken on a nonempty set thanks to Assumption 2. Then, for jJj\in J such that g(j)<0g^{\flat}(j)<0, we have g(j)>0g^{\sharp}(j)>0 by Property (i) in Lemma 5.5, and μ\mu is positive from the statement. Therefore, both the numerator is nonpositive and the denominator is negative, and thus

    0μt2g(j)R~(2)μt2μ1g(j)R~(2)g(j)R~(1)μt2g(j)R~(2)μt2μ1g(j)R~(2)=μ.0\leqslant\frac{\mu_{t}^{2}g^{\flat}(j)\tilde{R}^{(2)}}{\mu_{t}^{2}\mu^{-1}g^{\flat}(j)\tilde{R}^{(2)}-g^{\sharp}(j)\tilde{R}^{(1)}}\leqslant\frac{\mu_{t}^{2}g^{\flat}(j)\tilde{R}^{(2)}}{\mu_{t}^{2}\mu^{-1}g^{\flat}(j)\tilde{R}^{(2)}}=\mu.
  3. (iii)

    Let jJj\in J be fixed and denote g:=g(j)g^{\sharp}:=g^{\sharp}(j) and g:=g(j)g^{\flat}:=g^{\flat}(j) for short. For every a[0,μ]a\in[0,\mu], the following inner product writes

    ρα,μ(a,j),π(R)(t)\displaystyle\left<\rho_{\alpha,\mu}(a,j),\pi(R)\right>_{(t)} =α(agR~(1)+μt2(1μ1a)gR~(2))\displaystyle=-\alpha\left(ag^{\sharp}\tilde{R}^{(1)}+\mu_{t}^{2}(1-\mu^{-1}a)g^{\flat}\tilde{R}^{(2)}\right)
    =α(a(gR~(1)μt2μ1gR~(2))+μt2gR~(2)).\displaystyle=-\alpha\left(a\left(g^{\sharp}\tilde{R}^{(1)}-\mu^{2}_{t}\mu^{-1}g^{\flat}\tilde{R}^{(2)}\right)+\mu_{t}^{2}g^{\flat}\tilde{R}^{(2)}\right).

    If π(R)=0\pi(R)=0, the result is true for any choice of aa.

    Otherwise, if g=0g^{\flat}=0, it follows that g0g^{\sharp}\geqslant 0 by Property (i) in Lemma 5.5, and

    ρα,μ(a,j),π(R)(t)=αagR~(1)0.\left<\rho_{\alpha,\mu}(a,j),\pi(R)\right>_{(t)}=-\alpha\,a\,g^{\sharp}\tilde{R}^{(1)}\leqslant 0.

    If g>0g^{\flat}>0 and g0g^{\sharp}\geqslant 0, since aμa\leqslant\mu, we have

    ρα,μ(a,j),π(R)(t)=α(agR~(1)+μt2(1μ1a)gR~(2))0,\left<\rho_{\alpha,\mu}(a,j),\pi(R)\right>_{(t)}=-\alpha\left(ag^{\sharp}\tilde{R}^{(1)}+\mu_{t}^{2}(1-\mu^{-1}a)g^{\flat}\tilde{R}^{(2)}\right)\leqslant 0,

    because all quantities in the above grand parentheses are nonnegative.

    If g<0g^{\flat}<0, which implies that g>0g^{\sharp}>0, we have gR~(1)μt2μ1gR~(2)>0g^{\sharp}\tilde{R}^{(1)}-\mu_{t}^{2}\mu^{-1}g^{\flat}\tilde{R}^{(2)}>0 and therefore

    ρα,μ(a[t,ρα,μ,R],j),π(R)(t)\displaystyle\left<\rho_{\alpha,\mu}(a[t,\rho_{\alpha,\mu},R],j),\pi(R)\right>_{(t)}
    =α(a[t,ρα,μ,R](gR~(1)μt2μ1gR~(2))+μt2gR~(2))\displaystyle\qquad=-\alpha\left(a[t,\rho_{\alpha,\mu},R]\cdot\left(g^{\sharp}\tilde{R}^{(1)}-\mu^{2}_{t}\mu^{-1}g^{\flat}\tilde{R}^{(2)}\right)+\mu_{t}^{2}g^{\flat}\tilde{R}^{(2)}\right)
    =α((gR~(1)μt2μ1gR~(2))supjJg(j)<0μt2g(j)R~(2)g(j)R~(1)μt2μ1g(j)R~(2).\displaystyle\qquad=-\alpha\Biggl{(}\left(g^{\sharp}\tilde{R}^{(1)}-\mu^{2}_{t}\mu^{-1}g^{\flat}\tilde{R}^{(2)}\right)\sup_{\begin{subarray}{c}j^{\prime}\in J\\ g^{\flat}(j^{\prime})<0\end{subarray}}\frac{-\mu_{t}^{2}g^{\flat}(j^{\prime})\tilde{R}^{(2)}}{g^{\sharp}(j^{\prime})\tilde{R}^{(1)}-\mu_{t}^{2}\mu^{-1}g^{\flat}(j^{\prime})\tilde{R}^{(2)}}\Biggr{.}
    .+μt2gR~(2))\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\Biggl{.}+\mu_{t}^{2}g^{\flat}\tilde{R}^{(2)}\Biggr{)}
    α((gR~(1)μt2μ1gR~(2))μt2gR~(2)gR~(1)μt2μ1gR~(2)+μt2gR~(2))\displaystyle\qquad\leqslant-\alpha\left(\left(g^{\sharp}\tilde{R}^{(1)}-\mu^{2}_{t}\mu^{-1}g^{\flat}\tilde{R}^{(2)}\right)\frac{-\mu_{t}^{2}g^{\flat}\tilde{R}^{(2)}}{g^{\sharp}\tilde{R}^{(1)}-\mu_{t}^{2}\mu^{-1}g^{\flat}\tilde{R}^{(2)}}+\mu_{t}^{2}g^{\flat}\tilde{R}^{(2)}\right)
    0.\displaystyle\qquad\leqslant 0.

    We now turn to the last remaining case and assume g>0g^{\flat}>0 and g0g^{\sharp}\leqslant 0. Then, note that μt2μ1gR~(2)gR~(1)0\mu^{2}_{t}\mu^{-1}g^{\flat}\tilde{R}^{(2)}-g^{\sharp}\tilde{R}^{(1)}\geqslant 0. Property (iii) is equivalent to

    a[t,ρα,μ,R](μt2μ1gR~(2)gR~(1))μt2gR~(2)0.a[t,\rho_{\alpha,\mu},R]\left(\mu^{2}_{t}\mu^{-1}g^{\flat}\tilde{R}^{(2)}-g^{\sharp}\tilde{R}^{(1)}\right)-\mu_{t}^{2}g^{\flat}\tilde{R}^{(2)}\leqslant 0.

    If μt2μ1gR~(2)gR~(1)=0\mu^{2}_{t}\mu^{-1}g^{\flat}\tilde{R}^{(2)}-g^{\sharp}\tilde{R}^{(1)}=0 or R~(2)=0\tilde{R}^{(2)}=0, the property is easily satisfied (because R~(2)=0\tilde{R}^{(2)}=0 implies a[t,ρα,μ,R]=0a\left[t,\rho_{\alpha,\mu},R\right]=0). We now assume those two quantities to be positive. Then, the property is equivalent to

    a[t,ρα,μ,R]=supjJg(j)<0μt2g(j)R~(2)μt2μ1gR~(2)gR~(1)μt2gR~(2)μt2μ1gR~(2)gR~(1),a\left[t,\rho_{\alpha,\mu},R\right]=\sup_{\begin{subarray}{c}j^{\prime}\in J\\ g^{\flat}(j^{\prime})<0\end{subarray}}\frac{\mu_{t}^{2}g^{\flat}(j^{\prime})\tilde{R}^{(2)}}{\mu^{2}_{t}\mu^{-1}g^{\flat}\tilde{R}^{(2)}-g^{\sharp}\tilde{R}^{(1)}}\leqslant\frac{\mu^{2}_{t}g^{\flat}\tilde{R}^{(2)}}{\mu^{2}_{t}\mu^{-1}g^{\flat}\tilde{R}^{(2)}-g^{\sharp}\tilde{R}^{(1)}},

    which, after simplification, is equivalent to

    supjJg(j)<0g(j)g(j)gg,\sup_{\begin{subarray}{c}j^{\prime}\in J\\ g^{\flat}(j^{\prime})<0\end{subarray}}\frac{g^{\sharp}(j^{\prime})}{g^{\flat}(j^{\prime})}\leqslant\frac{g^{\sharp}}{g^{\flat}},

    which we now aim at proving. Let jJj^{\prime}\in J such that g(j)<0g^{\flat}(j^{\prime})<0. Then Property (i) from Lemma 5.5 applied to jj and jj^{\prime} gives

    λg+(1λ)g0andλg(j)+(1λ)g(j)0.\lambda g^{\flat}+(1-\lambda)g^{\sharp}\geqslant 0\quad\text{and}\quad\lambda g^{\flat}(j^{\prime})+(1-\lambda)g^{\sharp}(j^{\prime})\geqslant 0.

    Multiplying the first above inequality by g(j)0-g^{\flat}(j^{\prime})\geqslant 0 and the second by g0g^{\flat}\geqslant 0, summing, and simplifying gives

    g(j)g(j)gg.\frac{g^{\sharp}(j^{\prime})}{g^{\flat}(j^{\prime})}\leqslant\frac{g^{\sharp}}{g^{\flat}}.

    Hence the result.

Remark 6.2.

If our goal were solely to prove Theorem 4.4, we would not require an explicit expression for a[t,ρα,μ,R]a[t,\rho_{\alpha,\mu},R]. Indeed, later on, we will only rely on the fact that a[t,ρα,μ,R]a[t,\rho_{\alpha,\mu},R] lies within [0,μ][0,\mu] and satisfies (iii). Therefore, the mere existence of such an a[t,ρα,μ,R]a[t,\rho_{\alpha,\mu},R] suffices. This existence is actually quite straightforward using Blackwell’s dual condition (see Proposition 2.6), which would have spared us the tedious case distinctions made in the proof of Proposition 6.1. However, we chose to give a formula for a[t,ρα,μ,R]a[t,\rho_{\alpha,\mu},R] because it allows us to derive an explicit expression for the O(ε)O(\varepsilon)-uniform optimal strategy that we construct subsequently.

7. The Resulting Strategy for Player I in the Absorbing Game

We now define a strategy σ=(σt)t1\sigma=(\sigma_{t})_{t\geqslant 1} for Player I in the absorbing game. For (i1,j1,i2,j2,)(I×J)(i_{1},j_{1},i_{2},j_{2},\dots)\in(I\times J)^{\mathbb{N}^{*}} and all t1t\geqslant 1, recursively define

R~t1(1)\displaystyle\tilde{R}_{t-1}^{(1)} =max(0,s=1t1αsasg(js))\displaystyle=\max\left(0,\ -\sum_{s=1}^{t-1}\alpha_{s}a_{s}g^{\sharp}(j_{s})\right)
R~t1(2)\displaystyle\tilde{R}_{t-1}^{(2)} =max(0,s=1t1αs(μs1as1)g(js))\displaystyle=\max\left(0,\ \sum_{s=1}^{t-1}\alpha_{s}(\mu_{s}^{-1}a_{s}-1)g^{\flat}(j_{s})\right)
αt\displaystyle\alpha_{t} =s=1t1(asq(ω|xλ,js)+(1as)q(ω|x0,js))\displaystyle=\prod_{s=1}^{t-1}\left(a_{s}q(\omega|x_{\lambda},j_{s})+(1-a_{s})q(\omega|x_{0},j_{s})\right)
at\displaystyle a_{t} ={0if R~t1(1)=R~t1(2)=0supjJg(j)<0μt2g(j)R~t1(2)μtg(j)R~t1(2)g(j)R~t1(1)otherwise,\displaystyle=\begin{cases}0&\text{if $\tilde{R}_{t-1}^{(1)}=\tilde{R}_{t-1}^{(2)}=0$}\\ \displaystyle\sup_{\begin{subarray}{c}j\in J\\ g^{\flat}(j)<0\end{subarray}}\frac{\mu^{2}_{t}g^{\flat}(j)\tilde{R}^{(2)}_{t-1}}{\mu_{t}g^{\flat}(j)\tilde{R}^{(2)}_{t-1}-g^{\sharp}(j)\tilde{R}^{(1)}_{t-1}}&\text{otherwise},\end{cases}

where an easy induction proves that ata_{t} is indeed a function of (i1,j1,,it1,jt1)(i_{1},j_{1},\dots,i_{t-1},j_{t-1}), so that we can define

σt(i1,j1,,it1,jt1)=atxλ+(1at)x0.\sigma_{t}(i_{1},j_{1},\dots,i_{t-1},j_{t-1})=a_{t}x_{\lambda}+(1-a_{t})x_{0}.
Theorem 7.1.

If μt=εt3/4\mu_{t}=\varepsilon t^{-3/4} for all t1t\geqslant 1, the above strategy σ\sigma guarantees that for all τ𝒯\tau\in\mathcal{T} and T256λ4ε12T\geqslant 256\lambda^{-4}\varepsilon^{-12},

γT(σ,τ)8ε.\gamma_{T}(\sigma,\tau)\geqslant-8\varepsilon.
Proof.

Thanks to Lemma 5.2, it is sufficient to prove the result for pure Markov strategies of Player II. Let τ\tau be such a strategy, and (ω1,i1,j1,,ωt,it,jt,,)σ,τ(\omega_{1},i_{1},j_{1},\dots,\omega_{t},i_{t},j_{t},\dots,)\sim\mathbb{P}_{\sigma,\tau}. For all t1t\geqslant 1, consider above notation αt\alpha_{t} and ata_{t} and

ρt=ραt,μtandrt=ρt(at,jt)=αt(atg(jt)(μt1at1)g(jt)).\rho_{t}=\rho_{\alpha_{t},\mu_{t}}\qquad\text{and}\qquad r_{t}=\rho_{t}(a_{t},j_{t})=\alpha_{t}\begin{pmatrix}-a_{t}g^{\sharp}(j_{t})\\ (\mu^{-1}_{t}a_{t}-1)\,g^{\flat}(j_{t})\end{pmatrix}.

Let us first establish a bound on t=1+rt(t)2\sum_{t=1}^{+\infty}\left\|r_{t}\right\|_{(t)}^{2}. For all t1t\geqslant 1, using Property (ii) from Proposition 6.1 as well as αt[0,1]\alpha_{t}\in[0,1], at[0,μt]a_{t}\in[0,\mu_{t}] and g(jt),g(jt)[1,1]g^{\flat}(j_{t}),\,g^{\sharp}(j_{t})\in[-1,1] by Assumption 1, we have

rt(t)2|αt|2(at2+μt2(μt1at1)2)(2xt2+μt22μtat)3μt23ε2t3/2.\|r_{t}\|^{2}_{(t)}\leqslant|\alpha_{t}|^{2}\big{(}a_{t}^{2}+\mu^{2}_{t}(\mu^{-1}_{t}a_{t}-1)^{2}\big{)}\leqslant(2x^{2}_{t}+\mu^{2}_{t}-2\mu_{t}a_{t})\leqslant 3\mu^{2}_{t}\leqslant 3\varepsilon^{2}t^{-3/2}.

Therefore

t=1+rt(t)23ε2t=1+t3/29ε2.\sum_{t=1}^{+\infty}\|r_{t}\|^{2}_{(t)}\leqslant 3\varepsilon^{2}\sum_{t=1}^{+\infty}t^{-3/2}\leqslant 9\varepsilon^{2}.

Then, combining the above definition of σ\sigma with Proposition 6.1, it holds that

at=a[t,ρt,s=1t1rs],t1,a_{t}=a\left[t,\rho_{t},\sum_{s=1}^{t-1}r_{s}\right],\qquad t\geqslant 1,

and Corollary 3.3 gives for all t1t\geqslant 1,

(10) s=1t1αsasg(js)\displaystyle-\sum_{s=1}^{t-1}\alpha_{s}a_{s}\,g^{\sharp}(j_{s}) =s=1t1rs(1)s=1t1rs(s)23ε\displaystyle=\sum_{s=1}^{t-1}r_{s}^{(1)}\leqslant\sqrt{\sum_{s=1}^{t-1}\left\|r_{s}\right\|_{(s)}^{2}}\leqslant 3\varepsilon
(11) 1Tt=1Tαt(μt1at1)g(jt)\displaystyle\frac{1}{T}\sum_{t=1}^{T}\alpha_{t}(\mu_{t}^{-1}a_{t}-1)\,g^{\flat}(j_{t}) =1Tt=1Trt(2)1μTTt=1Trt(t)23εμTT3ε,\displaystyle=\frac{1}{T}\sum_{t=1}^{T}r_{t}^{(2)}\leqslant\frac{1}{\mu_{T}T}\sqrt{\sum_{t=1}^{T}\left\|r_{t}\right\|_{(t)}^{2}}\leqslant\frac{3\varepsilon}{\mu_{T}T}\leqslant 3\varepsilon,

where the last inequality holds as soon as Tε4T\geqslant\varepsilon^{-4}. Combining (10) (where the sum from s=1s=1 to t1t-1 is needed for all t1t\geqslant 1) and (11) (where only the sum from 11 to TT is needed) with Proposition 5.6 gives the result. ∎

8. Comparison with Other Proofs of Theorem 4.4

Let us compare our proof of Theorem 4.4 to three other proofs from the literature.

The Original Proof of Kohlberg [24] for Finite Action Sets

Kohlberg [24] proved Theorem 4.4 in the case where the action sets are finite. The proof uses a matrix theory approach [41] to reduce the problem to the case where Player 1 has only two actions, and then defines a strategy of Player I by making explicit the probability of playing each of the two actions as a function of the past history. Such a strategy is not a balanced strategy, and is not built from discounted optimal strategies. Hence, both the strategy and its analysis differ from our work.

The Proof of Mertens et al. [39] for Compact Action Sets

The strategy built in Mertens et al. [39] is based on discounted optimal strategies. Indeed, at each stage, the strategy plays optimally in a discounted game, where the discount factor depends on the past history. Unlike the present work, the discount factor can take infinitely many possible values, whereas in our case, it only takes two values (0 and λ\lambda, if we interpret the limit strategy x0x_{0} as an optimal strategy for the discount factor 0). Moreover, the proof in Mertens et al. [39] relies on the sophisticated machinery of Mertens and Neyman [38], who proved the existence of the uniform value in general stochastic games. It also builds on the characterization of the limit value obtained by Rosenberg and Sorin [47]. In contrast, our proof is more self-contained, since it neither relies on the Mertens and Neyman strategy nor requires the existence of the limit value.

The Proof of Hansen et al. [20]

Hansen et al. [20] build an ε\varepsilon-uniform optimal strategy with finite memory and a clock. At each stage, the mixed action played by such a strategy depends only on the state variable of a finite automaton. The state variable is updated from one stage to the next as a function of the previous action of Player 2, and its transitions depend on the stage. The automaton has three states. In two of them, the mixed action is a limit of discounted optimal strategies (careful action), which is similar to our mixed action x0x_{0}. In the third state, the mixed action is a discounted optimal strategy for some small discount factor (risky action), in the same fashion as our mixed action xλx_{\lambda}. In particular, the strategy built in Hansen et al. [20] is a balanced strategy. One main difference with our strategy, is that in Hansen et al. [20], the weights that are put on the risky actions and the safe actions are induced by the probability distribution over the states of the automaton. Such a probability depends on the past history and on the transitions of the automaton, and the definition of such transitions is rather sophisticated. In contrast, the weights that our strategy puts on the risky action and on the safe action are derived from an approachability problem. This makes the analysis shorter, and it provides intuition on how the weights should be defined so that the strategy works. Moreover, the approachability framework that we develop is quite general and offers promising perspectives for possible generalizations to stochastic games. Another difference is that the strategy in Hansen et al. [20] is built in blocks of increasing time lengths, whereas our strategy does not require such a progressive construction.

9. Conclusion and Perspectives

We introduced an extension of Blackwell’s approachability framework where the outcome function and the inner product vary with time, and studied the corresponding Blackwell’s algorithm. In the case where the target set is an orthant, we presented a choice of time-dependent inner products which yield different convergence speeds for each coordinate of the average outcome vector.

We applied the latter case to the construction of ε\varepsilon-uniformly optimal strategies in absorbing games, thereby proposing a novel application of online learning tools for solving games. We hope that the present work can be extended into a new systematic approach for constructing optimal strategies in a wider class of stochastic games.

We believe our framework will also find various applications in online learning and sequential decision problems as well. For instance, an interesting direction would be the definition of a hybrid between the following two regret minimization algorithms, which enjoy different adaptive properties. Regret Matching [22] (which is based on Blackwell’s approachability) and its variants have led to great success in the context of solving games, and the adaptive diagonal scalings of AdaGrad-Diagonal [37, 11] (and its popular variants such as RMSprop and Adam) have demonstrated excellent performance in deep learning optimization and continuous (stochastic) optimization in general. As the adaptive scalings of AdaGrad-Diagonal can be written as time-dependent metrics, our framework should allow the definition of an algorithm that combines Regret Matching and (the dual averaging version of) AdaGrad-Diagonal, and hopefully inherit adaptive properties and excellent practical performance from both.

Further possible extensions of our framework include the adaptation to potential-based [22] and regret-based [1, 50, 26] algorithms.

Acknowledgments

The authors are grateful to Sylvain Sorin and Vianney Perchet for valuable discussions that helped improve this paper. This work was supported by the French Agence Nationale de la Recherche (ANR) under reference ANR-21-CE40-0020 (CONVERGENCE project), and by a public grant as part of the Investissement d’avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH.

References

  • Abernethy et al. [2011] J. Abernethy, P. L. Bartlett, and E. Hazan. Blackwell approachability and low-regret learning are equivalent. In JMLR: Workshop and Conference Proceedings (COLT), volume 19, pages 27–46, 2011.
  • Bellman [1957] R. Bellman. A markovian decision process. Technical report, DTIC Document, 1957.
  • Bernstein and Shimkin [2015] A. Bernstein and N. Shimkin. Response-based approachability with applications to generalized no-regret problems. The Journal of Machine Learning Research, 16(1):747–773, 2015.
  • Bewley and Kohlberg [1976] T. Bewley and E. Kohlberg. The asymptotic theory of stochastic games. Mathematics of Operations Research, 1(3):197–208, 1976.
  • Blackwell [1954] D. Blackwell. Controlled random walks. In Proceedings of the International Congress of Mathematicians, volume 3, pages 336–338, 1954.
  • Blackwell [1956] D. Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1–8, 1956.
  • Blum and Mansour [2007] A. Blum and Y. Mansour. From external to internal regret. Journal of Machine Learning Research, 8(6), 2007.
  • Cesa-Bianchi and Lugosi [2006] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
  • Chzhen et al. [2021] E. Chzhen, C. Giraud, and G. Stoltz. A unified approach to fair online learning via Blackwell approachability. Advances in Neural Information Processing Systems, 34:18280–18292, 2021.
  • Dann et al. [2023] C. Dann, Y. Mansour, M. Mohri, J. Schneider, and B. Sivan. Pseudonorm approachability and applications to regret minimization. In International Conference on Algorithmic Learning Theory. PMLR, 2023.
  • Duchi et al. [2011] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • Dynkin and Yushkevich [1979] E. B. Dynkin and A. A. Yushkevich. Controlled markov processes, volume 235. Springer, 1979.
  • Even-Dar et al. [2009] E. Even-Dar, R. Kleinberg, S. Mannor, and Y. Mansour. Online learning for global cost functions. In COLT, 2009.
  • Farina et al. [2021] G. Farina, C. Kroer, and T. Sandholm. Faster game solving via predictive Blackwell approachability: Connecting regret matching and mirror descent. Proceedings of the AAAI Conference on Artificial Intelligence, 35(6):5363–5371, 2021.
  • Flesch et al. [2018] J. Flesch, R. Laraki, and V. Perchet. Approachability of convex sets in generalized quitting games. Games and Economic Behavior, 108:411–431, 2018.
  • Freund and Schapire [1999] Y. Freund and R. E. Schapire. Adaptive game playing using multiplicative weights. Games and Economic Behavior, 29(1):79–103, 1999.
  • Gordon [2007] G. J. Gordon. No-regret algorithms for online convex programs. In Advances in Neural Information Processing Systems, pages 489–496, 2007.
  • Grand-Clément and Kroer [2022] J. Grand-Clément and C. Kroer. Solving optimization problems with Blackwell approachability. arXiv preprint arXiv:2202.12277, 2022.
  • Hannan [1957] J. Hannan. Approximation to Bayes risk in repeated play. Contributions to the Theory of Games, 3:97–139, 1957.
  • Hansen et al. [2021] K. A. Hansen, R. Ibsen-Jensen, and A. Neyman. Absorbing games with a clock and two bits of memory. Games and Economic Behavior, 128:213–230, 2021.
  • Hart and Mas-Colell [2000] S. Hart and A. Mas-Colell. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68:1127–1150, 2000.
  • Hart and Mas-Colell [2001] S. Hart and A. Mas-Colell. A general class of adaptive strategies. Journal of Economic Theory, 98(1):26–54, 2001.
  • Hou and Kumar [2009] I.-H. Hou and P. R. Kumar. Admission control and scheduling for QoS guarantees for variable-bit-rate applications on wireless channels. In Proceedings of the tenth ACM international symposium on Mobile ad hoc networking and computing, pages 175–184, 2009.
  • Kohlberg [1974] E. Kohlberg. Repeated games with absorbing states. The Annals of Statistics, pages 724–738, 1974.
  • Kohlberg [1975] E. Kohlberg. Optimal strategies in repeated games with incomplete information. International Journal of Game Theory, 4(1):7–24, 1975.
  • Kwon [2021] J. Kwon. Refined approachability algorithms and application to regret minimization with global costs. Journal of Machine Learning Research, 22:200–1, 2021.
  • Lee et al. [2021] D. Lee, G. Noarov, M. Pai, and A. Roth. Online minimax multiobjective optimization: Multicalibeating and other applications. In Advances in Neural Information Processing Systems, 2021.
  • Li et al. [2021] T. Li, G. Peng, and Q. Zhu. Blackwell online learning for Markov decision processes. In 2021 55th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2021.
  • Littman [1994] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings, pages 157–163. Elsevier, 1994.
  • Maitra and Parthasarathy [1970] A. Maitra and T. Parthasarathy. On stochastic games. Journal of Optimization Theory and Applications, 5(4):289–300, 1970.
  • Mannor and Shimkin [2004] S. Mannor and N. Shimkin. A geometric approach to multi-criterion reinforcement learning. The Journal of Machine Learning Research, 5:325–360, 2004.
  • Mannor and Shimkin [2008] S. Mannor and N. Shimkin. Regret minimization in repeated matrix games with variable stage duration. Games and Economic Behavior, 63(1):227–258, 2008.
  • Mannor and Tsitsiklis [2009] S. Mannor and J. N. Tsitsiklis. Approachability in repeated games: Computational aspects and a Stackelberg variant. Games and Economic Behavior, 66(1):315–325, 2009.
  • Mannor et al. [2011] S. Mannor, V. Perchet, and G. Stoltz. Robust approachability and regret minimization in games with partial monitoring. In JMLR: Workshop and Conference Proceedings (COLT), volume 19, pages 515–536, 2011.
  • Mannor et al. [2013] S. Mannor, V. Perchet, and G. Stoltz. A primal condition for approachability with partial monitoring. Journal of Dynamics and Games, 1(3):447–469, 2013.
  • Mannor et al. [2014] S. Mannor, V. Perchet, and G. Stoltz. Set-valued approachability and online learning with partial monitoring. The Journal of Machine Learning Research, 15(1):3247–3295, 2014.
  • McMahan and Streeter [2010] H. B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the 23rd Conference on Learning Theory (COLT), page 244, 2010.
  • Mertens and Neyman [1981] J.-F. Mertens and A. Neyman. Stochastic games. International Journal of Game Theory, 10:53–66, 1981.
  • Mertens et al. [2009] J.-F. Mertens, A. Neyman, and D. Rosenberg. Absorbing games with compact action spaces. Mathematics of Operations Research, 34(2):257–262, 2009.
  • Mertens et al. [2015] J.-F. Mertens, S. Sorin, and S. Zamir. Repeated games. Cambridge University Press, 2015.
  • Mills [1956] H. D. Mills. Marginal values of matrix games and linear programs. Linear inequalities and related systems, 38:183–193, 1956.
  • Miryoosefi et al. [2019] S. Miryoosefi, K. Brantley, H. Daumé III, M. Dudik, and R. E. Schapire. Reinforcement learning with convex constraints. Advances in Neural Information Processing Systems, 32, 2019.
  • Perchet [2011] V. Perchet. Approachability of convex sets in games with partial monitoring. Journal of Optimization Theory and Applications, 149(3):665–677, 2011.
  • Perchet [2014] V. Perchet. Approachability, regret and calibration: Implications and equivalences. Journal of Dynamics and Games, 1(2):181–254, 2014.
  • Ragel [2023] T. Ragel. Weak approachability of convex sets in absorbing games. Mathematics of Operations Research, 2023.
  • Raja and Grammatico [2020] A. A. Raja and S. Grammatico. On the approachability principle for distributed payoff allocation in coalitional games. IFAC-PapersOnLine, 53(2):2690–2695, 2020.
  • Rosenberg and Sorin [2001] D. Rosenberg and S. Sorin. An operator approach to zero-sum repeated games. Israel Journal of Mathematics, 121(1):221–246, 2001.
  • Samuel [1959] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal of Research and Development, 3(3):210–229, 1959.
  • Shapley [1953] L. S. Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 39(10):1095–1100, 1953.
  • Shimkin [2016] N. Shimkin. An online convex optimization approach to Blackwell’s approachability. The Journal of Machine Learning Research, 17(1):4434–4456, 2016.
  • Solan [2022] E. Solan. A course in stochastic game theory, volume 103. Cambridge University Press, 2022.
  • Solan and Ziliotto [2016] E. Solan and B. Ziliotto. Stochastic games with signals. Advances in Dynamic and Evolutionary Games: Theory, Applications, and Numerical Methods, pages 77–94, 2016.
  • Sorin [2002] S. Sorin. A first course on zero-sum repeated games, volume 37. Springer Science & Business Media, 2002.
  • Stoltz and Lugosi [2005] G. Stoltz and G. Lugosi. Internal regret in on-line portfolio selection. Machine Learning, 59(1-2):125–159, 2005.
  • Tammelin et al. [2015] O. Tammelin, N. Burch, M. Johanson, and M. Bowling. Solving heads-up limit Texas hold’em. In Twenty-fourth international joint conference on artificial intelligence, 2015.
  • Vigeral [2013] G. Vigeral. A zero-sum stochastic game with compact action sets and no asymptotic value. Dynamic Games and Applications, 3(2):172–186, 2013.
  • Yu et al. [2021] T. Yu, Y. Tian, J. Zhang, and S. Sra. Provably efficient algorithms for multi-objective competitive RL. In International Conference on Machine Learning, pages 12167–12176. PMLR, 2021.
  • Zhong et al. [2018] Y. Zhong, Z. Zheng, M. C. Chou, and C.-P. Teo. Resource pooling and allocation policies to deliver differentiated service. Management Science, 64(4):1555–1573, 2018.
  • Ziliotto [2016] B. Ziliotto. Zero-sum repeated games: Counterexamples to the existence of the asymptotic value and the conjecture maxmin=limvn\operatorname{maxmin}=\operatorname{lim}v_{n}. The Annals of Probability, 44(2):1107–1133, 2016.
  • Zinkevich et al. [2007] M. Zinkevich, M. Johanson, M. Bowling, and C. Piccione. Regret minimization in games with incomplete information. Advances in neural information processing systems, 20:1729–1736, 2007.

Appendix A On General Closed Convex Targets

We here present one possible approach for reducing general closed convex targets to the conic case, which is inspired from [1]. Let us consider the assumptions from Proposition 2.6, but with 𝒞d\mathcal{C}\subset\mathbb{R}^{d} being a general nonempty closed convex set and not necessarily a cone. Further assume that 𝒞\mathcal{C} satisfies the following Blackwell’s condition: for all t1t\geqslant 1, ρ\rho\in\mathcal{R} and rdr\in\mathbb{R}^{d}, there exists a[t,g,r]𝒜a\left[t,g,r\right]\in\mathcal{A} such that for all bb\in\mathcal{B},

ρ(a[t,g,r],b)π(t)𝒞(r),rπ(t)𝒞(r)(t)0.\left<\rho\left(a\left[t,g,r\right],b\right)-\pi_{(t)}^{\mathcal{C}}(r),r-\pi_{(t)}^{\mathcal{C}}(r)\right>_{(t)}\leqslant 0.

Then, by easily adapting from e.g. [44, Theorem 1.3], one can see that the above is equivalent to the same condition as in Proposition 2.6:

ρ,b,a𝒜,ρ(a,b)𝒞.\forall\rho\in\mathcal{R},\ \forall b\in\mathcal{B},\ \exists a\in\mathcal{A},\ \rho(a,b)\in\mathcal{C}.

Now consider an auxiliary approachability problem with outcome functions ρ~t:𝒜×d+1\tilde{\rho}_{t}:\mathcal{A}\times\mathcal{B}\to\mathbb{R}^{d+1}, defined as

ρ~t(a,b)=(ρt(a,b),1),a𝒜,b,t1,\tilde{\rho}_{t}(a,b)=(\rho_{t}(a,b),1),\qquad a\in\mathcal{A},\ b\in\mathcal{B},\ t\geqslant 1,

and target 𝒞~=+(𝒞×{1})\tilde{\mathcal{C}}=\mathbb{R}_{+}(\mathcal{C}\times\left\{1\right\}), which is a closed convex cone. Then, for (a,b)𝒜×(a,b)\in\mathcal{A}\times\mathcal{B} and t1t\geqslant 1, ρt(a,b)𝒞\rho_{t}(a,b)\in\mathcal{C} if and only if, ρ~t(a,b)𝒞~\tilde{\rho}_{t}(a,b)\in\tilde{\mathcal{C}}. This shows that the target of this auxiliary approachability problem is approachable because it satisfies the dual condition. The results from Sections 2 and 3 thus apply. Besides, Abernethy et al. [1, Lemma 14] assures that the distance of the average outcome to 𝒞\mathcal{C} in the original problem differ from the distance to the average outcome to C~\tilde{C} in the auxiliary problem by a factor 2 at most. Therefore, the convergence bounds from the auxiliary problem are easily transposed to the original problem.

Appendix B Postponed Proofs

B.1. Proof of Lemma 5.2

To prove the lemma, it is enough to build, for each T1T\geqslant 1, a pure Markov strategy τ\tau^{*} such that γT(σ,τ)=minτ𝒯γT(σ,τ)\gamma_{T}(\sigma,\tau^{*})=\min_{\tau\in\mathcal{T}}\gamma_{T}(\sigma,\tau). Let us take the point of view of Player II that aims at minimizing the payoff in the TT-stage game, against the strategy σ\sigma. This problem can be viewed as a Markov Decision Process (1-Player stochastic game) [2], where the decision-maker is Player II, and the state space is the product Ω×t1Jt1\Omega\times\cup_{t\geqslant 1}J^{t-1}. The first component of the state represents the state of the absorbing game (that we call original state), while the second component encodes the sequence of actions played by Player II at some stage. Such a product variable is enough to describe the problem faced by Player II, due to the fact that the mixed action played by Player I at some stage depends only on past actions of Player II. This MDP admits a pure optimal Markov strategy [12], that is, a strategy that at each stage tt, picks a mixed action that only depends on the original state ωt\omega_{t} and the sequence of past actions (j1,j2,,jt1)(j_{1},j_{2},\dots,j_{t-1}). The actions of Player II only have an influence when the original state is the non-absorbing state ω\omega, hence such a strategy is equivalent to a sequence of actions of Player II, that is, a pure Markov strategy in the absorbing game. This proves the lemma.

B.2. Proof of Lemma 5.3

Let t2t\geqslant 2. If [ωt1=ω]=0\mathbb{P}\left[\omega_{t-1}=\omega\right]=0, then [ωt=ω]=0\mathbb{P}\left[\omega_{t}=\omega\right]=0 by definition of the model. Otherwise we can write,

[ωt=ω]\displaystyle\mathbb{P}\left[\omega_{t}=\omega\right] =[ωt=ω and ωt1=ω]\displaystyle=\mathbb{P}\left[\omega_{t}=\omega\text{ and }\omega_{t-1}=\omega\right]
=[ωt=ω|ωt1=ω]×[ωt1=ω]\displaystyle=\mathbb{P}\left[\omega_{t}=\omega\,\middle|\,\omega_{t-1}=\omega\right]\times\mathbb{P}\left[\omega_{t-1}=\omega\right]
=(Iq(ω|i,jt1)d(at1xλ+(1at1)x0)(i))×[ωt1=ω]\displaystyle=\left(\int_{I}q(\omega|i,j_{t-1})\,\mathrm{d}(a_{t-1}x_{\lambda}+(1-a_{t-1})x_{0})(i)\right)\times\mathbb{P}\left[\omega_{t-1}=\omega\right]
=q(ω|at1xλ+(1at1)x0,jt1)×[ωt1=ω]\displaystyle=q(\omega|a_{t-1}x_{\lambda}+(1-a_{t-1})x_{0},j_{t-1})\times\mathbb{P}\left[\omega_{t-1}=\omega\right]
=(at1q(ω|xλ,jt1)+(1at1)q(ω|x0,jt1))×[ωt1=ω],\displaystyle=\left(a_{t-1}\,q(\omega|x_{\lambda},j_{t-1})+(1-a_{t-1})q(\omega|x_{0},j_{t-1})\right)\times\mathbb{P}\left[\omega_{t-1}=\omega\right],

by linearity. In any case, the identity

[ωt=ω]=(at1q(ω|xλ,jt1)+(1at1)q(ω|x0,jt1))×[ωt1=ω]\mathbb{P}\left[\omega_{t}=\omega\right]=\left(a_{t-1}\,q(\omega|x_{\lambda},j_{t-1})+(1-a_{t-1})q(\omega|x_{0},j_{t-1})\right)\times\mathbb{P}\left[\omega_{t-1}=\omega\right]

holds, and the result follows from a simple induction.