Hierarchical Deep Counterfactual Regret Minimization

\nameJiayu Chen \email[email protected]
\addrPurdue University
West Lafayette, IN 47906 \AND\nameTian Lan \email[email protected]
\addrThe George Washington University
Washington, DC 20052 \AND\nameVaneet Aggarwal \email[email protected]
\addrPurdue University
West Lafayette, IN 47907

Abstract

Imperfect Information Games (IIGs) offer robust models for scenarios where decision-makers face uncertainty or lack complete information. Counterfactual Regret Minimization (CFR) has been one of the most successful family of algorithms for tackling IIGs. The integration of skill-based strategy learning with CFR could potentially mirror more human-like decision-making process and enhance the learning performance for complex IIGs. It enables the learning of a hierarchical strategy, wherein low-level components represent skills for solving subgames and the high-level component manages the transition between skills. In this paper, we introduce the first hierarchical version of Deep CFR (HDCFR), an innovative method that boosts learning efficiency in tasks involving extensively large state spaces and deep game trees. A notable advantage of HDCFR over previous works is its ability to facilitate learning with predefined (human) expertise and foster the acquisition of skills that can be transferred to similar tasks. To achieve this, we initially construct our algorithm on a tabular setting, encompassing hierarchical CFR updating rules and a variance-reduced Monte Carlo sampling extension. Notably, we offer the theoretical justifications, including the convergence rate of the proposed updating rule, the unbiasedness of the Monte Carlo regret estimator, and ideal criteria for effective variance reduction. Then, we employ neural networks as function approximators and develop deep learning objectives to adapt our proposed algorithms for large-scale tasks, while maintaining the theoretical support.

1 Introduction

Imperfect Information Games (IIGs) can be used to model various application domains where decision-makers have incomplete or uncertain information about the state of the environment, such as auctions (Noe et al. (2012)), diplomacy (Bakhtin et al. (2022)), cybersecurity (Kakkad et al. (2019)), etc. As one of the most successful family of algorithms for IIGs, variants of tabular Counterfactual Regret Minimization (CFR) (Zinkevich et al. (2007)) have been employed in all recent milestones of Poker AI which serves as a quintessential benchmark for IIGs (Bowling et al. (2015); Moravčík et al. (2017); Brown and Sandholm (2018)). However, implementing tabular CFR in domains characterized by an exceedingly large state space necessitates the use of abstraction techniques that group similar states together (Ganzfried and Sandholm (2014a); Sandholm (2015)), which requires extensive domain-specific expertise. To address this challenge, researchers have proposed deep learning extensions of CFR (Brown et al. (2019); Li et al. (2020); Steinberger et al. (2020)), which leverage neural networks as function approximations, enabling generalization across the state space.

On the other hand, professionals in a field typically possess robust domain-specific skills, which they can employ to compose comprehensive strategies for tackling diverse and intricate task scenarios. Therefore, integrating the skill-based strategy learning with CFR has the potential to enable human-like decision-making and enhance the learning performance for complex tasks with extended decision horizons, which is still an open problem. To accomplish this, the agent needs to learn a hierarchical strategy, in which the low-level components represent specific skills, and the high-level component coordinates the transition among skills. Notably, this is akin to the option framework (Sutton et al. (1999)) proposed in the context of reinforcement learning (RL), which enables learning or planning at multiple levels of temporal abstractions. Further, it’s worth noting that a hierarchical strategy is more interpretable, allowing humans to identify specific subcases where AI agents struggle. Targeted improvements can then be made by injecting critical skills that are defined by experts or learned through well-developed subgame-solving techniques (Moravcik et al. (2016); Brown and Sandholm (2017); Brown et al. (2018)). Also, skills acquired in one task, being more adaptable than the overarching strategy, can potentially be transferred to similar tasks to improve the learning in new IIGs.

In this paper, we introduce the first hierarchical extension of Deep CFR (HDCFR), a novel approach that significantly enhances learning efficiency in tasks with exceptionally large state spaces and deep game trees and enables learning with transferred knowledge. To achieve this, we establish the theoretical foundations of our algorithm in the tabular setting, drawing inspiration from vanilla CFR (Zinkevich et al. (2007)) and Variance-Reduced Monte Carlo CFR (VR-MCCFR) (Davis et al. (2020)). Then, building on these results, we introduce deep learning objectives to ensure the scalability of HDCFR. In particular, our contributions are as follows. (1) We propose to learn a hierarchical strategy for each player, which contains low-level strategies to encode skills (represented as sequences of primitive actions) and a high-level strategy for skill selection. We provide formal definitions for the hierarchical strategy within the IIG model, and provide extended CFR updating rules for strategy learning (i.e., HCFR) with convergence guarantees. (2) Vanilla CFR requires a perfect game tree model and a full traverse of the game tree in each training iteration, which can limit its use especially for large-scale tasks. Thus, we propose a sample-based model-free extension of HCFR, for which the key elements include unbiased Monte Carlo estimators of counterfactual regrets and a hierarchical baseline function for effective variance reduction. Note that controlling sample variance is vital for tasks with extended decision horizons, which our algorithm targets. Theoretical justifications are provided for each element of our design. (3) We present HDCFR, where the hierarchical strategy, regret, and baseline are approximated with Neural Networks, and the training objectives are demonstrated to be equivalent to those proposed in the tabular setting, i.e., (1) and (2), when optimality is achieved, thereby preserving the theoretical results while enjoying scalability.

2 Background

This section presents the background of our work, which includes two key concepts: Counterfactual Regret Minimization (CFR) and the option framework.

2.1 Counterfactual Regret Minimization

First, we introduce the extensive game model with imperfect information (Osborne and Rubinstein (1994)). In an extensive game, players make sequential moves represented by a game tree. At each non-terminal state, the player in control chooses from a set of available actions. At each terminal state, each player receives a payoff. In the presence of imperfect information, a player may not know which state they are in. For instance, in a poker game, a player sees its own cards and all cards laid on the table but not the opponents’ hands. Therefore, at each time step, each player makes decisions based on an information set – a collection of states that the controlling player cannot distinguish. Formally, the extensive game model can be represented by a tuple $<N,H,A,P,\sigma_{c},u,\mathcal{I}>$ . $N$ is a finite set of players. $H$ is a set of histories, where each history is a sequence of actions of all players from the start of the game and corresponds to a game state. For $h,h^{\prime}\in H$ , we write $h\sqsubseteq h^{\prime}$ if $h$ is a prefix of $h^{\prime}$ . The set of actions available at $h\in H$ is denoted as $A(h)$ . Suppose $a\in A(h)$ , then $(ha)\in H$ a successor history of $h$ . Histories with no successors are terminal histories $H_{TS}\subseteq H$ . $P:H\backslash H_{TS}\rightarrow N\cup\{c\}$ maps each non-terminal history to the player that chooses the next action, where $c$ is the chance player that acts according to a predefined distribution $\sigma_{c}(\cdot|h)$ . This chance player represents the environment’s inherent randomness, such as using a dice roll to decide the starting player. The utility function $u:N\times H_{TS}\rightarrow\mathbb{R}$ assigns a payoff for every player at each terminal history. For a player $i$ , $\mathcal{I}_{i}$ is a partition of $\{h\in H:P(h)=i\}$ and each element $I_{i}\in\mathcal{I}_{i}$ is an information set as introduced above. $I_{i}$ also represents the observable information for $i$ shared by all histories $h\in I_{i}$ . Due to the indistinguishability, we have $A(h)=A(I_{i}),\ P(h)=P(I_{i})$ . Notably, our work focus on the two-player zero-sum setting, where $N=\{1,2\}$ and $u_{1}(h)=-u_{2}(h),\ \forall\ h\in H_{TS}$ , like previous works on CFR (Zinkevich et al. (2007); Brown et al. (2019); Davis et al. (2020)).

Every player $i\in N$ selects actions according to a strategy $\sigma_{i}$ that maps each information set $I_{i}$ to a distribution over actions in $A(I_{i})$ . Note that $\sigma_{i}(\cdot|h)=\sigma_{i}(\cdot|I_{i}),\ \forall\ h\in I_{i}$ . The learning target of CFR is a Nash Equilibrium (NE) strategy profile $\sigma^{*}=\{\sigma_{1}^{*},\sigma_{2}^{*}\}$ , where no player has an incentive to deviate from their specified strategy. That is, $u_{i}(\sigma^{*})\geq\max_{\sigma_{i}}u_{i}(\{\sigma_{i},\sigma_{-i}^{*}\}),\ \forall\ i\in{N}$ , where $-i$ represents the players other than $i$ , $u_{i}(\sigma)$ is the expected payoff to player $i$ of $\sigma$ and defined as follows:

\displaystyle u_{i}(\sigma)=\sum_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})\pi^{\sigma}(h^{\prime}),\ \pi^{\sigma}(h^{\prime})=\prod_{(ha)\sqsubseteq h^{\prime}}\sigma_{P(h)}(a|I(h))

(1)

$I(h)$ denotes the information set containing $h$ , and $\pi^{\sigma}(h)$ is the reach probability of $h$ when employing $\sigma$ . $\pi^{\sigma}(h)$ can be decomposed as $\prod_{i\in N\cup\{c\}}\pi_{i}^{\sigma}(h)$ , where $\pi_{i}^{\sigma}(h)=\prod_{(ha)\sqsubseteq h^{\prime},P(h)=i}\sigma_{i}(a|I(h))$ . In addition, $\pi^{\sigma}(I)=\sum_{h\in I}\pi^{\sigma}(h)$ represents the reach probability of the information set $I$ .

CFR proposed in Zinkevich et al. (2007) is an iterative algorithm which accumulates the counterfactual regret $R_{i}^{T}(a|I)$ for each player $i$ at each information set $I\in\mathcal{I}_{i}$ . This regret informs the strategy determination. $R_{i}^{T}(a|I)$ is defined as follows:

	$\displaystyle R_{i}^{T}(a\|I)=\frac{1}{T}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I)(u_{i}(\sigma^{t}\|_{I\rightarrow a},I)-u_{i}(\sigma^{t},I))$		(2)
	$\displaystyle u_{i}(\sigma,I)=\sum_{h\in I}\pi_{-i}^{\sigma}(h)\sum_{h^{\prime}\in H_{TS}}\pi^{\sigma}(h,h^{\prime})u_{i}(h^{\prime})/\pi^{\sigma}_{-i}(I)$		(2)

where $\sigma^{t}$ is the strategy profile at iteration $t$ , $\sigma^{t}|_{I\rightarrow a}$ is identical to $\sigma^{t}$ except that the player always chooses the action $a$ at $I$ , $\pi^{\sigma}(h,h^{\prime})$ denotes the reach probability from $h$ to $h^{\prime}$ which equals $\frac{\pi^{\sigma}(h^{\prime})}{\pi^{\sigma}(h)}$ if $h\sqsubseteq h^{\prime}$ and 0 otherwise. Intuitively, $R_{i}^{T}(a|I)$ represents the expected regret of not choosing action $a$ at $I$ . With $R_{i}^{T}(a|I)$ , the next strategy profile $\sigma^{T+1}_{i}(\cdot|I)$ is acquired with regret matching (Abernethy et al. (2011)), which sets probabilities proportional to the positive regrets: $\sigma^{T+1}_{i}(a|I)\propto\max(R_{i}^{T}(a|I),0)$ . Defining the average strategy $\overline{\sigma}^{T}_{i}(\cdot|I)$ such that $\overline{\sigma}^{T}_{i}(a|I)\propto\sum_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)\sigma^{t}_{i}(a|I)$ , CFR guarantees that the strategy profile $\overline{\sigma}^{T}=\{\overline{\sigma}^{T}_{i}|i\in N\}$ converges to a Nash Equilibrium as $T\rightarrow\infty$ .

2.2 The Option Framework

As proposed in Sutton et al. (1999), an option $z\in\mathcal{Z}$ can be described with three components: an initiation set $Init_{z}\subseteq\mathcal{S}$ , an intra-option policy $\sigma_{z}(a|s):\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$ , and a termination function $\beta_{z}(s):\mathcal{S}\rightarrow[0,1]$ . $\mathcal{S}$ , $\mathcal{A}$ , $\mathcal{Z}$ represent the state, action, option space, respectively. An option $z$ is available in state $s$ if and only if $s\in Init_{z}$ . Once the option is taken, actions are selected according to $\sigma_{z}$ until it terminates stochastically according to $\beta_{z}$ , i.e., the termination probability at the current state. A new option will be activated by a high-level policy $\sigma_{\mathcal{Z}}(z|s):\mathcal{S}\times\mathcal{Z}\rightarrow[0,1]$ once the previous option terminates. In this way, $\sigma_{\mathcal{Z}}(z|s)$ and $\sigma_{z}(a|s)$ constitute a hierarchical policy for a certain task. Hierarchical policies tend to have superior performance on complex long-horizon tasks which can be broken down into and processed as a series of subtasks.

The one-step option framework (Li et al. (2021a)) is proposed to learn the hierarchical policy without the extra need to justify the exact beginning and breaking condition of each option, i.e., $Init_{z}$ and $\beta_{z}$ . First, it assumes that each option is available at each state, i.e., $Init_{z}=\mathcal{S},\forall\ z\in\mathcal{Z}$ . Second, it redefines the high-level and low-level policies as $\sigma^{H}(z|s,z^{\prime})$ ( $z^{\prime}$ : the option in the previous timestep) and $\sigma^{L}(a|s,z)$ , respectively, and implementing them as end-to-end neural networks. In particular, the Multi-Head Attention (MHA) mechanism (Vaswani et al. (2017)) is adopted in $\sigma^{H}(z|s,z^{\prime})$ , which enables it to temporally extend options in the absence of the termination function $\beta_{z}$ . Intuitively, if $z^{\prime}$ still fits $s$ , $\sigma^{H}(z|s,z^{\prime})$ will assign a larger attention weight to $z^{\prime}$ and thus has a tendency to continue with it; otherwise, a new option with better compatibility will be sampled. Then, the option is sampled at each timestep rather than after the previous option terminates. With this simplified framework, we only need to train the hierarchical policy, i.e., $\sigma^{H}$ and $\sigma^{L}$ .

The option framework is proposed within the realm of RL as opposed to CFR; however, these two fields are closely related. The authors of (Srinivasan et al. (2018); Fu et al. (2022)) propose actor-critic algorithms for multi-agent adversarial games with partial observability and show that they are indeed a form of MCCFR for IIGs. This insight inspires our adoption of the one-step option framework to create a hierarchical extension for CFR.

3 Methodology

In this work, we aim at extending CFR to learn a hierarchical strategy in the form of Neural Networks (NNs) to solve IIGs with extensive state spaces and deep game trees. The high-level and low-level strategies serve distinct roles in the learning system, where low-level components represent various skills composed of primitive actions, and the high-level component orchestrates their utilization, thus they should be defined and learned as different functions. In the absence of prior research on hierarchical extensions of CFR, we establish our work’s theoretical foundations by drawing upon tabular CFR algorithms. Firstly, we define the hierarchical strategy and hierarchical counterfactual regret, and provide corresponding updating rules along with the convergence guarantee. Subsequently, we propose that an unbiased estimation of the hierarchical counterfactual regret can be achieved through Monte Carlo sampling (Lanctot et al. (2009)) and that the sample variance can be reduced by introducing a hierarchical baseline function. This Low-Variance Monte Carlo sampling extension enables our algorithm to to tackle domains with vast or unknown game trees (i.e., the model-free setting) - where standard CFR traversal is impractical - without compromising the convergence rate. Finally, with the theoretical foundations established in the tabular setting, we develop our algorithm, HDCFR, by approximating these hierarchical functions using NNs and training them with novel objective functions. These training objectives are demonstrated to be consistent with the updating rules in the tabular case when optimality is achieved, thereby the theoretical support is maintained.

3.1 Preliminaries

At a game state $h$ , the player $i$ makes its $t$ -th decision by selecting a hierarchical action $\widetilde{a}_{t}\triangleq(z_{t},a_{t})$ , i.e., the option (a.k.a., skill) and primitive action, based on the observable information for player $i$ at $h$ , including the private observations $o_{1:t}$ and decision sequence $\widetilde{a}_{1:(t-1)}$ of player $i$ , and the public information for all players (defined by the game). All histories that share the same observable information are considered indistinguishable to player $i$ and belong to the same information set $I_{i}$ . Thus, in this work, we also use $I_{i}$ to denote observations upon which player $i$ makes decisions. With the hierarchical actions, we can redefine the extensive game model as $<N,H,\widetilde{A},P,\sigma_{c},u,\mathcal{I}>$ . Here, $N$ , $P$ , $u$ , and $\mathcal{I}$ retain the definitions in Section 2.1. $H$ includes all the possible histories, each of which is a sequence of hierarchical actions of all players starting from the first time step. $\widetilde{A}(h)=Z(h)\times A(h)$ , where $Z(h)$ and $A(h)$ represent the options and primitive actions available at $h$ respectively. $\sigma_{c}((z_{c},a)|h)=\sigma_{c}(a|h)$ , where $\sigma_{c}(a|h)$ is the predefined distribution in the original game model and $z_{c}$ (a dummy variable) is the only option choice for the chance player.

The learning target for player $i$ is a hierarchical strategy $\sigma_{i}(\widetilde{a}_{t}|I_{i})$ , which, by the chain rule, can be decomposed as $\sigma_{i}^{H}(z_{t}|I_{i})\cdot\sigma_{i}^{L}(a_{t}|I_{i},z_{t})$ . Note that although $I_{i}$ includes $z_{1:t-1}$ , we follow the conditional independence assumption of the one-step option framework (Li et al. (2021a); Zhang and Whiteson (2019)) which states that $z_{t}\perp\!\!\!\perp z_{1:(t-2)}\ |\ z_{t-1}$ and $a_{t}\perp\!\!\!\perp z_{1:(t-1)}\ |\ z_{t}$ , thus only $z_{t-1}$ ( $z_{t}$ ) is used for $\sigma_{i}^{H}$ ( $\sigma_{i}^{L}$ ) to determine $z_{t}$ ( $a_{t}$ ). With the hierarchical strategy, we can redefine the expected payoff and reach probability in Equation (1) by simply substituting $a$ with $\widetilde{a}$ , based on which we have the definition of the average overall regret of player $i$ at iteration $T$ : (From this point forward, $t$ refers to a certain learning iteration rather than a time step within an iteration.)

\displaystyle R_{full,i}^{T}=\frac{1}{T}\max_{\sigma_{i}^{{}^{\prime}}}\sum_{t=1}^{T}(u_{i}(\{\sigma_{i}^{{}^{\prime}},\sigma_{-i}^{t}\})-u_{i}(\sigma^{t}))

(3)

The following theorem (Theorem 2 from Zinkevich et al. (2007)) provides a connection between the average overall regret and the Nash Equilibrium solution.

Theorem 1

In a two-player zero-sum game at time $T$ , if both players’ average overall regret is less than $\epsilon$ , then $\overline{\sigma}^{T}=\{\overline{\sigma}^{T}_{1},\overline{\sigma}^{T}_{2}\}$ is a $2\epsilon$ -Nash Equilibrium.

Here, the average strategy $\overline{\sigma}^{T}_{i}$ is defined as ( $\forall\ i\in N,I\in\mathcal{I}_{i},\widetilde{a}\in\widetilde{A}(I)$ ):

\displaystyle\overline{\sigma}^{T}_{i}(\widetilde{a}|I)=\left(\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)\sigma^{t}_{i}(\widetilde{a}|I)\right)/\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)

(4)

An $\epsilon$ -Nash Equilibrium $\sigma$ approximates a Nash Equilibrium, with the property that $u_{i}(\sigma)+\epsilon\geq\max_{\sigma_{i}^{{}^{\prime}}}u_{i}(\{\sigma_{i}^{{}^{\prime}},\sigma_{-i}\}),\ \forall\ i\in{N}$ . Thus, $\epsilon$ measures the distance of $\sigma$ to the Nash Equilibrium in expected payoff. Then, according to Theorem 1, as $R_{full,i}^{T}\rightarrow 0$ ( $\forall\ i\in N$ ), $\overline{\sigma}^{T}$ converges to NE. Notably, Theorem 1 can be applied directly to our hierarchical setting, as the only difference from the original setting related to Theorem 1 is the replacement of $a$ with $\widetilde{a}$ in $R_{full,i}^{T}$ and $\overline{\sigma}^{T}_{i}(\widetilde{a}|I)$ . This difference can be viewed as employing a new action space (i.e., $A\rightarrow\widetilde{A}$ ) and is independent of using the option framework (i.e., the hierarchical extension).

3.2 Hierarchical Counterfactual Regret Minimization

One straightforward way to learn a hierarchical strategy $\sigma_{i}(\widetilde{a}|I)=\sigma_{i}^{H}(z|I)\cdot\sigma_{i}^{L}(a|I,z)$ is to view $\sigma_{i}(\widetilde{a}|I)$ as a unified strategy defined on a new action set $\widetilde{A}$ , and then apply CFR directly to learn it. However, this approach does not allow for the explicit separation and utilization of the high-level and low-level components, such as extracting and reusing skills (i.e., low-level parts) or initializing them with human knowledge. In this section, we treat $\sigma_{i}^{H}$ and $\sigma_{i}^{L}$ as distinct functions and introduce Hierarchical CFR (HCFR) to separately learn $\sigma_{i}^{H}(z|I)$ and $\sigma_{i}^{L}(a|I,z),\forall\ I\in\mathcal{I}_{i},z\in Z(I),a\in A(I)$ . Additionally, we provide the convergence guarantee for HCFR.

Taking inspiration from CFR (Zinkevich et al. (2007)), we derive an upper bound for the average overall regret $R_{full,i}^{T}$ , which is given by the sum of high-level and low-level counterfactual regrets at each information set, namely $R^{T,H}_{i}(z|I)$ and $R^{T,L}_{i}(a|I,z)$ . In this way, we can minimize $R^{T,H}_{i}(z|I)$ and $R^{T,L}_{i}(a|I,z)$ for each individual $I\in\mathcal{I}_{i}$ independently by adjusting $\sigma_{i}^{H}(z|I)$ and $\sigma_{i}^{L}(a|I,z)$ respectively, and in doing so, minimize the average overall regret. The learning of the high-level and low-level strategy is also decoupled.

Theorem 2

With the following definitions of high-level and low-level counterfactual regrets:

		$\displaystyle\ \ \ \ \ \ \ \ R^{T,H}_{i}(z\|I)=\frac{1}{T}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I)(u_{i}(\sigma^{t}\|_{I\rightarrow z},I)-u_{i}(\sigma^{t},I)),\ R^{T,H}_{i}(I)=\max_{z\in Z(I)}R^{T,H}_{i}(z\|I)$		(5)
		$\displaystyle R^{T,L}_{i}(a\|I,z)=\frac{1}{T}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I)(u_{i}(\sigma^{t}\|_{Iz\rightarrow a},Iz)-u_{i}(\sigma^{t},Iz)),\ R^{T,L}_{i}(I,z)=\max_{a\in A(I)}R^{T,L}_{i}(a\|I,z)$		(5)

we have $R_{full,i}^{T}\leq\sum_{I\in\mathcal{I}_{i}}\left[R^{T,H}_{i,+}(I)+\sum_{z\in Z(I)}R^{T,L}_{i,+}(I,z)\right]$ .

Here, $R^{T,H}_{i,+}(I)=\max(R^{T,H}_{i}(I),0),\ R^{T,L}_{i,+}(I,z)=\max(R^{T,L}_{i}(I,z),0)$ , $u_{i}(\sigma^{t},Iz)$ is the expected payoff for choosing option $z$ at $I$ , $\sigma^{t}|_{Iz\rightarrow a}$ is a hierarchical strategy profile identical to $\sigma^{t}$ except that the intra-option (i.e., low-level) strategy of option $z$ at $I$ is always choosing $a$ . Detailed proof of Theorem 2 is available in Appendix A.

After obtaining $R^{T,H}_{i}$ and $R^{T,L}_{i}$ , we can compute the high-level and low-level strategies for the next iteration as follows: ( $\forall\ i\in N,I\in\mathcal{I}_{i},z\in Z(I),a\in A(I)$ )

		$\displaystyle\ \ \ \ \sigma_{i}^{T+1,H}(z\|I)=\left\{\begin{aligned} R^{T,H}_{i,+}(z\|I)/\mu^{H}&,&\mu^{H}>0,\\ 1/\|Z(I)\|&,&o\backslash w.\end{aligned}\right.\ \ \ \ \mu^{H}=\sum_{z^{\prime}\in Z(I)}R^{T,H}_{i,+}(z^{\prime}\|I)$		(6)
		$\displaystyle\sigma_{i}^{T+1,L}(a\|I,z)=\left\{\begin{aligned} R^{T,L}_{i,+}(a\|I,z)/\mu^{L}&,&\mu^{L}>0,\\ 1/\|A(I)\|&,&o\backslash w.\end{aligned}\right.\ \ \ \ \mu^{L}=\sum_{a^{\prime}\in A(I)}R^{T,L}_{i,+}(a^{\prime}\|I,z)$		(6)

In this way, the counterfactual regrets and strategies are calculated alternatively (i.e., $\sigma^{1:t}\rightarrow R^{t}\rightarrow\sigma^{t+1},\ \sigma^{1:t+1}\rightarrow R^{t+1}\rightarrow\sigma^{t+2},\ \cdots$ ) with Equation (5) and (6) for iterations until convergence (i.e., $R_{full,i}^{T}\rightarrow 0$ ). The convergence rate of this algorithm is presented in the following theorem:

Theorem 3

If player $i$ selects options and actions according to Equation (6), then $R_{full,i}^{T}\leq\Delta_{u,i}|\mathcal{I}_{i}|(\sqrt{|Z_{i}|}+|Z_{i}|\sqrt{|A_{i}|})/\sqrt{T}$ , where $\Delta_{u,i}=\max_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})-\min_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})$ , $|\mathcal{I}_{i}|$ is the number of information sets for player $i$ , $|A_{i}|=\max_{h:P(h)=i}|A(h)|$ , $|Z_{i}|=\max_{h:P(h)=i}|Z(h)|$ .

Thus, as $T\rightarrow\infty$ , $R_{full,i}^{T}\rightarrow 0$ . Additionally, the convergence rate is $\mathcal{O}(T^{-0.5})$ , which is the same as CFR (Zinkevich et al. (2007)). Thus, the introduction of the option framework does not compromise the convergence guarantee, while allowing skill-based strategy learning. The proof of Theorem 3 is provided in Appendix B.

With $\sigma^{t,H}_{i}$ and $\sigma^{t,L}_{i}$ , we can compute the average high-level and low-level strategies as:

\displaystyle\overline{\sigma}^{T,H}_{i}(z|I)=\frac{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)\sigma^{t,H}_{i}(z|I)}{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)},\ \ \ \ \overline{\sigma}^{T,L}_{i}(a|I,z)=\frac{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(Iz)\sigma^{t,L}_{i}(a|I,z)}{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(Iz)}

(7)

where $\pi_{i}^{\sigma^{t}}(Iz)=\pi_{i}^{\sigma^{t}}(I)\sigma_{i}^{t,H}(z|I)$ . Then, we can state:

Proposition 1

If both players sequentially use their average high-level and low-level strategies following the one-step option model, i.e., $\forall\ I\in\mathcal{I}_{i}$ , selecting an option $z$ according to $\overline{\sigma}^{T,H}_{i}(\cdot|I)$ and then selecting the action $a$ according to the corresponding intra-option strategy $\overline{\sigma}^{T,L}_{i}(\cdot|I,z)$ , the resulting strategy profile converges to a Nash Equilibrium as $T\rightarrow\infty$ .

The proof is based on Theorem 1, for which you can refer to Appendix C.

3.3 Low-Variance Monte Carlo Sampling Extension

In vanilla CFR, counterfactual regrets and immediate strategies are updated for every information set during each iteration. This necessitates a complete traversal of the game tree, which becomes infeasible for large-scale game models. Monte Carlo CFR (MCCFR) (Lanctot et al. (2009)) is a framework that allows CFR to only update regrets/strategies on part of the tree for a single agent (i.e., the traverser) at each iteration. MCCFR features two sampling scheme variants: External Sampling (ES) and Outcome Sampling (OS). In OS, regrets/strategies are updated for information sets within a single trajectory that is generated by sampling one action at each decision point. In ES, a single action is sampled for non-traverser agents, while all actions of the traverser are explored, leading to updates over multiple trajectories. ES relies on perfect game models for backtracking and becomes impractical as the horizon increases, with which the search breadth grows exponentially. Our algorithm is specifically designed for domains with deep game trees, leading us to adopt OS as the sampling scheme. Nevertheless, OS is challenged by high sample variance, an issue that exacerbates with an increasing decision-making horizon. Therefore, in this section, we further complete our algorithm with a low-variance outcome sampling extension.

MCCFR’s main insight is substituting the counterfactual regrets $R^{T}_{i}$ with unbiased estimations, while maintaining the other learning rules (as in Section 3.2). This allows for updating functions only on information sets within the sampled trajectories, bypassing the need to traverse the full game tree. With MCCFR, the average overall regret $R_{full,i}^{T}\rightarrow 0$ as $T\rightarrow\infty$ at the same convergence rate as vanilla CFR, with high probability, as stated in Theorem 5 of Lanctot et al. (2009). Therefore, to apply the Monte Carlo extension, we propose unbiased estimations of $R^{T,H}_{i}(z|I)$ and $R^{T,L}_{i}(a|I,z)$ , $\forall\ i\in N,I\in\mathcal{I}_{i},z\in Z(I),a\in A(I)$ .

First, we define $R^{T,H}_{i}(z|I)$ and $R^{T,L}_{i}(a|I,z)$ with the immediate counterfactual regrets $r^{t}_{i}$ and values $v_{i}^{t}$ : ( $v^{t,H}_{i}(\sigma^{t},h)=u_{i}(h),\ \forall\ h\in H_{TS}$ )

		$\displaystyle\quad\ R_{i}^{T,H}(z\|I)=\frac{1}{T}\sum_{t=1}^{T}r_{i}^{t,H}(I,z),\ r_{i}^{t,H}(I,z)=\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\left[v^{t,L}_{i}(\sigma^{t},hz)-v^{t,H}_{i}(\sigma^{t},h)\right]$		(8)
		$\displaystyle R_{i}^{T,L}(a\|I,z)=\frac{1}{T}\sum_{t=1}^{T}r_{i}^{t,L}(Iz,a),\ r_{i}^{t,L}(Iz,a)=\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\left[v^{t,H}_{i}(\sigma^{t},hza)-v^{t,L}_{i}(\sigma^{t},hz)\right]$
		$\displaystyle v^{t,H}_{i}(\sigma^{t},h)=\sum_{z\in Z(h)}\sigma^{t,H}_{P(h)}(z\|h)v_{i}^{t,L}(\sigma^{t},hz),\ v_{i}^{t,L}(\sigma^{t},hz)=\sum_{a\in A(h)}\sigma^{t,L}_{P(h)}(a\|h,z)v^{t,H}_{i}(\sigma^{t},hza)$

The equivalence between Equation (8) and (5) is proved in Appendix D.

Next, we propose to collect trajectories $h^{\prime}\in H_{TS}$ with the sample strategy $q^{t}$ at each iteration $t$ , and compute the corresponding sampled immediate counterfactual regrets $\hat{r}^{t}_{i}$ and values $\hat{v}_{i}^{t}$ as follows:

		$\displaystyle\ \ \hat{r}_{i}^{t,H}(I,z\|h^{\prime})=\sum_{h\in I}\frac{\pi_{-i}^{\sigma^{t}}(h)}{\pi^{q^{t}}(h)}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z\|h^{\prime})-\hat{v}^{t,H}_{i}(\sigma^{t},h\|h^{\prime})\right]$		(9)
		$\displaystyle\hat{r}_{i}^{t,L}(Iz,a\|h^{\prime})=\sum_{h\in I}\frac{\pi_{-i}^{\sigma^{t}}(h)}{\pi^{q^{t}}(hz)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz,a\|h^{\prime})-\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})\right]$		(9)

Here, inspired by Davis et al. (2020), $\hat{v}^{t,H}_{i}(\sigma^{t},h,z|h^{\prime})$ and $\hat{v}^{t,L}_{i}(\sigma^{t},hz,a|h^{\prime})$ are incorporated with the baseline function $b^{t}_{i}$ for variance reduction: ( $\hat{v}^{t,H}_{i}(\sigma^{t},h^{\prime}|h^{\prime})=u_{i}(h^{\prime})$ )

		$\displaystyle\quad\ \ \hat{v}^{t,H}_{i}(\sigma^{t},h,z\|h^{\prime})=\frac{\delta(hz\sqsubseteq h^{\prime})}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b_{i}^{t}(h,z)\right]+b^{t}_{i}(h,z)$		(10)
		$\displaystyle\hat{v}^{t,L}_{i}(\sigma^{t},hz,a\|h^{\prime})=\frac{\delta(hza\sqsubseteq h^{\prime})}{q^{t}(a\|h,z)}\left[\hat{v}^{t,H}_{i}(\sigma^{t},hza\|h^{\prime})-b^{t}_{i}(h,z,a)\right]+b^{t}_{i}(h,z,a)$		(10)

where $\delta(\cdot)$ is the indicator function. Accordingly, $\hat{v}^{t,H}_{i}(\sigma^{t},h|h^{\prime})$ and $\hat{v}^{t,L}_{i}(\sigma^{t},hz|h^{\prime})$ are defined as $\sum_{z\in Z(h)}\sigma^{t,H}_{P(h)}(z|h)\hat{v}_{i}^{t,H}(\sigma^{t},h,z|h^{\prime})$ and $\sum_{a\in A(h)}\sigma^{t,L}_{P(h)}(a|h,z)\hat{v}_{i}^{t,L}(\sigma^{t},hz,a|h^{\prime})$ . (For superscripts on $\hat{r}$ and $\hat{v}$ : use $H$ when the agent is in state $h$ or $hza$ for high-level option choices, and $L$ in state $hz$ for low-level action decisions.)

Regarding estimators proposed in Equation (9) and (10), we have the following theorems:

Theorem 4

For all $i\in N,\ I\in\mathcal{I}_{i},\ z\in Z(I),\ a\in A(I)$ , we have:

\displaystyle\mathbb{E}_{h^{\prime}\sim\pi^{q^{t}}(\cdot)}\left[\hat{r}_{i}^{t,H}(I,z|h^{\prime})\right]=r^{t,H}_{i}(I,z),\ \mathbb{E}_{h^{\prime}\sim\pi^{q^{t}}(\cdot)}\left[\hat{r}_{i}^{t,L}(Iz,a|h^{\prime})\right]=r^{t,L}_{i}(Iz,a)

(11)

Therefore, we can acquire unbiased estimations of $R^{T}_{i}$ by substituting $r_{i}^{t}$ with $\hat{r}^{t}_{i}$ in Equation (8). This theorem is proved in Appendix E. Notably, Theorem 4 doesn’t prescribe any specific form for the baseline function $b^{t}_{i}$ . Yet, the baseline design can affect the sample variance of these unbiased estimators. As posited in Gibson et al. (2012), given a fixed $\epsilon>0$ , estimators with reduced variance necessitate fewer iterations to converge to an $\epsilon$ -Nash equilibrium. Hence, we propose the following ideal criteria for the baseline function to minimize the sample variance:

Theorem 5

If $b^{t}_{i}(h,z,a)=v^{t,H}_{i}(\sigma^{t},hza)$ and $b^{t}_{i}(h,z)=v^{t,L}_{i}(\sigma^{t},hz)$ , for all $h\in H\backslash H_{TS},\ z\in Z(h),\ a\in A(h)$ , we have:

\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]={\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,L}(\sigma^{t},hz,a|h^{\prime})|h^{\prime}\sqsupseteq hz\right]=0

(12)

Consequently, ${\rm Var}_{h^{\prime}\sim\pi^{q^{t}}(\cdot)}\left[\hat{r}_{i}^{t,H}(I,z|h^{\prime})\right]$ and ${\rm Var}_{h^{\prime}\sim\pi^{q^{t}}(\cdot)}\left[\hat{r}_{i}^{t,L}(Iz,a|h^{\prime})\right]$ are minimized with respect to $b^{t}_{i}$ for all $I\in\mathcal{I}_{i}$ , $z\in Z(I)$ , $a\in A(I)$ .

The proof can be found in Appendix F. The ideal criteria for the baseline function proposed in Theorem 5 is incorporated into our objective design in Section 3.4.

To sum up, by employing the immediate counterfactual regret estimators shown as Equation (9) and (10), and making appropriate choices for the baseline function (introduced in Section 3.4), we are able to bolster the adaptability and learning efficiency of our method through a low-variance outcome Monte Carlo sampling extension.

3.4 Hierarchical Deep Counterfactual Regret Minimization

Building upon theoretical foundations discussed in Section 3.2 and 3.3, we now present our algorithm – HDCFR. While the algorithm outline is similar to tabular CFR algorithms (Kakkad et al. (2019); Davis et al. (2020)), HDCFR differentiates itself by introducing NNs as function approximators for the counterfactual regret $R_{i}^{t}$ , average strategy $\overline{\sigma}^{T}_{i}$ , and baseline $b^{t}_{i}$ . These approximations enable HDCFR to handle large-scale state spaces and are trained with specially-designed objective functions. In this section, we introduce the deep learning objectives, demonstrate their alignment with the theoretical underpinnings provided in Section 3.2 and 3.3, and then present the complete algorithm in pseudo-code form.

Three types of networks are trained: the counterfactual regret networks $R^{t,H}_{i,\theta},\ R^{t,L}_{i,\theta}$ , average strategy networks $\overline{\sigma}^{T,H}_{i,\phi},\ \overline{\sigma}^{T,L}_{i,\phi}$ , and baseline network $b^{t}$ . Notably, we do not maintain the counterfactual values $\hat{v}^{t}$ and baselines $b^{t}$ for each player. Instead, we leverage the property of two-player zero-sum games where the payoff of the two players offsets each other. Thus, we track the payoff for player 1 and use the opposite value as the payoff for player 2. That is, $b^{t}=b^{t}_{1}=-b^{t}_{2}$ , $v^{t}=v^{t}_{1}=-v^{t}_{2}$ .

First, the counterfactual regret networks are trained by minimizing the following two objectives, denoted as $\mathcal{L}^{t,H}_{R,i}$ and $\mathcal{L}^{t,L}_{R,i}$ , respectively.

\displaystyle\mathop{\mathbb{E}}_{(I,\hat{r}^{t^{\prime},H}_{i})\sim\tau^{i}_{R}}\left[\sum_{z\in Z(I)}(R^{t,H}_{i,\theta}(z|I)-\hat{r}^{t^{\prime},H}_{i}(I,z))^{2}\right],\mathop{\mathbb{E}}_{(Iz,\hat{r}^{t^{\prime},L}_{i})\sim\tau^{i}_{R}}\left[\sum_{a\in A(I)}(R^{t,L}_{i,\theta}(a|I,z)-\hat{r}^{t^{\prime},L}_{i}(Iz,a))^{2}\right]

(13)

Here, $\tau^{i}_{R}$ represents a memory containing the sampled immediate counterfactual regrets gathered from iterations 1 to $t$ . As mentioned in Section 3.3, the counterfactual regrets (i.e., $R^{t,H}_{i}$ and $R^{t,L}_{i}$ ) should be replaced with their unbiased estimations acquired via Monte Carlo sampling. As a justification of our objective design, we claim:

Proposition 2

Let $R^{t,H}_{i,*}$ and $R^{t,L}_{i,*}$ denote the minimal points of $\mathcal{L}^{t,H}_{R,i}$ and $\mathcal{L}^{t,L}_{R,i}$ , respectively. For all $I\in\mathcal{I}_{i},\ z\in Z(I),\ a\in A(I)$ , $R^{t,H}_{i,*}(z|I)$ and $R^{t,L}_{i,*}(a|I,z)$ yield unbiased estimations of the true counterfactual regrets scaled by positive constant factors, i.e., $C_{1}R^{t,H}_{i}(z|I)$ and $C_{2}R^{t,L}_{i}(a|I,z)$ .

Please refer to Appendix G for the proof. Observe that the counterfactual regrets are employed solely for calculating the strategy in the subsequent iteration, as per Equation (6). The positive scale factors $C_{1}$ and $C_{2}$ do not impact this calculation, as they appear in both the numerator and denominator and cancel each other out. Thus, $R^{t,H}_{i,*}$ and $R^{t,L}_{i,*}$ can be used in place of $R^{t,H}_{i}$ and $R^{t,L}_{i}$ .

Second, the average strategy networks are learned based on the immediate strategies from iteration 1 to $T$ . Specifically, they are learned by minimizing $\mathcal{L}^{H}_{\overline{\sigma},i}$ and $\mathcal{L}^{L}_{\overline{\sigma},i}$ :

\displaystyle\mathop{\mathbb{E}}_{(I,\sigma^{t,H}_{i})\sim\tau^{i}_{\overline{\sigma}}}\left[\sum_{z\in Z(I)}(\overline{\sigma}^{T,H}_{i,\phi}(z|I)-\sigma^{t,H}_{i}(z|I))^{2}\right],\ \mathop{\mathbb{E}}_{(Iz,\sigma^{t,L}_{i})\sim\tau^{i}_{\overline{\sigma}}}\left[\sum_{a\in A(I)}(\overline{\sigma}^{T,L}_{i,\phi}(a|I,z)-\sigma^{t,L}_{i}(a|I,z))^{2}\right]

(14)

Notably, in our algorithm, the sampling scheme is specially designed to fulfill the subsequent proposition. Define $q^{t,i}$ as the sample strategy profile at iteration $t$ when $i$ is the traverser, meaning exploration occurs during $i$ ’s decision-making. $q^{t,i}_{p}$ is a uniformly random strategy when $p=i$ , and equals to $\sigma^{t}_{p}$ when $p=3-i$ (i.e., the other player). Furthermore, samples in $\tau^{i}_{\overline{\sigma}}$ are gathered when the traverser is $3-i$ (so $i$ samples with $\sigma^{t}_{i}$ ). With this scheme, we assert: (refer to Appendix H for proof)

Proposition 3

Let $\overline{\sigma}^{T,H}_{i,*}$ and $\overline{\sigma}^{T,L}_{i,*}$ represent the minimal points of $\mathcal{L}^{H}_{\overline{\sigma},i}$ and $\mathcal{L}^{L}_{\overline{\sigma},i}$ , respectively, and define $\tau^{t,i}_{\overline{\sigma}}$ as the partition of $\tau^{i}_{\overline{\sigma}}$ at iteration $t$ . If $\tau^{t,i}_{\overline{\sigma}}$ is a collection of random samples with the sampling scheme defined above, then $\overline{\sigma}^{T,H}_{i,*}(z|I)\rightarrow\overline{\sigma}^{T,H}_{i}(z|I)$ and $\overline{\sigma}^{T,L}_{i,*}(a|I,z)\rightarrow\overline{\sigma}^{T,H}_{i}(a|I,z)$ , $\forall\ I\in\mathcal{I}_{i},\ z\in Z(I),\ a\in A(I)$ , as $|\tau^{t,i}_{\overline{\sigma}}|\rightarrow\infty$ ( $t\in\{1,\cdots,T\}$ ).

According to Proposition 1 and 3, $\overline{\sigma}^{T,H}_{*}$ and $\overline{\sigma}^{T,L}_{*}$ can be returned as an approximate Nash Equilibrium.

Algorithm 1 Hierarchical Deep Counterfactual Regret Minimization (HDCFR)

1:Initialize the counterfactual regret networks

R^{0,H}_{i,\theta},\ R^{0,L}_{i,\theta},\ \forall\ i\in\{1,2\}

(collectively denoted as

R^{0}_{\theta}

), and the baseline network

b^{1}

so that they return 0 for all inputs

2:Initialize the average strategy networks

\overline{\sigma}^{T,H}_{i,\phi},\ \overline{\sigma}^{T,L}_{i,\phi},\ \forall\ i\in\{1,2\}

with random parameters

3:Initialize the replay buffer for the counterfactual regrets and average strategies, i.e.,

\tau_{R}^{i},\ \tau_{\overline{\sigma}}^{i},\ \forall\ i\in\{1,2\}

as empty sets

4:for

t

\{1,\cdots,T\}

5: Initialize the the replay buffer for the baseline function at iteration

t

\tau^{t}_{b}=\emptyset

6: for

i=\{1,2\}

7: Define the sample strategy profile at

t

with

i

being the traverser, i.e.,

q^{t,i}

8: for traversal

k=\{1,\cdots,K\}

9: HighRollout(

\emptyset

R^{t-1}_{\theta}

\tau^{i}_{R}

\tau^{3-i}_{\overline{\sigma}}

\tau^{t}_{b}

q^{t,i}

b^{t}

)

10: end for

11: end for

12: for

i=\{1,2\}

13: Train

R^{t,H}_{i,\theta},\ R^{t,L}_{i,\theta}

from scratch by minimizing Equation (13)

14: end for

15:

b^{t+1}

= BaselineTraining(

b^{t}

\tau^{t}_{b}

R^{t}_{\theta}

q^{t,1}

)

16:end for

17:for

i=\{1,2\}

18: Obtain

\overline{\sigma}^{T,H}_{i,\phi},\ \overline{\sigma}^{T,L}_{i,\phi}

by minimizing Equation (14)

19:end for

20:Return

\{(\overline{\sigma}^{T,H}_{1,\phi},\ \overline{\sigma}^{T,L}_{1,\phi}),\ (\overline{\sigma}^{T,H}_{2,\phi},\ \overline{\sigma}^{T,L}_{2,\phi})\}

, i.e., the approximate Nash Equilibrium hierarchical strategy profile

21:

22:function BaselineTraining(

b^{t}

\tau^{t}_{b}

R^{t}_{\theta}

q^{t,1}

)

23: for

h^{\prime}

\tau^{t}_{b}

24: for

hza\sqsubseteq h^{\prime}

do (tracing back from

h^{\prime}

to its initial state)

25: Compute

\hat{b}^{t+1}(hza|h^{\prime})

using

b^{t}

R^{t}_{\theta}

, and

q^{t,1}

, following Equation (16), where

26:

R^{t}_{\theta}

indicates

\sigma^{t+1}

according to Equation (6)

27: end for

28: end for

29: Train

b^{t+1}

by minimizing Equation (15)

30: Return

b^{t+1}

31:end function

Algorithm 2 Hierarchical Deep Counterfactual Regret Minimization (HDCFR) Continued

1:function HighRollout(

h

R^{t-1}_{\theta}

\tau^{i}_{R}

\tau^{3-i}_{\overline{\sigma}}

\tau^{t}_{b}

q^{t,i}

b^{t}

)

2: if

h\in H_{TS}

then

3: Assign

h^{\prime}=h

4: if

i==1

then

5: Add

h^{\prime}

\tau_{b}^{t}

6: end if

7: Return

u_{1}(h^{\prime})

8: end if

I=I(h)

p=P(h)

10: Sample an option

z\sim q^{t,i}(\cdot|h)

11:

\hat{v}^{t,L}(\sigma^{t},hz|h^{\prime})

= LowRollout(

h

z

R^{t-1}_{\theta}

\tau^{i}_{R}

\tau^{3-i}_{\overline{\sigma}}

\tau^{t}_{b}

q^{t,i}

b^{t}

)

12:

\hat{v}^{t,H}(\sigma^{t},h,z^{\prime}|h^{\prime})=b^{t}(h,z^{\prime})

\forall\ z^{\prime}\neq z

13:

\hat{v}^{t,H}(\sigma^{t},h,z|h^{\prime})=\frac{1}{q^{t,i}(z|h)}\left[\hat{v}^{t,L}(\sigma^{t},hz|h^{\prime})-b^{t}(h,z)\right]+b^{t}(h,z)

14:

\hat{v}^{t,H}(\sigma^{t},h|h^{\prime})=\sum_{z\in Z(h)}\sigma^{t,H}_{p}(z|h)\hat{v}^{t,H}(\sigma^{t},h,z|h^{\prime})

15: if

p==i

then

16:

\hat{r}_{i}^{t,H}(I,\cdot|h^{\prime})=(-1)^{i+1}\frac{\pi_{3-i}^{\sigma^{t}}(h)}{\pi^{q^{t,i}}(h)}\left[\hat{v}^{t,H}(\sigma^{t},h,\cdot|h^{\prime})-\hat{v}^{t,H}(\sigma^{t},h|h^{\prime})\right]

17: Add

(I,t,\hat{r}_{i}^{t,H}(I,\cdot|h^{\prime}))

\tau^{i}_{R}

18: else if

p==3-i

then

19: Compute

\sigma^{t,H}_{3-i}(\cdot|I)

based on

R^{t-1,H}_{3-i}(\cdot|I)

following Equation (6)

20: Add

(I,t,\sigma_{3-i}^{t,H}(\cdot|I))

\tau^{3-i}_{\overline{\sigma}}

21: end if

22: Return

\hat{v}^{t,H}(\sigma^{t},h|h^{\prime})

23:end function

24:

25:function LowRollout(

h

z

R^{t-1}_{\theta}

\tau^{i}_{R}

\tau^{3-i}_{\overline{\sigma}}

\tau^{t}_{b}

q^{t,i}

b^{t}

)

26:

I=I(h)

p=P(h)

27: Sample an action

a\sim q^{t,i}(\cdot|h,z)

28:

\hat{v}^{t,H}(\sigma^{t},hza|h^{\prime})

= HighRollout(

hza

R^{t-1}_{\theta}

\tau^{i}_{R}

\tau^{3-i}_{\overline{\sigma}}

\tau^{t}_{b}

q^{t,i}

b^{t}

)

29:

\hat{v}^{t,L}(\sigma^{t},hz,a^{\prime}|h^{\prime})=b^{t}(h,z,a^{\prime})

\forall\ a^{\prime}\neq a

30:

\hat{v}^{t,L}(\sigma^{t},hz,a|h^{\prime})=\frac{1}{q^{t,i}(a|h,z)}\left[\hat{v}^{t,H}(\sigma^{t},hza|h^{\prime})-b^{t}(h,z,a)\right]+b^{t}(h,z,a)

31:

\hat{v}^{t,L}(\sigma^{t},hz|h^{\prime})=\sum_{a\in A(h)}\sigma^{t,L}_{p}(a|h,z)\hat{v}^{t,L}(\sigma^{t},hz,a|h^{\prime})

32: if

p==i

then

33:

\hat{r}_{i}^{t,L}(Iz,\cdot|h^{\prime})=(-1)^{i+1}\frac{\pi_{3-i}^{\sigma^{t}}(h)}{\pi^{q^{t,i}}(hz)}\left[\hat{v}^{t,L}(\sigma^{t},hz,\cdot|h^{\prime})-\hat{v}^{t,L}(\sigma^{t},hz|h^{\prime})\right]

34: Add

(Iz,t,\hat{r}_{i}^{t,L}(Iz,\cdot|h^{\prime}))

\tau^{i}_{R}

35: else if

p==3-i

then

36: Compute

\sigma^{t,L}_{3-i}(\cdot|I,z)

based on

R^{t-1,L}_{3-i}(\cdot|I,z)

following Equation (6)

37: Add

(Iz,t,\sigma^{t,L}_{3-i}(\cdot|I,z))

\tau^{3-i}_{\overline{\sigma}}

38: end if

39: Return

\hat{v}^{t,L}(\sigma^{t},hz|h^{\prime})

40:end function

Last, at the end of each iteration, we determine the baseline function for the subsequent iteration to reduce sample variance, which is achieved by minimizing the following objective:

\displaystyle\mathcal{L}_{b}^{t+1}=\mathbb{E}_{h^{\prime}\sim\tau_{b}^{t}}\left[\sum_{hza\sqsubseteq h^{\prime}}(b^{t+1}(h,z,a)-\hat{b}^{t+1}(hza|h^{\prime}))^{2}\right]

(15)

Here, $\tau_{b}^{t}$ is a memory buffer including trajectories collected at iteration $t$ when player $1$ is the traverser. For each trajectory, we compute and record the sampled baseline values $\hat{b}^{t+1}(h|h^{\prime}),\forall\ h\sqsubseteq h^{\prime}$ , which are defined as: ( $\hat{b}^{t+1}(h|h^{\prime})=u_{1}(h)$ if $h\in H_{TS}$ )

		$\displaystyle\hat{b}^{t+1}(h\|h^{\prime})=\sum_{z\in Z(h)}\sigma^{t+1,H}_{P(h)}(z\|h)\hat{b}^{t+1}(h,z\|h^{\prime}),\ \hat{b}^{t+1}(hz\|h^{\prime})=\sum_{a\in A(h)}\sigma^{t+1,L}_{P(h)}(a\|h,z)\hat{b}^{t+1}(hz,a\|h^{\prime})$		(16)
		$\displaystyle\qquad\qquad\quad\quad\hat{b}^{t+1}(h,z\|h^{\prime})=\frac{\delta(hz\sqsubseteq h^{\prime})}{q^{t,1}(z\|h)}\left[\hat{b}^{t+1}(hz\|h^{\prime})-b^{t}(h,z)\right]+b^{t}(h,z)$
		$\displaystyle\qquad\qquad\ \ \hat{b}^{t+1}(hz,a\|h^{\prime})=\frac{\delta(hza\sqsubseteq h^{\prime})}{q^{t,1}(a\|h,z)}\left[\hat{b}^{t+1}(hza\|h^{\prime})-b^{t}(h,z,a)\right]+b^{t}(h,z,a)$

As for the high-level baseline function $b^{t+1}(h,z)$ , for simplicity, it is not trained as another network but defined based on $b^{t+1}(h,z,a)$ as: $b^{t+1}(h,z)=\sum_{a\in A(h)}\sigma^{t+1,L}_{P(h)}(a|I(h),z)b^{t+1}(h,z,a)$ . With the specially-designed sampled baseline functions and the relation between $b^{t+1}(h,z)$ and $b^{t+1}(h,z,a)$ , we have:

Proposition 4

Denote $b^{t+1,*}$ as the minimal point of $\mathcal{L}^{t+1}_{b}$ and consider trajectories in $\tau_{b}^{t}$ as independent and identically distributed random samples, then we have $b^{t+1,*}(h,z,a)\rightarrow v^{t+1,H}(\sigma^{t+1},hza)$ and $b^{t+1,*}(h,z)=\sum_{a^{\prime}}\sigma^{t+1,L}_{P(h)}(a^{\prime}|I(h),z)b^{t+1,*}(h,z,a^{\prime})\rightarrow v^{t+1,L}(\sigma^{t+1},hz)$ , $\forall\ h\in H,\ z\in Z(h),\ a\in A(h)$ , as $|\tau_{b}^{t}|\rightarrow\infty$ .

This proposition implies that the ideal criteria for the baseline function (i.e., Theorem 5) can be achieved at the optimal point of $\mathcal{L}^{t+1}_{b}$ . For a detailed proof, please refer to Appendix I.

To sum up, we present the pseudo code of HDCFR as Algorithm 1 and 2. There are in total $T$ iterations. (1) At each iteration $t$ , the two players take turns being the traverser and collecting $K$ trajectories for training (Line 6 – 11 of Algorithm 1). Each trajectory is obtained via outcome Monte Carlo sampling, detailed as Algorithm 2. In the course of sampling, immediate counterfactual regrets for the traverser $i$ (i.e., $\hat{r}^{t}_{i}$ ) are calculated using Equation (9) and (10) and stored in the regret buffer $\tau_{R}^{i}$ ; while the strategies for the non-traverser (i.e., $\sigma^{t}_{3-i}$ ) are derived from $R^{t-1}_{3-i,\theta}$ according to Equation (6) and saved in the strategy buffer $\tau_{\overline{\sigma}}^{3-i}$ . (2) At the end of iteration $t$ , the counterfactual regret networks $R^{t-1}_{i,\theta}$ are trained based on samples stored in the memory $\tau_{R}^{i}$ , according to Equation (13), in order to obtain $R^{t}_{i,\theta}$ (Line 12 – 14 of Algorithm 1). $R^{t}_{i,\theta}$ defines $\sigma^{t+1}_{i},\ \forall i\in N$ , based on which we can update the baseline function $b^{t}$ to $b^{t+1}$ according to Equation (15) (Line 22 – 30 of Algorithm 1). $b^{t+1}$ and $R^{t}_{i,\theta}$ are then utilized for the next iteration. (3) After $T$ iterations, a hierarchical strategy profile $\overline{\sigma}^{T}_{\phi}$ is learned based on samples in $\tau_{\overline{\sigma}}$ using Equation (14) (Line 17 – 19 of Algorithm 1). The training result is then returned as an approximate Nash Equilibrium strategy profile.

4 Evaluation and Main Results

Table 1: Comparative Scaling of Game Trees for Selected Benchmarks

Benchmark	Leduc	Leduc_10	Leduc_15	Leduc_20	FHP	FHP_10
Stack Size	13	60	80	100	2000	4000
Horizon	4	20	30	40	8	20
# of Nodes	464	31814	67556	113954	$2.58\times 10^{12}$	$3.17\times 10^{13}$

In this section, we present a comprehensive analysis of our proposed HDCFR algorithm. In Section 4.1, we benchmark HDCFR against leading model-free methods for imperfect-information zero-sum games, including DREAM (Steinberger et al. (2020)), OSSDCFR (an outcome-sampling variant of DCFR) (Steinberger (2019); Brown et al. (2019)), and NFSP (Heinrich and Silver (2016)). Notably, like HDCFR, these algorithms do not require task-specific knowledge and can be applied in environments with unknown game tree models (i.e., the model-free setting). For evaluation benchmarks, as a common practice, we select poker games: Leduc (Southey et al. (2005)) and heads-up flop hold’em (FHP) (Brown et al. (2019)). Given its hierarchical design, HDCFR is poised for enhanced performance in tasks demanding extended decision-making horizons. To underscore this, we elevate complexity of the standard poker benchmarks by raising the number of cards and the cap on the total raises and accordingly increasing the initial stack size for each player, compelling agents to strategize over longer horizons. Detailed comparisons among these benchmarks are available in Table 1. Then, in Section 4.2, we conduct an ablation study to highlight the importance of each component within our algorithm and elucidate the impact of key hyperparameters on its performance. Finally, in Section 4.3, we delve into the hierarchical strategy learned by HDCFR. We examine whether the high-level strategy can temporally extend skills and if the low-level ones (i.e., skills) can be transferred to new tasks as expert knowledge injections to aid learning. Notably, we utilize the baseline and benchmark implementation from Steinberger (2020), and provide the codes for HDCFR and necessary resources to reproduce all experimental results of this paper in https://github.com/LucasCJYSDL/HDCFR.

4.1 Comparison with State-of-the-Art Model-free Algorithms for Zero-sum IIGs

For Leduc poker games, we can explicitly compute the best response (BR) function for the learned strategy profile $\sigma=\{\sigma_{1},\sigma_{2}\}$ . We then can employ the exploitability of $\sigma$ defined as Equation (17) as the learning performance metric. Commonly-used in extensive-form games, exploitability measures the distance from Nash Equilibrium, for which a lower value is preferable. For hold’em poker games (like our benchmarks), exploitability is usually quantified in milli big blinds per game (mbb/g).

\text{exploitability}(\sigma)=1/2\max_{\sigma^{\prime}}\left[u_{1}(\sigma_{1}^{\prime},\sigma_{2})+u_{2}(\sigma_{1},\sigma_{2}^{\prime})\right]

(17)

In Figure 1, we depict the learning curves of HDCFR and the baselines. Solid lines represent the mean, while shadowed areas indicate the 95% confidence intervals from repeated trials. (1) For CFR-based algorithms, the agent samples 900 trajectories, from the root to a termination state, in each training episode, and visits around $10^{7}$ game states in the learning process. In contrast, the RL-based NFSP algorithm is trained over more episodes ( $\times 1000$ ) and the agent visits $10^{8}$ game states in total during training. However, NFSP consistently underperforms in all benchmarks. Note that NFSP utilizes a separate y-axis. Evidently, NFSP is less sample efficient than the CFR-based algorithms. (2) In the absence of game models, backtracking is not allowed and so the player can sample only one action at each information set, known as outcome sampling, during game tree traversals. Thus, algorithms that require backtracking, like DCFR (Brown et al. (2019)) and DNCFR (Li et al. (2020)), cannot work directly, unless adapted with the outcome sampling scheme. It can be observed that the performance of the resulting algorithm OSSDCFR declines significantly with increasing game complexity, primarily due to the high sample variance. (3) With variance reduction techniques, DREAM achieves comparable performance to HDCFR in simpler scenarios. Yet, HDCFR, owing to its hierarchical structure, excels over DREAM in games with extended horizons, where DREAM struggles to converge. Notably, HDCFR’s superiority becomes more significant as the game complexity increases.

Further, we conducted head-to-head tournaments between HDCFR and each baseline. We select the top three checkpoints for each algorithm, resulting in total nine pairings. Each pair of strategy profiles is competed over 1,000 hands. Table 2 shows the average payoff of HDCFR’s strategy profile (Equation (18)), along with 95% confidence intervals, measured in mbb/g. A higher payoff indicates superior decision-making performance and is therefore preferred.

1/2\left[u_{1}(\sigma_{1}^{\text{HDCFR}},\sigma_{2}^{\text{baseline}})+u_{2}(\sigma_{1}^{\text{baseline}},\sigma_{2}^{\text{HDCFR}})\right]

(18)

Observations from Leduc poker games in this table align with conclusions (1)-(3) previously mentioned. To further show the superiority of our algorithm, we compare its performance with baselines on larger-scale FHP games, which boast a game tree exceeding $10^{12}$ in size. Due to the immense scale of FHP games, computing the best response functions is impractical, so we offer only head-to-head comparison results. Training an instance on FHP games requires roughly seven days using a device with 8 CPU cores (3rd Gen Intel Xeon) and 128 GB RAM. Our implementation leverages the RAY parallel computing framework (Moritz et al. (2018)). Still, we can see that the advantage of HDCFR grows as task difficulty goes up.

Table 2: Comparative Analysis: HDCFR vs. Baseline Algorithms in Head-to-Head Matchups

Baseline	DREAM	OSSDCFR	NFSP
Leduc	$-11.94\pm 53.79$	$4.11\pm 64.03$	$596.55\pm 73.46$
Leduc_10	$-14.22\pm 62.10$	$500.0\pm 73.22$	$642.67\pm 109.41$
Leduc_15	$171.33\pm 70.80$	$563.75\pm 83.31$	$1351.5\pm 207.27$
Leduc_20	$196.89\pm 76.69$	$587.0\pm 68.83$	$1725.33\pm 206.01$
FHP	$184.58\pm 36.75$	$68.11\pm 36.61$	$244.61\pm 41.36$
FHP_10	$282.42\pm 14.20$	$343.22\pm 15.35$	$537.39\pm 16.91$

4.2 Ablation Analysis

HDCFR integrates the one-step option framework (Section 2.2) and variance-reduced Monte Carlo CFR (Section 3.2 and 3.3). This section offers an ablation analysis highlighting each crucial element of our algorithm: the option framework, variance reduction, Monte Carlo sampling, and CFR.

(1) The key component of the one-step option framework is the Multi-Head Attention (MHA) mechanism which enables the agent to temporarily extend skills and so form a hierarchical policy in the learning process. Without this component in the high-level strategy (NO_MHA in Figure 2(a)), the agent struggles to converge at the final stage, akin to the behavior observed for DREAM in Figure 1(d). (2) Within HDCFR, we incorporate a baseline function to reduce variance. This function proves pivotal for extended-horizon tasks where sampling variance can escalate. Excluding the baseline function from the hierarchical strategy, as marked by NO_BASELINE in Figure 2(a), results in a substantial performance decline. (3) In Monte Carlo sampling, as outlined in Section 3.4, the traverser should use a uniformly random sampling strategy. Yet, for fair comparisons, we employ a weighted average of a uniformly random strategy (with the weight $\epsilon$ ) and the player’s current strategy ( $\sigma^{t}_{p}$ ). The controlling weight $\epsilon$ is set as 0.5, aligning with configuration for the baselines. Figure 2(b) indicates that as $\epsilon$ increases, approximately there is a correlating rise in learning performance. Notably, our design – utilizing a purely random sampling strategy at $\epsilon=1$ , delivers the best result, amplifying the performance depicted in Figure 1(d). Another key aspect of Monte Carlo sampling is the number of sampled trajectories per training episode. According to Figure 2(c), increasing this count facilitates faster convergence in the initial training phase. However, it does not guarantee an improvement in the final model’s performance, and instead it would proportionally increase the overall training time. (4) As indicated by Brown et al. (2019) and Steinberger et al. (2020), slightly modifying the CFR updating rule (Equation (6)), that is, to greedily select the action with the largest regret rather than use a random one when the sum $\mu^{H},\mu^{L}\leq 0$ , can speed up the convergence. We adopt the same trick and find that it can improve the convergence speed slightly, as compared to the original setting (CFR_RULE in Figure 2(a)).

4.3 Case Study: Delving into the Learned Hierarchical Strategy

One key benefit of hierarchical learning is the agent’s ability to use prelearned or predefined skills as foundational blocks for strategy learning, which provides a manner for integrating expert knowledge. Even in the absence of domain-specific knowledge, where rule-based skills can’t be provided as expert guidance, we can leverage skills learned from similar scenarios. Skills, functioned as policy segments, often possess greater generality than complete strategies, enabling transferred use. In Figure 3, we demonstrate the transfer of skills from various Leduc games to Leduc_20 and depict the learning outcomes. For comparison, we also present the performance without the transferred skills, labeled as HDCFR. These prelearned skills can either remain static (Figure 3(b)) or be trained with the high-level strategy (Figure 3(a)). When kept static, the agent can focus on mastering its high-level strategy to select among a set of effective skills, resulting in quicker convergence and superior end performance. Notably, the final outcomes in Figure 3(b) are intrinsically tied to the predefined skills and positively correlate with the similarity between the skills’ source task and Leduc_20. On the other hand, if the skills evolve with the high-level strategy, the improvement on the convergence speed may not be obvious, but skills can be more customized for the current task and better performance may be achieved. For instance, with Leduc_15 skills, the peak performance is reached around episode 400; with Leduc skills, training with dynamic ones (Figure 3(a)) yields better results than with static ones (Figure 3(b)). However, for Leduc_20 skills, fixed skills works better. This could be because they originate from the same task, eliminating the need for further adaptation.

Table 3: Comparison of Skill Switching Frequencies Across Different Source Tasks

Source Task	Leduc	Leduc_10	Leduc_15	Leduc_20
Switch Frequency	$0.1363$ $\pm\ 4.02\times 10^{-4}$	$0.1256$ $\pm\ 3.43\times 10^{-4}$	$0.1088$ $\pm\ 2.16\times 10^{-4}$	$0.1016$ $\pm\ 3.64\times 10^{-4}$

We next delve into an analysis of the learned high-level strategy. As depicted in Figure 3(b), when utilizing fixed skills from various source tasks, corresponding high-level strategies can be acquired. To determine if the high-level strategy promotes the temporal extension of skills – instead of frequently toggling between them – we employ the hierarchical strategy at each node of Leduc_20’s game tree (with 113954 nodes in total). We then calculate the frequency of skill switches in the game tree, considering all potential hands of cards and five repeated experiments. Table 3 presents the mean and 95% confidence intervals for these results. It’s evident that as the decision horizon of the skill’s source task expands, switch frequency diminishes due to prolonged single-skill durations. Notably, for Leduc_20 skills, skill switches between parent and child nodes occur only about 10% of the time. This indicates the agent’s preference for decision-making at an extended-skill level, approximately 10 steps long in average, rather than on individual actions, aligning with our anticipations.

5 Related Work

Counterfactual Regret Minimization (CFR) (Zinkevich et al. (2007)) is an algorithm for learning Nash Equilibria in extensive-form games through iterative self-play. As part of this process, it must traverse the entire game tree on every learning iteration, which is prohibitive for large-scale games. This motivates the development of Monte Carlo CFR (MCCFR) (Lanctot et al. (2009)), which samples trajectories traversing part of the tree to allow for significantly faster iterations. Yet, the variance of Monte Carlo outcome sampling could be an issue, especially for long sample trajectories. The authors of (Schmid et al. (2019); Davis et al. (2020)) then propose to introduce baseline functions for variance reduction. Notably, all methods mentioned above are tabular-based. For games with large state space, domain-specific abstraction schemes (Ganzfried and Sandholm (2014b); Moravčík et al. (2017)) are required to shrink them to a manageable size by clustering states into buckets, which necessitates expert knowledge and is not applicable to all games.

To obviate the need of abstractions, several CFR variants with function approximators have emerged. Pioneering this was Regression CFR (Waugh et al. (2015)), which adopts regression trees to model cumulative regrets but relies on hand-crafted features and full traversals of the game tree. Subsequently, several works (Brown et al. (2019); Li et al. (2020); Steinberger (2019); Li et al. (2021b)) propose to model the cumulative counterfactual regrets and average strategies in MCCFR as neural networks to enhance the scalability. However, all these methods rely on knowledge of the game model to realize backtracking (i.e., sampling multiple actions at an information set) for regret estimation. As a model-free approach, Neural Fictitious Self-Play (NFSP) (Heinrich and Silver (2016)) is the first deep reinforcement learning algorithm to learn a Nash Equilibrium in two-player imperfect information games through self-play. Since its advent, various policy gradient and actor-critic methods have been shown to have similar convergence properties if tuned appropriately (Lanctot et al. (2017); Srinivasan et al. (2018)). However, fictitious play empirically converges slower than CFR-based approaches in many settings. DREAM (Steinberger et al. (2020)) extends DCFR with variance-reduction techniques from Davis et al. (2020) and represents the state-of-the-art in model-free algorithms of this area. Compared with DREAM, our algorithm enables hierarchical learning with (prelearned) skills and empirically show enhanced performance on longer-horizon games.

As another important module of HDCFR, the option framework (Sutton et al. (1999)) enables learning and planning at multiple temporal levels and has been widely adopted in reinforcement learning. Multiple research areas centered on this framework have been developed. Unsupervised Option Discovery aims at discovering skills that are diverse and efficient for downstream task learning without supervision from reward signals, for which algorithms have been proposed for both single-agent (Eysenbach et al. (2019); Jinnai et al. (2020); Chen et al. (2022a)) and collaborative multi-agent scenarios (Chen et al. (2022c, b); Zhang et al. (2022)). Hierarchical Reinforcement Learning (Zhang and Whiteson (2019); Li et al. (2021a)) and Hierarchical Imitation Learning (Jing et al. (2021); Chen et al. (2023a, b)), on the other hand, aim at directly learning a hierarchical policy incorporated with skills, either from interactions with the environment or expert demonstrations. As a pioneering effort to amalgamate options with CFR, HDCFR not only offers a robust theoretical foundation but also demonstrates resilient empirical performance against leading algorithms in zero-sum imperfect-information games.

6 Conclusion

In this research, we present the first hierarchical version of Counterfactual Regret Minimization (CFR) by utilizing the option framework. Initially, we establish its theoretical foundations in a tabular setting, introducing Hierarchical CFR updating rules that are guaranteed to converge. Then, we provide a low-variance Monte Carlo sampling extension for scalable learning in tasks without perfect game models or encompassing deep game trees. Further, we incorporate neural networks as function approximators, devising deep learning objectives that align with the theoretical outcomes in the tabular setting, thereby empowering our HDCFR algorithm to manage vast state spaces. Evaluations in complex two-player zero-sum games show HDCFR’s superiority over leading algorithms in this field and its advantage becomes more significant as the decision horizon increases, underscoring HDCFR’s great potential in tasks involving deep game trees. Moreover, we show empirically that the learned high-level strategy can temporarily extend skills to utilize the hierarchical subtask structures in long-horizon tasks, and the learned skills can be transferred to different tasks, serving as expert knowledge injections to facilitate learning. Finally, our algorithm provides a novel framework to learn with predefined skills in zero-sum IIGs. An interesting future research direction could be interactive learning with human inputs as skills.

References

Abernethy et al. (2011) J. Abernethy, P. L. Bartlett, and E. Hazan. Blackwell approachability and no-regret learning are equivalent. In Proceedings of the 24th Annual Conference on Learning Theory, pages 27–46. JMLR Workshop and Conference Proceedings, 2011.
Bakhtin et al. (2022) A. Bakhtin, N. Brown, E. Dinan, G. Farina, C. Flaherty, D. Fried, A. Goff, J. Gray, H. Hu, A. P. Jacob, M. Komeili, K. Konath, M. Kwon, A. Lerer, M. Lewis, A. H. Miller, S. Mitts, A. Renduchintala, S. Roller, D. Rowe, W. Shi, J. Spisak, A. Wei, D. Wu, H. Zhang, and M. Zijlstra. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, pages 1067–1074, 2022.
Bowling et al. (2015) M. Bowling, N. Burch, M. Johanson, and O. Tammelin. Heads-up limit hold’em poker is solved. Science, 347(6218):145–149, 2015.
Brown and Sandholm (2017) N. Brown and T. Sandholm. Safe and nested subgame solving for imperfect-information games. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 689–699, 2017.
Brown and Sandholm (2018) N. Brown and T. Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018.
Brown et al. (2018) N. Brown, T. Sandholm, and B. Amos. Depth-limited solving for imperfect-information games. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, pages 7674–7685, 2018.
Brown et al. (2019) N. Brown, A. Lerer, S. Gross, and T. Sandholm. Deep counterfactual regret minimization. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 793–802. PMLR, 2019.
Chen et al. (2022a) J. Chen, V. Aggarwal, and T. Lan. ODPP: A unified algorithm framework for unsupervised option discovery based on determinantal point process. CoRR, abs/2212.00211, 2022a.
Chen et al. (2022b) J. Chen, J. Chen, T. Lan, and V. Aggarwal. Learning multi-agent options for tabular reinforcement learning using factor graphs. IEEE Transactions on Artificial Intelligence, pages 1–13, 2022b. doi: 10.1109/tai.2022.3195818.
Chen et al. (2022c) J. Chen, J. Chen, T. Lan, and V. Aggarwal. Multi-agent covering option discovery based on kronecker product of factor graphs. IEEE Transactions on Artificial Intelligence, 2022c.
Chen et al. (2023a) J. Chen, T. Lan, and V. Aggarwal. Option-aware adversarial inverse reinforcement learning for robotic control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5902–5908. IEEE, 2023a.
Chen et al. (2023b) J. Chen, D. Tamboli, T. Lan, and V. Aggarwal. Multi-task hierarchical adversarial inverse reinforcement learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 4895–4920. PMLR, 2023b.
Davis et al. (2020) T. Davis, M. Schmid, and M. Bowling. Low-variance and zero-variance baselines for extensive-form games. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 2392–2401. PMLR, 2020.
Eysenbach et al. (2019) B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. In Proceedings of the 7th International Conference on Learning Representations. OpenReview.net, 2019.
Fu et al. (2022) H. Fu, W. Liu, S. Wu, Y. Wang, T. Yang, K. Li, J. Xing, B. Li, B. Ma, Q. Fu, and W. Yang. Actor-critic policy optimization in a large-scale imperfect-information game. In Proceedings of the 10th International Conference on Learning Representations. OpenReview.net, 2022.
Ganzfried and Sandholm (2014a) S. Ganzfried and T. Sandholm. Potential-aware imperfect-recall abstraction with earth mover’s distance in imperfect-information games. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, pages 682–690. AAAI Press, 2014a.
Ganzfried and Sandholm (2014b) S. Ganzfried and T. Sandholm. Potential-aware imperfect-recall abstraction with earth mover’s distance in imperfect-information games. In Proceedings of 28th the AAAI Conference on Artificial Intelligence, volume 28, 2014b.
Gibson et al. (2012) R. G. Gibson, M. Lanctot, N. Burch, D. Szafron, and M. Bowling. Generalized sampling and variance in counterfactual regret minimization. In Proceedings of the 26th AAAI Conference on Artificial Intelligence. AAAI Press, 2012.
Heinrich and Silver (2016) J. Heinrich and D. Silver. Deep reinforcement learning from self-play in imperfect-information games. CoRR, abs/1603.01121, 2016.
Jing et al. (2021) M. Jing, W. Huang, F. Sun, X. Ma, T. Kong, C. Gan, and L. Li. Adversarial option-aware hierarchical imitation learning. In Proceedings of the 38th International Conference on Machine Learning, pages 5097–5106. PMLR, 2021.
Jinnai et al. (2020) Y. Jinnai, J. W. Park, M. C. Machado, and G. D. Konidaris. Exploration in reinforcement learning with deep covering options. In Proceedings of the 8th International Conference on Learning Representations. OpenReview.net, 2020.
Kakkad et al. (2019) V. Kakkad, H. Shah, R. Patel, and N. Doshi. A comparative study of applications of game theory in cyber security and cloud computing. Procedia Computer Science, 155:680–685, 2019.
Lanctot et al. (2009) M. Lanctot, K. Waugh, M. Zinkevich, and M. H. Bowling. Monte carlo sampling for regret minimization in extensive games. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems, pages 1078–1086. Curran Associates, Inc., 2009.
Lanctot et al. (2017) M. Lanctot, V. F. Zambaldi, A. Gruslys, A. Lazaridou, K. Tuyls, J. Pérolat, D. Silver, and T. Graepel. A unified game-theoretic approach to multiagent reinforcement learning. In Advances in Neural Information Processing Systems 30, pages 4190–4203, 2017.
Li et al. (2021a) C. Li, D. Song, and D. Tao. The skill-action architecture: Learning abstract action embeddings for reinforcement learning. In Submissions of the 9th International Conference on Learning Representations, 2021a.
Li et al. (2020) H. Li, K. Hu, S. Zhang, Y. Qi, and L. Song. Double neural counterfactual regret minimization. In Proceedings of the 8th International Conference on Learning Representations. OpenReview.net, 2020.
Li et al. (2021b) H. Li, X. Wang, Z. Guo, J. Zhang, and S. Qi. D2cfr: Minimize counterfactual regret with deep dueling neural network. arXiv preprint arXiv:2105.12328, 2021b.
Moravcik et al. (2016) M. Moravcik, M. Schmid, K. Ha, M. Hladík, and S. J. Gaukrodger. Refining subgames in large imperfect information games. In Proceedings of the 30th AAAI Conference on Artificial Intelligence, pages 572–578. AAAI Press, 2016.
Moravčík et al. (2017) M. Moravčík, M. Schmid, N. Burch, V. Lisỳ, D. Morrill, N. Bard, T. Davis, K. Waugh, M. Johanson, and M. Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, 2017.
Moritz et al. (2018) P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang, W. Paul, M. I. Jordan, and I. Stoica. Ray: A distributed framework for emerging AI applications. In 13th USENIX Symposium on Operating Systems Design and Implementation, pages 561–577. USENIX Association, 2018.
Noe et al. (2012) T. H. Noe, M. Rebello, and J. Wang. Learning to bid: The design of auctions under uncertainty and adaptation. Games and Economic Behavior, pages 620–636, 2012.
Osborne and Rubinstein (1994) M. J. Osborne and A. Rubinstein. A course in game theory. MIT press, 1994.
Sandholm (2015) T. Sandholm. Abstraction for solving large incomplete-information games. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, pages 4127–4131. AAAI Press, 2015.
Schmid et al. (2019) M. Schmid, N. Burch, M. Lanctot, M. Moravcik, R. Kadlec, and M. Bowling. Variance reduction in monte carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines. In Proceedings of 33rd the AAAI Conference on Artificial Intelligence, pages 2157–2164. AAAI Press, 2019.
Southey et al. (2005) F. Southey, M. H. Bowling, B. Larson, C. Piccione, N. Burch, D. Billings, and D. C. Rayner. Bayes’ bluff: Opponent modelling in poker. In Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence, pages 550–558, 2005.
Srinivasan et al. (2018) S. Srinivasan, M. Lanctot, V. F. Zambaldi, J. Pérolat, K. Tuyls, R. Munos, and M. Bowling. Actor-critic policy optimization in partially observable multiagent environments. In Proceedings of the 32nd Annual Conference on Neural Information Processing Systems, pages 3426–3439, 2018.
Steinberger (2019) E. Steinberger. Single deep counterfactual regret minimization. CoRR, abs/1901.07621, 2019.
Steinberger (2020) E. Steinberger. Pokerrl. https://github.com/EricSteinberger/DREAM, 2020.
Steinberger et al. (2020) E. Steinberger, A. Lerer, and N. Brown. DREAM: deep regret minimization with advantage baselines and model-free learning. CoRR, abs/2006.10410, 2020.
Sutton et al. (1999) R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181–211, 1999. ISSN 0004-3702. doi: https://doi.org/10.1016/S0004-3702(99)00052-1.
Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In Proceedings of the 31st Annual Conference on Neural Information Processing Systems, pages 5998–6008, 2017.
Waugh et al. (2015) K. Waugh, D. Morrill, J. A. Bagnell, and M. H. Bowling. Solving games with functional regret estimation. In Proceedings of the 29th AAAI Conference on Artificial Intelligence, pages 2138–2145. AAAI Press, 2015.
Zhang et al. (2022) F. Zhang, C. Jia, Y.-C. Li, L. Yuan, Y. Yu, and Z. Zhang. Discovering generalizable multi-agent coordination skills from multi-task offline data. In Proceedings of the 11th International Conference on Learning Representations, 2022.
Zhang and Whiteson (2019) S. Zhang and S. Whiteson. DAC: the double actor-critic architecture for learning options. In Proceedings of the 33rd Annual Conference on Neural Information Processing Systems, pages 2010–2020, 2019.
Zinkevich et al. (2007) M. Zinkevich, M. Johanson, M. H. Bowling, and C. Piccione. Regret minimization in games with incomplete information. In Proceedings of the 21st Annual Conference on Neural Information Processing Systems, pages 1729–1736. Curran Associates, Inc., 2007.

Appendix A Proof of Theorem 2

Define $D(I)$ to be the information sets of player $i$ reachable from $I$ (including $I$ ), and $\sigma|_{D(I)\rightarrow\sigma^{{}^{\prime}}_{i}}$ to be a strategy profile equal to $\sigma$ except that player $i$ adopts $\sigma^{{}^{\prime}}_{i}$ in the information sets contained in $D(I)$ . Then, the average overall regret starting from $I$ ( $I\in\mathcal{I}_{i}$ ) can be defined as:

\displaystyle R_{full,i}^{T}(I)=\frac{1}{T}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I)(u_{i}(\sigma^{t}|_{D(I)\rightarrow\sigma^{{}^{\prime}}_{i}},I)-u_{i}(\sigma^{t},I))

(19)

Further, we define $S_{i}(I,\widetilde{a})$ to be the set of all possible next information sets of player $i$ given that action $\widetilde{a}\in\widetilde{A}(I)$ was just selected at $I$ and define $S_{i}(I)=\bigcup_{\widetilde{a}\in\widetilde{A}(I)}S_{i}(I,\widetilde{a})$ , $S_{i}(Iz)=\bigcup_{a\in A(I)}S_{i}(I,za)$ . Then, we have the following lemma:

Lemma 1

$R_{full,i}^{T,+}(I)\leq R_{i,+}^{T,H}(I)+\sum_{z\in Z(I)}R^{T,L}_{i,+}(I,z)+\sum_{I^{\prime}\in S_{i}(I)}R_{full,i}^{T,+}(I^{\prime})$

Proof

		$\displaystyle R_{full,i}^{T}(I)=\frac{1}{T}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I)(u_{i}(\sigma^{t}\|_{D(I)\rightarrow\sigma^{{}^{\prime}}_{i}},I)-u_{i}(\sigma^{t},I))$		(20)
		$\displaystyle=\frac{1}{T}\max_{z\in Z(I)}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I)\left[(u_{i}(\sigma^{t}\|_{I\rightarrow z},I)-u_{i}(\sigma^{t},I))+(u_{i}(\sigma^{t}\|_{D(Iz)\rightarrow\sigma^{{}^{\prime}}_{i}},Iz)-u_{i}(\sigma^{t},Iz))\right]$
		$\displaystyle\leq\frac{1}{T}\max_{z\in Z(I)}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I)(u_{i}(\sigma^{t}\|_{I\rightarrow z},I)-u_{i}(\sigma^{t},I))+$
		$\displaystyle\ \ \ \ \ \frac{1}{T}\max_{z\in Z(I)}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I)(u_{i}(\sigma^{t}\|_{D(Iz)\rightarrow\sigma^{{}^{\prime}}_{i}},Iz)-u_{i}(\sigma^{t},Iz))$
		$\displaystyle=R_{i}^{T,H}(I)+\frac{1}{T}\max_{z\in Z(I)}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(Iz)(u_{i}(\sigma^{t}\|_{D(Iz)\rightarrow\sigma^{{}^{\prime}}_{i}},Iz)-u_{i}(\sigma^{t},Iz))$
		$\displaystyle\leq R_{i}^{T,H}(I)+\sum_{z\in Z(I)}\left[\frac{1}{T}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(Iz)(u_{i}(\sigma^{t}\|_{D(Iz)\rightarrow\sigma^{{}^{\prime}}_{i}},Iz)-u_{i}(\sigma^{t},Iz))\right]^{+}$
		$\displaystyle=R_{i}^{T,H}(I)+\sum_{z\in Z(I)}R_{full,i}^{T,+}(Iz)$

		$\displaystyle R_{full,i}^{T}(Iz)=\frac{1}{T}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(Iz)(u_{i}(\sigma^{t}\|_{D(Iz)\rightarrow\sigma^{{}^{\prime}}_{i}},Iz)-u_{i}(\sigma^{t},Iz))$		(21)
		$\displaystyle=\frac{1}{T}\max_{a\in A(I)}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(Iz)[(u_{i}(\sigma^{t}\|_{Iz\rightarrow a},Iz)-u_{i}(\sigma^{t},Iz))+$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\quad\sum_{I^{\prime}\in S_{i}(I,za)}P_{\sigma^{t}_{-i}}(I^{\prime}\|I,za)(u_{i}(\sigma^{t}\|_{D(I^{\prime})\rightarrow\sigma^{{}^{\prime}}_{i}},I^{\prime})-u_{i}(\sigma^{t},I^{\prime}))]$
		$\displaystyle=R_{i}^{T,L}(I,z)+\max_{a\in A(I)}\sum_{I^{\prime}\in S_{i}(I,za)}\frac{1}{T}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I^{\prime})(u_{i}(\sigma^{t}\|_{D(I^{\prime})\rightarrow\sigma^{{}^{\prime}}_{i}},I^{\prime})-u_{i}(\sigma^{t},I^{\prime}))$
		$\displaystyle=R_{i}^{T,L}(I,z)+\max_{a\in A(I)}\sum_{I^{\prime}\in S_{i}(I,za)}R_{full,i}^{T}(I^{\prime})$
		$\displaystyle\leq R_{i}^{T,L}(I,z)+\sum_{I^{\prime}\in S_{i}(Iz)}R_{full,i}^{T,+}(I^{\prime})$

In Equation (20) and (21), we employ the one-step look-ahead expansion (Equation (10) in Zinkevich et al. (2007)) for the second line. At iteration $t$ , when player $i$ selects a hierarchical action $\widetilde{a}=(za)$ , it will transit to the subsequent information set $I^{\prime}\in S_{i}(I,za)$ with a probability of $P_{\sigma^{t}_{-i}}(I^{\prime}|I,za)$ , since only player $-i$ will act between $I$ and $I^{\prime}$ according to $\sigma^{t}_{-i}$ . According to the definition of the reach probability, $\pi_{-i}(Iz)=\pi_{-i}(I)$ (since $z$ and $a$ are executed by player $i$ ) and $\pi_{-i}(I)P_{\sigma^{t}_{-i}}(I^{\prime}|I,za)=\pi_{-i}(I^{\prime})$ . Combining Equation (20) and (21), we can get:

		$\displaystyle R_{full,i}^{T}(I)\leq R_{i}^{T,H}(I)+\sum_{z\in Z(I)}R_{full,i}^{T,+}(Iz)$		(22)
		$\displaystyle\leq R_{i}^{T,H}(I)+\sum_{z\in Z(I)}\left[R_{i,+}^{T,L}(I,z)+\sum_{I^{\prime}\in S_{i}(Iz)}R_{full,i}^{T,+}(I^{\prime})\right]$
		$\displaystyle=R_{i}^{T,H}(I)+\sum_{z\in Z(I)}R_{i,+}^{T,L}(I,z)+\sum_{z\in Z(I)}\sum_{I^{\prime}\in S_{i}(Iz)}R_{full,i}^{T,+}(I^{\prime})$
		$\displaystyle=R_{i}^{T,H}(I)+\sum_{z\in Z(I)}R_{i,+}^{T,L}(I,z)+\sum_{I^{\prime}\in S_{i}(I)}R_{full,i}^{T,+}(I^{\prime})$

In previous derivations, we have repeatedly employed the inequality $\max(a+b,0)\leq\max(a,0)+\max(b,0)$ , which holds for all $a,b\in\mathbb{R}$ , as in the last inequality of Equation (20) and (21). By applying this inequality once more to Equation (22), we can obtain Lemma 1.

Lemma 2

$R_{full,i}^{T,+}(I)\leq\sum_{I^{\prime}\in D(I)}\left[R_{i,+}^{T,H}(I^{\prime})+\sum_{z\in Z(I^{\prime})}R_{i,+}^{T,L}(I^{\prime},z)\right]$

Proof We prove this lemma by induction on the height of the information set $I$ on the game tree. When the height is 1, i.e., $S_{i}(I)=\emptyset$ , $D(I)=\{I\}$ , then Lemma 1 implies Lemma 2. Now, for the general case:

		$\displaystyle R_{full,i}^{T,+}(I)\leq R_{i,+}^{T,H}(I)+\sum_{z\in Z(I)}R^{T,L}_{i,+}(I,z)+\sum_{I^{\prime}\in S_{i}(I)}R_{full,i}^{T,+}(I^{\prime})$		(23)
		$\displaystyle\leq R_{i,+}^{T,H}(I)+\sum_{z\in Z(I)}R^{T,L}_{i,+}(I,z)+\sum_{I^{\prime}\in S_{i}(I)}\sum_{I^{\prime\prime}\in D(I^{\prime})}\left[R_{i,+}^{T,H}(I^{\prime\prime})+\sum_{z\in Z(I^{\prime\prime})}R_{i,+}^{T,L}(I^{\prime\prime},z)\right]$
		$\displaystyle=\sum_{I^{\prime}\in D(I)}\left[R_{i,+}^{T,H}(I^{\prime})+\sum_{z\in Z(I^{\prime})}R_{i,+}^{T,L}(I^{\prime},z)\right]$

In the second line, we employ the induction hypothesis. In the third line, we use the following facts: $D(I)=\{I\}\cup\bigcup_{I^{\prime}\in S_{i}(I)}D(I^{\prime})$ , $\{I\}\cap\bigcup_{I^{\prime}\in S_{i}(I)}D(I^{\prime})=\emptyset$ , and $D(I^{\prime})\cap D(I^{\prime\prime})=\emptyset$ for all distinct $I^{\prime},I^{\prime\prime}\in S_{i}(I)$ . The third fact here is derived from the perfect recall property of the game: all players can recall their previous (hierarchical) actions and the corresponding information sets. Then, $D(I^{\prime})\cap D(I^{\prime\prime})=\emptyset$ because elements from the two sets possess distinct prefixes (i.e., $I^{\prime}$ and $I^{\prime\prime}$ ).
Last, for the average overall regret, we have $R_{full,i}^{T}=R_{full,i}^{T}(\emptyset)$ , where $\emptyset$ corresponds to the start of the game tree and $D(\emptyset)=\mathcal{I}_{i}$ . Applying Lemma 2, we can get the theorem: $R_{full,i}^{T}\leq R_{full,i}^{T,+}(\emptyset)\leq\sum_{I\in\mathcal{I}_{i}}\left[R_{i,+}^{T,H}(I)+\sum_{z\in Z(I)}R_{i,+}^{T,L}(I,z)\right]$ .

Appendix B Proof of Theorem 3

Regret matching can be defined in a domain where a fixed set of actions $A$ and a payoff function $u^{t}:A\rightarrow\mathbb{R}$ exist. At each iteration $t$ , a distribution over the actions, $\sigma^{t}$ , is chosen based on the cumulative regret $R^{t}:A\rightarrow\mathbb{R}$ . Specifically, the cumulative regret at iteration $T$ for not playing action $a$ is defined as:

\displaystyle R^{T}(a)=\frac{1}{T}\sum_{t=1}^{T}\left[u^{t}(a)-\sum_{a^{\prime}\in A}\sigma^{t}(a^{\prime})u^{t}(a^{\prime})\right]

(24)

where $\sigma^{t}(a)$ is obtained by:

\displaystyle\sigma^{t}(a)=\left\{\begin{aligned} R^{t-1,+}(a)/\mu&,&\mu>0,\\ 1/|A|&,&o\backslash w.\end{aligned}\right.\ \ \ \ \mu=\sum_{a^{\prime}\in A}R^{t-1,+}(a^{\prime})

(25)

Then, we have the following lemma (Theorem 8 in Zinkevich et al. (2007)):

Lemma 3

$\max_{a\in A}R^{T}(a)\leq\frac{\Delta_{u}\sqrt{|A|}}{\sqrt{T}}$ , where $\Delta_{u}=\max_{t\in\{1,\cdots,T\}}\max_{a,a^{\prime}\in A}(u^{t}(a)-u^{t}(a^{\prime}))$ .

To apply this lemma, we must transform the definitions of $R_{i}^{T,H}$ and $R_{i}^{T,L}$ in Equation (5) to a form resembling Equation (24). With Equation (5) and (2), we can get:

$\displaystyle R^{T,H}_{i}(z\|I)$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}[\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})\pi^{\sigma^{t}}(hz,h^{\prime})-$	(26)
	$\displaystyle\qquad\quad\ \ \ \ \sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})\sum_{z^{\prime}\in Z(h)}\sigma^{H}_{t}(z^{\prime}\|h)\pi^{\sigma^{t}}(hz^{\prime},h^{\prime})]$
	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}[\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})\pi^{\sigma^{t}}(hz,h^{\prime})-$
	$\displaystyle\qquad\ \ \ \ \ \sum_{z^{\prime}\in Z(I)}\sigma^{H}_{t}(z^{\prime}\|I)\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})\pi^{\sigma^{t}}(hz^{\prime},h^{\prime})]$
	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left[v_{t}^{H}(z)-\sum_{z^{\prime}\in Z(I)}\sigma^{H}_{t}(z^{\prime}\|I)v_{t}^{H}(z^{\prime})\right]$

Applying the same process on $R_{i}^{T,L}(a|I,z)$ , we can get:

		$\displaystyle R^{T,L}_{i}(a\|I,z)=\frac{1}{T}\sum_{t=1}^{T}\left[v_{t}^{L}(a)-\sum_{a^{\prime}\in A(I)}\sigma^{L}_{t}(a^{\prime}\|I,z)v_{t}^{L}(a^{\prime})\right]$		(27)
		$\displaystyle\qquad\ \ v_{t}^{L}(a)=\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})\pi^{\sigma^{t}}(hza,h^{\prime})$		(27)

Then, we can apply Lemma 3 and obtain:

		$\displaystyle\ \ \ \max_{z\in Z(I)}R^{T,H}_{i}(z\|I)=R^{T,H}_{i}(I)\leq\frac{\Delta_{v^{H}}\sqrt{\|Z(I)\|}}{\sqrt{T}}\leq\frac{\Delta_{u,i}\sqrt{\|Z(I)\|}}{\sqrt{T}}$		(28)
		$\displaystyle\max_{a\in A(I)}R^{T,L}_{i}(a\|I,z)=R^{T,L}_{i}(I,z)\leq\frac{\Delta_{v^{L}}\sqrt{\|A(I)\|}}{\sqrt{T}}\leq\frac{\Delta_{u,i}\sqrt{\|A(I)\|}}{\sqrt{T}}$		(28)

Here, $\Delta_{u,i}=\max_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})-\min_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})$ is the range of the payoff function for $i$ , which covers $\Delta_{v^{H}}$ and $\Delta_{v^{L}}$ . We can directly apply Lemma 3, because the regret matching is adopted at each information set independently as defined in Equation (6). By integrating Equation 28 and Theorem 2, we then get:

	$\displaystyle R_{full,i}^{T}$	$\displaystyle\leq\sum_{I\in\mathcal{I}_{i}}\left[\frac{\Delta_{u,i}\sqrt{\|Z(I)\|}}{\sqrt{T}}+\sum_{z\in Z(I)}\frac{\Delta_{u,i}\sqrt{\|A(I)\|}}{\sqrt{T}}\right]$		(29)
		$\displaystyle\leq\frac{\Delta_{u,i}\|\mathcal{I}_{i}\|}{\sqrt{T}}(\sqrt{\|Z_{i}\|}+\|Z_{i}\|\sqrt{\|A_{i}\|})$		(29)

where $|\mathcal{I}_{i}|$ is the number of information sets for player $i$ , $|A_{i}|=\max_{h:P(h)=i}|A(h)|$ , $|Z_{i}|=\max_{h:P(h)=i}|Z(h)|$ .

Appendix C Proof of Proposition 1

According to Theorem 1 and 3, as $T\rightarrow\infty$ , $R_{full,i}^{T}\rightarrow 0$ , and thus the average strategy $\overline{\sigma}^{T}_{i}(\widetilde{a}|I)$ converges to a Nash Equilibrium. We claim that $\overline{\sigma}^{T}_{i}(\widetilde{a}|I)=\overline{\sigma}^{T,H}_{i}(z|I)\cdot\overline{\sigma}^{T,L}_{i}(a|I,z)$ .

Proof

$\displaystyle\overline{\sigma}^{T,H}_{i}(z\|I)\cdot\overline{\sigma}^{T,L}_{i}(a\|I,z)$	$\displaystyle=\frac{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)\sigma^{t,H}_{i}(z\|I)}{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)}\frac{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(Iz)\sigma^{t,L}_{i}(a\|I,z)}{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(Iz)}$	(30)
	$\displaystyle=\frac{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)\sigma^{t,H}_{i}(z\|I)}{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)}\frac{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(Iz)\sigma^{t,L}_{i}(a\|I,z)}{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)\sigma^{t,H}_{i}(z\|I)}$
	$\displaystyle=\frac{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(Iz)\sigma^{t,L}_{i}(a\|I,z)}{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)}=\frac{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)\sigma^{t,H}_{i}(z\|I)\sigma^{t,L}_{i}(a\|I,z)}{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)}$
	$\displaystyle=\frac{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)\sigma^{t}_{i}((z,a)\|I)}{\Sigma_{t=1}^{T}\pi_{i}^{\sigma^{t}}(I)}=\overline{\sigma}^{T}_{i}(\widetilde{a}\|I)$

Given this equivalence, we can infer that if both players adhere to the one-step option model for each $I$ —selecting an option $z$ based on $\overline{\sigma}^{T,H}_{i}(\cdot|I)$ and subsequently choosing the action $a$ in accordance with the corresponding intra-option strategy $\overline{\sigma}^{T,L}_{i}(\cdot|I,z)$ , this will result in an approximate NE solution.

Appendix D Proof of Equivalence between Equation (8) and (5)

Through induction on the height of $h$ on the game tree, one can easily prove that:

\displaystyle v^{t,H}_{i}(\sigma^{t},h)=\sum_{h^{\prime}\in H_{TS}}\pi^{\sigma^{t}}(h,h^{\prime})u_{i}(h^{\prime}),\ v^{t,L}_{i}(\sigma^{t},hz)=\sum_{h^{\prime}\in H_{TS}}\pi^{\sigma^{t}}(hz,h^{\prime})u_{i}(h^{\prime})

(31)

Thus, we have:

		$\displaystyle r_{i}^{t,H}(I,z)=\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}\pi^{\sigma^{t}}(hz,h^{\prime})u_{i}(h^{\prime})-\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}\pi^{\sigma^{t}}(h,h^{\prime})u_{i}(h^{\prime})$		(32)
		$\displaystyle\qquad\qquad=\pi^{\sigma^{t}}_{-i}(I)(u_{i}(\sigma^{t}\|_{I\rightarrow z},I)-u_{i}(\sigma^{t},I))$
		$\displaystyle r^{t,L}_{i}(Iz,a)=\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}\pi^{\sigma^{t}}(hza,h^{\prime})u_{i}(h^{\prime})-\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}\pi^{\sigma^{t}}(hz,h^{\prime})u_{i}(h^{\prime})$
		$\displaystyle\qquad\qquad=\pi^{\sigma^{t}}_{-i}(I)(u_{i}(\sigma^{t}\|_{Iz\rightarrow a},Iz)-u_{i}(\sigma^{t},Iz))$

The equation above connects the definitions of $R^{T,H}_{i}$ and $R^{T,L}_{i}$ in Equation (8) and (5).

Appendix E Proof of Theorem 4

Lemma 4

For all $h\in H\backslash H_{TS},\ z\in Z(h),\ a\in A(h)$ :

		$\displaystyle\ \ \ E_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]=E_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$		(33)
		$\displaystyle E_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz,a\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]=E_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},hza\|h^{\prime})\|h^{\prime}\sqsupseteq hza\right]$		(33)

Proof

		$\displaystyle E_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]$		(34)
		$\displaystyle=E_{h^{\prime}}\left[\frac{\delta(hz\sqsubseteq h^{\prime})}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b^{t}_{i}(h,z)\right]+b^{t}_{i}(h,z)\|h^{\prime}\sqsupseteq h\right]$
		$\displaystyle=P(hz\sqsubseteq h^{\prime}\|h^{\prime}\sqsupseteq h)E_{h^{\prime}}\left[\frac{1}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b_{i}^{t}(h,z)\right]+b^{t}_{i}(h,z)\|h^{\prime}\sqsupseteq hz\right]+$
		$\displaystyle\ \ \ \ \ \ P(hz\ {\not\sqsubseteq}\ h^{\prime}\|h^{\prime}\sqsupseteq h)b^{t}_{i}(h,z)$
		$\displaystyle=q^{t}(z\|h)\left[\frac{1}{q^{t}(z\|h)}\left[E_{h^{\prime}}(\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz)-b_{i}^{t}(h,z)\right]+b^{t}_{i}(h,z)\right]+$
		$\displaystyle\ \ \ \ \ \ (1-q^{t}(z\|h))b^{t}_{i}(h,z)$
		$\displaystyle=E_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$

Using the definition of $\hat{v}^{t,L}_{i}(\sigma^{t},hz,a|h^{\prime})$ in Equation (10) and following the same process as above, we can get the second part of the lemma.

Now, we present the proof of the first part of Theorem 4.

		$\displaystyle\mathbb{E}_{h^{\prime}\sim\pi^{q^{t}}(\cdot)}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\right]=\sum_{h\in I}\frac{\pi_{-i}^{\sigma^{t}}(h)}{\pi^{q^{t}}(h)}\left[\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z\|h^{\prime})\right]-\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h\|h^{\prime})\right]\right]$		(35)
		$\displaystyle=\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\left[\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]-\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]\right]$		(35)

For the second equality, we use the following fact:

	$\displaystyle\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h\|h^{\prime})\right]$	$\displaystyle=P(h^{\prime}\sqsupseteq h)\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]+P(h^{\prime}\ {\not\sqsupseteq}\ h)\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\ {\not\sqsupseteq}\ h\right]$		(36)
		$\displaystyle=\pi^{q^{t}}(h)\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]$		(36)

Based on Equation (10), $\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h|h^{\prime})|h^{\prime}\ {\not\sqsupseteq}\ h\right]=\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z|h^{\prime})|h^{\prime}\ {\not\sqsupseteq}\ h\right]=0$ . Similar with Equation (36), we can get $\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z|h^{\prime})\right]=\pi^{q^{t}}(h)\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]$ , which completes the proof of Equation (35).

Equation (8) and (35) show that, to prove $\mathbb{E}_{h^{\prime}\sim\pi^{q^{t}}(\cdot)}\left[\hat{r}_{i}^{t,H}(I,z|h^{\prime})\right]=r^{t,H}_{i}(I,z)$ , we only need to show the following lemma:

Lemma 5

For all $h\in H,\ z\in Z(h)$ :

\displaystyle E_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]=v^{t,L}_{i}(\sigma^{t},hz),\ E_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h|h^{\prime})|h^{\prime}\sqsupseteq h\right]=v^{t,H}_{i}(\sigma^{t},h)

(37)

Proof We prove this lemma by induction on the height of $h$ on the game tree. For the base case, if $(hza)\in H_{TS}$ , we have:

		$\displaystyle E_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]=E_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$		(38)
		$\displaystyle=\sum_{a\in A(h)}\sigma^{t,L}_{P(h)}(a\|h,z)E_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz,a\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$
		$\displaystyle=\sum_{a\in A(h)}\sigma^{t,L}_{P(h)}(a\|h,z)E_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},hza\|h^{\prime})\|h^{\prime}\sqsupseteq hza\right]$
		$\displaystyle=\sum_{a\in A(h)}\sigma^{t,L}_{P(h)}(a\|h,z)u_{i}(hza)=v_{i}^{t,L}(\sigma^{t},hz)$

Here, the first and third equality are due to Lemma 4, and the others are based on the corresponding definitions. Still, for this base case, we have:

	$\displaystyle E_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]$	$\displaystyle=\sum_{z\in Z(h)}\sigma^{t,H}_{P(h)}(z\|h)E_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]$		(39)
		$\displaystyle=\sum_{z\in Z(h)}\sigma^{t,H}_{P(h)}(z\|h)v_{i}^{t,L}(\sigma^{t},hz)=v^{t,H}_{i}(\sigma^{t},h)$		(39)

where the second equality comes for Equation (38). Then, we can move on to the general case, with the hypothesis that Lemma 5 holds for the nodes lower than $h$ on the game tree:

		$\displaystyle E_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]=E_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$		(40)
		$\displaystyle=\sum_{a\in A(h)}\sigma^{t,L}_{P(h)}(a\|h,z)E_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz,a\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$
		$\displaystyle=\sum_{a\in A(h)}\sigma^{t,L}_{P(h)}(a\|h,z)E_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},hza\|h^{\prime})\|h^{\prime}\sqsupseteq hza\right]$
		$\displaystyle=\sum_{a\in A(h)}\sigma^{t,L}_{P(h)}(a\|h,z)v^{t,H}_{i}(\sigma^{t},hza)=v_{i}^{t,L}(\sigma^{t},hz)$

where the induction hypothesis is adopted for the fourth equality. Equation (40) and (39) imply that $E_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h|h^{\prime})|h^{\prime}\sqsupseteq h\right]=v^{t,H}_{i}(\sigma^{t},h)$ holds for the general case.

So far, we have proved the first part of Theorem 4, i.e., $\mathbb{E}_{h^{\prime}\sim\pi^{q^{t}}(\cdot)}\left[\hat{r}_{i}^{t,H}(I,z|h^{\prime})\right]=r^{t,H}_{i}(I,z)$ . The second part, $\mathbb{E}_{h^{\prime}\sim\pi^{q^{t}}(\cdot)}\left[\hat{r}_{i}^{t,L}(Iz,a|h^{\prime})\right]=r^{t,L}_{i}(Iz,a)$ , can be proved with the same process as above based on Lemma 4, so we skip the complete proof and only present the following lemma within it.

Lemma 6

For all $h\in H\backslash H_{TS},\ z\in Z(h),a\in A(h)$ :

\displaystyle E_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz,a|h^{\prime})|h^{\prime}\sqsupseteq hz\right]=v^{t,H}_{i}(\sigma^{t},hza),\ E_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz|h^{\prime})|h^{\prime}\sqsupseteq hz\right]=v^{t,L}_{i}(\sigma^{t},hz)

(41)

Appendix F Proof of Theorem 5

Part I:

First, we can apply the law of total variance to ${\rm Var}_{h^{\prime}\sim\pi^{q^{t}}(\cdot)}\left[\hat{r}_{i}^{t,H}(I,z|h^{\prime})\right]$ , conditioning on $\delta(h^{\prime}\sqsupseteq I)$ (i.e., if $h^{\prime}$ is reachable from $I$ ), and get:

	$\displaystyle{\rm Var}_{h^{\prime}\sim\pi^{q^{t}}(\cdot)}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\right]=$	$\displaystyle\mathbb{E}\left[{\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|\delta(h^{\prime}\sqsupseteq I)\right]\right]+$		(42)
		$\displaystyle{\rm Var}\left[\mathbb{E}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|\delta(h^{\prime}\sqsupseteq I)\right]\right]$		(42)

The first term can be expanded as follows, where the second equality is due to $\hat{r}_{i}^{t,H}(I,z|h^{\prime})=0$ when $h^{\prime}\ {\not\sqsupseteq}\ I$ .

		$\displaystyle\mathbb{E}\left[{\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|\delta(h^{\prime}\sqsupseteq I)\right]\right]$		(43)
		$\displaystyle=P(h^{\prime}\sqsupseteq I){\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|h^{\prime}\sqsupseteq I\right]+P(h^{\prime}\ {\not\sqsupseteq}\ I){\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|h^{\prime}\ {\not\sqsupseteq}\ I\right]$
		$\displaystyle=P(h^{\prime}\sqsupseteq I){\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|h^{\prime}\sqsupseteq I\right]$

The second term can be converted as follows, based on the fact that $\mathbb{E}_{h^{\prime}}(\hat{r}_{i}^{t,H}(I,z|h^{\prime})|\delta(h^{\prime}\sqsupseteq I))=\frac{r^{t,H}_{i}(I,z)}{P(h^{\prime}\sqsupseteq I)}$ (i.e., $\mathbb{E}_{h^{\prime}}(\hat{r}_{i}^{t,H}(I,z|h^{\prime})|h^{\prime}\sqsupseteq I)$ ) with probability $P(h^{\prime}\sqsupseteq I)$ , and $\mathbb{E}_{h^{\prime}}(\hat{r}_{i}^{t,H}(I,z|h^{\prime})|$ $\delta(h^{\prime}\sqsupseteq I))=0$ (i.e., $\mathbb{E}_{h^{\prime}}(\hat{r}_{i}^{t,H}(I,z|h^{\prime})|h^{\prime}\ {\not\sqsupseteq}\ I)$ ) with probability $1-P(h^{\prime}\sqsupseteq I)$ .

		$\displaystyle{\rm Var}\left[\mathbb{E}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|\delta(h^{\prime}\sqsupseteq I)\right]\right]$		(44)
		$\displaystyle=\mathbb{E}\left[\left[\mathbb{E}_{h^{\prime}}(\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|\delta(h^{\prime}\sqsupseteq I))\right]^{2}\right]-\left[\mathbb{E}\left[\mathbb{E}_{h^{\prime}}(\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|\delta(h^{\prime}\sqsupseteq I))\right]\right]^{2}$
		$\displaystyle=\frac{1-P(h^{\prime}\sqsupseteq I)}{P(h^{\prime}\sqsupseteq I)}(r^{t,H}_{i}(I,z))^{2}$

Note that $\frac{1-P(h^{\prime}\sqsupseteq I)}{P(h^{\prime}\sqsupseteq I)}(r^{t,H}_{i}(I,z))^{2}$ and $P(h^{\prime}\sqsupseteq I)$ is not affected by $b_{i}^{t}$ , so we focus on ${\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z|h^{\prime})|h^{\prime}\sqsupseteq I\right]$ in Equation (43). Applying the law of total variance:

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|h^{\prime}\sqsupseteq I\right]$		(45)
		$\displaystyle=\mathbb{E}_{h\in I}\left[{\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]\right]+{\rm Var}_{h\in I}\left[\mathbb{E}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]\right]$
		$\displaystyle\geq{\rm Var}_{h\in I}\left[\mathbb{E}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]\right]$

Fix $h\in I$ , $\mathbb{E}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]=\frac{\pi^{\sigma^{t}}_{-i}(h)}{\pi^{q^{t}}(h)}\left[v_{i}^{t,L}(\sigma^{t},hz)-v_{i}^{t,H}(\sigma^{t},h)\right]$ , based on the definition of $\hat{r}^{t,H}_{i}(I,z|h^{\prime})$ and Lemma 5. Thus, the second term in Equation (45) is irrelevant to $b^{t}_{i}$ . According to Equation (42)-(45), we conclude that the minimum of ${\rm Var}_{h^{\prime}\sim\pi^{q^{t}}(\cdot)}\left[\hat{r}_{i}^{t,H}(I,z|h^{\prime})\right]$ with respect to $b_{i}^{t}$ can be achieved when $\mathbb{E}_{h\in I}\left[{\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]\right]=0$ . Following the same process, we can show that the minimum of ${\rm Var}_{h^{\prime}\sim\pi^{q^{t}}(\cdot)}\left[\hat{r}_{i}^{t,L}(Iz,a|h^{\prime})\right]$ with respect to $b_{i}^{t}$ can be achieved when $\mathbb{E}_{h\in I}\left[{\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,L}(Iz,a|h^{\prime})|h^{\prime}\sqsupseteq hz\right]\right]=0$ .

Lemma 7

If ${\rm Var}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]={\rm Var}_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz,a|h^{\prime})|h^{\prime}\sqsupseteq hz\right]=0$ , for all $h\in H\backslash H_{TS},\ z\in Z(h),\ a\in A(h)$ , then $\mathbb{E}_{h\in I}\left[{\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]\right]=$
$\mathbb{E}_{h\in I}\left[{\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,L}(Iz,a|h^{\prime})|h^{\prime}\sqsupseteq hz\right]\right]=0$ , $\forall\ I\in\mathcal{I}_{i},\ z\in Z(I),\ a\in A(I)$ .

Proof Pick any $h\in H\backslash H_{TS},\ z\in Z(h)$ . Based on Lemma 5, $E_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]=v^{t,L}_{i}(\sigma^{t},hz)$ . If ${\rm Var}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]=0$ , then $\hat{v}^{t,H}_{i}(\sigma^{t},h,z|h^{\prime})=v^{t,L}_{i}(\sigma^{t},hz),\ \forall h^{\prime}\sqsupseteq h$ . It follows that $\hat{v}^{t,H}_{i}(\sigma^{t},h|h^{\prime})=v^{t,H}_{i}(\sigma^{t},h),\ \forall h^{\prime}\sqsupseteq h$ , based on the definitions of $v^{t,H}_{i}(\sigma^{t},h)$ and $\hat{v}^{t,H}_{i}(\sigma^{t},h|h^{\prime})$ . Now, for any $I\in\mathcal{I}_{i},\ h\in I,\ h^{\prime}\sqsupseteq h$ :

		$\displaystyle\hat{r}_{i}^{t,H}(I,z\|h^{\prime})=\sum_{h^{\prime\prime}\in I}\frac{\pi_{-i}^{\sigma^{t}}(h^{\prime\prime})}{\pi^{q^{t}}(h^{\prime\prime})}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h^{\prime\prime},z\|h^{\prime})-\hat{v}^{t,H}_{i}(\sigma^{t},h^{\prime\prime}\|h^{\prime})\right]$		(46)
		$\displaystyle\qquad\qquad\quad\ =\frac{\pi_{-i}^{\sigma^{t}}(h)}{\pi^{q^{t}}(h)}\left[\hat{v}^{t,H}_{i}(\sigma^{t},h,z\|h^{\prime})-\hat{v}^{t,H}_{i}(\sigma^{t},h\|h^{\prime})\right]$
		$\displaystyle\qquad\qquad\quad\ =\frac{\pi_{-i}^{\sigma^{t}}(h)}{\pi^{q^{t}}(h)}\left[v^{t,L}_{i}(\sigma^{t},hz)-v^{t,H}_{i}(\sigma^{t},h)\right]$

Thus, ${\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]=0$ , $\forall\ I\in\mathcal{I}_{i},\ h\in I$ . Then, it follows that for any $I$ , $\mathbb{E}_{h\in I}\left[{\rm Var}_{h^{\prime}}\left[\hat{r}_{i}^{t,H}(I,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]\right]=0$ . With the same process as above, we can show the second part of Lemma 7.

Given the discussions above, to complete the proof of Theorem 5, we need to further show that, $\forall\ i\in N$ , if $b^{t}_{i}(h,z,a)=v^{t,H}_{i}(\sigma^{t},hza)$ and $b^{t}_{i}(h,z)=v^{t,L}_{i}(\sigma^{t},hz)$ , for all $h\in H\backslash H_{TS},\ z\in Z(h),\ a\in A(h)$ , we have ${\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]={\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,L}(\sigma^{t},hz,a|h^{\prime})|h^{\prime}\sqsupseteq hz\right]=0$ , for all $h\in H\backslash H_{TS},\ z\in Z(h),\ a\in A(h)$ .

Part II:

Lemma 8

For any $i\in N,\ h\in H\backslash H_{TS},\ z\in Z(h),a\in A(h)$ and any baseline function $b^{t}_{i}$ :

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]=\sum_{z\in Z(h)}\frac{(\sigma^{t,H}_{P(h)}(z\|h))^{2}}{q^{t}(z\|h)}{\rm Var}_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$		(47)
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\quad\ \ +{\rm Var}_{z\sim q^{t}(\cdot\|h)}\left[\frac{\sigma^{t,H}_{P(h)}(z\|h)}{q^{t}(z\|h)}(v_{i}^{t,L}(\sigma^{t},hz)-b^{t}_{i}(h,z))\right]$
		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,L}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]=\sum_{a\in A(h)}\frac{(\sigma^{t,L}_{P(h)}(a\|h,z))^{2}}{q^{t}(a\|h,z)}{\rm Var}_{h^{\prime}}\left[\hat{v}^{t,H}_{i}(\sigma^{t},hza\|h^{\prime})\|h^{\prime}\sqsupseteq hza\right]$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\quad\ \ +{\rm Var}_{a}\left[\frac{\sigma^{t,L}_{P(h)}(a\|h,z)}{q^{t}(a\|h,z)}(v_{i}^{t,H}(\sigma^{t},hza)-b^{t}_{i}(h,z,a))\right]$

Proof By conditioning on the option choice at $h$ , we apply the law of total variance to ${\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h|h^{\prime})|h^{\prime}\sqsupseteq h\right]$ :

	$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]=$	$\displaystyle\mathbb{E}_{z\in q^{t}(\cdot\|h)}\left[{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]\right]+$		(48)
		$\displaystyle{\rm Var}_{z\in q^{t}(\cdot\|h)}\left[\mathbb{E}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]\right]$		(48)

According to the definition of $\hat{v}_{i}^{t,H}(\sigma^{t},h|h^{\prime})$ and the fact that $h^{\prime}\sqsupseteq hz$ , we have:

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$		(49)
		$\displaystyle={\rm Var}_{h^{\prime}}\left[\frac{\sigma^{t,H}_{P(h)}(z\|h)}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b_{i}^{t}(h,z)\right]+\sum_{z^{\prime}\in Z(h)}\sigma^{t,H}_{P(h)}(z^{\prime}\|h)b^{t}_{i}(h,z^{\prime})\|h^{\prime}\sqsupseteq hz\right]$
		$\displaystyle=\left[\frac{\sigma^{t,H}_{P(h)}(z\|h)}{q^{t}(z\|h)}\right]^{2}{\rm Var}_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$
		$\displaystyle\mathbb{E}_{z\in q^{t}(\cdot\|h)}\left[{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]\right]=\sum_{z\in Z(h)}\frac{(\sigma^{t,H}_{P(h)}(z\|h))^{2}}{q^{t}(z\|h)}{\rm Var}_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$

Then, we analyze the second term in Equation (48):

		$\displaystyle\mathbb{E}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$		(50)
		$\displaystyle=\frac{\sigma^{t,H}_{P(h)}(z\|h)}{q^{t}(z\|h)}\left[\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]-b_{i}^{t}(h,z)\right]+\sum_{z^{\prime}\in Z(h)}\sigma^{t,H}_{P(h)}(z^{\prime}\|h)b^{t}_{i}(h,z^{\prime})$
		$\displaystyle=\frac{\sigma^{t,H}_{P(h)}(z\|h)}{q^{t}(z\|h)}\left[v^{t,L}_{i}(\sigma^{t},hz)-b_{i}^{t}(h,z)\right]+\sum_{z^{\prime}\in Z(h)}\sigma^{t,H}_{P(h)}(z^{\prime}\|h)b^{t}_{i}(h,z^{\prime})$
		$\displaystyle{\rm Var}_{z\in q^{t}(\cdot\|h)}\left[\mathbb{E}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]\right]={\rm Var}_{z\in q^{t}(\cdot\|h)}\left[\frac{\sigma^{t,H}_{P(h)}(z\|h)}{q^{t}(z\|h)}\left[v^{t,L}_{i}(\sigma^{t},hz)-b_{i}^{t}(h,z)\right]\right]$

Based on Equation (48)-(50), we can get the first part of Lemma 8. The second part can be obtained similarly.

Lemma 8 illustrates the outcome of a single-step lookahead from state $h$ . Employing this in an inductive manner, we can derive the complete expansion of ${\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h|h^{\prime})|h^{\prime}\sqsupseteq h\right]$ on the game tree as the following lemma:

Lemma 9

For any $i\in N,\ h\in H$ and any baseline function $b^{t}_{i}$ :

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]=\sum_{h^{\prime\prime}\sqsupseteq h}\frac{(\pi^{\sigma^{t}}(h,h^{\prime\prime}))^{2}}{\pi^{q^{t}}(h,h^{\prime\prime})}f(h^{\prime\prime})$		(51)
		$\displaystyle f(h^{\prime\prime})={\rm Var}_{z}\left[\frac{\sigma^{t,H}_{P(h^{\prime\prime})}(z\|h^{\prime\prime})}{q^{t}(z\|h^{\prime\prime})}(v_{i}^{t,L}(\sigma^{t},h^{\prime\prime}z)-b^{t}_{i}(h^{\prime\prime},z))\right]+$
		$\displaystyle\qquad\quad\ \ \sum_{z\in Z(h^{\prime\prime})}\frac{(\sigma^{t,H}_{P(h^{\prime\prime})}(z\|h^{\prime\prime}))^{2}}{q^{t}(z\|h^{\prime\prime})}{\rm Var}_{a}\left[\frac{\sigma^{t,L}_{P(h^{\prime\prime})}(a\|h^{\prime\prime},z)}{q^{t}(a\|h^{\prime\prime},z)}(v_{i}^{t,H}(\sigma^{t},h^{\prime\prime}za)-b^{t}_{i}(h^{\prime\prime},z,a))\right]$

Proof We proof this lemma through an induction on the height of $h$ on the game tree. For the base case, $h\in H_{TS}$ , then $Z(h)=A(h)=\emptyset$ , so $f(h^{\prime\prime})=0,\ \forall\ h^{\prime\prime}\sqsupseteq h$ . In addition, we have ${\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h|h^{\prime})|h^{\prime}\sqsupseteq h\right]={\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h|h^{\prime})|h^{\prime}=h\right]=0$ . Thus, the lemma holds for the base case.

For the general case, $h\in H\backslash H_{TS}$ , we apply Lemma 8 and get:

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]={\rm Var}_{z\sim q^{t}(\cdot\|h)}\left[\frac{\sigma^{t,H}_{P(h)}(z\|h)}{q^{t}(z\|h)}(v_{i}^{t,L}(\sigma^{t},hz)-b^{t}_{i}(h,z))\right]$		(52)
		$\displaystyle+\sum_{z\in Z(h)}\frac{(\sigma^{t,H}_{P(h)}(z\|h))^{2}}{q^{t}(z\|h)}{\rm Var}_{a\sim q^{t}(\cdot\|h,z)}\left[\frac{\sigma^{t,L}_{P(h)}(a\|h,z)}{q^{t}(a\|h,z)}(v_{i}^{t,H}(\sigma^{t},hza)-b^{t}_{i}(h,z,a))\right]$
		$\displaystyle+\sum_{(z,a)\in\widetilde{A}(h)}\frac{(\sigma^{t}_{P(h)}((z,a)\|h))^{2}}{q^{t}((z,a)\|h)}{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},hza\|h^{\prime})\|h^{\prime}\sqsupseteq hza\right]$
		$\displaystyle=f(h)+\sum_{(z,a)\in\widetilde{A}(h)}\frac{(\sigma^{t}_{P(h)}((z,a)\|h))^{2}}{q^{t}((z,a)\|h)}{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},hza\|h^{\prime})\|h^{\prime}\sqsupseteq hza\right]$

where the first equality is the result of the sequential use of the two formulas in Lemma 8 and the second equality is based on the definition of $f(h)$ . Next, we apply the induction hypothesis on $hza$ , i.e., a node lower than $h$ on the game tree, and get:

\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},hza|h^{\prime})|h^{\prime}\sqsupseteq hza\right]=\sum_{h^{\prime\prime}\sqsupseteq hza}\frac{(\pi^{\sigma^{t}}(hza,h^{\prime\prime}))^{2}}{\pi^{q^{t}}(hza,h^{\prime\prime})}f(h^{\prime\prime})

(53)

By integrating Equation (52) and (53), we can get:

$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]$	$\displaystyle=f(h)+\sum_{(z,a)}\frac{(\sigma^{t}_{P(h)}((z,a)\|h))^{2}}{q^{t}((z,a)\|h)}\sum_{h^{\prime\prime}\sqsupseteq hza}\frac{(\pi^{\sigma^{t}}(hza,h^{\prime\prime}))^{2}}{\pi^{q^{t}}(hza,h^{\prime\prime})}f(h^{\prime\prime})$	(54)
	$\displaystyle=f(h)+\sum_{\begin{subarray}{c}h^{\prime\prime}\sqsupset h\end{subarray}}\frac{(\pi^{\sigma^{t}}(h,h^{\prime\prime}))^{2}}{\pi^{q^{t}}(h,h^{\prime\prime})}f(h^{\prime\prime})$
	$\displaystyle=\sum_{h^{\prime\prime}\sqsupseteq h}\frac{(\pi^{\sigma^{t}}(h,h^{\prime\prime}))^{2}}{\pi^{q^{t}}(h,h^{\prime\prime})}f(h^{\prime\prime})$

For the second equality, we use the definitions of $\pi^{\sigma^{t}}(h,h^{\prime\prime})$ and $\pi^{q^{t}}(h,h^{\prime\prime})$ , and the fact that they equal 1 when $h^{\prime\prime}=h$ .

Before moving to the final proof, we introduce another lemma as follows.

Lemma 10

For any $i\in N,\ h\in H\backslash H_{TS},\ z\in Z(h)$ and any baseline function $b^{t}_{i}$ :

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,L}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]\leq\sum_{\begin{subarray}{c}a\in A(h),\\ h^{\prime\prime}\sqsupseteq hza,\\ z^{\prime\prime}\in Z(h^{\prime\prime})\end{subarray}}\frac{(\pi^{\sigma^{t}}(hz,h^{\prime\prime}z^{\prime\prime}))^{2}}{\pi^{q^{t}}(hz,h^{\prime\prime}z^{\prime\prime})}\left[v^{t,L}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime})\right]^{2}$		(55)
		$\displaystyle\qquad\qquad\qquad\qquad\quad\ \ +\sum_{\begin{subarray}{c}h^{\prime\prime}z^{\prime\prime}\sqsupseteq hz,\\ a^{\prime\prime}\in A(h^{\prime\prime})\end{subarray}}\frac{(\pi^{\sigma^{t}}(hz,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime}))^{2}}{\pi^{q^{t}}(hz,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})}\left[v^{t,H}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime},a^{\prime\prime})\right]^{2}$		(55)

Proof Applying the fact ${\rm Var}(X)=\mathbb{E}(X^{2})-(\mathbb{E}(X))^{2}\leq\mathbb{E}(X^{2})$ to both variance terms of Equation (51) and after rearranging the terms, we arrive at the following expression:

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]\leq\sum_{\begin{subarray}{c}h^{\prime\prime}\sqsupseteq h,\\ z^{\prime\prime}\in Z(h^{\prime\prime})\end{subarray}}\frac{(\pi^{\sigma^{t}}(h,h^{\prime\prime}z^{\prime\prime}))^{2}}{\pi^{q^{t}}(h,h^{\prime\prime}z^{\prime\prime})}\left[v^{t,L}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime})\right]^{2}$		(56)
		$\displaystyle\qquad\qquad\qquad\ \ \ +\sum_{\begin{subarray}{c}h^{\prime\prime}\sqsupseteq h,\\ (z^{\prime\prime},a^{\prime\prime})\in\widetilde{A}(h^{\prime\prime})\end{subarray}}\frac{(\pi^{\sigma^{t}}(h,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime}))^{2}}{\pi^{q^{t}}(h,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})}\left[v^{t,H}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime},a^{\prime\prime})\right]^{2}$		(56)

Note that the equation above holds for any $h\in H$ . Then, to get an upper bound of ${\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,L}(\sigma^{t},hz|h^{\prime})|h^{\prime}\sqsupseteq hz\right]$ , we go back to Lemma 8 and apply Equation (56) and ${\rm Var}(X)\leq\mathbb{E}(X^{2})$ to its first and second term, respectively. After rearranging, we can get:

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,L}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]\leq\sum_{\begin{subarray}{c}a\in A(h),\\ h^{\prime\prime}\sqsupseteq hza,\\ z^{\prime\prime}\in Z(h^{\prime\prime})\end{subarray}}\frac{(\pi^{\sigma^{t}}(hz,h^{\prime\prime}z^{\prime\prime}))^{2}}{\pi^{q^{t}}(hz,h^{\prime\prime}z^{\prime\prime})}\left[v^{t,L}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime})\right]^{2}$		(57)
		$\displaystyle\qquad\qquad\qquad\quad\ \ \ +\sum_{\begin{subarray}{c}a\in A(h),\\ h^{\prime\prime}\sqsupseteq hza,\\ (z^{\prime\prime},a^{\prime\prime})\in\widetilde{A}(h^{\prime\prime})\end{subarray}}\frac{(\pi^{\sigma^{t}}(hz,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime}))^{2}}{\pi^{q^{t}}(hz,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})}\left[v^{t,H}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime},a^{\prime\prime})\right]^{2}$
		$\displaystyle\qquad\qquad\qquad\quad\ \ \ +\sum_{\begin{subarray}{c}a\in A(h)\end{subarray}}\frac{(\sigma^{t,L}_{P(h)}(a\|h,z))^{2}}{q^{t}(a\|h,z)}\left[v^{t,H}_{i}(\sigma^{t},hza)-b^{t}_{i}(h,z,a)\right]^{2}$

We note that the second term of Equation (55) can be obtained by combining the last two terms of Equation (57). The second and third term of Equation (57) correspond to the sum over $h^{\prime\prime}z^{\prime\prime}\sqsupset hz,\ a^{\prime\prime}\in A(h^{\prime\prime})$ and $h^{\prime\prime}z^{\prime\prime}=hz,\ a^{\prime\prime}\in A(h^{\prime\prime})$ , respectively.

Based on the discussions above, we give out the upper bound of ${\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]$ as the following lemma:

Lemma 11

For any $i\in N,\ h\in H\backslash H_{TS},\ z\in Z(h)$ and any baseline function $b^{t}_{i}$ :

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]\leq\frac{1}{q^{t}(z\|h)}\sum_{\begin{subarray}{c}h^{\prime\prime}z^{\prime\prime}\sqsupseteq hz\end{subarray}}\frac{(\pi^{\sigma^{t}}(hz,h^{\prime\prime}z^{\prime\prime}))^{2}}{\pi^{q^{t}}(hz,h^{\prime\prime}z^{\prime\prime})}\left[v^{t,L}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime})\right]^{2}$		(58)
		$\displaystyle\qquad\qquad\qquad\qquad\quad\ \ +\frac{1}{q^{t}(z\|h)}\sum_{\begin{subarray}{c}h^{\prime\prime}z^{\prime\prime}\sqsupseteq hz,\\ a^{\prime\prime}\in A(h^{\prime\prime})\end{subarray}}\frac{(\pi^{\sigma^{t}}(hz,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime}))^{2}}{\pi^{q^{t}}(hz,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})}\left[v^{t,H}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime},a^{\prime\prime})\right]^{2}$		(58)

Proof

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]$		(59)
		$\displaystyle={\rm Var}_{h^{\prime}}\left[\frac{\delta(hz\sqsubseteq h^{\prime})}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b_{i}^{t}(h,z)\right]\|h^{\prime}\sqsupseteq h\right]$
		$\displaystyle=\mathbb{E}\left[{\rm Var}_{h^{\prime}}\left[\frac{\delta(hz\sqsubseteq h^{\prime})}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b_{i}^{t}(h,z)\right]\|h^{\prime}\sqsupseteq h,\delta(hz\sqsubseteq h^{\prime})\right]\right]+$
		$\displaystyle\quad\ {\rm Var}\left[\mathbb{E}_{h^{\prime}}\left[\frac{\delta(hz\sqsubseteq h^{\prime})}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b_{i}^{t}(h,z)\right]\|h^{\prime}\sqsupseteq h,\delta(hz\sqsubseteq h^{\prime})\right]\right]$

Here, we apply the definition of $\hat{v}_{i}^{t,H}(\sigma^{t},h,z|h^{\prime})$ to get the first equality, and the law of total variance conditioned on $\delta(hz\sqsubseteq h^{\prime})$ (given $h\sqsubseteq h^{\prime}$ ) to get the second equality. Next, we analyze the two terms in the third and fourth line of Equation (59) separately.

		$\displaystyle\mathbb{E}\left[{\rm Var}_{h^{\prime}}\left[\frac{\delta(hz\sqsubseteq h^{\prime})}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b_{i}^{t}(h,z)\right]\|h^{\prime}\sqsupseteq h,\delta(hz\sqsubseteq h^{\prime})\right]\right]$		(60)
		$\displaystyle=P(hz\sqsubseteq h^{\prime}\|h\sqsubseteq h^{\prime}){\rm Var}_{h^{\prime}}\left[\frac{1}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b_{i}^{t}(h,z)\right]\|h^{\prime}\sqsupseteq hz\right]$
		$\displaystyle=q^{t}(z\|h){\rm Var}_{h^{\prime}}\left[\frac{1}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b_{i}^{t}(h,z)\right]\|h^{\prime}\sqsupseteq hz\right]$
		$\displaystyle=\frac{1}{q^{t}(z\|h)}{\rm Var}_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$

Note that $\delta(hz\sqsubseteq h^{\prime})$ can be 0 or 1 (with probability $P(hz\sqsubseteq h^{\prime}|h\sqsubseteq h^{\prime})$ ), and the variance equals 0 when $\delta(hz\sqsubseteq h^{\prime})=0$ , so we get the first equality in Equation (60). Similarly, we can get:

		$\displaystyle{\rm Var}\left[\mathbb{E}_{h^{\prime}}\left[\frac{\delta(hz\sqsubseteq h^{\prime})}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b_{i}^{t}(h,z)\right]\|h^{\prime}\sqsupseteq h,\delta(hz\sqsubseteq h^{\prime})\right]\right]$		(61)
		$\displaystyle\leq\mathbb{E}\left[\left[\mathbb{E}_{h^{\prime}}\left[\frac{\delta(hz\sqsubseteq h^{\prime})}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b_{i}^{t}(h,z)\right]\|h^{\prime}\sqsupseteq h,\delta(hz\sqsubseteq h^{\prime})\right]\right]^{2}\right]$
		$\displaystyle=q^{t}(z\|h)\left[\mathbb{E}_{h^{\prime}}\left[\frac{1}{q^{t}(z\|h)}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})-b_{i}^{t}(h,z)\right]\|h^{\prime}\sqsupseteq hz\right]\right]^{2}$
		$\displaystyle=\frac{1}{q^{t}(z\|h)}\left[\mathbb{E}_{h^{\prime}}\left[\hat{v}^{t,L}_{i}(\sigma^{t},hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]-b_{i}^{t}(h,z)\right]^{2}$
		$\displaystyle=\frac{1}{q^{t}(z\|h)}\left[v^{t,L}_{i}(\sigma^{t},hz)-b_{i}^{t}(h,z)\right]^{2}$

Integrating Equation (59)-(61) and utilizing the upper bound proposed in Lemma 10, we can get:

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]\leq\frac{1}{q^{t}(z\|h)}\left[v^{t,L}_{i}(\sigma^{t},hz)-b_{i}^{t}(h,z)\right]^{2}+$		(62)
		$\displaystyle\frac{1}{q^{t}(z\|h)}\sum_{\begin{subarray}{c}a\in A(h),\\ h^{\prime\prime}\sqsupseteq hza,\\ z^{\prime\prime}\in Z(h^{\prime\prime})\end{subarray}}\frac{(\pi^{\sigma^{t}}(hz,h^{\prime\prime}z^{\prime\prime}))^{2}}{\pi^{q^{t}}(hz,h^{\prime\prime}z^{\prime\prime})}\left[v^{t,L}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime})\right]^{2}+$
		$\displaystyle\frac{1}{q^{t}(z\|h)}\sum_{\begin{subarray}{c}h^{\prime\prime}z^{\prime\prime}\sqsupseteq hz,\\ a^{\prime\prime}\in A(h^{\prime\prime})\end{subarray}}\frac{(\pi^{\sigma^{t}}(hz,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime}))^{2}}{\pi^{q^{t}}(hz,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})}\left[v^{t,H}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime},a^{\prime\prime})\right]^{2}$

Note that the sum of the first two terms of Equation (62) equals the first term of Equation (58). The first and second term of Equation (62) correspond to the sum over $h^{\prime\prime}z^{\prime\prime}=hz$ and $h^{\prime\prime}z^{\prime\prime}\sqsupset hz$ , respectively.

Similarly, we can derive the upper bound for ${\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,L}(\sigma^{t},hz,a|h^{\prime})|h^{\prime}\sqsupseteq hz\right]$ shown as follows.

Lemma 12

For any $i\in N,\ h\in H\backslash H_{TS},\ z\in Z(h),\ a\in A(h)$ and any baseline function $b^{t}_{i}$ :

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,L}(\sigma^{t},hz,a\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$		(63)
		$\displaystyle\leq\frac{1}{q^{t}(a\|h,z)}\sum_{\begin{subarray}{c}h^{\prime\prime}z^{\prime\prime}a^{\prime\prime}\sqsupseteq hza\end{subarray}}\frac{(\pi^{\sigma^{t}}(hza,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime}))^{2}}{\pi^{q^{t}}(hza,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})}\left[v^{t,H}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime},a^{\prime\prime})\right]^{2}+$
		$\displaystyle\quad\ \frac{1}{q^{t}(a\|h,z)}\sum_{\begin{subarray}{c}h^{\prime\prime}\sqsupseteq hza,\\ z^{\prime\prime}\in Z(h^{\prime\prime})\end{subarray}}\frac{(\pi^{\sigma^{t}}(hza,h^{\prime\prime}z^{\prime\prime}))^{2}}{\pi^{q^{t}}(hza,h^{\prime\prime}z^{\prime\prime})}\left[v^{t,L}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime})\right]^{2}$

Proof By applying the law of total variance conditioned on $\delta(hza\sqsubseteq h^{\prime})$ (given $hz\sqsubseteq h^{\prime}$ ) and following the same process as Equation (59)-(61), we can get:

	$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,L}(\sigma^{t},hz,a\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]\leq$	$\displaystyle\frac{1}{q^{t}(a\|h,z)}\left[v^{t,H}_{i}(\sigma^{t},hza)-b^{t}_{i}(h,z,a)\right]^{2}+$		(64)
		$\displaystyle\frac{1}{q^{t}(a\|h,z)}{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},hza\|h^{\prime})\|h^{\prime}\sqsupseteq hza\right]$		(64)

Then, we can apply the upper bound shown as Equation (56) and get:

		$\displaystyle{\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,L}(\sigma^{t},hz,a\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]\leq\frac{1}{q^{t}(a\|h,z)}\left[v^{t,H}_{i}(\sigma^{t},hza)-b^{t}_{i}(h,z,a)\right]^{2}+$		(65)
		$\displaystyle\frac{1}{q^{t}(a\|h,z)}\sum_{\begin{subarray}{c}h^{\prime\prime}\sqsupseteq hza,\\ (z^{\prime\prime},a^{\prime\prime})\in\widetilde{A}(h^{\prime\prime})\end{subarray}}\frac{(\pi^{\sigma^{t}}(hza,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime}))^{2}}{\pi^{q^{t}}(hza,h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})}\left[v^{t,H}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime}a^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime},a^{\prime\prime})\right]^{2}+$
		$\displaystyle\frac{1}{q^{t}(a\|h,z)}\sum_{\begin{subarray}{c}h^{\prime\prime}\sqsupseteq hza,\\ z^{\prime\prime}\in Z(h^{\prime\prime})\end{subarray}}\frac{(\pi^{\sigma^{t}}(hza,h^{\prime\prime}z^{\prime\prime}))^{2}}{\pi^{q^{t}}(hza,h^{\prime\prime}z^{\prime\prime})}\left[v^{t,L}_{i}(\sigma^{t},h^{\prime\prime}z^{\prime\prime})-b^{t}_{i}(h^{\prime\prime},z^{\prime\prime})\right]^{2}$

Again, we can combine the first two terms of Equation (65) and get the first term of the right hand side of Equation (63), since the first term of Equation (65) corresponds to the case that $h^{\prime\prime}=h,\ h^{\prime\prime}z^{\prime\prime}a^{\prime\prime}\sqsupseteq hza$ and the second term is equivalent to the sum over $h^{\prime\prime}\sqsupset h,\ h^{\prime\prime}z^{\prime\prime}a^{\prime\prime}\sqsupseteq hza$ .

Finally, with Lemma 11 - 12 and the fact that variance cannot be negative, we can claim: $\forall\ i\in N$ , if $b^{t}_{i}(h,z,a)=v^{t,H}_{i}(\sigma^{t},hza)$ and $b^{t}_{i}(h,z)=v^{t,L}_{i}(\sigma^{t},hz)$ , for all $h\in H\backslash H_{TS},\ z\in Z(h),\ a\in A(h)$ , we have ${\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,H}(\sigma^{t},h,z|h^{\prime})|h^{\prime}\sqsupseteq h\right]={\rm Var}_{h^{\prime}}\left[\hat{v}_{i}^{t,L}(\sigma^{t},hz,a|h^{\prime})|h^{\prime}\sqsupseteq hz\right]=0$ , for all $h\in H\backslash H_{TS},\ z\in Z(h),\ a\in A(h)$ .

Appendix G Proof of Proposition 2

We start from the definition:

		$\displaystyle\mathcal{L}^{t,H}_{R,i}=\mathcal{L}(R^{t,H}_{i,\theta})=\mathop{\mathbb{E}}_{(I,\hat{r}^{t^{\prime},H}_{i})\sim\tau^{i}_{R}}\left[\sum_{z\in Z(I)}(R^{t,H}_{i,\theta}(z\|I)-\hat{r}^{t^{\prime},H}_{i}(I,z))^{2}\right]$		(66)
		$\displaystyle=\frac{1}{norm}\sum_{t^{\prime}=1}^{t}\sum_{I\in\mathcal{I}_{i}}\sum_{k=1}^{K}x^{k}_{t^{\prime}}(I)\left[\sum_{z\in Z(I)}(R^{t,H}_{i,\theta}(z\|I)-\hat{r}^{t^{\prime},H}_{i}(I,z))^{2}\right]$		(66)

Here, $x^{k}_{t^{\prime}}(I)$ denotes whether $I$ is visited in the $k$ -th sampled trajectory at iteration $t^{\prime}$ , and $norm=\sum_{t^{\prime}=1}^{t}\sum_{I\in\mathcal{I}_{i}}\sum_{k=1}^{K}x^{k}_{t^{\prime}}(I)$ serves as the normalizing factor.

Let $R^{t,H}_{i,*}$ denote a minimal point of $\mathcal{L}(R^{t,H}_{i,\theta})$ . Utilizing the first-order necessary condition for optimality, we obtain: $\nabla\mathcal{L}(R^{t,H}_{i,*})=0$ . Thus, for the $(I,z)$ entry of $R^{t,H}_{i,*}$ , we deduce:

		$\displaystyle\qquad\quad\frac{\partial\mathcal{L}(R^{t,H}_{i,})}{\partial R^{t,H}_{i,\theta}(z\|I)}=\frac{2}{norm}\sum_{t^{\prime}=1}^{t}\sum_{k=1}^{K}x^{k}_{t^{\prime}}(I)(R^{t,H}_{i,}(z\|I)-\hat{r}^{t^{\prime},H}_{i}(I,z))=0$		(67)
		$\displaystyle R^{t,H}_{i,*}(z\|I)=\frac{1}{norm^{\prime}}\sum_{t^{\prime}=1}^{t}\sum_{k=1}^{K}x^{k}_{t^{\prime}}(I)\hat{r}^{t^{\prime},H}_{i}(I,z)=\frac{1}{norm^{\prime}}\sum_{t^{\prime}=1}^{t}\sum_{k=1}^{K}\hat{r}^{t^{\prime},H}_{i}(I,z\|h_{t^{\prime},k}^{{}^{\prime}})$		(67)

where $norm^{\prime}=\sum_{t^{\prime}=1}^{t}\sum_{k=1}^{K}x^{k}_{t^{\prime}}(I)$ denotes the normalizing factor, which is a positive constant for a certain memory $\tau^{i}_{R}$ , and $h_{t^{\prime},k}^{{}^{\prime}}$ is the termination state of the $k$ -th sampled trajectory at iteration $t^{\prime}$ . In the second line of Equation (67), the second equality is valid based on the definition of sampled counterfactual regret (Equation (9) and (10)), which assigns non-zero values exclusively to information sets along the sampled trajectory. Now, we consider the expectation of $R^{t,H}_{i,*}(z|I)$ on the set of sampled trajectories $\{h_{t^{\prime},k}^{{}^{\prime}}\}$ :

	$\displaystyle\mathbb{E}_{\{h_{t^{\prime},k}^{{}^{\prime}}\}}\left[R^{t,H}_{i,*}(z\|I)\right]$	$\displaystyle=\frac{1}{norm^{\prime}}\sum_{t^{\prime}=1}^{t}\sum_{k=1}^{K}\mathbb{E}_{h_{t^{\prime},k}^{{}^{\prime}}}\left[\hat{r}^{t^{\prime},H}_{i}(I,z\|h_{t^{\prime},k}^{{}^{\prime}})\right]$		(68)
		$\displaystyle=\frac{1}{norm^{\prime}}\sum_{t^{\prime}=1}^{t}\sum_{k=1}^{K}r^{t^{\prime},H}_{i}(I,z)=C_{1}R^{t,H}_{i}(z\|I)$		(68)

where $C_{1}=\frac{T}{K\times norm^{\prime}}$ and the second equality holds due to Theorem 4. The second part of Proposition 2, i.e., $\mathbb{E}_{\{h_{t^{\prime},k}^{{}^{\prime}}\}}\left[R^{t,L}_{i,*}(a|I,z)\right]=C_{2}R^{t,L}_{i}(a|I,z)$ , can be demonstrated similarly.

Appendix H Proof of Proposition 3

According to the definition of $\mathcal{L}^{H}_{\overline{\sigma},i}$ in Equation (14) and following the same process as Equation (66) - (67), we can obtain:

\displaystyle\overline{\sigma}^{T,H}_{i,*}(z|I)=\frac{1}{norm^{\prime}}\sum_{t=1}^{T}\sum_{k=1}^{K}x^{k}_{t}(I)\sigma^{t,H}_{i}(z|I)

(69)

According to the law of large numbers, as $|\tau^{t,i}_{\overline{\sigma}}|\rightarrow\infty$ ( $t\in\{1,\cdots,T\}$ ), we have:

\displaystyle\overline{\sigma}^{T,H}_{i,*}(z|I)\rightarrow\frac{\sum_{t=1}^{T}\pi^{q^{t,3-i}}(I)\sigma^{t,H}_{i}(z|I)}{\sum_{t=1}^{T}\pi^{q^{t,3-i}}(I)}

(70)

Ideally, we should randomly select a single information set for each randomly-sampled trajectory and add its strategy distribution to the memory. This guarantees that occurrences of information sets within each iteration $t$ are independent and identically distributed, as the sampling strategy remains consistent, and the number of samples (i.e., $K$ ) are equal across different iterations, thus validating the above formula. However, in practice (Algorithm 2), we gather strategy distributions of all information sets for the non-traverser along each sampled trajectory to enhance sample efficiency, which has been empirically proven to be effective. In addition, at a certain iteration $t$ , the samples for updating the strategy of player $i$ are collected when $3-i$ is the traverser, so the probability to visit a certain information set $I$ is $\pi^{q^{t,3-i}}(I)$ .

To connect the convergence result in Equation (70) and the definition of $\overline{\sigma}^{T,H}_{i}$ in Equation (7), we need to show that $\forall\ I\in\mathcal{I}_{i},\ t\in\{1,\cdots\,T-1\}$ , $\frac{\pi^{q^{t,3-i}}(I)}{\pi^{q^{t+1,3-i}}(I)}=\frac{\pi^{\sigma^{t}}_{i}(I)}{\pi^{\sigma^{t+1}}_{i}(I)}$ . According to the sampling scheme, $q^{t,3-i}_{p}$ is a uniformly random strategy when $p=3-i$ , and it is equal to $\sigma^{t}_{p}$ when $p=i$ . Therefore, we have:

\displaystyle\frac{\pi^{q^{t,3-i}}(I)}{\pi^{q^{t+1,3-i}}(I)}=\frac{\sum_{h\in I}\pi^{Unif}_{3-i}(h)\pi^{\sigma^{t}}_{i}(h)}{\sum_{h\in I}\pi^{Unif}_{3-i}(h)\pi^{\sigma^{t+1}}_{i}(h)}=\frac{\sum_{h\in I}\pi^{\sigma^{t}}_{i}(h)}{\sum_{h\in I}\pi^{\sigma^{t+1}}_{i}(h)}=\frac{\pi^{\sigma^{t}}_{i}(I)}{\pi^{\sigma^{t+1}}_{i}(I)}

(71)

It is satisfied in our/usual game settings that $\pi^{Unif}_{3-i}(h)$ remains consistent for all $h\in I$ . This is attributable to the fact that histories within a single information set possess identical heights, and player $3-i$ consistently employs a uniformly random strategy. Similarly, we can deduce that $\overline{\sigma}^{T,L}_{i,*}(a|I,z)\rightarrow\overline{\sigma}^{T,L}_{i}(a|I,z)$ using the aforementioned procedure, which we refrain from elaborating upon here.

Appendix I Proof of Proposition 4

First, we present a lemma concerning the sampled baseline values $\hat{b}^{t+1}(h|h^{\prime})$ , as defined in Equation (16). This definition closely resembles that of the sampled counterfactual values in Equation (10), with two key distinctions: (1) $b^{t+1}$ is replaced with $b^{t}$ , as $b^{t+1}$ is not yet available; and (2) $q^{t+1}$ is substituted with $q^{t}$ , enabling the reuse of trajectories sampled with $q^{t}$ for updating $b^{t+1}$ , thereby enhancing efficiency.

Lemma 13

For all $h\in H,\ z\in Z(h),\ a\in A(h)$ , we have:

\displaystyle\mathbb{E}_{h^{\prime}}\left[\hat{b}^{t+1}(h|h^{\prime})|h^{\prime}\sqsupseteq h\right]=v^{t+1,H}(\sigma^{t+1},h)

(72)

Proof Given the similarity between $\hat{b}^{t+1}$ and $\hat{v}^{t+1,H}$ , we can follow the proof of Lemma 4 and 5 to justify the lemma here.

		$\displaystyle E_{h^{\prime}}\left[\hat{b}^{t+1}(h,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]$		(73)
		$\displaystyle=E_{h^{\prime}}\left[\frac{\delta(hz\sqsubseteq h^{\prime})}{q^{t,1}(z\|h)}\left[\hat{b}^{t+1}(hz\|h^{\prime})-b^{t}(h,z)\right]+b^{t}(h,z)\|h^{\prime}\sqsupseteq h\right]$
		$\displaystyle=P(hz\sqsubseteq h^{\prime}\|h^{\prime}\sqsupseteq h)E_{h^{\prime}}\left[\frac{1}{q^{t,1}(z\|h)}\left[\hat{b}^{t+1}(hz\|h^{\prime})-b^{t}(h,z)\right]+b^{t}(h,z)\|h^{\prime}\sqsupseteq hz\right]+$
		$\displaystyle\ \ \ \ \ \ P(hz\ {\not\sqsubseteq}\ h^{\prime}\|h^{\prime}\sqsupseteq h)b^{t}(h,z)$
		$\displaystyle=q^{t,1}(z\|h)\left[\frac{1}{q^{t,1}(z\|h)}\left[E_{h^{\prime}}(\hat{b}^{t+1}(hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz)-b^{t}(h,z)\right]+b^{t}(h,z)\right]+$
		$\displaystyle\ \ \ \ \ \ (1-q^{t,1}(z\|h))b^{t}(h,z)$
		$\displaystyle=E_{h^{\prime}}\left[\hat{b}^{t+1}(hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$

According to Algorithm 2, the trajectories for updating $\hat{b}^{t+1}$ are sampled at iteration $t$ when player 1 is the traverser, so $P(hz\sqsubseteq h^{\prime}|h^{\prime}\sqsupseteq h)=q^{t,1}(z|h)$ . Similarly, we can obtain $E_{h^{\prime}}\left[\hat{b}^{t+1}(hz,a|h^{\prime})|h^{\prime}\sqsupseteq hz\right]=E_{h^{\prime}}\left[\hat{b}^{t+1}(hza|h^{\prime})|h^{\prime}\sqsupseteq hza\right]$ .

Next, we can employ these two equations to perform induction based on the height of $h$ within the game tree. If $h\in H_{TS}$ , $E_{h^{\prime}}\left[\hat{b}^{t+1}(h|h^{\prime})|h^{\prime}\sqsupseteq h\right]=\hat{b}^{t+1}(h|h^{\prime})=u_{1}(h)=v^{t+1,H}(\sigma^{t+1},h)$ based on the definition. If $hza\in H_{TS}$ , we have:

		$\displaystyle E_{h^{\prime}}\left[\hat{b}^{t+1}(h,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]=E_{h^{\prime}}\left[\hat{b}^{t+1}(hz\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$		(74)
		$\displaystyle=\sum_{a\in A(h)}\sigma^{t+1,L}_{P(h)}(a\|h,z)E_{h^{\prime}}\left[\hat{b}^{t+1}(hz,a\|h^{\prime})\|h^{\prime}\sqsupseteq hz\right]$
		$\displaystyle=\sum_{a\in A(h)}\sigma^{t+1,L}_{P(h)}(a\|h,z)E_{h^{\prime}}\left[\hat{b}^{t+1}(hza\|h^{\prime})\|h^{\prime}\sqsupseteq hza\right]$
		$\displaystyle=\sum_{a\in A(h)}\sigma^{t+1,L}_{P(h)}(a\|h,z)v^{t+1,H}(\sigma^{t+1},hza)=v^{t+1,L}(\sigma^{t+1},hz)$

Here, we employ the induction hypothesis in the fourth equivalence, and incorporate pertinent definitions for the remaining equivalences. It follows that:

	$\displaystyle E_{h^{\prime}}\left[\hat{b}^{t+1}(h\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]$	$\displaystyle=\sum_{z\in Z(h)}\sigma^{t+1,H}_{P(h)}(z\|h)E_{h^{\prime}}\left[\hat{b}^{t+1}(h,z\|h^{\prime})\|h^{\prime}\sqsupseteq h\right]$		(75)
		$\displaystyle=\sum_{z\in Z(h)}\sigma^{t+1,H}_{P(h)}(z\|h)v^{t+1,L}(\sigma^{t+1},hz)=v^{t+1,H}(\sigma^{t+1},h)$		(75)

By repeating the two equations above, we can show that $E_{h^{\prime}}\left[\hat{b}^{t+1}(h|h^{\prime})|h^{\prime}\sqsupseteq h\right]=v^{t+1,H}(\sigma^{t+1},h)$ holds for a general $h\notin H_{TS}$ .

Next, we complete the proof of Proposition 4.

	$\displaystyle\mathcal{L}_{b}^{t+1}=\mathcal{L}(b^{t+1})$	$\displaystyle=\mathbb{E}_{h^{\prime}\sim\tau_{b}^{t}}\left[\sum_{hza\sqsubseteq h^{\prime}}(b^{t+1}(h,z,a)-\hat{b}^{t+1}(hza\|h^{\prime}))^{2}\right]$		(76)
		$\displaystyle=\frac{\sum_{h^{\prime}}N(h^{\prime})\sum_{hza\sqsubseteq h^{\prime}}(b^{t+1}(h,z,a)-\hat{b}^{t+1}(hza\|h^{\prime}))^{2}}{\sum_{h^{\prime}}N(h^{\prime})}$		(76)

Here, $N(h^{\prime})$ denotes the number of occurrences of $h^{\prime}$ in the memory $\tau^{t}_{b}$ . Let $b^{t+1,*}$ denote a minimal point of $\mathcal{L}_{b}^{t+1}$ . Utilizing the first-order necessary condition for optimality, we obtain: $\nabla\mathcal{L}(b^{t+1,*})=0$ . Thus, for the $(h,z,a)$ entry of $b^{t+1,*}$ , we deduce:

		$\displaystyle\frac{\partial\mathcal{L}(b^{t+1,})}{\partial b^{t+1}(h,z,a)}=\frac{2\sum_{h^{\prime}\sqsupseteq hza}N(h^{\prime})(b^{t+1,}(h,z,a)-\hat{b}^{t+1}(hza\|h^{\prime}))}{\sum_{h^{\prime}}N(h^{\prime})}=0$		(77)
		$\displaystyle\qquad\qquad\quad b^{t+1,*}(h,z,a)=\frac{\sum_{h^{\prime}\sqsupseteq hza}N(h^{\prime})\hat{b}^{t+1}(hza\|h^{\prime})}{\sum_{h^{\prime}\sqsupseteq hza}N(h^{\prime})}$		(77)

The trajectories in $\tau_{b}^{t}$ can be considered as a sequence of independent and identically distributed random variables,since they are independently sampled with the same sample strategy $q^{t,1}$ . Then, according to the law of large numbers, as $|\tau_{b}^{t}|\rightarrow\infty$ , we conclude:

	$\displaystyle b^{t+1,*}(h,z,a)$	$\displaystyle\rightarrow\frac{\sum_{\ h^{\prime}\sqsupseteq hza}\pi^{q^{t,q}}(h^{\prime})\hat{b}^{t+1}(hza\|h^{\prime})}{\sum_{\ h^{\prime}\sqsupseteq hza}\pi^{q^{t,q}}(h^{\prime})}$		(78)
		$\displaystyle=\mathbb{E}_{h^{\prime}}\left[\hat{b}^{t+1}(hza\|h^{\prime})\|h^{\prime}\sqsupseteq hza\right]=v^{t+1,H}(\sigma^{t+1},hza)$		(78)

where the last equality comes from Lemma 13. It follows:

	$\displaystyle b^{t+1,*}(h,z)$	$\displaystyle=\sum_{a}\sigma^{t+1,L}_{P(h)}(a\|I(h),z)b^{t+1,*}(h,z,a)$		(79)
		$\displaystyle\rightarrow\sum_{a}\sigma^{t+1,L}_{P(h)}(a\|I(h),z)v^{t+1,H}(\sigma^{t+1},hza)=v^{t+1,L}(\sigma^{t+1},hz)$		(79)

		$\displaystyle\quad\ R_{i}^{T,H}(z\|I)=\frac{1}{T}\sum_{t=1}^{T}r_{i}^{t,H}(I,z),\ r_{i}^{t,H}(I,z)=\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\left[v^{t,L}_{i}(\sigma^{t},hz)-v^{t,H}_{i}(\sigma^{t},h)\right]$		(8)
		$\displaystyle R_{i}^{T,L}(a\|I,z)=\frac{1}{T}\sum_{t=1}^{T}r_{i}^{t,L}(Iz,a),\ r_{i}^{t,L}(Iz,a)=\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\left[v^{t,H}_{i}(\sigma^{t},hza)-v^{t,L}_{i}(\sigma^{t},hz)\right]$
		$\displaystyle v^{t,H}_{i}(\sigma^{t},h)=\sum_{z\in Z(h)}\sigma^{t,H}_{P(h)}(z\|h)v_{i}^{t,L}(\sigma^{t},hz),\ v_{i}^{t,L}(\sigma^{t},hz)=\sum_{a\in A(h)}\sigma^{t,L}_{P(h)}(a\|h,z)v^{t,H}_{i}(\sigma^{t},hza)$

		$\displaystyle\hat{b}^{t+1}(h\|h^{\prime})=\sum_{z\in Z(h)}\sigma^{t+1,H}_{P(h)}(z\|h)\hat{b}^{t+1}(h,z\|h^{\prime}),\ \hat{b}^{t+1}(hz\|h^{\prime})=\sum_{a\in A(h)}\sigma^{t+1,L}_{P(h)}(a\|h,z)\hat{b}^{t+1}(hz,a\|h^{\prime})$		(16)
		$\displaystyle\qquad\qquad\quad\quad\hat{b}^{t+1}(h,z\|h^{\prime})=\frac{\delta(hz\sqsubseteq h^{\prime})}{q^{t,1}(z\|h)}\left[\hat{b}^{t+1}(hz\|h^{\prime})-b^{t}(h,z)\right]+b^{t}(h,z)$
		$\displaystyle\qquad\qquad\ \ \hat{b}^{t+1}(hz,a\|h^{\prime})=\frac{\delta(hza\sqsubseteq h^{\prime})}{q^{t,1}(a\|h,z)}\left[\hat{b}^{t+1}(hza\|h^{\prime})-b^{t}(h,z,a)\right]+b^{t}(h,z,a)$

		$\displaystyle R_{full,i}^{T}(I)=\frac{1}{T}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I)(u_{i}(\sigma^{t}\|_{D(I)\rightarrow\sigma^{{}^{\prime}}_{i}},I)-u_{i}(\sigma^{t},I))$		(20)
		$\displaystyle=\frac{1}{T}\max_{z\in Z(I)}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I)\left[(u_{i}(\sigma^{t}\|_{I\rightarrow z},I)-u_{i}(\sigma^{t},I))+(u_{i}(\sigma^{t}\|_{D(Iz)\rightarrow\sigma^{{}^{\prime}}_{i}},Iz)-u_{i}(\sigma^{t},Iz))\right]$
		$\displaystyle\leq\frac{1}{T}\max_{z\in Z(I)}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I)(u_{i}(\sigma^{t}\|_{I\rightarrow z},I)-u_{i}(\sigma^{t},I))+$
		$\displaystyle\ \ \ \ \ \frac{1}{T}\max_{z\in Z(I)}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I)(u_{i}(\sigma^{t}\|_{D(Iz)\rightarrow\sigma^{{}^{\prime}}_{i}},Iz)-u_{i}(\sigma^{t},Iz))$
		$\displaystyle=R_{i}^{T,H}(I)+\frac{1}{T}\max_{z\in Z(I)}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(Iz)(u_{i}(\sigma^{t}\|_{D(Iz)\rightarrow\sigma^{{}^{\prime}}_{i}},Iz)-u_{i}(\sigma^{t},Iz))$
		$\displaystyle\leq R_{i}^{T,H}(I)+\sum_{z\in Z(I)}\left[\frac{1}{T}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(Iz)(u_{i}(\sigma^{t}\|_{D(Iz)\rightarrow\sigma^{{}^{\prime}}_{i}},Iz)-u_{i}(\sigma^{t},Iz))\right]^{+}$
		$\displaystyle=R_{i}^{T,H}(I)+\sum_{z\in Z(I)}R_{full,i}^{T,+}(Iz)$

		$\displaystyle R_{full,i}^{T}(Iz)=\frac{1}{T}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(Iz)(u_{i}(\sigma^{t}\|_{D(Iz)\rightarrow\sigma^{{}^{\prime}}_{i}},Iz)-u_{i}(\sigma^{t},Iz))$		(21)
		$\displaystyle=\frac{1}{T}\max_{a\in A(I)}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(Iz)[(u_{i}(\sigma^{t}\|_{Iz\rightarrow a},Iz)-u_{i}(\sigma^{t},Iz))+$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\quad\sum_{I^{\prime}\in S_{i}(I,za)}P_{\sigma^{t}_{-i}}(I^{\prime}\|I,za)(u_{i}(\sigma^{t}\|_{D(I^{\prime})\rightarrow\sigma^{{}^{\prime}}_{i}},I^{\prime})-u_{i}(\sigma^{t},I^{\prime}))]$
		$\displaystyle=R_{i}^{T,L}(I,z)+\max_{a\in A(I)}\sum_{I^{\prime}\in S_{i}(I,za)}\frac{1}{T}\max_{\sigma^{{}^{\prime}}_{i}}\sum_{t=1}^{T}\pi^{\sigma^{t}}_{-i}(I^{\prime})(u_{i}(\sigma^{t}\|_{D(I^{\prime})\rightarrow\sigma^{{}^{\prime}}_{i}},I^{\prime})-u_{i}(\sigma^{t},I^{\prime}))$
		$\displaystyle=R_{i}^{T,L}(I,z)+\max_{a\in A(I)}\sum_{I^{\prime}\in S_{i}(I,za)}R_{full,i}^{T}(I^{\prime})$
		$\displaystyle\leq R_{i}^{T,L}(I,z)+\sum_{I^{\prime}\in S_{i}(Iz)}R_{full,i}^{T,+}(I^{\prime})$

$\displaystyle R^{T,H}_{i}(z\|I)$	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}[\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})\pi^{\sigma^{t}}(hz,h^{\prime})-$	(26)
	$\displaystyle\qquad\quad\ \ \ \ \sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})\sum_{z^{\prime}\in Z(h)}\sigma^{H}_{t}(z^{\prime}\|h)\pi^{\sigma^{t}}(hz^{\prime},h^{\prime})]$
	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}[\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})\pi^{\sigma^{t}}(hz,h^{\prime})-$
	$\displaystyle\qquad\ \ \ \ \ \sum_{z^{\prime}\in Z(I)}\sigma^{H}_{t}(z^{\prime}\|I)\sum_{h\in I}\pi_{-i}^{\sigma^{t}}(h)\sum_{h^{\prime}\in H_{TS}}u_{i}(h^{\prime})\pi^{\sigma^{t}}(hz^{\prime},h^{\prime})]$
	$\displaystyle=\frac{1}{T}\sum_{t=1}^{T}\left[v_{t}^{H}(z)-\sum_{z^{\prime}\in Z(I)}\sigma^{H}_{t}(z^{\prime}\|I)v_{t}^{H}(z^{\prime})\right]$