This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep Synoptic Monte Carlo Planning in Reconnaissance Blind Chess

Gregory Clark
ML Collective, Google
[email protected]
Abstract

This paper introduces deep synoptic Monte Carlo planning (DSMCP) for large imperfect information games. The algorithm constructs a belief state with an unweighted particle filter and plans via playouts that start at samples drawn from the belief state. The algorithm accounts for uncertainty by performing inference on “synopses,” a novel stochastic abstraction of information states. DSMCP is the basis of the program Penumbra, which won the official 2020 reconnaissance blind chess competition versus 33 other programs. This paper also evaluates algorithm variants that incorporate caution, paranoia, and a novel bandit algorithm. Furthermore, it audits the synopsis features used in Penumbra with per-bit saliency statistics.

1 Introduction

Choosing a Nash equilibrium strategy is rational when the opponent is able to identify and exploit suboptimal behavior (Bowling and Veloso, 2001). However, not all opponents are so responsive, and computing a Nash equilibrium is intractable for many games. This paper presents deep synoptic Monte Carlo planning (DSMCP), an algorithm for large imperfect information games that seeks a best-response strategy rather than a Nash equilibrium strategy.

When opponents use fixed policies, an imperfect information game may be viewed as a partially observable Markov decision process (POMDP) with the opponents as part of the environment. DSMCP treats playing against specific opponents as related offline reinforcement learning (RL) problems and exploits predictability. Importantly, the structure of having opponents with imperfect information is preserved in order to account for their uncertainty.

Refer to caption
Figure 1: Within 98k historical reconnaissance blind chess (RBC) games between non-random players, the win percentage tends to decrease as the number of possible states increases. The games were replayed while tracking up to one million states. Each bucket is labeled with an inclusive upper bound. The median of the maximum number of possible states encountered during a game is 4,869.
Refer to caption

(a)

Refer to caption

(b)

Figure 2: Playing RBC well requires balancing risks and rewards. (a) On the left, Penumbra moved the white queen to g8. After sensing at d2, Black could infer that the white queen occupied one of 25 squares. That uncertainty allowed the white queen to survive and capture the black king on the next turn. (b) On the right, Penumbra moved the black queen to h2. In this case, the opponent detected and captured the black queen. The games are available online at %****␣main.tex␣Line␣125␣****https://rbc.jhuapl.edu/games/120174 and https://rbc.jhuapl.edu/games/124718.

DSMCP uses sampling to break the “curse of dimensionality” (Pineau et al., 2006) in three ways: sampling possible histories with a particle filter, sampling possible futures with upper confidence bound tree search (UCT) (Kocsis and Szepesvári, 2006), and sampling possible world states within each information state uniformly. It represents information states with a generally-applicable stochastic abstraction technique that produces a “synopsis” from sampled world states. This paper assesses DSMCP on reconnaissance blind chess (RBC), a large imperfect information chess variant.

2 Background

Significant progress has been made in recent years in both perfect and imperfect information settings. For example, using deep neural networks to guide UCT has enabled monumental achievements in abstract strategy games as well as computer games (Silver et al., 2016, 2017a, 2017b; Schrittwieser et al., 2019; Wu, 2020; Tomašev et al., 2020). This work employs deep learning in a similar fashion.

Recent advancements in imperfect information games are also remarkable. Several programs have reached superhuman performance in Poker (Moravčík et al., 2017; Brown and Sandholm, 2018, 2019; Brown et al., 2020). In particular, ReBeL (Brown et al., 2020) combines RL and search by converting imperfect information games into continuous state space perfect information games with public belief states as nodes. This approach is powerful, but it relies on public knowledge and fails to scale to games with hidden actions and substantial private information, such as RBC.

Information set search (Parker et al., 2006, 2010) is a limited-depth algorithm for imperfect information games that operates on information states according to a minimax rule. This algorithm was designed for and evaluated on Kriegspiel chess, which is comparable to RBC.

Partially observable Monte Carlo planning (POMCP) (Silver and Veness, 2010) achieves optimal policies for POMDPs by tracking approximate belief states with an unweighted particle filter and planning with a variant of UCT on a search tree of histories. In practice, POMCP can suffer from particle depletion, requiring a domain-specific workaround. This work combines an unweighted particle filter with a novel information state abstraction technique which increases sample quality and supports deep learning.

Smooth UCT (Heinrich and Silver, 2015) and information set Monte Carlo tree search (ISMCTS) (Cowling et al., 2012) may be viewed as multi-agent versions of POMCP. These two algorithms for playing extensive-form games build search trees (for each player) of information states. These two algorithms and DSMCP all perform playouts from determinized states that are accurate from the current player’s perspective, effectively granting the opponent extra information. Still, Smooth UCT approached a Nash equilibrium by incorporating a stochastic bandit algorithm into its tree search. DSMCP uses a similar bandit algorithm that mixes in a learned policy during early node visits.

While adapting perfect information algorithms has performed surprisingly well in some imperfect information settings (Long et al., 2010), the theoretical guarantees of variants of counterfactual regret minimization (CFR) (Neller and Lanctot, 2013; Brown et al., 2018) are enticing. Online outcome sampling (OOS) (Lisy et al., 2015) extends Monte Carlo counterfactual regret minimization (MCCFR) (Lanctot et al., 2009) by building its search tree incrementally and targeting playouts to relevant parts of the tree. OOS draws samples from the beginning of the game. MCCFR and OOS are theoretically guaranteed to converge to a Nash equilibrium strategy. Specifically, CFR-based algorithms produce mixed strategies while DSMCP relies on incidental stochasticity.

Neural fictitious self-play (NFSP) (Heinrich and Silver, 2016) is an RL algorithm for training two neural networks for imperfect information games. Experiments with NFSP employed compact observations embeddings of information states. DSMCP offers a generic technique for embedding information states in large games. Dual sequential Monte Carlo (DualSMC) (Wang et al., 2019) estimates belief states and plans in a continuous domain via sequential Monte Carlo with heuristics.

3 Reconnaissance blind chess

Reconnaissance blind chess (RBC) (Newman et al., 2016; Markowitz et al., 2018; Gardner et al., 2020) is a chess variant that incorporates uncertainty about the placement of the opposing pieces along with a private sensing mechanism. As shown in Figure 2, RBC players are often faced with thousands of possible game states, and reducing uncertainty increases the odds of winning.

Game rules

Pieces move in the same way in RBC as in chess. Players cannot directly observe the movement of the opposing pieces. However, at the beginning of each turn, players may view the ground truth of any 3×33\small{\times}3 patch of the board. The information gained from the sensing action remains private to that player. Players are also informed of the location of all captures, but not the identity of capturing pieces. When a requested move is illegal, the move is substituted with the closest legal move and the player is notified of the substitution. For example, in Figure 2 (a), if Black attempted to move the rook from h8 to f8, the rook would capture the queen on g8 and stop there instead. Players are always able to track the placement of their own pieces. Capturing the opposing king wins the game, and players are not notified about check. Passing and moving into check are legal.

Official competition

This paper introduces the program Penumbra, the winner of the official 2020 RBC rating competition. In total, 34 programs competed to achieve the highest rating by playing public games. Ratings were computed with BayesElo (Coulom, 2008), and playing at least 100 games was required to be eligible to win. Figure 2 shows ground truth positions from the tournament in which Penumbra voluntarily put its queen in danger. Players were paired randomly, but the opponent’s identity was provided at the start of each game which allowed catering strategies for specific opponents. However, opponents were free to change their strategies at any point, so attempting to exploit others could backfire. Nonetheless, Penumbra sought to model and counter predictable opponents rather than focusing on finding a Nash equilibrium.

Other RBC programs

RBC programs have employed a variety of algorithms (Gardner et al., 2020) including Q-learning (Mnih et al., 2013), counterfactual regret minimization (CFR) (Zinkevich et al., 2008), online outcome sampling (OOS) (Lisy et al., 2015), and the Stockfish chess engine (Romstad et al., 2020). Another strong RBC program (Highley et al., 2020; Blowitski and Highley, 2021) maintains a probability distribution for each piece. Most RBC programs select sense actions and move actions in separate ways while DSMCP unifies all action selection. Savelyev (2020) also applied UCT to RBC and modeled the root belief state with a distribution over 10,000 tracked positions. Input to a neural network consisted of the most-likely 100 positions, and storing a single training example required 3.5MB on average, large enough to hinder training. This work circumvented the same issue by representing training examples with compact synopses which are less than 1kB.

4 Terminology

Consider the two-player extensive-form game with agents 𝒫={\mathcal{P}=\{self, opponent}\}, actions 𝒜\mathcal{A}, “ground truth” world states 𝒳\mathcal{X}{}, and initial state x0𝒳x_{0}\in\mathcal{X}{}. Each time an action is taken, each agent p𝒫p\in\mathcal{P} is given an observation 𝐨p𝒪\mathbf{o}_{p}\in\mathcal{O} that matches (\sim) the possible world states from pp’s perspective. For simplicity, assume the game has deterministic actions such that each a𝒜a\in\mathcal{A} is a function a:X𝒳a:X{}\rightarrow\mathcal{X}{} defined on a subset of world states X𝒳X{}\subset\mathcal{X}{}. Define 𝒜x\mathcal{A}_{x}{} as the set of actions available from x𝒳x{}\in\mathcal{X}{}.

An information state (infostate) s𝒮s\in\mathcal{S} for agent pp consists of all observations pp has received so far.111 An infostate is equivalent to an information set, which is the set of all possible action histories from pp’s perspective (Osborne and Rubinstein, 1994). Let 𝒳s𝒳\mathcal{X}{}_{s}\subset\mathcal{X}{} be the set of all world states that are possible from pp’s perspective from ss. In general, 𝒳s\mathcal{X}_{s} contains less information than ss since some (sensing) actions may not affect the world state. Define a collection of limited-size world state sets ={L𝒳s:s𝒮,|L|}\mathcal{L}=\{L\subset\mathcal{X}_{s}:s\in\mathcal{S},|L|\leq\ell\}, given a constant \ell.

Let ρ:𝒳𝒫\rho:\mathcal{X}{}\rightarrow\mathcal{P} indicate the agent to act in each world state. Assume that 𝒜x=𝒜y\mathcal{A}_{x}{}=\mathcal{A}_{y} and ρ(x)=ρ(y)\rho(x{})=\rho(y) for all x,y𝒳sx{},y\in\mathcal{X}{}_{s} and s𝒮s\in\mathcal{S}. Then extend the definitions of actions available 𝒜\mathcal{A}_{*} and agent to act ρ\rho over sets of world states and over infostates in the natural way. A policy π(a|s)\pi(a|s) is a distribution over actions given an infostate. A belief state B(h)B(h) is a distribution over action histories. Creating a belief state from an infostate requires assumptions about the opponent’s action policy τ(a|s)\tau(a|s). Let p:𝒳\mathcal{R}_{p}:\mathcal{X}{}\rightarrow\mathbb{R} map terminal states to the reward for player pp. Then (𝒮,𝒜,self,τ,s0)(\mathcal{S},\mathcal{A},\mathcal{R}_{\text{self}},\tau,s_{0}) is a POMDP, where the opponent’s policy τ\tau provides environment state transitions and s0s_{0} is the initial infostate. In the rest of this paper, the word “state” refers to a world state unless otherwise specified.

5 Deep synoptic Monte Carlo planning

\qquad\qquad\qquad\qquad\dots \dots \qquad\qquad\qquad\qquad\dots Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to caption \dots\dots\dotsB^0\hat{B}_{0} B^1\hat{B}_{1} B^t\hat{B}_{t} Refer to caption Refer to captionRefer to caption \dots \dots Refer to captionRefer to caption \dots X0X_{0} X1X_{1} XtX_{t} Refer to caption Refer to captionRefer to captionRefer to caption Refer to captionRefer to caption Refer to captionRefer to captionRefer to captionRefer to caption Refer to captionRefer to caption
Figure 3: DSMCP approximates infostates with size-limited sets of possible states (circles). It tracks all possible states XtX_{t} for each turn from its own perspective and constructs belief states B^t\hat{B}_{t} with approximate infostates from the opponent’s perspective. At the root of each playout, the initial approximate infostate for the opponent is sampled from B^t\hat{B}_{t}, and the initial approximate infostate for itself is a random subset of XtX_{t}.
Algorithm 1 Bandit – Action selection with a stochastic multi-armed bandit
  Given: c>0,m0c>0,m\geq 0
  Input: policy π\pi, visit counts n\vec{n}, value totals q\vec{q}
  nanan\leftarrow\sum_{a}{\vec{n}_{a}}
  if emn>Uniform([0,1])e^{-mn}>\texttt{Uniform}([0,1]) and ¬root\neg\text{root} then
     return aa\leftarrow an action selected by policy π\pi
  else
     return aargmaxa(qana+cπalnnna)a\leftarrow\operatorname*{argmax}_{a}\left(\frac{\vec{q}_{a}}{\vec{n}_{a}}+c\pi_{a}\sqrt{\frac{\ln{n}}{\vec{n}_{a}}}\right)
Algorithm 2 DrawSample – Sample selection with rejection
  Given: k,+k,\ell\in\mathbb{Z}^{+}, synopsis function σ\sigma,
  Given: policy τ^\hat{\tau}
  Input: previous belief distribution B^\hat{B}, possible
  Input: states XX, visit counts 𝐍\mathbf{N}, value totals 𝐐\mathbf{Q}
  for 0 to kk do
     JJ\leftarrow random sample from B^\hat{B}
     aBandit(τ^(σ(J)),𝐍J,𝐐J)a\leftarrow\texttt{Bandit}(\hat{\tau}(\sigma(J)),\mathbf{N}_{J},\mathbf{Q}_{J})
     I{a(x):xJ}I\leftarrow\{a(x):x\in J\}
     II\leftarrow random \ell states from II if |I|>|I|>\ell
     return II if IXI\cap X\neq\varnothing
  return {\{ random state from XX }\}

Effective planning algorithms for imperfect information games must model agents’ choice of actions based on (belief states derived from) infostates, not on world states themselves. Deep synoptic Monte Carlo planning (DSMCP) approximates infostates with size-limited sets of possible world states in \mathcal{L}. It uses those approximations to construct a belief state and as UCT nodes (Kocsis and Szepesvári, 2006). Figure 3 provides a high-level visualization of the algorithm.

Algorithm 3 ChooseAction – UCT playouts
  Given: b,d,,nvl,z+b,d,\ell,n_{\text{vl}},z\in\mathbb{Z}^{+}, synopsis func. σ\sigma,
  Given: policy π\pi, opp. policy τ^\hat{\tau}, value func. ν\nu
  Input: belief distribution B^\hat{B}, possible states XX,
  Input: visit counts 𝐍\mathbf{N}, value totals 𝐐\mathbf{Q}
  π¯\bar{\pi}\leftarrow average of πσ\pi\circ\sigma over bb samples from B^\hat{B}
  while time is left do
     JJ\leftarrow random sample from B^\hat{B}
     II\leftarrow random \ell states in XX including one in JJ
     a0Bandit(π¯,𝐍root,𝐐root)a_{0}\leftarrow\texttt{Bandit}(\bar{\pi},\mathbf{N}_{\text{root}},\mathbf{Q}_{\text{root}})
     for t=0t=0 to dd do
        xtx{}_{t}\leftarrow the one state in IJI\cap J
        𝐨\mathbf{o}\leftarrow observations when ata_{t} is played on xtx{}_{t}
        I{a(x):xI,a𝒜II\leftarrow\{a(x):x\in I,a\in\mathcal{A}_{I}, 𝐨selfa(x)}\mathbf{o}_{\text{self}}\sim a(x)\}
        J{a(x):xJ,a𝒜JJ\leftarrow\{a(x):x\in J,a\in\mathcal{A}_{J}, 𝐨oppa(x)}\mathbf{o}_{\text{opp}}{\sim}\,a(x)\}
        II\leftarrow random \ell states in II including at(xt)a_{t}(x_{t})
        JJ\leftarrow random \ell states in JJ including at(xt)a_{t}(x_{t})
        KtIK_{t}\leftarrow I and μπ\mu\leftarrow\pi if II is to act
        KtJK_{t}\leftarrow J and μτ^\mu\leftarrow\hat{\tau} if JJ is to act
        d+=1d\mathrel{+}=1 if 𝐍Kt,at>z\mathbf{N}_{K_{t},a_{t}}>z
        for i=0i=0 to tt do
           𝐍Ki,ai+=nvl\mathbf{N}_{K_{i},a_{i}}\mathrel{+}=n_{\text{vl}}
        at+1Bandit(μ(σ(Kt)),𝐍Kt,𝐐Kt)a_{t+1}\leftarrow\texttt{Bandit}(\mu(\sigma(K_{t})),\mathbf{N}_{K_{t}},\mathbf{Q}_{K_{t}})
        qν(σ(Kt))q\leftarrow\nu(\sigma(K_{t}))
        for i=0i=0 to tt do
           𝐐Ki,ai+=q\mathbf{Q}_{K_{i},a_{i}}\mathrel{+}=q if ρ(Kt)=ρ(Ki)\rho(K_{t}){=}\rho(K_{i}) else q-q
           𝐍Ki,ai+=1nvl\mathbf{N}_{K_{i},a_{i}}\mathrel{+}=1-n_{\text{vl}}
  return aargmaxa𝐐roota\leftarrow\operatorname*{argmax}_{a}\mathbf{Q}_{\text{root}}
Algorithm 4 PlayGame – DSMCP
  Given: nparticles+n_{\text{particles}}\in\mathbb{Z}^{+}
  𝐍,𝟎\mathbf{N}_{*,*}\leftarrow\mathbf{0} // Visits \forall (sample, action) ×𝒜\in\mathcal{L}\times\mathcal{A}
  𝐐,𝟎\mathbf{Q}_{*,*}\leftarrow\mathbf{0} // Values \forall (sample, action) ×𝒜\in\mathcal{L}\times\mathcal{A}
  X0{x}0X_{0}\leftarrow\{x{}_{0}\}
  B^0{X0}\hat{B}_{0}\leftarrow\{X_{0}\}
  while the game is not over do
     tcurrent turnt\leftarrow\text{current turn}
     𝐨current observation\mathbf{o}\leftarrow\text{current observation}
     // Track all possible world states
     Xt{a(x):xXt1,a𝒜xX_{t}\leftarrow\{a(x):x\in X_{t-1},a\in\mathcal{A}_{x}, 𝐨selfa(x)}\mathbf{o}_{\text{self}}\sim a(x)\}
     // Filter belief states with the new information
     for i=t1i=t-1 to 0 do
        Xi{xXi:a𝒜xX_{i}\leftarrow\{x\in X_{i}:\exists a\in\mathcal{A}_{x}
        Xi{xXi:X_{i}\leftarrow\{x\in X_{i}: such that a(x)Xi+1}a(x)\in X_{i+1}\}
        B^i{IB^i:IXi}\hat{B}_{i}\leftarrow\{I\in\hat{B}_{i}:I\cap X_{i}\neq\varnothing\}
     // Repopulate belief states with new particles
     while opponent to act or |B^t|<nparticles|\hat{B}_{t}|<n_{\text{particles}} do
        ii\leftarrow smallest i>0i>0 s.t. |B^i|<nparticles|\hat{B}_{i}|<n_{\text{particles}}
        IDrawSample(B^i1,Xi,𝐍,𝐐)I\leftarrow\texttt{DrawSample}(\hat{B}_{i-1},X_{i},\mathbf{N},\mathbf{Q})
        B^iB^i{I}\hat{B}_{i}\leftarrow\hat{B}_{i}\cup\{I\}
     if self to act then
        aChooseAction(B^t,Xt,𝐍,𝐐)a\leftarrow\texttt{ChooseAction}(\hat{B}_{t},X_{t},\mathbf{N},\mathbf{Q})
        Perform action aa

A bandit algorithm chooses an action during each node visit, as described in Algorithm 1. This bandit algorithm is similar to Smooth UCB (Heinrich and Silver, 2015) in that they both introduce stochasticity by mixing in a secondary policy. Smooth UCB empirically approached a Nash equilibrium utilizing the average policy according to action visits at each node. DSMCP mixes in a neural network’s policy (π\pi) instead. The constant cc controls the level of exploration, and mm controls how the policy π\pi is mixed into the bandit algorithm. For example, taking m=0m=0 always selects actions directly with π\pi without considering visit counts, and taking m=m=\infty never mixes in π\pi. As in Silver et al. (2016), π\pi provides per-action exploration values which guide the tree search.

Approximate belief states are constructed as subsets B^\hat{B}\subset\mathcal{L}, where each LB^L\in\hat{B} is a set of possible world-states from the opponent’s perspective. This “second order” representation of belief states allows DSMCP to account for the opponent’s uncertainty. Infostates sampled with rejection (Algorithm 2) are used as the “particles” in a particle filter which models successive belief states. Sampling is guided by a neural network policy (τ^\hat{\tau}) based on the identity of the opponent. To counter particle deprivation, if kk consecutive candidate samples are rejected as incompatible with the possible world states, then a singleton sample consisting of a randomly-chosen possible state is selected instead.

The tree search, described in Algorithm 3, tracks an approximate infostate for each player while simulating playouts. Playouts are also guided by policy (π\pi and τ^\hat{\tau}) and value (ν\nu) estimations from a neural network. A synopsis function σ\sigma creates a fixed-size summary of each node as input for the network. The constant bb is the batch size for inference, dd is the search depth, \ell is the size of approximate infostates, nvln_{\text{vl}} is the virtual loss weight, and zz is a threshold for increasing search depth.

Algorithm 4 describes how to play an entire game, tracking all possible world states. Approximate belief states (B^t\hat{B}_{t}) are constructed for each past turn by tracking nparticlesn_{\text{particles}} elements of \mathcal{L} (from the opponent’s point of view) with an unweighted particle filter. Each time the agent receives a new observation, all of the (past) particles that are inconsistent with the observation are filtered out and replenished, starting with the oldest belief states.

5.1 Synopsis

Refer to caption
# White Black
1 g6 : e2e4 e3 : h7h5
2 g7 : d2d4 f2 : f7f5
3 g6 : e4f5 e4 : h5h4
4 d7 : f1e2 g4 : b8c6
5 g7 : e2h5 g4 : h8h5
6 b7 : d1h5 g5 : g7g6
7 e7 : h5e8 g6 : d7d6
8 g6 : g6e8
Figure 4: Penumbra played as White in this short game. In the position shown on the left, Black just moved a knight from b8 to c6. From White’s perspective, the black pieces could be placed in 238 different ways. Figure 5 shows a set of synopsis bitboards for this information state.
Refer to caption 0 Refer to caption 1 Refer to caption 2 Refer to caption 3 Refer to caption 4 Refer to caption 5 Refer to caption 6 Refer to caption 7 Refer to caption 8 Refer to caption 9 Refer to caption 10 Refer to caption 11 Refer to caption 12 Refer to caption 13 Refer to caption 14 Refer to caption 15 Refer to caption 16 Refer to caption 17 Refer to caption 18 Refer to caption 19 Refer to caption 20
Refer to caption 21 Refer to caption 22 Refer to caption 23 Refer to caption 24 Refer to caption 25 Refer to caption 26 Refer to caption 27 Refer to caption 28 Refer to caption 29 Refer to caption 30 Refer to caption 31 Refer to caption 32 Refer to caption 33 Refer to caption 34 Refer to caption 35 Refer to caption 36 Refer to caption 37 Refer to caption 38 Refer to caption 39 Refer to caption 40 Refer to caption 41
Refer to caption 42 Refer to caption 43 Refer to caption 44 Refer to caption 45 Refer to caption 46 Refer to caption 47 Refer to caption 48 Refer to caption 49 Refer to caption 50 Refer to caption 51 Refer to caption 52 Refer to caption 53 Refer to caption 54 Refer to caption 55 Refer to caption 56 Refer to caption 57 Refer to caption 58 Refer to caption 59 Refer to caption 60 Refer to caption 61 Refer to caption 62
Refer to caption 63 Refer to caption 64 Refer to caption 65 Refer to caption 66 Refer to caption 67 Refer to caption 68 Refer to caption 69 Refer to caption 70 Refer to caption 71 Refer to caption 72 Refer to caption 73 Refer to caption 74 Refer to caption 75 Refer to caption 76 Refer to caption 77 Refer to caption 78 Refer to caption 79 Refer to caption 80 Refer to caption 81 Refer to caption 82 Refer to caption 83
Refer to caption 84 Refer to caption 85 Refer to caption 86 Refer to caption 87 Refer to caption 88 Refer to caption 89 Refer to caption 90 Refer to caption 91 Refer to caption 92 Refer to caption 93 Refer to caption 94 Refer to caption 95 Refer to caption 96 Refer to caption 97 Refer to caption 98 Refer to caption 99 Refer to caption 100 Refer to caption 101 Refer to caption 102 Refer to caption 103
Figure 5: This set of synopsis bitboards was used as input to the neural network before White’s sense on turn 5 of the game in Figure 4. The synopsis contains 104 bitboards. Each bitboard encodes 64 binary features of the possible state set that the synopsis summarizes. For example, bitboard #26 contains the possible locations of opposing pawns, and bitboard #27 contains the possible locations of opposing knights. An attentive reader may notice that the black pawn on h4 is missing from bitboard #26, which is due to subsampling to =128\ell=128 states before computing the bitboards. In this case, the true state was missing from the set of states used to create the synopsis. The features in each synopsis are only approximations of the infostates that they represent. The first 10 bitboards are constants, which provide information that is difficult for convolutions to construct otherwise (Liu et al., 2018).
8×8×1048\small{\times}8\small{\times}104 input 8×8×1288\small{\times}8\small{\times}128 res. block … 8 residual blocks omitted … 8×8×1288\small{\times}8\small{\times}128 res. block Headset 8×8×88\small{\times}8\small{\times}8 8×8×48\small{\times}8\small{\times}4 8×8×1288\small{\times}8\small{\times}128 8×8×48\small{\times}8\small{\times}4 8×8×48\small{\times}8\small{\times}4 8×8×2168\small{\times}8\small{\times}216 8×8×658\small{\times}8\small{\times}65 Policy 64 dense Pass 256 dense Value 256 dense SoonWin 256 dense SoonLose 8×8×2168\small{\times}8\small{\times}216 PieceCount
Figure 6: Penumbra’s network contains a shared tower with 10 residual blocks and 14 headsets. Each headset contains 5 heads for a total of 70 output heads. The residual blocks are shown with a double border, and they each contain two 3×33\small{\times}3 convolutional layers and batch normalization. All of the convolutional layers in the headsets are 1×11\small{\times}1 convolutions with the exception of the one residual block for each policy head. Each headset was trained on a separate subset of the data, as described in Table 1. The policy head provides logits for both sense and move actions.

One of the contributions of this paper is the methodology used to approximate and encode infostates. Games that consist of a fixed number of turns, such as poker, admit a naturally-compact infostate representation based on observations (Heinrich and Silver, 2015, 2016). However, perfect representations are not always practical. Game abstractions are often used to reduce computation and memory requirements. For example, imperfect recall is an effective abstraction when past actions are unnecessary for understanding the present situation (Waugh et al., 2009; Lanctot et al., 2012).

DSMCP employs a stochastic abstraction which represents infostates with sets of world states and then subsamples to a manageable cardinality ()(\ell). Finally, a permutation-invariant synopsis function σ\sigma produces fixed-size summaries of the approximate infostates which are used for inference. An alternative is to run inference on “determinized” world states individually and then somehow aggregate the results. However, such aggregation can easily lead to strategy fusion (Frank et al., 1998). Other alternatives include evaluating states with a recurrent network (Rumelhart et al., 1986) one-at-a-time or using a permutation-invariant architecture (Zaheer et al., 2017; Wagstaff et al., 2019).

Given functions gi:𝒳{0,1}g_{i}:\mathcal{X}\rightarrow\{0,1\} for i=0,,Fi=0,\dots,F that map states to binary features, define the ithi^{\text{th}} component of a synopsis function σ:{0,1}F\sigma:\mathcal{L}\rightarrow\{0,1\}^{F} as

σi(X)=gi(x0)igi(x1)iigi(x)\sigma_{i}(X)=g_{i}(x_{0})*_{i}g_{i}(x_{1})\ast_{i}\dots*_{i}g_{i}(x_{\ell}) (1)

where X={x0,x1,,x}X=\{x_{0},x_{1},\dots,x_{\ell}\} and i*_{i} is either the logical AND (\land) or the logical OR (\lor) operation. For example, if gig_{i} encodes whether an opposing knight can move to the d7 square of a chess board and i=*_{i}=\land, then σi\sigma_{i} indicates that a knight can definitely move to d7. Figure 4 shows an example game, and Figure 5 shows an example output of Penumbra’s synopsis function, which consists of 104 bitboard feature planes each with 64 binary features. The appendix describes each feature plane.

5.2 Network architecture

Penumbra uses a residual neural network (He et al., 2016) as shown in Figure 6. The network contains 14 headsets, designed to model specific opponents and regularize each other as they are trained on different slices of data (Zhang et al., 2020). Each headset contains 5 heads: a policy head, a value head, two heads for predicting winning and losing within the next 5 actions, and a head for guessing the number of pieces of each type in the ground truth world state. The Top policy head and the All value head are used for planning as π\pi and ν\nu, respectively. The other heads (including the SoonWin, SoonLose, and PieceCount heads) provide auxiliary tasks for further regularization (Wu, 2020; Fifty et al., 2020). While playing against an opponent that is “recognized” (when a headset was trained on data from only that opponent), the policy head (τ^\hat{\tau}) of the corresponding headset is used for the opponent’s moves while progressing the particle filter (Algorithm 2) and while constructing the UCT tree (Algorithm 3). When the opponent is unrecognized, the Top policy head is used by default.

5.3 Training procedure

The network was trained on historical222The games were downloaded from rbmc.jhuapl.edu in June, 2019 and rbc.jhuapl.edu in August, 2020. Additionally, 5,000 games were played locally by StockyInference. game data as described by Table 1. The reported accuracies are averages over 5 training runs. The All headset was trained on all games, the Top headset was trained on games from the highest-rated players, the Human headset was trained on all games played by humans, and each of the other 11 headsets were trained to mimic specific opponents.

10% of the games were used as validation data based on game filename hashes. Training examples were extracted from games multiple times since reducing possible state sets to \ell states is non-deterministic. A single step of vanilla stochastic gradient descent was applied to one headset at a time, alternating between headsets according to their training weights. See the appendix for hyperparameter settings and accuracy cross tables. Training and evaluation were run on four RTX 2080 Ti GPUs.

Table 1: Summary of headset training data and resulting validation accuracies

Name

# Games

# Train Actions

# Validation Actions

Multiplicity

Training Weight

Stop Gradient

% Top-1 Accuracy

% Top-5 Accuracy

% Winner Accuracy

All 85.6k 16.8M 1.8M 04 15 No 33.6 62.0 74.8
Top 49.5k 16.0M 1.6M 0\sim9.6 13 No 43.1 73.4 82.4
StrangeFish 13.3k 8.1M 0.8M \sim20.7 10 No 45.9 74.3 86.6
LaSalle 3.7k 4.3M 0.4M 32 04 No 36.4 68.2 69.7
Dyn.Entropy 7.8k 7.8M 0.8M 32 07 No 50.0 76.7 78.9
Oracle 20.3k 17.6M 1.9M 32 09 No 49.3 77.4 81.4
StockyInfe. 10.7k 13.5M 1.4M 16 09 No 45.2 73.6 68.3
Marmot 10.2k 3.9M 0.4M 16 08 No 24.6 55.4 80.8
Genetic 5.6k 2.3M 0.2M 16 08 No 40.4 70.5 77.9
Zugzwang 10.9k 3.1M 0.3M \sim12.0 05 No 47.0 71.6 80.1
Trout 18.0k 5.1M 0.5M 16 05 No 41.8 63.4 79.8
Human 6.3k 1.5M 0.2M 16 02 No 24.9 51.9 73.3
Attacker 15.3k 1.7M 0.2M 16 04 No 45.1 61.9 80.1
Random 17.0k 0.6M 76k 02 01 Yes 4.5 20.7 94.7

5.4 Implementation details

Penumbra plays RBC with DSMCP along with RBC-specific extensions. First, sense actions that are dominated by other sense actions are pruned from consideration. Second, Penumbra can detect some forced wins in the sense phase, the move phase, and during the opponent’s turn. This static analysis is applied at the root and to playouts; playouts are terminated as soon as a player could win, avoiding unnecessary neural network inference. The static analysis was also used to clean training games in which the losing player had sufficient information to find a forced win.

Piece placements are represented with bitboards (Browne, 2014), and the tree of approximate infostates is implemented with a hash table. Zobrist hashing (Zobrist, 1990) maintains hashes of piece placements incrementally. Hash table collisions are resolved by overwriting older entries. The tree was not implemented until after the competition, so fixed-depth playouts were used instead (m=0m=0). Inference is done in batches of 256256 during both training and online planning. The time used per action is approximately proportional to the time remaining. The program processes approximately 4,000 nodes per second, and it plays randomly when the number of possible states exceeds 9 million.

6 Experiments

This section presents the results of playing games between Penumbra and several publicly available RBC baselines (Gardner et al., 2020; Bernardoni, 2020). Each variant of Penumbra in Table 4 played 1000 games against each baseline, and each variant in Table 4 and Table 4 played 250 games against each baseline. Games with errors were ignored and replayed. The Elo ratings and 95% confidence intervals were computed with BayesElo (Coulom, 2008) and are all compatible. The scale was anchored with StockyInference at 1500 based on its rating during the competition.

Table 4 gives ratings of the baselines and five versions of Penumbra. PenumbraCache relied solely on the network policy for action selection in playouts (m= 0m\,{=}\,0), PenumbraTree built a UCT search tree (m=m\,{=}\,\infty), and PenumbraMixture mixed in the network policy during early node visits (m= 1m\,{=}\,1). The mixed strategy performed the best. PenumbraNetwork selected actions based on the network policy without performing any playouts. PenumbraSimple is the same as PenumbraMixture with the static analysis described in Section 5.4 disabled. PenumbraNetwork and PenumbraSimple serve as ablation studies; removing the search algorithm is detrimental while the effect of removing the static analysis is not statistically significant. Unexpectedly, Penumbra played the strongest against StockyInference when that program was unrecognized. So, in this case, modeling the opponent with a stronger policy outperformed modeling it more accurately.

Two algorithmic modifications that give the opponent an artificial advantage during planning were investigated. Table 4 reports the results of a grid search over “cautious” and “paranoid” variants of DSMCP. The caution parameter κ\kappa specifies the percentage of playouts that use =4\ell=4 for the opponent instead of the higher default limit. Since each approximate infostate is guaranteed to contain the correct ground truth in playouts, reducing \ell for the opponent gives the opponent higher-quality information, allowing the opponent to counter risky play more easily in the constructed UCT tree.

The paranoia parameter augments the exploration values in Algorithm 1 to incorporate the minimum value seen during the current playout. With paranoia ϕ\phi, actions are selected according to

argmaxa((1ϕ)qana+ϕma+cπalnnna)\operatorname*{argmax}_{a}\left((1-\phi)\frac{\vec{q}_{a}}{\vec{n}_{a}}+\phi\vec{m}_{a}+c\pi_{a}\sqrt{\frac{\ln{n}}{\vec{n}_{a}}}\right) (2)

where m\vec{m} contains the minimum value observed for each action. This is akin to the notion of paranoia studied by Parker et al. (2006, 2010).

Table 2: Bot Elo scores
Bot Recognized as Elo score
PenumbraMixture unrecognized 17471747\, ± 11\pm\,11
PenumbraSimple unrecognized 17391739\, ± 10\pm\,10
PenumbraCache unrecognized 17271727\, ± 10\pm\,10
PenumbraTree unrecognized 16411641\, ± 9\pm\,9
StockyInference Trout 16101610\, ± 7\pm\,7
StockyInference StrangeFi. 15281528\, ± 8\pm\,8
StockyInference Genetic 15121512\, ± 8\pm\,8
StockyInference StockyInf. 15001500\, ± 8\pm\,8
StockyInference unrecognized 14741474\, ± 8\pm\,8
PenumbraNetwork unrecognized 13761376\, ± 9\pm\,9
AggressiveTree unrecognized 11341134\, ± 15\pm\,15
FullMonte unrecognized 10281028\, ± 20\pm\,20
Trout Trout 10051005\, ± 22\pm\,22
Trout unrecognized 997997\, ± 22\pm\,22
Table 3: Caution and paranoia grid search results
Caution
0% 10% 20% 30%
Paranoia 0% 1711±191711\small{\pm}19 1714±191714\small{\pm}19 1707±191707\small{\pm}19 1702±191702\small{\pm}19
10% 1711±191711\small{\pm}19 1705±191705\small{\pm}19 1726±191726\small{\pm}19 1695±181695\small{\pm}18
20% 1688±181688\small{\pm}18 1700±191700\small{\pm}19 1688±181688\small{\pm}18 1670±181670\small{\pm}18
30% 1691±181691\small{\pm}18 1683±181683\small{\pm}18 1681±181681\small{\pm}18 1666±181666\small{\pm}18
Table 4: Exploration strategy grid search results
Exploration ratio cc
1 2 4
UCB1 1698±191698\pm 19 1686±181686\pm 18 1696±181696\pm 18
aVoP 1696±181696\pm 18 1680±181680\pm 18 1695±181695\pm 18

Table 4 shows the results of a grid search over exploration constants and two bandit algorithms. UCB1 (Kuleshov and Precup, 2014) (with policy priors), which is used on the last line of Algorithm 1, is compared with “a variant of PUCT” (aVoP) (Silver et al., 2016; Tian et al., 2019; Lee et al., 2019), another popular bandit algorithm. This experiment used κ=20%\kappa=20\% and ϕ=20%\phi=20\%. Figure 7 show that Penumbra’s value head accounts for the uncertainty of the underlying infostate.

Refer to caption

(a)

Refer to caption

(b)

Figure 7: The mean historical win percentage and the mean network value assigned to (a) train and (b) validation synopses tend to decrease as the number of world states given to σ\sigma increases.

7 Per-bit saliency

Saliency methods may be able to identify which of the synopsis feature planes are most important and which are least important. Gradients only provide local information, and some saliency methods fail basic sanity checks (Adebayo et al., 2018). Higher quality saliency information may be surfaced by integrating gradients over gradually-varied inputs (Sundararajan et al., 2017; Kapishnikov et al., 2019) and by smoothing gradients locally (Smilkov et al., 2017). Those saliency methods are not directly applicable to discrete inputs such as the synopses used in this work. So, this paper introduces a saliency method that aggregates gradient information across two separate dimensions: training examples and iterations. Per-batch saliency (PBS) averages the absolute value of gradients over random batches of test examples throughout training. Similarly, per-bit saliency (PbS) averages the absolute value of gradients over bits (with specific values) within batches of test examples throughout training. Gradients were taken both with respect to the loss and with respect to the action policy.

Refer to caption

(a) Refer to caption (b)

Refer to caption

(c) Refer to caption (d)

Figure 8: (a) The loss per-batch saliency (PBS) and (b) the action per-bit saliency (PbS) are taken on test examples during training. These graphs show the saliency of feature plane #8, dark squares, for each headset in one training run. The large gradients with respect to the loss suggest that the Genetic headset has overfit. (c) The loss PBS and (d) the action PbS provide insight about which synopsis features are most useful. The top-five most-salient feature planes and the least-salient feature plane for the Top headset from one training run are shown.

Figure 8 shows saliency information for input synopsis features used by Penumbra. In order to validate that these saliency statistics are meaningful, the model was retrained 104 times, once with each feature removed (Hooker et al., 2018). Higher saliency is slightly correlated with decreased performance when a feature is removed. The correlation coefficient to the average change in accuracy is 0.208-0.208 for loss-PBS, and 0.206-0.206 for action-PbS. Explanations for the low correlation include noise in the training process and the presence of closely-related features. Ultimately, the contribution of a feature during training is distinct from how well the model can do without that feature. Since some features are near-duplicates of others, removing one may simply increase dependence on another. Still, features with high saliency — such as the current stage (sense or move) and the location of the last capture — are likely to be the most important, and features with low saliency may be considered for removal. The appendix includes saliency statistics for each feature plane.

8 Discussion

Broader impact

DSMCP is more broadly applicable than some prior algorithms for imperfect information games, which are intractable in settings with large infostates and small amounts of shared knowledge (Brown et al., 2020). RBC and the related game Kriegspiel were motivated by uncertainty in warfare (Newman et al., 2016). While playing board games is not dangerous in itself, algorithms that account for uncertainty may become effective and consequential in the real world. In particular, since it focuses on exploiting weaknesses of other agents, DSMCP could be applied in harmful ways.

Future work

Planning in imperfect information games is an active area of research (Russell and Norvig, 2020), and RBC is a promising testing ground for such research. Penumbra would likely benefit from further hyperparameter tuning and potentially alternative corralled bandit algorithms (Arora et al., 2020). Modeling an opponent poorly could be catastrophic; algorithmic adjustments may lead to more-robust best-response strategies (Ponsen et al., 2011). How much is lost by collapsing infostates with synopses is unclear and deserves further investigation. Finally, the “bitter lesson” of machine learning (Sutton, 2019) suggests that a learned synopsis function may perform better.

Acknowledgements

Thanks to the Johns Hopkins University Applied Physics Laboratory for inventing such an intriguing game and for hosting RBC competitions. Thanks to Ryan Gardner for valuable correspondence. Thanks to Rosanne Liu, Joel Veness, Marc Lanctot, Zhe Zhao, and Zach Nussbaum for providing feedback on early drafts. Thanks to William Bernardoni for open sourcing high-quality baseline bots. Thanks to Solidmind for the song “Penumbra”, which is an excellent soundtrack for programming.

References

  • Adebayo et al. (2018) Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps, 2018.
  • Arora et al. (2020) Raman Arora, Teodor V. Marinov, and Mehryar Mohri. Corralling stochastic bandit algorithms, 2020.
  • Bernardoni (2020) William Bernardoni. Baseline bots. https://github.com/wrbernardoni/Baseline-Bots, 2020.
  • Blowitski and Highley (2021) Kyle Blowitski and Timothy Highley. Checkpoint variations for deep q-learning in reconnaissance blind chess. J. Comput. Sci. Coll., 36(3), oct 2021.
  • Bowling and Veloso (2001) Michael Bowling and Manuela Veloso. Rational and convergent learning in stochastic games. IJCAI International Joint Conference on Artificial Intelligence, 08 2001.
  • Brown and Sandholm (2018) Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018. ISSN 0036-8075. doi: 10.1126/science.aao1733. URL https://science.sciencemag.org/content/359/6374/418.
  • Brown and Sandholm (2019) Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019. ISSN 0036-8075. doi: 10.1126/science.aay2400. URL https://science.sciencemag.org/content/365/6456/885.
  • Brown et al. (2018) Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization, 2018.
  • Brown et al. (2020) Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining deep reinforcement learning and search for imperfect-information games, 2020.
  • Browne (2014) Cameron Browne. Bitboard methods for games. ICGA journal, 37:67–84, 06 2014. doi: 10.3233/ICG-2014-37202.
  • Ciancarini and Favini (2009) Paolo Ciancarini and Gian Piero Favini. Monte carlo tree search techniques in the game of kriegspiel. In IJCAI, pages 474–479, 2009. URL http://ijcai.org/Proceedings/09/Papers/086.pdf.
  • Coulom (2008) Rémi Coulom. Whole-history rating: A bayesian rating system for players of time-varying strength. In H. Jaap van den Herik, Xinhe Xu, Zongmin Ma, and Mark H. M. Winands, editors, Computers and Games, pages 113–124, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. ISBN 978-3-540-87608-3.
  • Cowling et al. (2012) Peter Cowling, Edward Powley, and Daniel Whitehouse. Information set monte carlo tree search. IEEE Transactions on Computational Intelligence and Ai in Games, 4:120–143, 06 2012. doi: 10.1109/TCIAIG.2012.2200894.
  • Fifty et al. (2020) Christopher Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Measuring and harnessing transference in multi-task learning, 2020.
  • Frank et al. (1998) I. Frank, D. Basin, and Hitoshi Matsubara. Finding optimal strategies for imperfect information games. In AAAI/IAAI, 1998.
  • Gardner et al. (2020) Ryan W. Gardner, Corey Lowman, Casey Richardson, Ashley J. Llorens, Jared Markowitz, Nathan Drenkow, Andrew Newman, Gregory Clark, Gino Perrotta, Robert Perrotta, Timothy Highley, Vlad Shcherbina, William Bernardoni, Mark Jordan, and Asen Asenov. The first international competition in machine reconnaissance blind chess. volume 123 of Proceedings of Machine Learning Research, pages 121–130, Vancouver, CA, 08–14 Dec 2020. PMLR. URL http://proceedings.mlr.press/v123/gardner20a.html.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016. doi: 10.1109/cvpr.2016.90. URL http://dx.doi.org/10.1109/cvpr.2016.90.
  • Heinrich and Silver (2015) Johannes Heinrich and David Silver. Smooth uct search in computer poker. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, page 554–560. AAAI Press, 2015. ISBN 9781577357384.
  • Heinrich and Silver (2016) Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-information games, 2016.
  • Highley et al. (2020) Timothy Highley, Brendan Funk, and Laureen Okin. Dealing with uncertainty: A piecewisegrid agent for reconnaissance blind chess. J. Comput. Sci. Coll., 35(8):156–165, April 2020. ISSN 1937-4771.
  • Hooker et al. (2018) Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. A benchmark for interpretability methods in deep neural networks, 2018.
  • Kapishnikov et al. (2019) Andrei Kapishnikov, Tolga Bolukbasi, Fernanda Viégas, and Michael Terry. Xrai: Better attributions through regions, 2019.
  • Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. volume 2006, pages 282–293, 09 2006. ISBN 978-3-540-45375-8. doi: 10.1007/11871842_29.
  • Kuleshov and Precup (2014) Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems, 2014.
  • Lanctot et al. (2009) Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte carlo sampling for regret minimization in extensive games. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 22, pages 1078–1086. Curran Associates, Inc., 2009.
  • Lanctot et al. (2012) Marc Lanctot, Richard Gibson, Neil Burch, Martin Zinkevich, and Michael Bowling. No-regret learning in extensive-form games with imperfect recall, 2012.
  • Lee et al. (2019) B. Lee, A. Jackson, T. Madams, Seth Troisi, and Derek Jones. Minigo: A case study in reproducing reinforcement learning research. In RML@ICLR, 2019.
  • Lisy et al. (2015) Viliam Lisy, Marc Lanctot, and Michael Bowling. Online monte carlo counterfactual regret minimization for search in imperfect information games. volume 1, 05 2015.
  • Liu et al. (2018) Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution, 2018.
  • Long et al. (2010) Jeffrey Long, Nathan R. Sturtevant, Michael Buro, and Timothy Furtak. Understanding the success of perfect information monte carlo sampling in game tree search. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, page 134–140. AAAI Press, 2010.
  • Markowitz et al. (2018) Jared Markowitz, Ryan W. Gardner, and Ashley J. Llorens. On the complexity of reconnaissance blind chess, 2018.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013.
  • Moravčík et al. (2017) Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, Mar 2017. ISSN 1095-9203. doi: 10.1126/science.aam6960. URL http://dx.doi.org/10.1126/science.aam6960.
  • Neller and Lanctot (2013) T. Neller and Marc Lanctot. An introduction to counterfactual regret minimization. 2013.
  • Newman et al. (2016) Andrew J. Newman, Casey L. Richardson, Sean M. Kain, Paul G. Stankiewicz, Paul R. Guseman, Blake A. Schreurs, and Jeffrey A. Dunne. Reconnaissance blind multi-chess: an experimentation platform for ISR sensor fusion and resource management. In Ivan Kadar, editor, Signal Processing, Sensor/Information Fusion, and Target Recognition XXV, volume 9842, pages 62 – 81. International Society for Optics and Photonics, SPIE, 2016. doi: 10.1117/12.2228127. URL https://doi.org/10.1117/12.2228127.
  • Osborne and Rubinstein (1994) M.J. Osborne and A. Rubinstein. A Course in Game Theory. The MIT Press. MIT Press, 1994. ISBN 9780262650403. URL https://books.google.com/books?id=mnv1DAAAQBAJ.
  • Parker et al. (2006) Austin Parker, Dana Nau, and V. S. Subrahmanian. Overconfidence or paranoia? search in imperfect-information games. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI’06, page 1045–1050. AAAI Press, 2006. ISBN 9781577352815.
  • Parker et al. (2010) Austin Parker, Dana S. Nau, and V. S. Subrahmanian. Paranoia versus overconfidence in imperfect-information games. In Rina Dechter, Hector Geffner, and Joseph Y. Halpern, editors, Heuristics, Probabilities, and Causality: A Tribute to Judea Pearl, chapter 5, pages 63–87. College Publications, 2010.
  • Pineau et al. (2006) J. Pineau, G. Gordon, and S. Thrun. Anytime point-based approximations for large pomdps. Journal of Artificial Intelligence Research, 27:335–380, Nov 2006. ISSN 1076-9757. doi: 10.1613/jair.2078. URL http://dx.doi.org/10.1613/jair.2078.
  • Ponsen et al. (2011) Marc Ponsen, Steven de Jong, and Marc Lanctot. Computing approximate nash equilibria and robust best-responses using sampling. J. Artif. Int. Res., 42(1):575–605, September 2011. ISSN 1076-9757.
  • Richards (2012) Mark D. Richards. Reasoning and decisions in partially observable games. 2012.
  • Romstad et al. (2020) Tord Romstad, Marco Costalba, Joona Kiiski, and et al. Stockfish - open source chess engine. https://stockfishchess.org, 2020.
  • Rumelhart et al. (1986) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Internal Representations by Error Propagation, page 318–362. MIT Press, Cambridge, MA, USA, 1986. ISBN 026268053X.
  • Russell and Norvig (2020) S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Pearson, 2020. ISBN 9780134671932.
  • Savelyev (2020) Sergey Savelyev. Mastering reconnaissance blind chess with reinforcement learning. May 2020. URL https://www.sergeysav.com/projects/rbmc.
  • Schrittwieser et al. (2019) Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model, 2019.
  • Silver and Veness (2010) D. Silver and J. Veness. Monte-carlo planning in large pomdps. In NIPS, 2010.
  • Silver et al. (2016) D. Silver, Aja Huang, Chris J. Maddison, A. Guez, L. Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, S. Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016.
  • Silver et al. (2017a) D. Silver, Julian Schrittwieser, K. Simonyan, Ioannis Antonoglou, Aja Huang, A. Guez, T. Hubert, L. Baker, Matthew Lai, A. Bolton, Yutian Chen, T. Lillicrap, F. Hui, L. Sifre, George van den Driessche, T. Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354–359, 2017a.
  • Silver et al. (2017b) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017b.
  • Smilkov et al. (2017) Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise, 2017.
  • Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks, 2017.
  • Sutton (2019) Richard S. Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html, 2019.
  • Tian et al. (2019) Yuandong Tian, Jerry Ma, Qucheng Gong, Shubho Sengupta, Zhuoyuan Chen, James Pinkerton, and C. Lawrence Zitnick. Elf opengo: An analysis and open reimplementation of alphazero, 2019.
  • Tomašev et al. (2020) Nenad Tomašev, Ulrich Paquet, Demis Hassabis, and Vladimir Kramnik. Assessing game balance with alphazero: Exploring alternative rule sets in chess, 2020.
  • Wagstaff et al. (2019) Edward Wagstaff, Fabian B. Fuchs, Martin Engelcke, Ingmar Posner, and Michael Osborne. On the limitations of representing functions on sets, 2019.
  • Wang et al. (2019) Yunbo Wang, B. Liu, Jiajun Wu, Yuke Zhu, S. S. Du, Li Fei-Fei, and J. Tenenbaum. Dualsmc: Tunneling differentiable filtering and planning under continuous pomdps. arXiv: Learning, 2019.
  • Waugh et al. (2009) Kevin Waugh, Martin Zinkevich, Michael Johanson, Morgan Kan, David Schnizlein, and Michael Bowling. A practical use of imperfect recall. 01 2009.
  • Wu (2020) David J. Wu. Accelerating self-play learning in go, 2020.
  • Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola. Deep sets, 2017.
  • Zhang et al. (2020) Junjie Zhang, Lingqiao Liu, Peng Wang, and Chunhua Shen. To balance or not to balance: A simple-yet-effective approach for learning with long-tailed distributions, 2020.
  • Zinkevich et al. (2008) Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20, pages 1729–1736. Curran Associates, Inc., 2008. URL https://proceedings.neurips.cc/paper/2007/file/08d98638c6fcd194a4b1e6992063e944-Paper.pdf.
  • Zobrist (1990) A. Zobrist. A new hashing method with application for game playing. ICGA Journal, 13:69–73, 1990.

Appendix A Appendix

Table 5 lists the training hyperparameters and runtime hyperparameters used by PenumbraMixture. Table 6, Table 7, and Table 8 provide top-1 action, top-5 action, and winner accuracies, respectively, between each headset in the neural network. Figure 9 shows game length distributions for each headset.

The synopsis features were hand-designed. Many of them are natural given the rules of chess. Some of them are near duplicates of each other. Table 9 and Table 10 jointly provide brief descriptions of each synopsis feature plane. These tables also include saliency estimates averaged over five runs. The penultimate column orders the synopsis features by their per-bit saliency based on action gradients, and the final column reports the average difference of the policy head accuracies when the model was retrained without each feature.

Table 5: Hyperparameters used by PenumbraMixture

Symbol Parameter Value
bb Batch size 256256
cc Exploration constant 22
dsensed_{\text{sense}} Search depth for sense actions 66
dmoved_{\text{move}} Search depth for move actions 1212
FF # of binary synopsis features 8×8×1048\small{\times}8\small{\times}104
kk Rejection sampling persistence 512512
\ell Limited state set size 128128
mm Bandit mixing constant 11
nparticlesn_{\text{particles}} # of samples to track 40964096
nvln_{\text{vl}} Virtual loss 11
nbatchesn_{\text{batches}} Total minibatches of training 650000650000
nwidthn_{\text{width}} Network width; # features per layer 128128
ndepthn_{\text{depth}} Network depth; # residual blocks 1010
zz Depth increase threshold \infty
κ\kappa Caution 0
ϕ\phi Paranoia 0
ϵ\epsilon Learning rate 0.00050.0005

A.1 2019 NeurIPS competition

Penumbra was originally created to compete in the 2019 reconnaissance blind chess competition hosted by the Conference on Neural Information Processing Systems (NeurIPS). However, it performed very poorly in that competition, winning fewer games than the random bot.

The program and underlying algorithm presented in this paper are largely the same as the originals. The main differences are that some hyperparameters were adjusted, the neural network was retrained with more data, and a key bug in the playout code was fixed. Instead of choosing actions according to the policy from the neural network, the playout code erroneously always selected the last legal action. Giving the program a break made a huge difference.

A.2 Comparison with Kriegspiel

A comparison between RBC and Kriegspiel chess [Ciancarini and Favini, 2009, Parker et al., 2010, Richards, 2012] may be worthwhile. Kriegspiel chess also introduces uncertainty about the opposing pieces but lacks an explicit sensing mechanism. Instead, information is gathered solely from captures, check notifications, and illegal move attempts. In Kriegspiel, illegal moves are rejected and the player is allowed to choose a new move with their increased information about the board state, which entangles the positional and informational aspects of the game. In contrast, sensing in RBC gives players direct control over the amount and character of the information they possess.

Another significant difference comes from the mechanics related to check. Capturing pieces and putting the opposing king into check have benefits in both games: capturing pieces leads to a material advantage, and check often precedes checkmate. In Kriegspiel, however, both capturing and giving check also provide the opponent with information. In RBC, while capturing does give the opponent information, putting their king into check does not, which makes sneak attacks more viable.

A.3 Games played

The games that were played in order to produce Table 4, Table 4, and Table 4 are available for download from https://github.com/w-hat/penumbra.

Table 6: Top-1 action accuracy across headsets.
Dataset
Headset All Top StrangeFish LaSalle Dyn.Entropy Oracle StockyInfe. Marmot Genetic Zugzwang Trout Human Attacker Random
All 33.6 41.5 40.5 32.3 43.7 44.1 41.5 20.3 31.9 37.1 36.6 21.9 41.4 3.8
Top 30.5 43.1 44.5 32.3 43.9 44.4 40.3 19.4 31.6 22.4 26.0 18.5 15.6 3.3
StrangeFish 27.3 40.5 45.9 30.3 36.9 38.8 33.2 17.9 27.7 21.0 23.6 17.5 14.6 3.3
LaSalle 26.0 34.4 34.6 36.4 34.1 34.5 33.2 17.2 24.6 22.6 26.1 15.5 11.2 3.4
Dyn.Entropy 27.9 37.9 35.2 27.5 50.0 40.1 37.4 18.0 32.6 18.3 22.6 16.9 13.8 3.3
Oracle 28.8 38.4 36.3 29.1 41.2 49.3 35.1 17.2 29.5 19.9 26.6 17.0 11.9 3.4
StockyInfe. 28.8 38.1 33.9 30.7 42.9 38.6 45.2 17.9 29.3 18.4 23.1 16.7 11.6 3.3
Marmot 22.4 29.4 28.6 24.0 31.4 29.3 29.7 24.6 25.0 16.4 15.5 15.4 11.1 3.4
Genetic 24.0 32.5 30.3 24.1 39.8 35.0 32.3 16.9 40.4 15.1 15.7 15.1 7.8 3.4
Zugzwang 20.5 21.8 23.2 20.5 20.4 23.5 17.3 11.7 12.3 47.0 34.6 14.0 10.9 3.2
Trout 22.8 25.1 26.0 24.0 23.0 27.8 21.3 12.5 14.4 36.1 41.8 14.9 14.7 3.7
Human 23.8 30.1 30.6 24.9 31.4 30.1 28.4 16.8 24.1 24.7 24.5 24.9 12.5 3.3
Attacker 10.6 11.7 11.0 9.8 11.4 11.6 12.4 8.6 8.8 9.2 9.0 6.7 45.1 4.4
Random 14.0 16.7 16.2 14.0 16.5 17.8 16.8 9.4 11.4 15.7 16.5 10.4 10.4 4.5
Table 7: Top-5 action accuracy across headsets.
Dataset
Headset All Top StrangeFish LaSalle Dyn.Entropy Oracle StockyInfe. Marmot Genetic Zugzwang Trout Human Attacker Random
All 62.0 72.3 71.0 66.6 74.9 75.8 72.5 51.3 65.3 65.5 61.8 48.1 59.1 18.2
Top 58.8 73.4 73.4 66.4 75.5 76.4 72.0 50.1 64.7 48.5 52.2 45.4 35.3 16.3
StrangeFish 56.8 72.2 74.3 64.7 72.6 74.2 67.5 48.5 62.1 46.5 50.1 43.8 41.4 15.7
LaSalle 56.8 68.8 68.4 68.2 69.8 70.6 68.3 48.7 58.9 51.5 56.6 43.6 30.8 16.8
Dyn.Entropy 55.5 69.1 66.9 60.9 76.7 72.3 69.6 47.2 64.4 38.4 47.2 42.7 41.4 16.6
Oracle 56.7 70.4 68.9 61.9 74.1 77.4 69.1 45.6 63.4 43.9 50.6 43.1 34.4 16.4
StockyInfe. 57.0 70.0 67.3 64.7 74.0 71.3 73.6 49.3 63.3 43.3 50.2 44.0 29.8 17.0
Marmot 53.4 65.0 64.0 58.7 68.6 66.1 65.2 55.4 59.3 43.0 46.2 42.6 32.2 16.2
Genetic 52.6 65.8 64.2 58.2 71.3 69.7 65.8 45.4 70.5 37.4 41.6 40.9 27.0 16.2
Zugzwang 42.8 45.0 45.4 46.0 42.1 47.5 42.4 32.5 32.8 71.6 57.9 35.3 29.0 15.4
Trout 49.3 54.5 54.4 53.9 53.7 57.2 52.5 38.4 42.2 62.9 63.4 38.7 40.9 18.1
Human 53.5 62.9 62.4 58.3 64.7 64.6 62.5 45.9 55.7 53.7 53.4 51.9 33.4 16.3
Attacker 35.2 39.5 38.8 34.7 40.8 40.2 39.7 31.1 33.8 33.0 32.4 28.4 61.9 20.0
Random 39.7 45.6 44.5 41.2 46.4 47.9 46.0 31.8 37.7 41.3 42.1 31.5 30.3 20.7
Table 8: Winner accuracy across headsets.
Dataset
Headset All Top StrangeFish LaSalle Dyn.Entropy Oracle StockyInfe. Marmot Genetic Zugzwang Trout Human Attacker Random
All 74.8 73.4 76.6 68.4 74.6 76.3 67.8 79.3 74.1 76.7 76.7 72.1 79.6 91.3
Top 63.1 82.4 86.7 68.0 76.0 80.6 66.3 65.6 73.2 49.1 55.1 52.7 47.7 29.9
StrangeFish 64.1 82.1 86.6 67.5 76.2 80.6 65.2 65.8 72.7 50.1 55.4 55.8 60.9 37.3
LaSalle 69.9 77.3 80.7 69.7 76.0 78.8 67.1 73.5 75.9 65.1 68.3 62.4 71.9 60.7
Dyn.Entropy 67.8 80.4 85.0 69.4 78.9 80.8 67.0 71.7 75.6 58.5 61.1 61.5 65.0 47.3
Oracle 66.0 82.0 86.4 69.4 77.3 81.4 66.9 69.2 75.3 53.5 58.2 57.9 59.5 39.7
StockyInfe. 71.3 78.5 82.5 69.3 77.2 79.5 68.3 75.3 76.8 64.0 65.8 66.9 72.6 68.9
Marmot 70.6 67.7 70.1 64.5 70.9 72.3 65.7 80.8 72.9 73.9 73.4 69.2 77.0 72.6
Genetic 67.1 80.0 84.1 68.6 77.1 80.3 67.1 71.5 77.9 56.5 60.8 60.9 62.1 44.3
Zugzwang 68.3 54.6 54.0 59.7 61.4 60.0 61.7 76.1 64.4 80.1 78.8 72.5 78.5 88.9
Trout 69.7 58.6 59.1 60.9 63.6 64.4 63.4 77.9 69.0 77.9 79.8 72.4 78.6 87.3
Human 71.1 64.8 66.0 63.8 67.7 68.0 65.1 76.6 70.4 74.6 76.2 73.3 77.7 90.1
Attacker 63.7 47.1 46.1 55.6 53.8 50.9 56.5 71.4 59.5 75.0 73.4 71.5 80.1 92.7
Random 51.0 25.5 20.9 43.4 34.8 28.5 46.5 54.9 38.2 65.7 56.6 62.5 69.5 94.7
Refer to caption
Figure 9: The historical game length distributions are shown for the data used to train each of the headsets. On average, the games from Attacker were the shortest, and the games from StockyInference where the longest.
Table 9: Synopsis feature descriptions, saliency estimates, and ablation study results.
# Description Loss PBS Action PbS Action PbS 0 Action PbS 1 Action PbS # Ablation % Acc. Diff.
0 East side (constant) 3.24 0.86 0.80 0.91 34 0.27
1 West side (constant) 3.12 0.82 0.84 0.81 43 0.00
2 South side (constant) 3.24 0.86 0.78 0.94 33 -0.08
3 North side (constant) 3.22 0.82 0.89 0.75 44 0.27
4 Rank 1 (constant) 3.15 0.80 0.81 0.76 50 -0.03
5 Rank 8 (constant) 3.12 0.82 0.85 0.61 42 -0.07
6 A-file (constant) 3.04 0.78 0.81 0.53 59 0.18
7 H-file (constant) 3.08 0.80 0.83 0.58 51 0.15
8 Dark squares (constant) 3.08 0.82 0.81 0.82 45 0.21
9 Light squares (constant) 3.03 0.78 0.77 0.78 61 0.01
10 Stage (move or sense) 7.80 3.14 2.82 3.45 0 -0.19
11 Not own piece 5.40 1.43 2.66 1.13 8 -0.29
12 Own pawns 4.16 1.14 1.07 1.73 14 0.01
13 Own knights 3.68 0.93 0.91 1.63 22 0.09
14 Own bishops 3.46 0.89 0.87 1.63 27 0.02
15 Own rooks 3.67 0.94 0.93 1.12 21 0.03
16 Own queens 3.28 0.87 0.85 2.24 32 0.06
17 Own king 3.14 0.79 0.79 0.88 52 0.10
18 Definitely not opposing pieces 3.85 1.14 1.03 1.21 13 -0.10
19 Definitely opposing pawns 3.49 1.01 1.02 0.73 17 -0.08
20 Definitely opposing knights 3.30 0.93 0.93 0.61 23 -0.05
21 Definitely opposing bishops 3.21 0.88 0.88 0.59 29 -0.02
22 Definitely opposing rooks 3.04 0.81 0.82 0.38 47 0.02
23 Definitely opposing queens 3.15 0.85 0.85 0.60 35 -0.10
24 Definitely opposing king 3.60 0.92 0.91 2.27 26 0.04
25 Possibly not opposing pieces 5.22 1.54 1.34 1.56 5 -0.04
26 Possibly opposing pawns 3.50 0.92 0.92 0.93 24 0.06
27 Possibly opposing knights 2.97 0.77 0.77 0.81 67 0.07
28 Possibly opposing bishops 2.95 0.75 0.74 0.89 70 0.09
29 Possibly opposing rooks 3.01 0.75 0.76 0.63 69 -0.18
30 Possibly opposing queens 3.05 0.78 0.77 1.05 57 -0.07
31 Possibly opposing kings 4.86 1.48 1.43 2.64 7 -0.04
32 Last from 2.77 0.72 0.72 0.83 76 -0.11
33 Last to 3.28 0.96 0.96 1.40 19 0.02
34 Last own capture 3.10 0.83 0.83 1.17 40 0.07
35 Last opposing capture 8.04 2.83 2.82 6.51 1 -0.08
36 Definitely attackable 2.72 0.70 0.62 0.78 84 -0.06
37 Definitely attackable somehow 2.73 0.71 0.65 0.78 80 -0.02
38 Possibly attackable 3.02 0.81 0.71 0.92 48 0.19
39 Definitely doubly attackable 2.67 0.66 0.63 0.80 92 -0.11
40 Definitely doubly attackable somehow 2.66 0.69 0.67 0.80 88 0.14
41 Possibly doubly attackable 2.71 0.75 0.73 0.83 71 -0.26
42 Definitely attackable by pawns 3.54 0.92 0.92 2.38 25 0.13
43 Possibly attackable by pawns 3.11 0.78 0.78 0.95 58 -0.10
44 Definitely attackable by knights 2.91 0.72 0.71 0.84 77 0.24
45 Definitely attackable by bishops 2.60 0.64 0.61 0.80 95 0.15
46 Possibly attackable by bishops 2.60 0.68 0.64 0.85 89 -0.07
47 Definitely attackable by rooks 2.63 0.65 0.64 0.75 93 0.07
48 Possibly attackable by rooks 2.74 0.70 0.69 0.77 81 0.00
49 Possibly attackable without king 2.72 0.70 0.63 0.79 82 0.19
50 Possibly attackable without pawns 2.63 0.67 0.62 0.73 90 0.17
51 Definitely attackable by opponent 3.25 0.87 0.91 0.77 31 -0.03
Table 10: Synopsis feature descriptions, saliency estimates, and ablation study results (continued).
# Description Loss PBS Action PbS Action PbS 0 Action PbS 1 Action PbS # Ablation % Acc. Diff.
52 Possibly attackable by opponent 3.15 0.84 0.90 0.81 37 0.06
53 Definitely doubly attackable by opp. 2.56 0.65 0.66 0.56 94 -0.01
54 Possibly doubly attackable by opp. 2.67 0.71 0.73 0.66 79 -0.13
55 Definitely attackable by opp. pawns 3.10 0.87 0.87 1.69 30 0.12
56 Possibly attackable by opp. pawns 2.84 0.77 0.77 1.10 62 0.30
57 Definitely attackable by opp. knights 2.78 0.69 0.70 0.59 87 0.21
58 Possibly attackable by opp. knights 2.14 0.52 0.51 0.55 102 0.08
59 Definitely attackable by opp. bishops 2.66 0.67 0.67 0.63 91 0.09
60 Possibly attackable by opp. bishops 2.16 0.53 0.52 0.56 101 -0.09
61 Definitely attackable by opp. rooks 2.77 0.70 0.72 0.48 83 -0.19
62 Possibly attackable by opp. rooks 2.10 0.51 0.53 0.47 103 0.20
63 Possibly attackable by opp. w/o king 2.55 0.64 0.63 0.64 96 -0.17
64 Possibly attackable by opp. w/o pawns 2.45 0.62 0.61 0.62 97 -0.13
65 Possibly safe opposing king 6.04 2.06 2.01 3.38 2 0.07
66 Squares the opponent may move to 2.40 0.60 0.60 0.60 98 0.01
67 Possible castle state for opponent 3.09 0.79 0.79 0.72 53 0.00
68 Known squares 4.94 1.52 1.67 1.45 6 0.13
69 Own king’s king-neighbors 3.10 0.78 0.77 0.93 56 0.14
70 Own king’s knight-neighbors 2.82 0.71 0.70 0.91 78 0.31
71 Definitely opp. knights near king 3.09 0.79 0.79 1.64 54 0.13
72 Possibly opp. knights near king 5.13 1.72 1.72 2.77 4 -0.01
73 Own king’s bishop-neighbors 2.74 0.69 0.68 0.86 85 -0.10
74 Definitely opp. bishops near king 3.04 0.79 0.79 0.89 55 0.23
75 Possibly opp. bishops near king 5.23 1.75 1.75 2.41 3 -0.11
76 Own king’s rook-neighbors 2.76 0.69 0.68 0.83 86 -0.13
77 Definitely opp. rooks near king 3.10 0.81 0.81 0.87 49 0.31
78 Possibly opp. rooks near king 4.45 1.40 1.40 1.55 10 0.05
79 All own pieces 5.26 1.36 1.09 2.47 11 -0.01
80 Definitely empty squares 3.69 0.96 1.05 0.84 20 -0.13
81 May castle eventually 3.11 0.81 0.81 1.26 46 0.24
82 Possibly may castle 3.05 0.77 0.77 0.63 68 0.05
83 Definitely may castle 3.04 0.77 0.77 0.87 66 0.12
84 Own queens’ rook-neighbors 2.20 0.54 0.53 0.63 100 0.04
85 Own queens’ bishop-neighbors 2.33 0.57 0.57 0.67 99 0.06
86 Previous definitely not opp. pieces 3.82 0.88 0.87 0.89 28 -0.30
87 Previous definitely opp. pawns 4.16 1.16 1.18 0.88 12 0.14
88 Previous definitely opp. knights 3.02 0.77 0.77 0.72 63 0.12
89 Previous definitely opp. bishops 2.92 0.73 0.73 0.70 75 -0.02
90 Previous definitely opp. rooks 3.60 1.01 1.02 0.56 16 0.05
91 Previous definitely opp. queens 3.93 1.11 1.11 1.00 15 0.22
92 Previous definitely opp. king 3.33 0.83 0.82 1.58 38 -0.04
93 Previous possibly not opp. pieces 4.40 1.43 1.21 1.47 9 -0.05
94 Previous possibly opp. pawns 3.27 0.83 0.82 0.92 39 0.21
95 Previous possibly opp. knights 3.04 0.78 0.78 0.73 60 0.22
96 Previous possibly opp. bishops 3.10 0.74 0.74 0.78 73 0.16
97 Previous possibly opp. rooks 2.94 0.77 0.79 0.45 64 0.10
98 Previous possibly opp. queens 3.02 0.74 0.74 0.81 74 -0.07
99 Previous possibly opp. king 3.14 0.83 0.82 1.15 41 0.04
100 Previous last from 2.85 0.75 0.74 0.85 72 -0.04
101 Previous last to 3.36 1.00 1.00 1.50 18 0.21
102 Previous own capture 3.05 0.84 0.84 1.13 36 -0.09
103 Previous opposing capture 2.93 0.77 0.77 1.05 65 0.05