Deep Synoptic Monte Carlo Planning in Reconnaissance Blind Chess

Gregory Clark
ML Collective, Google
[email protected]

Abstract

This paper introduces deep synoptic Monte Carlo planning (DSMCP) for large imperfect information games. The algorithm constructs a belief state with an unweighted particle filter and plans via playouts that start at samples drawn from the belief state. The algorithm accounts for uncertainty by performing inference on “synopses,” a novel stochastic abstraction of information states. DSMCP is the basis of the program Penumbra, which won the official 2020 reconnaissance blind chess competition versus 33 other programs. This paper also evaluates algorithm variants that incorporate caution, paranoia, and a novel bandit algorithm. Furthermore, it audits the synopsis features used in Penumbra with per-bit saliency statistics.

1 Introduction

Choosing a Nash equilibrium strategy is rational when the opponent is able to identify and exploit suboptimal behavior (Bowling and Veloso, 2001). However, not all opponents are so responsive, and computing a Nash equilibrium is intractable for many games. This paper presents deep synoptic Monte Carlo planning (DSMCP), an algorithm for large imperfect information games that seeks a best-response strategy rather than a Nash equilibrium strategy.

When opponents use fixed policies, an imperfect information game may be viewed as a partially observable Markov decision process (POMDP) with the opponents as part of the environment. DSMCP treats playing against specific opponents as related offline reinforcement learning (RL) problems and exploits predictability. Importantly, the structure of having opponents with imperfect information is preserved in order to account for their uncertainty.

Refer to caption — Figure 1: Within 98k historical reconnaissance blind chess (RBC) games between non-random players, the win percentage tends to decrease as the number of possible states increases. The games were replayed while tracking up to one million states. Each bucket is labeled with an inclusive upper bound. The median of the maximum number of possible states encountered during a game is 4,869.

DSMCP uses sampling to break the “curse of dimensionality” (Pineau et al., 2006) in three ways: sampling possible histories with a particle filter, sampling possible futures with upper confidence bound tree search (UCT) (Kocsis and Szepesvári, 2006), and sampling possible world states within each information state uniformly. It represents information states with a generally-applicable stochastic abstraction technique that produces a “synopsis” from sampled world states. This paper assesses DSMCP on reconnaissance blind chess (RBC), a large imperfect information chess variant.

2 Background

Significant progress has been made in recent years in both perfect and imperfect information settings. For example, using deep neural networks to guide UCT has enabled monumental achievements in abstract strategy games as well as computer games (Silver et al., 2016, 2017a, 2017b; Schrittwieser et al., 2019; Wu, 2020; Tomašev et al., 2020). This work employs deep learning in a similar fashion.

Recent advancements in imperfect information games are also remarkable. Several programs have reached superhuman performance in Poker (Moravčík et al., 2017; Brown and Sandholm, 2018, 2019; Brown et al., 2020). In particular, ReBeL (Brown et al., 2020) combines RL and search by converting imperfect information games into continuous state space perfect information games with public belief states as nodes. This approach is powerful, but it relies on public knowledge and fails to scale to games with hidden actions and substantial private information, such as RBC.

Information set search (Parker et al., 2006, 2010) is a limited-depth algorithm for imperfect information games that operates on information states according to a minimax rule. This algorithm was designed for and evaluated on Kriegspiel chess, which is comparable to RBC.

Partially observable Monte Carlo planning (POMCP) (Silver and Veness, 2010) achieves optimal policies for POMDPs by tracking approximate belief states with an unweighted particle filter and planning with a variant of UCT on a search tree of histories. In practice, POMCP can suffer from particle depletion, requiring a domain-specific workaround. This work combines an unweighted particle filter with a novel information state abstraction technique which increases sample quality and supports deep learning.

Smooth UCT (Heinrich and Silver, 2015) and information set Monte Carlo tree search (ISMCTS) (Cowling et al., 2012) may be viewed as multi-agent versions of POMCP. These two algorithms for playing extensive-form games build search trees (for each player) of information states. These two algorithms and DSMCP all perform playouts from determinized states that are accurate from the current player’s perspective, effectively granting the opponent extra information. Still, Smooth UCT approached a Nash equilibrium by incorporating a stochastic bandit algorithm into its tree search. DSMCP uses a similar bandit algorithm that mixes in a learned policy during early node visits.

While adapting perfect information algorithms has performed surprisingly well in some imperfect information settings (Long et al., 2010), the theoretical guarantees of variants of counterfactual regret minimization (CFR) (Neller and Lanctot, 2013; Brown et al., 2018) are enticing. Online outcome sampling (OOS) (Lisy et al., 2015) extends Monte Carlo counterfactual regret minimization (MCCFR) (Lanctot et al., 2009) by building its search tree incrementally and targeting playouts to relevant parts of the tree. OOS draws samples from the beginning of the game. MCCFR and OOS are theoretically guaranteed to converge to a Nash equilibrium strategy. Specifically, CFR-based algorithms produce mixed strategies while DSMCP relies on incidental stochasticity.

Neural fictitious self-play (NFSP) (Heinrich and Silver, 2016) is an RL algorithm for training two neural networks for imperfect information games. Experiments with NFSP employed compact observations embeddings of information states. DSMCP offers a generic technique for embedding information states in large games. Dual sequential Monte Carlo (DualSMC) (Wang et al., 2019) estimates belief states and plans in a continuous domain via sequential Monte Carlo with heuristics.

3 Reconnaissance blind chess

Reconnaissance blind chess (RBC) (Newman et al., 2016; Markowitz et al., 2018; Gardner et al., 2020) is a chess variant that incorporates uncertainty about the placement of the opposing pieces along with a private sensing mechanism. As shown in Figure 2, RBC players are often faced with thousands of possible game states, and reducing uncertainty increases the odds of winning.

Game rules

Pieces move in the same way in RBC as in chess. Players cannot directly observe the movement of the opposing pieces. However, at the beginning of each turn, players may view the ground truth of any $3\small{\times}3$ patch of the board. The information gained from the sensing action remains private to that player. Players are also informed of the location of all captures, but not the identity of capturing pieces. When a requested move is illegal, the move is substituted with the closest legal move and the player is notified of the substitution. For example, in Figure 2 (a), if Black attempted to move the rook from h8 to f8, the rook would capture the queen on g8 and stop there instead. Players are always able to track the placement of their own pieces. Capturing the opposing king wins the game, and players are not notified about check. Passing and moving into check are legal.

Official competition

This paper introduces the program Penumbra, the winner of the official 2020 RBC rating competition. In total, 34 programs competed to achieve the highest rating by playing public games. Ratings were computed with BayesElo (Coulom, 2008), and playing at least 100 games was required to be eligible to win. Figure 2 shows ground truth positions from the tournament in which Penumbra voluntarily put its queen in danger. Players were paired randomly, but the opponent’s identity was provided at the start of each game which allowed catering strategies for specific opponents. However, opponents were free to change their strategies at any point, so attempting to exploit others could backfire. Nonetheless, Penumbra sought to model and counter predictable opponents rather than focusing on finding a Nash equilibrium.

Other RBC programs

RBC programs have employed a variety of algorithms (Gardner et al., 2020) including Q-learning (Mnih et al., 2013), counterfactual regret minimization (CFR) (Zinkevich et al., 2008), online outcome sampling (OOS) (Lisy et al., 2015), and the Stockfish chess engine (Romstad et al., 2020). Another strong RBC program (Highley et al., 2020; Blowitski and Highley, 2021) maintains a probability distribution for each piece. Most RBC programs select sense actions and move actions in separate ways while DSMCP unifies all action selection. Savelyev (2020) also applied UCT to RBC and modeled the root belief state with a distribution over 10,000 tracked positions. Input to a neural network consisted of the most-likely 100 positions, and storing a single training example required 3.5MB on average, large enough to hinder training. This work circumvented the same issue by representing training examples with compact synopses which are less than 1kB.

4 Terminology

Consider the two-player extensive-form game with agents $\mathcal{P}=\{$ self, opponent $\}$ , actions $\mathcal{A}$ , “ground truth” world states $\mathcal{X}{}$ , and initial state $x_{0}\in\mathcal{X}{}$ . Each time an action is taken, each agent $p\in\mathcal{P}$ is given an observation $\mathbf{o}_{p}\in\mathcal{O}$ that matches ( $\sim$ ) the possible world states from $p$ ’s perspective. For simplicity, assume the game has deterministic actions such that each $a\in\mathcal{A}$ is a function $a:X{}\rightarrow\mathcal{X}{}$ defined on a subset of world states $X{}\subset\mathcal{X}{}$ . Define $\mathcal{A}_{x}{}$ as the set of actions available from $x{}\in\mathcal{X}{}$ .

An information state (infostate) $s\in\mathcal{S}$ for agent $p$ consists of all observations $p$ has received so far.¹¹1 An infostate is equivalent to an information set, which is the set of all possible action histories from $p$ ’s perspective (Osborne and Rubinstein, 1994). Let $\mathcal{X}{}_{s}\subset\mathcal{X}{}$ be the set of all world states that are possible from $p$ ’s perspective from $s$ . In general, $\mathcal{X}_{s}$ contains less information than $s$ since some (sensing) actions may not affect the world state. Define a collection of limited-size world state sets $\mathcal{L}=\{L\subset\mathcal{X}_{s}:s\in\mathcal{S},|L|\leq\ell\}$ , given a constant $\ell$ .

Let $\rho:\mathcal{X}{}\rightarrow\mathcal{P}$ indicate the agent to act in each world state. Assume that $\mathcal{A}_{x}{}=\mathcal{A}_{y}$ and $\rho(x{})=\rho(y)$ for all $x{},y\in\mathcal{X}{}_{s}$ and $s\in\mathcal{S}$ . Then extend the definitions of actions available $\mathcal{A}_{*}$ and agent to act $\rho$ over sets of world states and over infostates in the natural way. A policy $\pi(a|s)$ is a distribution over actions given an infostate. A belief state $B(h)$ is a distribution over action histories. Creating a belief state from an infostate requires assumptions about the opponent’s action policy $\tau(a|s)$ . Let $\mathcal{R}_{p}:\mathcal{X}{}\rightarrow\mathbb{R}$ map terminal states to the reward for player $p$ . Then $(\mathcal{S},\mathcal{A},\mathcal{R}_{\text{self}},\tau,s_{0})$ is a POMDP, where the opponent’s policy $\tau$ provides environment state transitions and $s_{0}$ is the initial infostate. In the rest of this paper, the word “state” refers to a world state unless otherwise specified.

5 Deep synoptic Monte Carlo planning

Figure 3: DSMCP approximates infostates with size-limited sets of possible states (circles). It tracks all possible states

X_{t}

for each turn from its own perspective and constructs belief states

\hat{B}_{t}

with approximate infostates from the opponent’s perspective. At the root of each playout, the initial approximate infostate for the opponent is sampled from

\hat{B}_{t}

, and the initial approximate infostate for itself is a random subset of

X_{t}

Algorithm 1 Bandit – Action selection with a stochastic multi-armed bandit

Given:

c>0,m\geq 0

Input: policy

\pi

, visit counts

\vec{n}

, value totals

\vec{q}

n\leftarrow\sum_{a}{\vec{n}_{a}}

e^{-mn}>\texttt{Uniform}([0,1])

and

\neg\text{root}

then

return

a\leftarrow

an action selected by policy

\pi

else

return

a\leftarrow\operatorname*{argmax}_{a}\left(\frac{\vec{q}_{a}}{\vec{n}_{a}}+c\pi_{a}\sqrt{\frac{\ln{n}}{\vec{n}_{a}}}\right)

Algorithm 2 DrawSample – Sample selection with rejection

Given:

k,\ell\in\mathbb{Z}^{+}

, synopsis function

\sigma

Given: policy

\hat{\tau}

Input: previous belief distribution

\hat{B}

, possible

Input: states

X

, visit counts

\mathbf{N}

, value totals

\mathbf{Q}

for

0

k

J\leftarrow

random sample from

\hat{B}

a\leftarrow\texttt{Bandit}(\hat{\tau}(\sigma(J)),\mathbf{N}_{J},\mathbf{Q}_{J})

I\leftarrow\{a(x):x\in J\}

I\leftarrow

random

\ell

states from

I

|I|>\ell

return

I

I\cap X\neq\varnothing

return

\{

random state from

X

\}

Effective planning algorithms for imperfect information games must model agents’ choice of actions based on (belief states derived from) infostates, not on world states themselves. Deep synoptic Monte Carlo planning (DSMCP) approximates infostates with size-limited sets of possible world states in $\mathcal{L}$ . It uses those approximations to construct a belief state and as UCT nodes (Kocsis and Szepesvári, 2006). Figure 3 provides a high-level visualization of the algorithm.

Algorithm 3 ChooseAction – UCT playouts

Given:

b,d,\ell,n_{\text{vl}},z\in\mathbb{Z}^{+}

, synopsis func.

\sigma

Given: policy

\pi

, opp. policy

\hat{\tau}

, value func.

\nu

Input: belief distribution

\hat{B}

, possible states

X

Input: visit counts

\mathbf{N}

, value totals

\mathbf{Q}

\bar{\pi}\leftarrow

average of

\pi\circ\sigma

over

b

samples from

\hat{B}

while time is left do

J\leftarrow

random sample from

\hat{B}

I\leftarrow

random

\ell

states in

X

including one in

J

a_{0}\leftarrow\texttt{Bandit}(\bar{\pi},\mathbf{N}_{\text{root}},\mathbf{Q}_{\text{root}})

for

t=0

d

x{}_{t}\leftarrow

the one state in

I\cap J

\mathbf{o}\leftarrow

observations when

a_{t}

is played on

x{}_{t}

I\leftarrow\{a(x):x\in I,a\in\mathcal{A}_{I}

\mathbf{o}_{\text{self}}\sim a(x)\}

J\leftarrow\{a(x):x\in J,a\in\mathcal{A}_{J}

\mathbf{o}_{\text{opp}}{\sim}\,a(x)\}

I\leftarrow

random

\ell

states in

I

including

a_{t}(x_{t})

J\leftarrow

random

\ell

states in

J

including

a_{t}(x_{t})

K_{t}\leftarrow I

and

\mu\leftarrow\pi

I

is to act

K_{t}\leftarrow J

and

\mu\leftarrow\hat{\tau}

J

is to act

d\mathrel{+}=1

\mathbf{N}_{K_{t},a_{t}}>z

for

i=0

t

\mathbf{N}_{K_{i},a_{i}}\mathrel{+}=n_{\text{vl}}

a_{t+1}\leftarrow\texttt{Bandit}(\mu(\sigma(K_{t})),\mathbf{N}_{K_{t}},\mathbf{Q}_{K_{t}})

q\leftarrow\nu(\sigma(K_{t}))

for

i=0

t

\mathbf{Q}_{K_{i},a_{i}}\mathrel{+}=q

\rho(K_{t}){=}\rho(K_{i})

else

-q

\mathbf{N}_{K_{i},a_{i}}\mathrel{+}=1-n_{\text{vl}}

return

a\leftarrow\operatorname*{argmax}_{a}\mathbf{Q}_{\text{root}}

Algorithm 4 PlayGame – DSMCP

Given:

n_{\text{particles}}\in\mathbb{Z}^{+}

\mathbf{N}_{*,*}\leftarrow\mathbf{0}

// Visits

\forall

(sample, action)

\in\mathcal{L}\times\mathcal{A}

\mathbf{Q}_{*,*}\leftarrow\mathbf{0}

// Values

\forall

(sample, action)

\in\mathcal{L}\times\mathcal{A}

X_{0}\leftarrow\{x{}_{0}\}

\hat{B}_{0}\leftarrow\{X_{0}\}

while the game is not over do

t\leftarrow\text{current turn}

\mathbf{o}\leftarrow\text{current observation}

// Track all possible world states

X_{t}\leftarrow\{a(x):x\in X_{t-1},a\in\mathcal{A}_{x}

\mathbf{o}_{\text{self}}\sim a(x)\}

// Filter belief states with the new information

for

i=t-1

0

X_{i}\leftarrow\{x\in X_{i}:\exists a\in\mathcal{A}_{x}

X_{i}\leftarrow\{x\in X_{i}:

such that

a(x)\in X_{i+1}\}

\hat{B}_{i}\leftarrow\{I\in\hat{B}_{i}:I\cap X_{i}\neq\varnothing\}

// Repopulate belief states with new particles

while opponent to act or

|\hat{B}_{t}|<n_{\text{particles}}

i\leftarrow

smallest

i>0

s.t.

|\hat{B}_{i}|<n_{\text{particles}}

I\leftarrow\texttt{DrawSample}(\hat{B}_{i-1},X_{i},\mathbf{N},\mathbf{Q})

\hat{B}_{i}\leftarrow\hat{B}_{i}\cup\{I\}

if self to act then

a\leftarrow\texttt{ChooseAction}(\hat{B}_{t},X_{t},\mathbf{N},\mathbf{Q})

Perform action

a

A bandit algorithm chooses an action during each node visit, as described in Algorithm 1. This bandit algorithm is similar to Smooth UCB (Heinrich and Silver, 2015) in that they both introduce stochasticity by mixing in a secondary policy. Smooth UCB empirically approached a Nash equilibrium utilizing the average policy according to action visits at each node. DSMCP mixes in a neural network’s policy ( $\pi$ ) instead. The constant $c$ controls the level of exploration, and $m$ controls how the policy $\pi$ is mixed into the bandit algorithm. For example, taking $m=0$ always selects actions directly with $\pi$ without considering visit counts, and taking $m=\infty$ never mixes in $\pi$ . As in Silver et al. (2016), $\pi$ provides per-action exploration values which guide the tree search.

Approximate belief states are constructed as subsets $\hat{B}\subset\mathcal{L}$ , where each $L\in\hat{B}$ is a set of possible world-states from the opponent’s perspective. This “second order” representation of belief states allows DSMCP to account for the opponent’s uncertainty. Infostates sampled with rejection (Algorithm 2) are used as the “particles” in a particle filter which models successive belief states. Sampling is guided by a neural network policy ( $\hat{\tau}$ ) based on the identity of the opponent. To counter particle deprivation, if $k$ consecutive candidate samples are rejected as incompatible with the possible world states, then a singleton sample consisting of a randomly-chosen possible state is selected instead.

The tree search, described in Algorithm 3, tracks an approximate infostate for each player while simulating playouts. Playouts are also guided by policy ( $\pi$ and $\hat{\tau}$ ) and value ( $\nu$ ) estimations from a neural network. A synopsis function $\sigma$ creates a fixed-size summary of each node as input for the network. The constant $b$ is the batch size for inference, $d$ is the search depth, $\ell$ is the size of approximate infostates, $n_{\text{vl}}$ is the virtual loss weight, and $z$ is a threshold for increasing search depth.

Algorithm 4 describes how to play an entire game, tracking all possible world states. Approximate belief states ( $\hat{B}_{t}$ ) are constructed for each past turn by tracking $n_{\text{particles}}$ elements of $\mathcal{L}$ (from the opponent’s point of view) with an unweighted particle filter. Each time the agent receives a new observation, all of the (past) particles that are inconsistent with the observation are filtered out and replenished, starting with the oldest belief states.

5.1 Synopsis

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41
42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62
63	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80	81	82	83
84	85	86	87	88	89	90	91	92	93	94	95	96	97	98	99	100	101	102	103

Figure 5: This set of synopsis bitboards was used as input to the neural network before White’s sense on turn 5 of the game in Figure 4. The synopsis contains 104 bitboards. Each bitboard encodes 64 binary features of the possible state set that the synopsis summarizes. For example, bitboard #26 contains the possible locations of opposing pawns, and bitboard #27 contains the possible locations of opposing knights. An attentive reader may notice that the black pawn on h4 is missing from bitboard #26, which is due to subsampling to

\ell=128

states before computing the bitboards. In this case, the true state was missing from the set of states used to create the synopsis. The features in each synopsis are only approximations of the infostates that they represent. The first 10 bitboards are constants, which provide information that is difficult for convolutions to construct otherwise (Liu et al., 2018).

Figure 6: Penumbra’s network contains a shared tower with 10 residual blocks and 14 headsets. Each headset contains 5 heads for a total of 70 output heads. The residual blocks are shown with a double border, and they each contain two

3\small{\times}3

convolutional layers and batch normalization. All of the convolutional layers in the headsets are

1\small{\times}1

convolutions with the exception of the one residual block for each policy head. Each headset was trained on a separate subset of the data, as described in Table 1. The policy head provides logits for both sense and move actions.

One of the contributions of this paper is the methodology used to approximate and encode infostates. Games that consist of a fixed number of turns, such as poker, admit a naturally-compact infostate representation based on observations (Heinrich and Silver, 2015, 2016). However, perfect representations are not always practical. Game abstractions are often used to reduce computation and memory requirements. For example, imperfect recall is an effective abstraction when past actions are unnecessary for understanding the present situation (Waugh et al., 2009; Lanctot et al., 2012).

DSMCP employs a stochastic abstraction which represents infostates with sets of world states and then subsamples to a manageable cardinality $(\ell)$ . Finally, a permutation-invariant synopsis function $\sigma$ produces fixed-size summaries of the approximate infostates which are used for inference. An alternative is to run inference on “determinized” world states individually and then somehow aggregate the results. However, such aggregation can easily lead to strategy fusion (Frank et al., 1998). Other alternatives include evaluating states with a recurrent network (Rumelhart et al., 1986) one-at-a-time or using a permutation-invariant architecture (Zaheer et al., 2017; Wagstaff et al., 2019).

Given functions $g_{i}:\mathcal{X}\rightarrow\{0,1\}$ for $i=0,\dots,F$ that map states to binary features, define the $i^{\text{th}}$ component of a synopsis function $\sigma:\mathcal{L}\rightarrow\{0,1\}^{F}$ as

\sigma_{i}(X)=g_{i}(x_{0})*_{i}g_{i}(x_{1})\ast_{i}\dots*_{i}g_{i}(x_{\ell})

(1)

where $X=\{x_{0},x_{1},\dots,x_{\ell}\}$ and $*_{i}$ is either the logical AND ( $\land$ ) or the logical OR ( $\lor$ ) operation. For example, if $g_{i}$ encodes whether an opposing knight can move to the d7 square of a chess board and $*_{i}=\land$ , then $\sigma_{i}$ indicates that a knight can definitely move to d7. Figure 4 shows an example game, and Figure 5 shows an example output of Penumbra’s synopsis function, which consists of 104 bitboard feature planes each with 64 binary features. The appendix describes each feature plane.

5.2 Network architecture

Penumbra uses a residual neural network (He et al., 2016) as shown in Figure 6. The network contains 14 headsets, designed to model specific opponents and regularize each other as they are trained on different slices of data (Zhang et al., 2020). Each headset contains 5 heads: a policy head, a value head, two heads for predicting winning and losing within the next 5 actions, and a head for guessing the number of pieces of each type in the ground truth world state. The Top policy head and the All value head are used for planning as $\pi$ and $\nu$ , respectively. The other heads (including the SoonWin, SoonLose, and PieceCount heads) provide auxiliary tasks for further regularization (Wu, 2020; Fifty et al., 2020). While playing against an opponent that is “recognized” (when a headset was trained on data from only that opponent), the policy head ( $\hat{\tau}$ ) of the corresponding headset is used for the opponent’s moves while progressing the particle filter (Algorithm 2) and while constructing the UCT tree (Algorithm 3). When the opponent is unrecognized, the Top policy head is used by default.

5.3 Training procedure

The network was trained on historical²²2The games were downloaded from rbmc.jhuapl.edu in June, 2019 and rbc.jhuapl.edu in August, 2020. Additionally, 5,000 games were played locally by StockyInference. game data as described by Table 1. The reported accuracies are averages over 5 training runs. The All headset was trained on all games, the Top headset was trained on games from the highest-rated players, the Human headset was trained on all games played by humans, and each of the other 11 headsets were trained to mimic specific opponents.

10% of the games were used as validation data based on game filename hashes. Training examples were extracted from games multiple times since reducing possible state sets to $\ell$ states is non-deterministic. A single step of vanilla stochastic gradient descent was applied to one headset at a time, alternating between headsets according to their training weights. See the appendix for hyperparameter settings and accuracy cross tables. Training and evaluation were run on four RTX 2080 Ti GPUs.

Table 1: Summary of headset training data and resulting validation accuracies

Name	# Games	# Train Actions	# Validation Actions	Multiplicity	Training Weight	Stop Gradient	% Top-1 Accuracy	% Top-5 Accuracy	% Winner Accuracy
All	85.6k	16.8M	1.8M	04	15	No	33.6	62.0	74.8
Top	49.5k	16.0M	1.6M	0 $\sim$ 9.6	13	No	43.1	73.4	82.4
StrangeFish	13.3k	8.1M	0.8M	$\sim$ 20.7	10	No	45.9	74.3	86.6
LaSalle	3.7k	4.3M	0.4M	32	04	No	36.4	68.2	69.7
Dyn.Entropy	7.8k	7.8M	0.8M	32	07	No	50.0	76.7	78.9
Oracle	20.3k	17.6M	1.9M	32	09	No	49.3	77.4	81.4
StockyInfe.	10.7k	13.5M	1.4M	16	09	No	45.2	73.6	68.3
Marmot	10.2k	3.9M	0.4M	16	08	No	24.6	55.4	80.8
Genetic	5.6k	2.3M	0.2M	16	08	No	40.4	70.5	77.9
Zugzwang	10.9k	3.1M	0.3M	$\sim$ 12.0	05	No	47.0	71.6	80.1
Trout	18.0k	5.1M	0.5M	16	05	No	41.8	63.4	79.8
Human	6.3k	1.5M	0.2M	16	02	No	24.9	51.9	73.3
Attacker	15.3k	1.7M	0.2M	16	04	No	45.1	61.9	80.1
Random	17.0k	0.6M	76k	02	01	Yes	4.5	20.7	94.7

5.4 Implementation details

Penumbra plays RBC with DSMCP along with RBC-specific extensions. First, sense actions that are dominated by other sense actions are pruned from consideration. Second, Penumbra can detect some forced wins in the sense phase, the move phase, and during the opponent’s turn. This static analysis is applied at the root and to playouts; playouts are terminated as soon as a player could win, avoiding unnecessary neural network inference. The static analysis was also used to clean training games in which the losing player had sufficient information to find a forced win.

Piece placements are represented with bitboards (Browne, 2014), and the tree of approximate infostates is implemented with a hash table. Zobrist hashing (Zobrist, 1990) maintains hashes of piece placements incrementally. Hash table collisions are resolved by overwriting older entries. The tree was not implemented until after the competition, so fixed-depth playouts were used instead ( $m=0$ ). Inference is done in batches of $256$ during both training and online planning. The time used per action is approximately proportional to the time remaining. The program processes approximately 4,000 nodes per second, and it plays randomly when the number of possible states exceeds 9 million.

6 Experiments

This section presents the results of playing games between Penumbra and several publicly available RBC baselines (Gardner et al., 2020; Bernardoni, 2020). Each variant of Penumbra in Table 4 played 1000 games against each baseline, and each variant in Table 4 and Table 4 played 250 games against each baseline. Games with errors were ignored and replayed. The Elo ratings and 95% confidence intervals were computed with BayesElo (Coulom, 2008) and are all compatible. The scale was anchored with StockyInference at 1500 based on its rating during the competition.

Table 4 gives ratings of the baselines and five versions of Penumbra. PenumbraCache relied solely on the network policy for action selection in playouts ( $m\,{=}\,0$ ), PenumbraTree built a UCT search tree ( $m\,{=}\,\infty$ ), and PenumbraMixture mixed in the network policy during early node visits ( $m\,{=}\,1$ ). The mixed strategy performed the best. PenumbraNetwork selected actions based on the network policy without performing any playouts. PenumbraSimple is the same as PenumbraMixture with the static analysis described in Section 5.4 disabled. PenumbraNetwork and PenumbraSimple serve as ablation studies; removing the search algorithm is detrimental while the effect of removing the static analysis is not statistically significant. Unexpectedly, Penumbra played the strongest against StockyInference when that program was unrecognized. So, in this case, modeling the opponent with a stronger policy outperformed modeling it more accurately.

Two algorithmic modifications that give the opponent an artificial advantage during planning were investigated. Table 4 reports the results of a grid search over “cautious” and “paranoid” variants of DSMCP. The caution parameter $\kappa$ specifies the percentage of playouts that use $\ell=4$ for the opponent instead of the higher default limit. Since each approximate infostate is guaranteed to contain the correct ground truth in playouts, reducing $\ell$ for the opponent gives the opponent higher-quality information, allowing the opponent to counter risky play more easily in the constructed UCT tree.

The paranoia parameter augments the exploration values in Algorithm 1 to incorporate the minimum value seen during the current playout. With paranoia $\phi$ , actions are selected according to

\operatorname*{argmax}_{a}\left((1-\phi)\frac{\vec{q}_{a}}{\vec{n}_{a}}+\phi\vec{m}_{a}+c\pi_{a}\sqrt{\frac{\ln{n}}{\vec{n}_{a}}}\right)

(2)

where $\vec{m}$ contains the minimum value observed for each action. This is akin to the notion of paranoia studied by Parker et al. (2006, 2010).

Table 2: Bot Elo scores

Bot	Recognized as	Elo score
PenumbraMixture	unrecognized	$1747\,$	$\pm\,11$
PenumbraSimple	unrecognized	$1739\,$	$\pm\,10$
PenumbraCache	unrecognized	$1727\,$	$\pm\,10$
PenumbraTree	unrecognized	$1641\,$	$\pm\,9$
StockyInference	Trout	$1610\,$	$\pm\,7$
StockyInference	StrangeFi.	$1528\,$	$\pm\,8$
StockyInference	Genetic	$1512\,$	$\pm\,8$
StockyInference	StockyInf.	$1500\,$	$\pm\,8$
StockyInference	unrecognized	$1474\,$	$\pm\,8$
PenumbraNetwork	unrecognized	$1376\,$	$\pm\,9$
AggressiveTree	unrecognized	$1134\,$	$\pm\,15$
FullMonte	unrecognized	$1028\,$	$\pm\,20$
Trout	Trout	$1005\,$	$\pm\,22$
Trout	unrecognized	$997\,$	$\pm\,22$

Table 3: Caution and paranoia grid search results

		Caution
		0%	10%	20%	30%
Paranoia	0%	$1711\small{\pm}19$	$1714\small{\pm}19$	$1707\small{\pm}19$	$1702\small{\pm}19$
	10%	$1711\small{\pm}19$	$1705\small{\pm}19$	$1726\small{\pm}19$	$1695\small{\pm}18$
	20%	$1688\small{\pm}18$	$1700\small{\pm}19$	$1688\small{\pm}18$	$1670\small{\pm}18$
	30%	$1691\small{\pm}18$	$1683\small{\pm}18$	$1681\small{\pm}18$	$1666\small{\pm}18$

Table 4: Exploration strategy grid search results

	Exploration ratio $c$
	1	2	4
UCB1	$1698\pm 19$	$1686\pm 18$	$1696\pm 18$
aVoP	$1696\pm 18$	$1680\pm 18$	$1695\pm 18$

Table 4 shows the results of a grid search over exploration constants and two bandit algorithms. UCB1 (Kuleshov and Precup, 2014) (with policy priors), which is used on the last line of Algorithm 1, is compared with “a variant of PUCT” (aVoP) (Silver et al., 2016; Tian et al., 2019; Lee et al., 2019), another popular bandit algorithm. This experiment used $\kappa=20\%$ and $\phi=20\%$ . Figure 7 show that Penumbra’s value head accounts for the uncertainty of the underlying infostate.

7 Per-bit saliency

Saliency methods may be able to identify which of the synopsis feature planes are most important and which are least important. Gradients only provide local information, and some saliency methods fail basic sanity checks (Adebayo et al., 2018). Higher quality saliency information may be surfaced by integrating gradients over gradually-varied inputs (Sundararajan et al., 2017; Kapishnikov et al., 2019) and by smoothing gradients locally (Smilkov et al., 2017). Those saliency methods are not directly applicable to discrete inputs such as the synopses used in this work. So, this paper introduces a saliency method that aggregates gradient information across two separate dimensions: training examples and iterations. Per-batch saliency (PBS) averages the absolute value of gradients over random batches of test examples throughout training. Similarly, per-bit saliency (PbS) averages the absolute value of gradients over bits (with specific values) within batches of test examples throughout training. Gradients were taken both with respect to the loss and with respect to the action policy.

Figure 8 shows saliency information for input synopsis features used by Penumbra. In order to validate that these saliency statistics are meaningful, the model was retrained 104 times, once with each feature removed (Hooker et al., 2018). Higher saliency is slightly correlated with decreased performance when a feature is removed. The correlation coefficient to the average change in accuracy is $-0.208$ for loss-PBS, and $-0.206$ for action-PbS. Explanations for the low correlation include noise in the training process and the presence of closely-related features. Ultimately, the contribution of a feature during training is distinct from how well the model can do without that feature. Since some features are near-duplicates of others, removing one may simply increase dependence on another. Still, features with high saliency — such as the current stage (sense or move) and the location of the last capture — are likely to be the most important, and features with low saliency may be considered for removal. The appendix includes saliency statistics for each feature plane.

8 Discussion

Broader impact

DSMCP is more broadly applicable than some prior algorithms for imperfect information games, which are intractable in settings with large infostates and small amounts of shared knowledge (Brown et al., 2020). RBC and the related game Kriegspiel were motivated by uncertainty in warfare (Newman et al., 2016). While playing board games is not dangerous in itself, algorithms that account for uncertainty may become effective and consequential in the real world. In particular, since it focuses on exploiting weaknesses of other agents, DSMCP could be applied in harmful ways.

Future work

Planning in imperfect information games is an active area of research (Russell and Norvig, 2020), and RBC is a promising testing ground for such research. Penumbra would likely benefit from further hyperparameter tuning and potentially alternative corralled bandit algorithms (Arora et al., 2020). Modeling an opponent poorly could be catastrophic; algorithmic adjustments may lead to more-robust best-response strategies (Ponsen et al., 2011). How much is lost by collapsing infostates with synopses is unclear and deserves further investigation. Finally, the “bitter lesson” of machine learning (Sutton, 2019) suggests that a learned synopsis function may perform better.

Acknowledgements

Thanks to the Johns Hopkins University Applied Physics Laboratory for inventing such an intriguing game and for hosting RBC competitions. Thanks to Ryan Gardner for valuable correspondence. Thanks to Rosanne Liu, Joel Veness, Marc Lanctot, Zhe Zhao, and Zach Nussbaum for providing feedback on early drafts. Thanks to William Bernardoni for open sourcing high-quality baseline bots. Thanks to Solidmind for the song “Penumbra”, which is an excellent soundtrack for programming.

References

Adebayo et al. (2018) Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps, 2018.
Arora et al. (2020) Raman Arora, Teodor V. Marinov, and Mehryar Mohri. Corralling stochastic bandit algorithms, 2020.
Bernardoni (2020) William Bernardoni. Baseline bots. https://github.com/wrbernardoni/Baseline-Bots, 2020.
Blowitski and Highley (2021) Kyle Blowitski and Timothy Highley. Checkpoint variations for deep q-learning in reconnaissance blind chess. J. Comput. Sci. Coll., 36(3), oct 2021.
Bowling and Veloso (2001) Michael Bowling and Manuela Veloso. Rational and convergent learning in stochastic games. IJCAI International Joint Conference on Artificial Intelligence, 08 2001.
Brown and Sandholm (2018) Noam Brown and Tuomas Sandholm. Superhuman ai for heads-up no-limit poker: Libratus beats top professionals. Science, 359(6374):418–424, 2018. ISSN 0036-8075. doi: 10.1126/science.aao1733. URL https://science.sciencemag.org/content/359/6374/418.
Brown and Sandholm (2019) Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019. ISSN 0036-8075. doi: 10.1126/science.aay2400. URL https://science.sciencemag.org/content/365/6456/885.
Brown et al. (2018) Noam Brown, Adam Lerer, Sam Gross, and Tuomas Sandholm. Deep counterfactual regret minimization, 2018.
Brown et al. (2020) Noam Brown, Anton Bakhtin, Adam Lerer, and Qucheng Gong. Combining deep reinforcement learning and search for imperfect-information games, 2020.
Browne (2014) Cameron Browne. Bitboard methods for games. ICGA journal, 37:67–84, 06 2014. doi: 10.3233/ICG-2014-37202.
Ciancarini and Favini (2009) Paolo Ciancarini and Gian Piero Favini. Monte carlo tree search techniques in the game of kriegspiel. In IJCAI, pages 474–479, 2009. URL http://ijcai.org/Proceedings/09/Papers/086.pdf.
Coulom (2008) Rémi Coulom. Whole-history rating: A bayesian rating system for players of time-varying strength. In H. Jaap van den Herik, Xinhe Xu, Zongmin Ma, and Mark H. M. Winands, editors, Computers and Games, pages 113–124, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. ISBN 978-3-540-87608-3.
Cowling et al. (2012) Peter Cowling, Edward Powley, and Daniel Whitehouse. Information set monte carlo tree search. IEEE Transactions on Computational Intelligence and Ai in Games, 4:120–143, 06 2012. doi: 10.1109/TCIAIG.2012.2200894.
Fifty et al. (2020) Christopher Fifty, Ehsan Amid, Zhe Zhao, Tianhe Yu, Rohan Anil, and Chelsea Finn. Measuring and harnessing transference in multi-task learning, 2020.
Frank et al. (1998) I. Frank, D. Basin, and Hitoshi Matsubara. Finding optimal strategies for imperfect information games. In AAAI/IAAI, 1998.
Gardner et al. (2020) Ryan W. Gardner, Corey Lowman, Casey Richardson, Ashley J. Llorens, Jared Markowitz, Nathan Drenkow, Andrew Newman, Gregory Clark, Gino Perrotta, Robert Perrotta, Timothy Highley, Vlad Shcherbina, William Bernardoni, Mark Jordan, and Asen Asenov. The first international competition in machine reconnaissance blind chess. volume 123 of Proceedings of Machine Learning Research, pages 121–130, Vancouver, CA, 08–14 Dec 2020. PMLR. URL http://proceedings.mlr.press/v123/gardner20a.html.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2016. doi: 10.1109/cvpr.2016.90. URL http://dx.doi.org/10.1109/cvpr.2016.90.
Heinrich and Silver (2015) Johannes Heinrich and David Silver. Smooth uct search in computer poker. In Proceedings of the 24th International Conference on Artificial Intelligence, IJCAI’15, page 554–560. AAAI Press, 2015. ISBN 9781577357384.
Heinrich and Silver (2016) Johannes Heinrich and David Silver. Deep reinforcement learning from self-play in imperfect-information games, 2016.
Highley et al. (2020) Timothy Highley, Brendan Funk, and Laureen Okin. Dealing with uncertainty: A piecewisegrid agent for reconnaissance blind chess. J. Comput. Sci. Coll., 35(8):156–165, April 2020. ISSN 1937-4771.
Hooker et al. (2018) Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. A benchmark for interpretability methods in deep neural networks, 2018.
Kapishnikov et al. (2019) Andrei Kapishnikov, Tolga Bolukbasi, Fernanda Viégas, and Michael Terry. Xrai: Better attributions through regions, 2019.
Kocsis and Szepesvári (2006) Levente Kocsis and Csaba Szepesvári. Bandit based monte-carlo planning. volume 2006, pages 282–293, 09 2006. ISBN 978-3-540-45375-8. doi: 10.1007/11871842_29.
Kuleshov and Precup (2014) Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems, 2014.
Lanctot et al. (2009) Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte carlo sampling for regret minimization in extensive games. In Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems, volume 22, pages 1078–1086. Curran Associates, Inc., 2009.
Lanctot et al. (2012) Marc Lanctot, Richard Gibson, Neil Burch, Martin Zinkevich, and Michael Bowling. No-regret learning in extensive-form games with imperfect recall, 2012.
Lee et al. (2019) B. Lee, A. Jackson, T. Madams, Seth Troisi, and Derek Jones. Minigo: A case study in reproducing reinforcement learning research. In RML@ICLR, 2019.
Lisy et al. (2015) Viliam Lisy, Marc Lanctot, and Michael Bowling. Online monte carlo counterfactual regret minimization for search in imperfect information games. volume 1, 05 2015.
Liu et al. (2018) Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution, 2018.
Long et al. (2010) Jeffrey Long, Nathan R. Sturtevant, Michael Buro, and Timothy Furtak. Understanding the success of perfect information monte carlo sampling in game tree search. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, page 134–140. AAAI Press, 2010.
Markowitz et al. (2018) Jared Markowitz, Ryan W. Gardner, and Ashley J. Llorens. On the complexity of reconnaissance blind chess, 2018.
Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning, 2013.
Moravčík et al. (2017) Matej Moravčík, Martin Schmid, Neil Burch, Viliam Lisý, Dustin Morrill, Nolan Bard, Trevor Davis, Kevin Waugh, Michael Johanson, and Michael Bowling. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science, 356(6337):508–513, Mar 2017. ISSN 1095-9203. doi: 10.1126/science.aam6960. URL http://dx.doi.org/10.1126/science.aam6960.
Neller and Lanctot (2013) T. Neller and Marc Lanctot. An introduction to counterfactual regret minimization. 2013.
Newman et al. (2016) Andrew J. Newman, Casey L. Richardson, Sean M. Kain, Paul G. Stankiewicz, Paul R. Guseman, Blake A. Schreurs, and Jeffrey A. Dunne. Reconnaissance blind multi-chess: an experimentation platform for ISR sensor fusion and resource management. In Ivan Kadar, editor, Signal Processing, Sensor/Information Fusion, and Target Recognition XXV, volume 9842, pages 62 – 81. International Society for Optics and Photonics, SPIE, 2016. doi: 10.1117/12.2228127. URL https://doi.org/10.1117/12.2228127.
Osborne and Rubinstein (1994) M.J. Osborne and A. Rubinstein. A Course in Game Theory. The MIT Press. MIT Press, 1994. ISBN 9780262650403. URL https://books.google.com/books?id=mnv1DAAAQBAJ.
Parker et al. (2006) Austin Parker, Dana Nau, and V. S. Subrahmanian. Overconfidence or paranoia? search in imperfect-information games. In Proceedings of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI’06, page 1045–1050. AAAI Press, 2006. ISBN 9781577352815.
Parker et al. (2010) Austin Parker, Dana S. Nau, and V. S. Subrahmanian. Paranoia versus overconfidence in imperfect-information games. In Rina Dechter, Hector Geffner, and Joseph Y. Halpern, editors, Heuristics, Probabilities, and Causality: A Tribute to Judea Pearl, chapter 5, pages 63–87. College Publications, 2010.
Pineau et al. (2006) J. Pineau, G. Gordon, and S. Thrun. Anytime point-based approximations for large pomdps. Journal of Artificial Intelligence Research, 27:335–380, Nov 2006. ISSN 1076-9757. doi: 10.1613/jair.2078. URL http://dx.doi.org/10.1613/jair.2078.
Ponsen et al. (2011) Marc Ponsen, Steven de Jong, and Marc Lanctot. Computing approximate nash equilibria and robust best-responses using sampling. J. Artif. Int. Res., 42(1):575–605, September 2011. ISSN 1076-9757.
Richards (2012) Mark D. Richards. Reasoning and decisions in partially observable games. 2012.
Romstad et al. (2020) Tord Romstad, Marco Costalba, Joona Kiiski, and et al. Stockfish - open source chess engine. https://stockfishchess.org, 2020.
Rumelhart et al. (1986) D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning Internal Representations by Error Propagation, page 318–362. MIT Press, Cambridge, MA, USA, 1986. ISBN 026268053X.
Russell and Norvig (2020) S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Pearson, 2020. ISBN 9780134671932.
Savelyev (2020) Sergey Savelyev. Mastering reconnaissance blind chess with reinforcement learning. May 2020. URL https://www.sergeysav.com/projects/rbmc.
Schrittwieser et al. (2019) Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model, 2019.
Silver and Veness (2010) D. Silver and J. Veness. Monte-carlo planning in large pomdps. In NIPS, 2010.
Silver et al. (2016) D. Silver, Aja Huang, Chris J. Maddison, A. Guez, L. Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot, S. Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529:484–489, 2016.
Silver et al. (2017a) D. Silver, Julian Schrittwieser, K. Simonyan, Ioannis Antonoglou, Aja Huang, A. Guez, T. Hubert, L. Baker, Matthew Lai, A. Bolton, Yutian Chen, T. Lillicrap, F. Hui, L. Sifre, George van den Driessche, T. Graepel, and Demis Hassabis. Mastering the game of go without human knowledge. Nature, 550:354–359, 2017a.
Silver et al. (2017b) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm, 2017b.
Smilkov et al. (2017) Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. Smoothgrad: removing noise by adding noise, 2017.
Sundararajan et al. (2017) Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks, 2017.
Sutton (2019) Richard S. Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html, 2019.
Tian et al. (2019) Yuandong Tian, Jerry Ma, Qucheng Gong, Shubho Sengupta, Zhuoyuan Chen, James Pinkerton, and C. Lawrence Zitnick. Elf opengo: An analysis and open reimplementation of alphazero, 2019.
Tomašev et al. (2020) Nenad Tomašev, Ulrich Paquet, Demis Hassabis, and Vladimir Kramnik. Assessing game balance with alphazero: Exploring alternative rule sets in chess, 2020.
Wagstaff et al. (2019) Edward Wagstaff, Fabian B. Fuchs, Martin Engelcke, Ingmar Posner, and Michael Osborne. On the limitations of representing functions on sets, 2019.
Wang et al. (2019) Yunbo Wang, B. Liu, Jiajun Wu, Yuke Zhu, S. S. Du, Li Fei-Fei, and J. Tenenbaum. Dualsmc: Tunneling differentiable filtering and planning under continuous pomdps. arXiv: Learning, 2019.
Waugh et al. (2009) Kevin Waugh, Martin Zinkevich, Michael Johanson, Morgan Kan, David Schnizlein, and Michael Bowling. A practical use of imperfect recall. 01 2009.
Wu (2020) David J. Wu. Accelerating self-play learning in go, 2020.
Zaheer et al. (2017) Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander Smola. Deep sets, 2017.
Zhang et al. (2020) Junjie Zhang, Lingqiao Liu, Peng Wang, and Chunhua Shen. To balance or not to balance: A simple-yet-effective approach for learning with long-tailed distributions, 2020.
Zinkevich et al. (2008) Martin Zinkevich, Michael Johanson, Michael Bowling, and Carmelo Piccione. Regret minimization in games with incomplete information. In J. Platt, D. Koller, Y. Singer, and S. Roweis, editors, Advances in Neural Information Processing Systems, volume 20, pages 1729–1736. Curran Associates, Inc., 2008. URL https://proceedings.neurips.cc/paper/2007/file/08d98638c6fcd194a4b1e6992063e944-Paper.pdf.
Zobrist (1990) A. Zobrist. A new hashing method with application for game playing. ICGA Journal, 13:69–73, 1990.

Appendix A Appendix

Table 5 lists the training hyperparameters and runtime hyperparameters used by PenumbraMixture. Table 6, Table 7, and Table 8 provide top-1 action, top-5 action, and winner accuracies, respectively, between each headset in the neural network. Figure 9 shows game length distributions for each headset.

The synopsis features were hand-designed. Many of them are natural given the rules of chess. Some of them are near duplicates of each other. Table 9 and Table 10 jointly provide brief descriptions of each synopsis feature plane. These tables also include saliency estimates averaged over five runs. The penultimate column orders the synopsis features by their per-bit saliency based on action gradients, and the final column reports the average difference of the policy head accuracies when the model was retrained without each feature.

Table 5: Hyperparameters used by PenumbraMixture

Symbol	Parameter	Value
$b$	Batch size	$256$
$c$	Exploration constant	$2$
$d_{\text{sense}}$	Search depth for sense actions	$6$
$d_{\text{move}}$	Search depth for move actions	$12$
$F$	# of binary synopsis features	$8\small{\times}8\small{\times}104$
$k$	Rejection sampling persistence	$512$
$\ell$	Limited state set size	$128$
$m$	Bandit mixing constant	$1$
$n_{\text{particles}}$	# of samples to track	$4096$
$n_{\text{vl}}$	Virtual loss	$1$
$n_{\text{batches}}$	Total minibatches of training	$650000$
$n_{\text{width}}$	Network width; # features per layer	$128$
$n_{\text{depth}}$	Network depth; # residual blocks	$10$
$z$	Depth increase threshold	$\infty$
$\kappa$	Caution	$0$
$\phi$	Paranoia	$0$
$\epsilon$	Learning rate	$0.0005$

A.1 2019 NeurIPS competition

Penumbra was originally created to compete in the 2019 reconnaissance blind chess competition hosted by the Conference on Neural Information Processing Systems (NeurIPS). However, it performed very poorly in that competition, winning fewer games than the random bot.

The program and underlying algorithm presented in this paper are largely the same as the originals. The main differences are that some hyperparameters were adjusted, the neural network was retrained with more data, and a key bug in the playout code was fixed. Instead of choosing actions according to the policy from the neural network, the playout code erroneously always selected the last legal action. Giving the program a break made a huge difference.

A.2 Comparison with Kriegspiel

A comparison between RBC and Kriegspiel chess [Ciancarini and Favini, 2009, Parker et al., 2010, Richards, 2012] may be worthwhile. Kriegspiel chess also introduces uncertainty about the opposing pieces but lacks an explicit sensing mechanism. Instead, information is gathered solely from captures, check notifications, and illegal move attempts. In Kriegspiel, illegal moves are rejected and the player is allowed to choose a new move with their increased information about the board state, which entangles the positional and informational aspects of the game. In contrast, sensing in RBC gives players direct control over the amount and character of the information they possess.

Another significant difference comes from the mechanics related to check. Capturing pieces and putting the opposing king into check have benefits in both games: capturing pieces leads to a material advantage, and check often precedes checkmate. In Kriegspiel, however, both capturing and giving check also provide the opponent with information. In RBC, while capturing does give the opponent information, putting their king into check does not, which makes sneak attacks more viable.

A.3 Games played

The games that were played in order to produce Table 4, Table 4, and Table 4 are available for download from https://github.com/w-hat/penumbra.

Table 6: Top-1 action accuracy across headsets.

	Dataset
Headset	All	Top	StrangeFish	LaSalle	Dyn.Entropy	Oracle	StockyInfe.	Marmot	Genetic	Zugzwang	Trout	Human	Attacker	Random
All	33.6	41.5	40.5	32.3	43.7	44.1	41.5	20.3	31.9	37.1	36.6	21.9	41.4	3.8
Top	30.5	43.1	44.5	32.3	43.9	44.4	40.3	19.4	31.6	22.4	26.0	18.5	15.6	3.3
StrangeFish	27.3	40.5	45.9	30.3	36.9	38.8	33.2	17.9	27.7	21.0	23.6	17.5	14.6	3.3
LaSalle	26.0	34.4	34.6	36.4	34.1	34.5	33.2	17.2	24.6	22.6	26.1	15.5	11.2	3.4
Dyn.Entropy	27.9	37.9	35.2	27.5	50.0	40.1	37.4	18.0	32.6	18.3	22.6	16.9	13.8	3.3
Oracle	28.8	38.4	36.3	29.1	41.2	49.3	35.1	17.2	29.5	19.9	26.6	17.0	11.9	3.4
StockyInfe.	28.8	38.1	33.9	30.7	42.9	38.6	45.2	17.9	29.3	18.4	23.1	16.7	11.6	3.3
Marmot	22.4	29.4	28.6	24.0	31.4	29.3	29.7	24.6	25.0	16.4	15.5	15.4	11.1	3.4
Genetic	24.0	32.5	30.3	24.1	39.8	35.0	32.3	16.9	40.4	15.1	15.7	15.1	7.8	3.4
Zugzwang	20.5	21.8	23.2	20.5	20.4	23.5	17.3	11.7	12.3	47.0	34.6	14.0	10.9	3.2
Trout	22.8	25.1	26.0	24.0	23.0	27.8	21.3	12.5	14.4	36.1	41.8	14.9	14.7	3.7
Human	23.8	30.1	30.6	24.9	31.4	30.1	28.4	16.8	24.1	24.7	24.5	24.9	12.5	3.3
Attacker	10.6	11.7	11.0	9.8	11.4	11.6	12.4	8.6	8.8	9.2	9.0	6.7	45.1	4.4
Random	14.0	16.7	16.2	14.0	16.5	17.8	16.8	9.4	11.4	15.7	16.5	10.4	10.4	4.5

Table 7: Top-5 action accuracy across headsets.

	Dataset
Headset	All	Top	StrangeFish	LaSalle	Dyn.Entropy	Oracle	StockyInfe.	Marmot	Genetic	Zugzwang	Trout	Human	Attacker	Random
All	62.0	72.3	71.0	66.6	74.9	75.8	72.5	51.3	65.3	65.5	61.8	48.1	59.1	18.2
Top	58.8	73.4	73.4	66.4	75.5	76.4	72.0	50.1	64.7	48.5	52.2	45.4	35.3	16.3
StrangeFish	56.8	72.2	74.3	64.7	72.6	74.2	67.5	48.5	62.1	46.5	50.1	43.8	41.4	15.7
LaSalle	56.8	68.8	68.4	68.2	69.8	70.6	68.3	48.7	58.9	51.5	56.6	43.6	30.8	16.8
Dyn.Entropy	55.5	69.1	66.9	60.9	76.7	72.3	69.6	47.2	64.4	38.4	47.2	42.7	41.4	16.6
Oracle	56.7	70.4	68.9	61.9	74.1	77.4	69.1	45.6	63.4	43.9	50.6	43.1	34.4	16.4
StockyInfe.	57.0	70.0	67.3	64.7	74.0	71.3	73.6	49.3	63.3	43.3	50.2	44.0	29.8	17.0
Marmot	53.4	65.0	64.0	58.7	68.6	66.1	65.2	55.4	59.3	43.0	46.2	42.6	32.2	16.2
Genetic	52.6	65.8	64.2	58.2	71.3	69.7	65.8	45.4	70.5	37.4	41.6	40.9	27.0	16.2
Zugzwang	42.8	45.0	45.4	46.0	42.1	47.5	42.4	32.5	32.8	71.6	57.9	35.3	29.0	15.4
Trout	49.3	54.5	54.4	53.9	53.7	57.2	52.5	38.4	42.2	62.9	63.4	38.7	40.9	18.1
Human	53.5	62.9	62.4	58.3	64.7	64.6	62.5	45.9	55.7	53.7	53.4	51.9	33.4	16.3
Attacker	35.2	39.5	38.8	34.7	40.8	40.2	39.7	31.1	33.8	33.0	32.4	28.4	61.9	20.0
Random	39.7	45.6	44.5	41.2	46.4	47.9	46.0	31.8	37.7	41.3	42.1	31.5	30.3	20.7

Table 8: Winner accuracy across headsets.

	Dataset
Headset	All	Top	StrangeFish	LaSalle	Dyn.Entropy	Oracle	StockyInfe.	Marmot	Genetic	Zugzwang	Trout	Human	Attacker	Random
All	74.8	73.4	76.6	68.4	74.6	76.3	67.8	79.3	74.1	76.7	76.7	72.1	79.6	91.3
Top	63.1	82.4	86.7	68.0	76.0	80.6	66.3	65.6	73.2	49.1	55.1	52.7	47.7	29.9
StrangeFish	64.1	82.1	86.6	67.5	76.2	80.6	65.2	65.8	72.7	50.1	55.4	55.8	60.9	37.3
LaSalle	69.9	77.3	80.7	69.7	76.0	78.8	67.1	73.5	75.9	65.1	68.3	62.4	71.9	60.7
Dyn.Entropy	67.8	80.4	85.0	69.4	78.9	80.8	67.0	71.7	75.6	58.5	61.1	61.5	65.0	47.3
Oracle	66.0	82.0	86.4	69.4	77.3	81.4	66.9	69.2	75.3	53.5	58.2	57.9	59.5	39.7
StockyInfe.	71.3	78.5	82.5	69.3	77.2	79.5	68.3	75.3	76.8	64.0	65.8	66.9	72.6	68.9
Marmot	70.6	67.7	70.1	64.5	70.9	72.3	65.7	80.8	72.9	73.9	73.4	69.2	77.0	72.6
Genetic	67.1	80.0	84.1	68.6	77.1	80.3	67.1	71.5	77.9	56.5	60.8	60.9	62.1	44.3
Zugzwang	68.3	54.6	54.0	59.7	61.4	60.0	61.7	76.1	64.4	80.1	78.8	72.5	78.5	88.9
Trout	69.7	58.6	59.1	60.9	63.6	64.4	63.4	77.9	69.0	77.9	79.8	72.4	78.6	87.3
Human	71.1	64.8	66.0	63.8	67.7	68.0	65.1	76.6	70.4	74.6	76.2	73.3	77.7	90.1
Attacker	63.7	47.1	46.1	55.6	53.8	50.9	56.5	71.4	59.5	75.0	73.4	71.5	80.1	92.7
Random	51.0	25.5	20.9	43.4	34.8	28.5	46.5	54.9	38.2	65.7	56.6	62.5	69.5	94.7

Table 9: Synopsis feature descriptions, saliency estimates, and ablation study results.

#	Description	Loss PBS	Action PbS	Action PbS 0	Action PbS 1	Action PbS #	Ablation % Acc. Diff.
0	East side (constant)	3.24	0.86	0.80	0.91	34	0.27
1	West side (constant)	3.12	0.82	0.84	0.81	43	0.00
2	South side (constant)	3.24	0.86	0.78	0.94	33	-0.08
3	North side (constant)	3.22	0.82	0.89	0.75	44	0.27
4	Rank 1 (constant)	3.15	0.80	0.81	0.76	50	-0.03
5	Rank 8 (constant)	3.12	0.82	0.85	0.61	42	-0.07
6	A-file (constant)	3.04	0.78	0.81	0.53	59	0.18
7	H-file (constant)	3.08	0.80	0.83	0.58	51	0.15
8	Dark squares (constant)	3.08	0.82	0.81	0.82	45	0.21
9	Light squares (constant)	3.03	0.78	0.77	0.78	61	0.01
10	Stage (move or sense)	7.80	3.14	2.82	3.45	0	-0.19
11	Not own piece	5.40	1.43	2.66	1.13	8	-0.29
12	Own pawns	4.16	1.14	1.07	1.73	14	0.01
13	Own knights	3.68	0.93	0.91	1.63	22	0.09
14	Own bishops	3.46	0.89	0.87	1.63	27	0.02
15	Own rooks	3.67	0.94	0.93	1.12	21	0.03
16	Own queens	3.28	0.87	0.85	2.24	32	0.06
17	Own king	3.14	0.79	0.79	0.88	52	0.10
18	Definitely not opposing pieces	3.85	1.14	1.03	1.21	13	-0.10
19	Definitely opposing pawns	3.49	1.01	1.02	0.73	17	-0.08
20	Definitely opposing knights	3.30	0.93	0.93	0.61	23	-0.05
21	Definitely opposing bishops	3.21	0.88	0.88	0.59	29	-0.02
22	Definitely opposing rooks	3.04	0.81	0.82	0.38	47	0.02
23	Definitely opposing queens	3.15	0.85	0.85	0.60	35	-0.10
24	Definitely opposing king	3.60	0.92	0.91	2.27	26	0.04
25	Possibly not opposing pieces	5.22	1.54	1.34	1.56	5	-0.04
26	Possibly opposing pawns	3.50	0.92	0.92	0.93	24	0.06
27	Possibly opposing knights	2.97	0.77	0.77	0.81	67	0.07
28	Possibly opposing bishops	2.95	0.75	0.74	0.89	70	0.09
29	Possibly opposing rooks	3.01	0.75	0.76	0.63	69	-0.18
30	Possibly opposing queens	3.05	0.78	0.77	1.05	57	-0.07
31	Possibly opposing kings	4.86	1.48	1.43	2.64	7	-0.04
32	Last from	2.77	0.72	0.72	0.83	76	-0.11
33	Last to	3.28	0.96	0.96	1.40	19	0.02
34	Last own capture	3.10	0.83	0.83	1.17	40	0.07
35	Last opposing capture	8.04	2.83	2.82	6.51	1	-0.08
36	Definitely attackable	2.72	0.70	0.62	0.78	84	-0.06
37	Definitely attackable somehow	2.73	0.71	0.65	0.78	80	-0.02
38	Possibly attackable	3.02	0.81	0.71	0.92	48	0.19
39	Definitely doubly attackable	2.67	0.66	0.63	0.80	92	-0.11
40	Definitely doubly attackable somehow	2.66	0.69	0.67	0.80	88	0.14
41	Possibly doubly attackable	2.71	0.75	0.73	0.83	71	-0.26
42	Definitely attackable by pawns	3.54	0.92	0.92	2.38	25	0.13
43	Possibly attackable by pawns	3.11	0.78	0.78	0.95	58	-0.10
44	Definitely attackable by knights	2.91	0.72	0.71	0.84	77	0.24
45	Definitely attackable by bishops	2.60	0.64	0.61	0.80	95	0.15
46	Possibly attackable by bishops	2.60	0.68	0.64	0.85	89	-0.07
47	Definitely attackable by rooks	2.63	0.65	0.64	0.75	93	0.07
48	Possibly attackable by rooks	2.74	0.70	0.69	0.77	81	0.00
49	Possibly attackable without king	2.72	0.70	0.63	0.79	82	0.19
50	Possibly attackable without pawns	2.63	0.67	0.62	0.73	90	0.17
51	Definitely attackable by opponent	3.25	0.87	0.91	0.77	31	-0.03

Table 10: Synopsis feature descriptions, saliency estimates, and ablation study results (continued).

#	Description	Loss PBS	Action PbS	Action PbS 0	Action PbS 1	Action PbS #	Ablation % Acc. Diff.
52	Possibly attackable by opponent	3.15	0.84	0.90	0.81	37	0.06
53	Definitely doubly attackable by opp.	2.56	0.65	0.66	0.56	94	-0.01
54	Possibly doubly attackable by opp.	2.67	0.71	0.73	0.66	79	-0.13
55	Definitely attackable by opp. pawns	3.10	0.87	0.87	1.69	30	0.12
56	Possibly attackable by opp. pawns	2.84	0.77	0.77	1.10	62	0.30
57	Definitely attackable by opp. knights	2.78	0.69	0.70	0.59	87	0.21
58	Possibly attackable by opp. knights	2.14	0.52	0.51	0.55	102	0.08
59	Definitely attackable by opp. bishops	2.66	0.67	0.67	0.63	91	0.09
60	Possibly attackable by opp. bishops	2.16	0.53	0.52	0.56	101	-0.09
61	Definitely attackable by opp. rooks	2.77	0.70	0.72	0.48	83	-0.19
62	Possibly attackable by opp. rooks	2.10	0.51	0.53	0.47	103	0.20
63	Possibly attackable by opp. w/o king	2.55	0.64	0.63	0.64	96	-0.17
64	Possibly attackable by opp. w/o pawns	2.45	0.62	0.61	0.62	97	-0.13
65	Possibly safe opposing king	6.04	2.06	2.01	3.38	2	0.07
66	Squares the opponent may move to	2.40	0.60	0.60	0.60	98	0.01
67	Possible castle state for opponent	3.09	0.79	0.79	0.72	53	0.00
68	Known squares	4.94	1.52	1.67	1.45	6	0.13
69	Own king’s king-neighbors	3.10	0.78	0.77	0.93	56	0.14
70	Own king’s knight-neighbors	2.82	0.71	0.70	0.91	78	0.31
71	Definitely opp. knights near king	3.09	0.79	0.79	1.64	54	0.13
72	Possibly opp. knights near king	5.13	1.72	1.72	2.77	4	-0.01
73	Own king’s bishop-neighbors	2.74	0.69	0.68	0.86	85	-0.10
74	Definitely opp. bishops near king	3.04	0.79	0.79	0.89	55	0.23
75	Possibly opp. bishops near king	5.23	1.75	1.75	2.41	3	-0.11
76	Own king’s rook-neighbors	2.76	0.69	0.68	0.83	86	-0.13
77	Definitely opp. rooks near king	3.10	0.81	0.81	0.87	49	0.31
78	Possibly opp. rooks near king	4.45	1.40	1.40	1.55	10	0.05
79	All own pieces	5.26	1.36	1.09	2.47	11	-0.01
80	Definitely empty squares	3.69	0.96	1.05	0.84	20	-0.13
81	May castle eventually	3.11	0.81	0.81	1.26	46	0.24
82	Possibly may castle	3.05	0.77	0.77	0.63	68	0.05
83	Definitely may castle	3.04	0.77	0.77	0.87	66	0.12
84	Own queens’ rook-neighbors	2.20	0.54	0.53	0.63	100	0.04
85	Own queens’ bishop-neighbors	2.33	0.57	0.57	0.67	99	0.06
86	Previous definitely not opp. pieces	3.82	0.88	0.87	0.89	28	-0.30
87	Previous definitely opp. pawns	4.16	1.16	1.18	0.88	12	0.14
88	Previous definitely opp. knights	3.02	0.77	0.77	0.72	63	0.12
89	Previous definitely opp. bishops	2.92	0.73	0.73	0.70	75	-0.02
90	Previous definitely opp. rooks	3.60	1.01	1.02	0.56	16	0.05
91	Previous definitely opp. queens	3.93	1.11	1.11	1.00	15	0.22
92	Previous definitely opp. king	3.33	0.83	0.82	1.58	38	-0.04
93	Previous possibly not opp. pieces	4.40	1.43	1.21	1.47	9	-0.05
94	Previous possibly opp. pawns	3.27	0.83	0.82	0.92	39	0.21
95	Previous possibly opp. knights	3.04	0.78	0.78	0.73	60	0.22
96	Previous possibly opp. bishops	3.10	0.74	0.74	0.78	73	0.16
97	Previous possibly opp. rooks	2.94	0.77	0.79	0.45	64	0.10
98	Previous possibly opp. queens	3.02	0.74	0.74	0.81	74	-0.07
99	Previous possibly opp. king	3.14	0.83	0.82	1.15	41	0.04
100	Previous last from	2.85	0.75	0.74	0.85	72	-0.04
101	Previous last to	3.36	1.00	1.00	1.50	18	0.21
102	Previous own capture	3.05	0.84	0.84	1.13	36	-0.09
103	Previous opposing capture	2.93	0.77	0.77	1.05	65	0.05

#	White	Black
1	g6 : e2e4	e3 : h7h5
2	g7 : d2d4	f2 : f7f5
3	g6 : e4f5	e4 : h5h4
4	d7 : f1e2	g4 : b8c6
5	g7 : e2h5	g4 : h8h5
6	b7 : d1h5	g5 : g7g6
7	e7 : h5e8	g6 : d7d6
8	g6 : g6e8

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41
42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62
63	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80	81	82	83
84	85	86	87	88	89	90	91	92	93	94	95	96	97	98	99	100	101	102	103

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41
42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62
63	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80	81	82	83
84	85	86	87	88	89	90	91	92	93	94	95	96	97	98	99	100	101	102	103

0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40	41
42	43	44	45	46	47	48	49	50	51	52	53	54	55	56	57	58	59	60	61	62
63	64	65	66	67	68	69	70	71	72	73	74	75	76	77	78	79	80	81	82	83
84	85	86	87	88	89	90	91	92	93	94	95	96	97	98	99	100	101	102	103