Planning with Learned Binarized Neural Networks Benchmarks for MaxSAT Evaluation 2021

Buser Say Monash University
Melbourne, Australia
[email protected] Scott Sanner University of Toronto
Toronto, Canada
[email protected] Jo Devriendt {@IEEEauthorhalign} Jakob Nordström KU Leuven
Leuven, Belgium
[email protected] University of Copenhagen
Copenhagen, Denmark
[email protected] Peter J. Stuckey Monash University
Melbourne, Australia
[email protected]

Abstract

This document provides a brief introduction to learned automated planning problem where the state transition function is in the form of a binarized neural network (BNN), presents a general MaxSAT encoding for this problem, and describes the four domains, namely: Navigation, Inventory Control, System Administrator and Cellda, that are submitted as benchmarks for MaxSAT Evaluation 2021.

Index Terms:

binarized neural networks, automated planning

I Introduction

Automated planning studies the reasoning side of acting in Artificial Intelligence, and automates the selection and ordering of actions to reach desired states of the world as best as possible [1]. An automated planning problem represents dynamics of the real-world using a model, which can either be manually encoded [2, 3, 4, 5, 6], or learned from data [7, 8, 9, 10]. In this document, we focus on the latter.

Automated planning with deep neural network (DNN) learned state transition functions is a two stage data-driven framework for learning and solving automated planning problems with unknown state transition functions [11, 12, 13]. The first stage of the framework learns the unknown state transition function from data as a DNN. The second stage of the framework plans optimally with respect to the learned DNN by solving an equivalent optimization problem (e.g., a mixed-integer programming model [11, 14, 15, 16], a 0–1 integer programming model [12, 17], a weighted partial MaxSAT model [12, 17], a constraint programming model [18], or a pseudo-Boolean optimization model [18]). In this document, we focus on the second stage of the data-driven framework where the learned DNN is a binarized neural network (BNN) [19].

The remaining of the document is organized as follows. We begin with the description of the learned automated planning problem and the binarized neural network (BNN). Then we present the weighted partial MaxSAT model of the general learned automated planning problem, and conclude with the description of four learned automated planning domains, namely: Navigation, Inventory Control, System Administrator and Cellda, that are submitted as benchmarks for MaxSAT Evaluation 2021.

II Automated Planning with Learned Binarized Neural Network State Transitions

II-A Problem Definition

A fixed-horizon learned deterministic automated planning problem [11, 12, 18] is a tuple $\tilde{\Pi}=\langle S,A,C,\tilde{T},V,G,R,H\rangle$ , where $S=\{s_{1},\dots,s_{n}\}$ and $A=\{a_{1},\dots,a_{m}\}$ are sets of state and action variables for positive integers $n,m\in\mathbb{Z^{+}}$ with domains $D_{s_{1}},\dots,D_{s_{n}}$ and $D_{a_{1}},\dots,D_{a_{m}}$ respectively. Moreover, $C:D_{s_{1}}\times\dots\times D_{s_{n}}\times D_{a_{1}}\times\dots\times D_{a_{m}}\rightarrow\{\mathit{true},\mathit{false}\}$ is the global function, $\tilde{T}:D_{s_{1}}\times\dots\times D_{s_{n}}\times D_{a_{1}}\times\dots\times D_{a_{m}}\rightarrow D_{s_{1}}\times\dots\times D_{s_{n}}$ is the learned state transition function, and $R:D_{s_{1}}\times\dots\times D_{s_{n}}\times D_{a_{1}}\times\dots\times D_{a_{m}}\rightarrow\mathbb{R}$ is the reward function. Finally, $V$ is a tuple of constants $\langle V_{1},\dots,V_{n}\rangle\in D_{s_{1}}\times\dots\times D_{s_{n}}$ denoting the initial values of all state variables, $G:D_{s_{1}}\times\dots\times D_{s_{n}}\rightarrow\{\mathit{true},\mathit{false}\}$ is the goal state function, and $H\in\mathbb{Z}^{+}$ is the planning horizon.

A solution to (i.e., a plan for) $\tilde{\Pi}$ is a tuple of values $\bar{A}^{t}=\langle\bar{a}^{t}_{1},\dots,\bar{a}^{t}_{m}\rangle\in D_{a_{1}}\times\dots\times D_{a_{m}}$ for all action variables $A$ over time steps $t\in\{1,\dots,H\}$ such that $\tilde{T}(\langle\bar{s}^{t}_{1},\dots,\bar{s}^{t}_{n},\bar{a}^{t}_{1},\dots,\bar{a}^{t}_{m}\rangle)=\langle\bar{s}^{t+1}_{1},\dots,\bar{s}^{t+1}_{n}\rangle$ and $C(\langle\bar{s}^{t}_{1},\dots,\bar{s}^{t}_{n},\bar{a}^{t}_{1},\dots,\bar{a}^{t}_{m}\rangle)=\mathit{true}$ holds for time steps $t\in\{1,\dots,H\}$ , $V_{i}=\bar{s}^{1}_{i}$ for all $s_{i}\in S$ and $G(\langle\bar{s}^{H+1}_{1},\dots,\bar{s}^{H+1}_{n}\rangle)=\mathit{true}$ . It has been shown that finding a feasible solution to $\tilde{\Pi}$ is NP-complete [18]. An optimal solution to (i.e., an optimal plan for) $\tilde{\Pi}$ is a solution such that the total reward $\sum_{t=1}^{H}R(\langle\bar{s}^{t+1}_{1},\dots,\bar{s}^{t+1}_{n},\bar{a}^{t}_{1},\dots,\bar{a}^{t}_{m}\rangle)$ is maximized.

We assume that the domains of action and state variables are binary unless otherwise stated¹¹1When the domain of a variable is not binary (e.g., see Inventory Control), we can use the following approximation $x\approx(-2^{m_{1}-1}x_{m_{1}}+\sum_{i=1}^{m_{1}-1}2^{i-1}x_{i})10^{m_{2}}$ for integers $m_{1}\in\mathbb{Z}^{+}$ and $m_{2}\in\mathbb{Z}$ ., the functions $C,G,R$ and function $\tilde{T}$ are known, functions $C,G$ can be equivalently represented by $J_{C}\in\mathbb{Z^{+}}$ and $J_{G}\in\mathbb{Z^{+}}$ linear constraints, function $R$ is a linear expression and function $\tilde{T}$ is a learned BNN [19].

II-B Binarized Neural Networks

Binarized neural networks (BNNs) are DNNs with binarized weights and activation functions [19]. Given $L$ layers with layer width ${W_{l}}$ of layer $l\in\{1,\dots,L\}$ , and a set of neurons $J(l)=\{u_{1,l},\dots,u_{W_{l},l}\}$ , is stacked in the following order.

Input Layer

The first layer consists of neurons $u_{i,1}\in J(1)$ that represent the domain of the learned state transition function $\tilde{T}$ where neurons $u_{1,1},\dots,u_{n,1}\in J(1)$ represent the state variables $S$ and neurons $u_{n+1,1},\dots,u_{n+m,1}\in J(1)$ represent the action variables $A$ . During the training of the BNN, values $0$ and $1$ of action and state variables are represented by $-1$ and $1$ , respectively.

Batch Normalization Layers

For layers $l\in\{2,\dots,L\}$ , Batch Normalization [20] sets the weighted sum of outputs at layer $l-1$ in $\triangle_{j,l}=\sum_{i\in J(l-1)}w_{i,j,l}y_{i,l-1}$ to inputs $x_{j,l}$ of neurons $u_{j,l}\in J(l)$ using the formula $x_{j,l}=\frac{\triangle_{j,l}-\mu_{j,l}}{\sqrt[]{\sigma^{2}_{j,l}+\epsilon_{j,l}}}\gamma_{j,l}+\beta_{j,l}$ , where $y_{i,l-1}$ is the output of neuron $u_{i,l-1}\in J(l-1)$ , and the parameters are the weight $w_{i,j,l}$ , input mean $\mu_{j,l}$ , input variance $\sigma^{2}_{j,l}$ , numerical stability constant $\epsilon_{j,l}$ , input scaling $\gamma_{j,l}$ , and input bias $\beta_{j,l}$ , all computed at training time.

Activation Layers

Given input $x_{j,l}$ , the activation function $y_{j,l}$ computes the output of neuron $u_{j,l}\in J(l)$ at layer $l\in\{2,\dots,L\}$ , which is $1$ if $x_{j,l}\geq 0$ and $-1$ otherwise. The last activation layer consists of neurons $u_{i,L}\in J(L)$ that represent the codomain of the learned state transition function $\tilde{T}$ such that $u_{1,L},\dots,u_{n,L}\in J(L)$ represent the state variables $S$ .

The BNN is trained to learn the function $\tilde{T}$ from data that consists of measurements on the domain and codomain of the unknown state transition function $T:D_{s_{1}}\times\dots\times D_{s_{n}}\times D_{a_{1}}\times\dots\times D_{a_{m}}\rightarrow D_{s_{1}}\times\dots\times D_{s_{n}}$ .

III The Weighted Partial MaxSAT Model

In this section, we present the weighted partial MaxSAT model [12, 17] of the learned automated planning problem.

III-A Decision Variables

The weighted partial MaxSAT model uses the following decision variables:

•

${X}_{i,t}$ encodes whether action variable $a_{i}\in A$ is executed at time step $t\in\{1,\dots,H\}$ or not.
•

${Y}_{i,t}$ encodes whether state variable $s_{i}\in S$ is true at time step $t\in\{1,\dots,H+1\}$ or not.
•

${Z}_{i,l,t}$ encodes whether neuron $u_{i,l}\in J(l)$ in layer $l\in\{1,\dots,L\}$ is active at time step $t\in\{1,\dots,H\}$ or not.

III-B Parameters

The weighted partial MaxSAT model uses the following parameters:

•

$\bar{w}_{i,j,l}$ is the value of the learned BNN weight between neuron $u_{i,l-1}\in J(l-1)$ and neuron $u_{j,l}\in J(l)$ in layer $l\in\{2,\dots,L\}$ .
•

$B(j,l)$ is the value of the bias for neuron $u_{j,l}\in J(l)$ in layer $l\in\{2,\dots,L\}$ . Given the values of learned parameters $\bar{\mu}_{j,l}$ , $\bar{\sigma}^{2}_{j,l}$ , $\bar{\epsilon}_{j,l}$ , $\bar{\gamma}_{j,l}$ and $\bar{\beta}_{j,l}$ , the bias is computed as $B(j,l)=\biggl{\lceil}\frac{\bar{\beta}_{j,l}\sqrt[]{\bar{\sigma}^{2}_{j,l}+\bar{\epsilon}_{j,l}}}{\bar{\gamma}_{j,l}}-\bar{\mu}_{j,l}\biggr{\rceil}$ .
•

$r^{s}_{i}\in\mathbb{R}$ and $r^{a}_{i}\in\mathbb{R}$ are constants of the reward function $R$ that is in the form of $\sum_{i=1}^{n}r^{s}_{i}s_{i}+\sum_{i=1}^{m}r^{a}_{i}a_{i}$ .
•

$c^{s}_{i,j}\in\mathbb{Z}$ , $c^{a}_{i,j}\in\mathbb{Z}$ and $c^{k}_{j}\in\mathbb{Z}$ are constants of the set of linear constraints that represent the global function $C$ where each linear constraint $j\in\{1,\dots,J_{C}\}$ is in the form of $\sum_{i=1}^{n}c^{s}_{i}s_{i}+\sum_{i=1}^{m}c^{a}_{i}a_{i}\leq c^{k}_{j}$ .
•

$g^{s}_{i,j}\in\mathbb{Z}$ and $g^{k}_{j}\in\mathbb{Z}$ are constants of the set of linear constraints that represent the goal state function $G$ where each linear constraint $j\in\{1,\dots,J_{G}\}$ is in the form of $\sum_{i=1}^{n}g^{s}_{i}s_{i}\leq g^{k}_{j}$ .

III-C Hard Clauses

The weighted partial MaxSAT model uses the following hard clauses.

Initial State Clauses

The following conjunction of hard clauses sets the initial value $V_{i}$ of each state variable $s_{i}\in S$ .

\displaystyle\bigwedge_{i=1}^{n}(\neg{Y}_{i,1}\vee V_{i})\wedge({Y}_{i,1}\vee\neg V_{i})

(1)

Goal State Clauses

The following conjunction of hard clauses encodes the set of linear constraints that represent the goal state function $G$ .

\displaystyle\bigwedge_{j=1}^{J_{G}}Card(\sum_{i=1}^{n}g^{s}_{i}{Y}_{i,H+1}\leq g^{k}_{j})

(2)

In the above notation, $Card$ produces the CNF encoding of a given linear constraint [21].

Global Clauses

The following conjunction of hard clauses encodes the set of linear constraints that represent the global function $C$ .

\displaystyle\bigwedge_{j=1}^{J_{C}}\bigwedge_{t=1}^{H}Card(\sum_{i=1}^{n}c^{s}_{i}{Y}_{i,t}+\sum_{i=1}^{m}c^{a}_{i}{X}_{i,t}\leq c^{k}_{j})

(3)

BNN Clauses

The following conjunction of hard clauses maps the input and the output of the BNN onto the state and action variables.

	$\displaystyle\bigwedge_{i=1}^{n}\bigwedge_{t=1}^{H}(\neg{Y}_{i,t}\vee{Z}_{i,1,t})\wedge({Y}_{i,t}\vee\neg{Z}_{i,1,t})$		(4)
	$\displaystyle\bigwedge_{i=1}^{m}\bigwedge_{t=1}^{H}(\neg{X}_{i,t}\vee{Z}_{i+n,1,t})\wedge({X}_{i,t}\vee\neg{Z}_{i+n,1,t})$		(5)
	$\displaystyle\bigwedge_{i=1}^{n}\bigwedge_{t=1}^{H}(\neg{Y}_{i,t+1}\vee{Z}_{i,L,t})\wedge({Y}_{i,t+1}\vee\neg{Z}_{i,L,t})$		(6)

Finally, the following conjunction of hard clauses encodes the activation function of each neuron in the learned BNN.

	$\displaystyle\bigwedge_{l=2}^{L}\bigwedge_{u_{j,l}\in J(l)}\bigwedge_{t=1}^{H}Act\bigl{(}$
	$\displaystyle(\sum_{u_{i,l-1}\in J(l-1)}{\bar{w}_{i,j,l}}(2{Z}_{i,l-1,t}-1)+B(j,l)\geq 0)={Z}_{j,l,t}\bigr{)}$		(7)

In the above notation, $Act$ produces the CNF encoding of a given biconditional constraint [17] by extending the CNF encoding of Cardinality Networks [22].

III-D Soft Clauses

The weighted partial MaxSAT model uses the following soft clauses.

Reward Clauses

The following conjunction of soft clauses encodes the reward function $R$ .

\displaystyle\bigwedge_{t=1}^{H}\bigl{(}\bigwedge_{i=1}^{n}(r^{s}_{i},{Y}_{i,t+1})\wedge\bigwedge_{i=1}^{m}(r^{a}_{i},{X}_{i,t})\bigr{)}

(8)

IV Benchmark Domain Descriptions

In this section, we provide detailed description of four learned automated planning problems, namely: Navigation [23], Inventory Control [24], System Administrator [25] and Cellda [17].²²2The repository: https://github.com/saybuser/FD-SAT-Plan

TABLE I: The BNN architectures of all four learned automated planning problems.

Problem	BNN Structure
Discrete Navigation ( $N=3$ )	13:36:36:9
Discrete Navigation ( $N=4$ )	20:96:96:16
Discrete Navigation ( $N=5$ )	29:128:128:25
Inventory Control ( $N=2$ )	7:96:96:5
Inventory Control ( $N=4$ )	8:128:128:5
System Administrator ( $N=4$ )	16:128:128:12
System Administrator ( $N=5$ )	20:128:128:128:15
Cellda (policy=x-axis)	12:256:256:4
Cellda (policy=y-axis)	12:256:256:4

Navigation

Navigation [23] task for an agent in a two-dimensional ( $N$ -by- $N$ where $N\in\mathbb{Z}^{+}$ ) maze is cast as an automated planning problem as follows.

•

The location of the agent is represented by $N^{2}$ state variables $S=\{s_{1},\dots,s_{N^{2}}\}$ where state variable $s_{i}$ represents whether the agent is located at position $i\in\{1,\dots,N^{2}\}$ or not.
•

The intended movement of the agent is represented by four action variables $A=\{a_{1},a_{2},a_{3},a_{4}\}$ where action variables $a_{1}$ , $a_{2}$ , $a_{3}$ and $a_{4}$ represent whether the agent attempts to move up, down, right or left, respectively.

•

Mutual exclusion on the intended movement of the agent is represented by the global function as follows.

\displaystyle C(\langle s_{1},\dots,a_{4}\rangle)=\begin{cases}\mathit{true},&\text{if }a_{1}+a_{2}+a_{3}+a_{4}\leq 1\\ \mathit{false},&\text{otherwise}\end{cases}

•

The initial location of the agent is $s_{i}=V_{i}$ for all positions $i\in\{1,\dots,{N^{2}}\}$ .

•

The final location of the agent is represented by the goal state function as follows.

\displaystyle G(\langle s_{1},\dots,s_{N^{2}}\rangle)=\begin{cases}\mathit{true},&\text{if }s_{i}=V^{\prime}_{i}\\ &\quad\forall{i\in\{1,\dots,{N^{2}}\}}\\ \mathit{false},&\text{otherwise}\end{cases}

where $V^{\prime}_{i}$ denotes the goal location of the agent (i.e., $V^{\prime}_{i}=\mathit{true}$ if and only if position $i\in\{1,\dots,{N^{2}}\}$ is the final location, $V^{\prime}_{i}=\mathit{false}$ otherwise).

•

The objective is to minimize total number of intended movements by the agent and is represented by the reward function as follows.

$\displaystyle R(\langle s_{1},\dots,a_{4}\rangle)=a_{1}+a_{2}+a_{3}+a_{4}$
•

The next location of the agent is represented by the state transition function $T$ that is a complex function of state and action variables $s_{1},\dots,s_{N^{2}},a_{1},\dots,a_{4}$ . The unknown function $T$ is approximated by a BNN $\tilde{T}$ , and the details of $\tilde{T}$ are provided in Table I.

We submitted problems with $N=3,4,5$ over planning horizons $H=4,\dots,10$ . Note that this automated planning problem is a deterministic version of its original from IPPC2011 [23].

Inventory Control

Inventory Control [24] is the problem of managing inventory of a product with demand cycle length $N\in\mathbb{Z}^{+}$ , and is cast as an automated planning problem as follows.

•

The inventory level of the product, phase of the demand cycle and whether demand is met or not are represented by three state variables $S=\{s_{1},s_{2},s_{3}\}$ where state variables $s_{1}$ and $s_{2}$ have non-negative integer domains.
•

Ordering some fixed amount of inventory is represented by an action variable $A=\{a_{1}\}$ .

•

Meeting the demand is represented by the global function as follows.

\displaystyle C(\langle s_{1},s_{2},s_{3},a_{1}\rangle)=\begin{cases}\mathit{true},&\text{if }s_{3}=\mathit{true}\\ \mathit{false},&\text{otherwise}\end{cases}

•

The inventory, the phase of the demand cycle and meeting the demand are set to their initial values $s_{i}=V_{i}$ for all ${i\in\{1,2,3\}}$ .

•

Meeting the final demand is represented by the goal state function as follows.

\displaystyle G(\langle s_{1},s_{2},s_{3}\rangle)=\begin{cases}\mathit{true},&\text{if }s_{3}=\mathit{true}\\ \mathit{false},&\text{otherwise}\end{cases}

•

The objective is to minimize total storage cost and is represented by the reward function as follows.

$\displaystyle R(\langle s_{1},s_{2},s_{3},a_{1}\rangle)=cs_{1}$

where $c$ denotes the unit storage cost.
•

The next inventory level, the next phase of the demand cycle and whether the next demand is met or not are represented by the state transition function $T$ that is a complex function of state and action variables $s_{1},s_{2},s_{3},a_{1}$ . The unknown function $T$ is approximated by a BNN $\tilde{T}$ , and the details of $\tilde{T}$ are provided in Table I.

We submitted problems with two demand cycle lengths $N\in\{2,4\}$ over planning horizons $H=5,\dots,8$ . The values of parameters are chosen as $m_{1}=4$ and $m_{2}=0$ .

System Administrator

System Administrator [25, 23] is the problem of maintaining a computer network of size $N$ and is cast as an automated planning problem as follows.

•

The age of computer $i\in\{1,\dots,N\}$ , and whether computer $i\in\{1,\dots,N\}$ is running or not, are represented by $2N$ state variables $S=\{s_{1},\dots,s_{2N}\}$ where state variables $s_{1},\dots,s_{N}$ have non-negative integer domains.
•

Rebooting computers $i\in\{1,\dots,N\}$ are represented by $N$ action variables $A=\{a_{1},\dots,a_{N}\}$ .

•

The bounds on the number of computers that can be rebooted and the requirement that all computers must be running are represented by global function as follows.

\displaystyle C(\langle s_{1},\dots,a_{N}\rangle)=\begin{cases}\mathit{true},&\text{if }\sum_{i=1}^{N}a_{i}\leq a^{max}\\ &\text{and }s_{i}=\mathit{true}\\ &\quad\forall{i\in\{N+1,\dots,2N\}}\\ \mathit{false},&\text{otherwise}\end{cases}

where $a^{max}$ is the maximum on the number of computers that can be rebooted at a given time.

•

The age of computer $i\in\{1,\dots,N\}$ , and whether computer $i\in\{1,\dots,N\}$ is running or not are set to their initial values $s_{i}=V_{i}$ for all ${i\in\{1,\dots,2N\}}$ .

•

The requirement that all computers must be running in the end is represented by the goal state function as follows.

\displaystyle G(\langle s_{1},\dots,s_{2N}\rangle)=\begin{cases}\mathit{true},&\text{if }s_{i}=\mathit{true}\\ &\quad\forall{i\in\{N+1,\dots,2N\}}\\ \mathit{false},&\text{otherwise}\end{cases}

•

The objective is to minimize total number of reboots and is represented by the reward function as follows.

$\displaystyle R(\langle s_{1},\dots,s_{2N},a_{1},\dots,a_{N}\rangle)=\sum_{i=1}^{N}a_{i}$
•

The next age of computer $i\in\{1,\dots,N\}$ and whether computer $i\in\{1,\dots,N\}$ will be running or not, are represented by the state transition function $T$ that is a complex function of state and action variables $s_{1},\dots,s_{2N},a_{1},\dots,a_{N}$ . The unknown function $T$ is approximated by a BNN $\tilde{T}$ , and the details of $\tilde{T}$ are provided in Table I.

We submitted problems with $N\in\{4,5\}$ computers over planning horizons $H=2,3,4$ . The values of parameters are chosen as $m_{1}=3$ and $m_{2}=0$ .

Cellda

Influenced by the famous video game [26], Cellda [17] is the task of an agent who must escape from a two dimensional ( $N$ -by- $N$ where $N\in\mathbb{Z}^{+}$ ) cell through a locked door by obtaining the key without getting hit by the enemy, and is cast as an automated planning problem as follows.

•

The location of the agent, the location of the enemy, whether the key is obtained or not and whether the agent is alive or not are represented by six state variables $S=\{s_{1},\dots,s_{6}\}$ where state variables $s_{1}$ and $s_{2}$ represent the horizontal and vertical locations of the agent, state variables $s_{3}$ and $s_{4}$ represent the horizontal and vertical locations of the enemy, state variable $s_{5}$ represents whether the key is obtained or not, and state variable $s_{6}$ represents whether the agent is alive or not. State variables $s_{1}$ , $s_{2}$ , $s_{3}$ and $s_{4}$ have positive integer domains.
•

The intended movement of the agent is represented by four action variables $A=\{a_{1},a_{2},a_{3},a_{4}\}$ where action variables $a_{1}$ , $a_{2}$ , $a_{3}$ and $a_{4}$ represent whether the agent intends to move up, down, right or left, respectively.

•

Mutual exclusion on the intended movement of the agent, the boundaries of the maze and requirement that the agent must be alive are represented by global function as follows.

\displaystyle C(\langle s_{1},\dots,a_{4}\rangle)=\begin{cases}\mathit{true},&\text{if }a_{1}+a_{2}+a_{3}+a_{4}\leq 1\\ &\text{and }0\leq s_{i}<N\quad\forall{i\in\{1,2\}}\\ &\text{and }s_{6}=\mathit{true}\\ \mathit{false},&\text{otherwise}\end{cases}

•

The location of the agent, the location of the enemy, whether the key is obtained or not, and whether the agent is alive or not are set to their initial values $s_{i}=V_{i}$ for all ${i\in\{1,\dots,6\}}$ .

•

The goal location of the agent (i.e., the location of the door), the requirement that the agent must be alive in the end and the requirement that the key must be obtained are represented by the goal state function as follows.

\displaystyle G(\langle s_{1},\dots,s_{6}\rangle)=\begin{cases}\mathit{true},&\text{if }s_{1}=V^{\prime}_{1}\text{ and }s_{2}=V^{\prime}_{2}\\ &\text{and }s_{5}=\mathit{true}\text{ and }s_{6}=\mathit{true}\\ \mathit{false},&\text{otherwise}\end{cases}

where $V^{\prime}_{1}$ and $V^{\prime}_{2}$ denote the goal location of the agent (i.e., the location of the door).

•

The objective is to minimize total number of intended movements by the agent and is represented by the reward function as follows.

$\displaystyle R(\langle s_{1},\dots,a_{4}\rangle)=a_{1}+a_{2}+a_{3}+a_{4}$
•

The next location of the agent, the next location of the enemy, whether the key will be obtained or not, and whether the agent will be alive or not, are represented by the state transition function $T$ that is a complex function of state and action variables $s_{1},\dots,s_{6},a_{1},\dots,a_{4}$ . The unknown function $T$ is approximated by a BNN $\tilde{T}$ , and the details of $\tilde{T}$ are provided in Table I.

We submitted problems with maze size $N=4$ over planning horizons $H=8,\dots,12$ with two different enemy policies. The values of parameters are chosen as $m_{1}=2$ and $m_{2}=0$ .

References

[1] D. Nau, M. Ghallab, and P. Traverso, Automated Planning: Theory & Practice. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2004.
[2] H. Kautz and B. Selman, “Planning as satisfiability,” in Proceedings of the Tenth European Conference on Artificial Intelligence, ser. ECAI’92, 1992, pp. 359–363.
[3] J. Hoffmann and B. Nebel, “The FF planning system: Fast plan generation through heuristic search,” in Journal of Artificial Intelligence Research, vol. 14. USA: AI Access Foundation, 2001, pp. 253–302.
[4] M. Helmert, “The fast downward planning system,” in Journal Artificial Intelligence Research, vol. 26. USA: AI Access Foundation, 2006, pp. 191–246.
[5] F. Pommerening, G. Röger, M. Helmert, and B. Bonet, “LP-based heuristics for cost-optimal planning,” in Proceedings of the Twenty-Fourth International Conference on Automated Planning and Scheduling, ser. ICAPS’14. AAAI Press, 2014, pp. 226–234.
[6] T. O. Davies, A. R. Pearce, P. J. Stuckey, and N. Lipovetzky, “Sequencing operator counts,” in Proceedings of the Twenty-Fifth International Conference on Automated Planning and Scheduling. AAAI Press, 2015, pp. 61–69.
[7] W.-M. Shen and H. A. Simon, “Rule creation and rule learning through environmental exploration,” in Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, ser. IJCAI’89. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1989, pp. 675––680.
[8] Y. Gil, “Acquiring domain knowledge for planning by experimentation,” Ph.D. dissertation, Carnegie Mellon University, USA, 1992.
[9] S. W. Bennett and G. F. DeJong, “Real-world robotics: Learning to plan for robust execution,” in Machine Learning, vol. 23, 1996, pp. 121–161.
[10] S. S. Benson, “Learning action models for reactive autonomous agents,” Ph.D. dissertation, Stanford University, Stanford, CA, USA, 1997.
[11] B. Say, G. Wu, Y. Q. Zhou, and S. Sanner, “Nonlinear hybrid planning with deep net learned transition models and mixed-integer linear programming,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, ser. IJCAI’17, 2017, pp. 750–756.
[12] B. Say and S. Sanner, “Planning in factored state and action spaces with learned binarized neural network transition models,” in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, ser. IJCAI’18, 2018, pp. 4815–4821.
[13] B. Say, “Optimal planning with learned neural network transition models,” Ph.D. dissertation, University of Toronto, Toronto, ON, Canada, 2020.
[14] B. Say, S. Sanner, and S. Thiébaux, “Reward potentials for planning with learned neural network transition models,” in Proceedings of the Twenty-Fifth International Conference on Principles and Practice of Constraint Programming, T. Schiex and S. de Givry, Eds. Cham: Springer International Publishing, 2019, pp. 674–689.
[15] G. Wu, B. Say, and S. Sanner, “Scalable planning with deep neural network learned transition models,” Journal of Artificial Intelligence Research, vol. 68, pp. 571–606, 2020.
[16] B. Say, “A unified framework for planning with learned neural network transition models,” in Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021, pp. 5016–5024.
[17] B. Say and S. Sanner, “Compact and efficient encodings for planning in factored state and action spaces with learned binarized neural network transition models,” Artificial Intelligence, vol. 285, p. 103291, 2020.
[18] B. Say, J. Devriendt, J. Nordström, and P. Stuckey, “Theoretical and experimental results for planning with learned binarized neural network transition models,” in Proceedings of the Twenty-Sixth International Conference on Principles and Practice of Constraint Programming, H. Simonis, Ed. Cham: Springer International Publishing, 2020, pp. 917–934.
[19] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Proceedings of the Thirtieth International Conference on Neural Information Processing Systems. USA: Curran Associates Inc., 2016, pp. 4114–4122.
[20] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the Thirty-Second International Conference on International Conference on Machine Learning, ser. ICML. JMLR.org, 2015, pp. 448–456.
[21] I. Abío and P. Stuckey, “Encoding linear constraints into SAT,” in Principles and Practice of Constraint Programming. Springer Int Publishing, 2014, pp. 75–91.
[22] R. Asin and R. Nieuwenhuis, “Cardinality networks and their applications and oliveras, albert and rodriguez-carbonell, enric,” in International Conference on Theory and Applications of Satisfiability Testing, 2009, pp. 167–180.
[23] S. Sanner and S. Yoon, “International probabilistic planning competition,” 2011.
[24] T. Mann and S. Mannor, “Scaling up approximate value iteration with options: Better policies with fewer iterations,” in Proceedings of the Thirty-First International Conference on Machine Learning, ser. Machine Learning Research, E. P. Xing and T. Jebara, Eds., vol. 32. Bejing, China: PMLR, 2014, pp. 127–135.
[25] C. Guestrin, D. Koller, and R. Parr, “Max-norm projections for factored MDPs,” in Seventeenth International Joint Conferences on Artificial Intelligence, 2001, pp. 673–680.
[26] Nintendo, “The legend of zelda,” 1986.