This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Planning with Learned Binarized Neural Networks Benchmarks for MaxSAT Evaluation 2021

Buser Say Monash University
Melbourne, Australia
[email protected]
   Scott Sanner University of Toronto
Toronto, Canada
[email protected]
   Jo Devriendt {@IEEEauthorhalign} Jakob Nordström KU Leuven
Leuven, Belgium
[email protected]
University of Copenhagen
Copenhagen, Denmark
[email protected]
   Peter J. Stuckey Monash University
Melbourne, Australia
[email protected]
Abstract

This document provides a brief introduction to learned automated planning problem where the state transition function is in the form of a binarized neural network (BNN), presents a general MaxSAT encoding for this problem, and describes the four domains, namely: Navigation, Inventory Control, System Administrator and Cellda, that are submitted as benchmarks for MaxSAT Evaluation 2021.

Index Terms:
binarized neural networks, automated planning

I Introduction

Automated planning studies the reasoning side of acting in Artificial Intelligence, and automates the selection and ordering of actions to reach desired states of the world as best as possible [1]. An automated planning problem represents dynamics of the real-world using a model, which can either be manually encoded [2, 3, 4, 5, 6], or learned from data [7, 8, 9, 10]. In this document, we focus on the latter.

Automated planning with deep neural network (DNN) learned state transition functions is a two stage data-driven framework for learning and solving automated planning problems with unknown state transition functions [11, 12, 13]. The first stage of the framework learns the unknown state transition function from data as a DNN. The second stage of the framework plans optimally with respect to the learned DNN by solving an equivalent optimization problem (e.g., a mixed-integer programming model [11, 14, 15, 16], a 0–1 integer programming model [12, 17], a weighted partial MaxSAT model [12, 17], a constraint programming model [18], or a pseudo-Boolean optimization model [18]). In this document, we focus on the second stage of the data-driven framework where the learned DNN is a binarized neural network (BNN) [19].

The remaining of the document is organized as follows. We begin with the description of the learned automated planning problem and the binarized neural network (BNN). Then we present the weighted partial MaxSAT model of the general learned automated planning problem, and conclude with the description of four learned automated planning domains, namely: Navigation, Inventory Control, System Administrator and Cellda, that are submitted as benchmarks for MaxSAT Evaluation 2021.

II Automated Planning with Learned Binarized Neural Network State Transitions

II-A Problem Definition

A fixed-horizon learned deterministic automated planning problem [11, 12, 18] is a tuple Π~=S,A,C,T~,V,G,R,H\tilde{\Pi}=\langle S,A,C,\tilde{T},V,G,R,H\rangle, where S={s1,,sn}S=\{s_{1},\dots,s_{n}\} and A={a1,,am}A=\{a_{1},\dots,a_{m}\} are sets of state and action variables for positive integers n,m+n,m\in\mathbb{Z^{+}} with domains Ds1,,DsnD_{s_{1}},\dots,D_{s_{n}} and Da1,,DamD_{a_{1}},\dots,D_{a_{m}} respectively. Moreover, C:Ds1××Dsn×Da1××Dam{𝑡𝑟𝑢𝑒,𝑓𝑎𝑙𝑠𝑒}C:D_{s_{1}}\times\dots\times D_{s_{n}}\times D_{a_{1}}\times\dots\times D_{a_{m}}\rightarrow\{\mathit{true},\mathit{false}\} is the global function, T~:Ds1××Dsn×Da1××DamDs1××Dsn\tilde{T}:D_{s_{1}}\times\dots\times D_{s_{n}}\times D_{a_{1}}\times\dots\times D_{a_{m}}\rightarrow D_{s_{1}}\times\dots\times D_{s_{n}} is the learned state transition function, and R:Ds1××Dsn×Da1××DamR:D_{s_{1}}\times\dots\times D_{s_{n}}\times D_{a_{1}}\times\dots\times D_{a_{m}}\rightarrow\mathbb{R} is the reward function. Finally, VV is a tuple of constants V1,,VnDs1××Dsn\langle V_{1},\dots,V_{n}\rangle\in D_{s_{1}}\times\dots\times D_{s_{n}} denoting the initial values of all state variables, G:Ds1××Dsn{𝑡𝑟𝑢𝑒,𝑓𝑎𝑙𝑠𝑒}G:D_{s_{1}}\times\dots\times D_{s_{n}}\rightarrow\{\mathit{true},\mathit{false}\} is the goal state function, and H+H\in\mathbb{Z}^{+} is the planning horizon.

A solution to (i.e., a plan for) Π~\tilde{\Pi} is a tuple of values A¯t=a¯1t,,a¯mtDa1××Dam\bar{A}^{t}=\langle\bar{a}^{t}_{1},\dots,\bar{a}^{t}_{m}\rangle\in D_{a_{1}}\times\dots\times D_{a_{m}} for all action variables AA over time steps t{1,,H}t\in\{1,\dots,H\} such that T~(s¯1t,,s¯nt,a¯1t,,a¯mt)=s¯1t+1,,s¯nt+1\tilde{T}(\langle\bar{s}^{t}_{1},\dots,\bar{s}^{t}_{n},\bar{a}^{t}_{1},\dots,\bar{a}^{t}_{m}\rangle)=\langle\bar{s}^{t+1}_{1},\dots,\bar{s}^{t+1}_{n}\rangle and C(s¯1t,,s¯nt,a¯1t,,a¯mt)=𝑡𝑟𝑢𝑒C(\langle\bar{s}^{t}_{1},\dots,\bar{s}^{t}_{n},\bar{a}^{t}_{1},\dots,\bar{a}^{t}_{m}\rangle)=\mathit{true} holds for time steps t{1,,H}t\in\{1,\dots,H\}, Vi=s¯i1V_{i}=\bar{s}^{1}_{i} for all siSs_{i}\in S and G(s¯1H+1,,s¯nH+1)=𝑡𝑟𝑢𝑒G(\langle\bar{s}^{H+1}_{1},\dots,\bar{s}^{H+1}_{n}\rangle)=\mathit{true}. It has been shown that finding a feasible solution to Π~\tilde{\Pi} is NP-complete [18]. An optimal solution to (i.e., an optimal plan for) Π~\tilde{\Pi} is a solution such that the total reward t=1HR(s¯1t+1,,s¯nt+1,a¯1t,,a¯mt)\sum_{t=1}^{H}R(\langle\bar{s}^{t+1}_{1},\dots,\bar{s}^{t+1}_{n},\bar{a}^{t}_{1},\dots,\bar{a}^{t}_{m}\rangle) is maximized.

We assume that the domains of action and state variables are binary unless otherwise stated111When the domain of a variable is not binary (e.g., see Inventory Control), we can use the following approximation x(2m11xm1+i=1m112i1xi)10m2x\approx(-2^{m_{1}-1}x_{m_{1}}+\sum_{i=1}^{m_{1}-1}2^{i-1}x_{i})10^{m_{2}} for integers m1+m_{1}\in\mathbb{Z}^{+} and m2m_{2}\in\mathbb{Z}., the functions C,G,RC,G,R and function T~\tilde{T} are known, functions C,GC,G can be equivalently represented by JC+J_{C}\in\mathbb{Z^{+}} and JG+J_{G}\in\mathbb{Z^{+}} linear constraints, function RR is a linear expression and function T~\tilde{T} is a learned BNN [19].

II-B Binarized Neural Networks

Binarized neural networks (BNNs) are DNNs with binarized weights and activation functions [19]. Given LL layers with layer width Wl{W_{l}} of layer l{1,,L}l\in\{1,\dots,L\}, and a set of neurons J(l)={u1,l,,uWl,l}J(l)=\{u_{1,l},\dots,u_{W_{l},l}\}, is stacked in the following order.

Input Layer

The first layer consists of neurons ui,1J(1)u_{i,1}\in J(1) that represent the domain of the learned state transition function T~\tilde{T} where neurons u1,1,,un,1J(1)u_{1,1},\dots,u_{n,1}\in J(1) represent the state variables SS and neurons un+1,1,,un+m,1J(1)u_{n+1,1},\dots,u_{n+m,1}\in J(1) represent the action variables AA. During the training of the BNN, values 0 and 11 of action and state variables are represented by 1-1 and 11, respectively.

Batch Normalization Layers

For layers l{2,,L}l\in\{2,\dots,L\}, Batch Normalization [20] sets the weighted sum of outputs at layer l1l-1 in j,l=iJ(l1)wi,j,lyi,l1\triangle_{j,l}=\sum_{i\in J(l-1)}w_{i,j,l}y_{i,l-1} to inputs xj,lx_{j,l} of neurons uj,lJ(l)u_{j,l}\in J(l) using the formula xj,l=j,lμj,lσj,l2+ϵj,lγj,l+βj,lx_{j,l}=\frac{\triangle_{j,l}-\mu_{j,l}}{\sqrt[]{\sigma^{2}_{j,l}+\epsilon_{j,l}}}\gamma_{j,l}+\beta_{j,l}, where yi,l1y_{i,l-1} is the output of neuron ui,l1J(l1)u_{i,l-1}\in J(l-1), and the parameters are the weight wi,j,lw_{i,j,l}, input mean μj,l\mu_{j,l}, input variance σj,l2\sigma^{2}_{j,l}, numerical stability constant ϵj,l\epsilon_{j,l}, input scaling γj,l\gamma_{j,l}, and input bias βj,l\beta_{j,l}, all computed at training time.

Activation Layers

Given input xj,lx_{j,l}, the activation function yj,ly_{j,l} computes the output of neuron uj,lJ(l)u_{j,l}\in J(l) at layer l{2,,L}l\in\{2,\dots,L\}, which is 11 if xj,l0x_{j,l}\geq 0 and 1-1 otherwise. The last activation layer consists of neurons ui,LJ(L)u_{i,L}\in J(L) that represent the codomain of the learned state transition function T~\tilde{T} such that u1,L,,un,LJ(L)u_{1,L},\dots,u_{n,L}\in J(L) represent the state variables SS.

The BNN is trained to learn the function T~\tilde{T} from data that consists of measurements on the domain and codomain of the unknown state transition function T:Ds1××Dsn×Da1××DamDs1××DsnT:D_{s_{1}}\times\dots\times D_{s_{n}}\times D_{a_{1}}\times\dots\times D_{a_{m}}\rightarrow D_{s_{1}}\times\dots\times D_{s_{n}}.

III The Weighted Partial MaxSAT Model

In this section, we present the weighted partial MaxSAT model [12, 17] of the learned automated planning problem.

III-A Decision Variables

The weighted partial MaxSAT model uses the following decision variables:

  • Xi,t{X}_{i,t} encodes whether action variable aiAa_{i}\in A is executed at time step t{1,,H}t\in\{1,\dots,H\} or not.

  • Yi,t{Y}_{i,t} encodes whether state variable siSs_{i}\in S is true at time step t{1,,H+1}t\in\{1,\dots,H+1\} or not.

  • Zi,l,t{Z}_{i,l,t} encodes whether neuron ui,lJ(l)u_{i,l}\in J(l) in layer l{1,,L}l\in\{1,\dots,L\} is active at time step t{1,,H}t\in\{1,\dots,H\} or not.

III-B Parameters

The weighted partial MaxSAT model uses the following parameters:

  • w¯i,j,l\bar{w}_{i,j,l} is the value of the learned BNN weight between neuron ui,l1J(l1)u_{i,l-1}\in J(l-1) and neuron uj,lJ(l)u_{j,l}\in J(l) in layer l{2,,L}l\in\{2,\dots,L\}.

  • B(j,l)B(j,l) is the value of the bias for neuron uj,lJ(l)u_{j,l}\in J(l) in layer l{2,,L}l\in\{2,\dots,L\}. Given the values of learned parameters μ¯j,l\bar{\mu}_{j,l}, σ¯j,l2\bar{\sigma}^{2}_{j,l}, ϵ¯j,l\bar{\epsilon}_{j,l}, γ¯j,l\bar{\gamma}_{j,l} and β¯j,l\bar{\beta}_{j,l}, the bias is computed as B(j,l)=β¯j,lσ¯j,l2+ϵ¯j,lγ¯j,lμ¯j,lB(j,l)=\biggl{\lceil}\frac{\bar{\beta}_{j,l}\sqrt[]{\bar{\sigma}^{2}_{j,l}+\bar{\epsilon}_{j,l}}}{\bar{\gamma}_{j,l}}-\bar{\mu}_{j,l}\biggr{\rceil}.

  • risr^{s}_{i}\in\mathbb{R} and riar^{a}_{i}\in\mathbb{R} are constants of the reward function RR that is in the form of i=1nrissi+i=1mriaai\sum_{i=1}^{n}r^{s}_{i}s_{i}+\sum_{i=1}^{m}r^{a}_{i}a_{i}.

  • ci,jsc^{s}_{i,j}\in\mathbb{Z}, ci,jac^{a}_{i,j}\in\mathbb{Z} and cjkc^{k}_{j}\in\mathbb{Z} are constants of the set of linear constraints that represent the global function CC where each linear constraint j{1,,JC}j\in\{1,\dots,J_{C}\} is in the form of i=1ncissi+i=1mciaaicjk\sum_{i=1}^{n}c^{s}_{i}s_{i}+\sum_{i=1}^{m}c^{a}_{i}a_{i}\leq c^{k}_{j}.

  • gi,jsg^{s}_{i,j}\in\mathbb{Z} and gjkg^{k}_{j}\in\mathbb{Z} are constants of the set of linear constraints that represent the goal state function GG where each linear constraint j{1,,JG}j\in\{1,\dots,J_{G}\} is in the form of i=1ngissigjk\sum_{i=1}^{n}g^{s}_{i}s_{i}\leq g^{k}_{j}.

III-C Hard Clauses

The weighted partial MaxSAT model uses the following hard clauses.

Initial State Clauses

The following conjunction of hard clauses sets the initial value ViV_{i} of each state variable siSs_{i}\in S.

i=1n(¬Yi,1Vi)(Yi,1¬Vi)\displaystyle\bigwedge_{i=1}^{n}(\neg{Y}_{i,1}\vee V_{i})\wedge({Y}_{i,1}\vee\neg V_{i}) (1)

Goal State Clauses

The following conjunction of hard clauses encodes the set of linear constraints that represent the goal state function GG.

j=1JGCard(i=1ngisYi,H+1gjk)\displaystyle\bigwedge_{j=1}^{J_{G}}Card(\sum_{i=1}^{n}g^{s}_{i}{Y}_{i,H+1}\leq g^{k}_{j}) (2)

In the above notation, CardCard produces the CNF encoding of a given linear constraint [21].

Global Clauses

The following conjunction of hard clauses encodes the set of linear constraints that represent the global function CC.

j=1JCt=1HCard(i=1ncisYi,t+i=1mciaXi,tcjk)\displaystyle\bigwedge_{j=1}^{J_{C}}\bigwedge_{t=1}^{H}Card(\sum_{i=1}^{n}c^{s}_{i}{Y}_{i,t}+\sum_{i=1}^{m}c^{a}_{i}{X}_{i,t}\leq c^{k}_{j}) (3)

BNN Clauses

The following conjunction of hard clauses maps the input and the output of the BNN onto the state and action variables.

i=1nt=1H(¬Yi,tZi,1,t)(Yi,t¬Zi,1,t)\displaystyle\bigwedge_{i=1}^{n}\bigwedge_{t=1}^{H}(\neg{Y}_{i,t}\vee{Z}_{i,1,t})\wedge({Y}_{i,t}\vee\neg{Z}_{i,1,t}) (4)
i=1mt=1H(¬Xi,tZi+n,1,t)(Xi,t¬Zi+n,1,t)\displaystyle\bigwedge_{i=1}^{m}\bigwedge_{t=1}^{H}(\neg{X}_{i,t}\vee{Z}_{i+n,1,t})\wedge({X}_{i,t}\vee\neg{Z}_{i+n,1,t}) (5)
i=1nt=1H(¬Yi,t+1Zi,L,t)(Yi,t+1¬Zi,L,t)\displaystyle\bigwedge_{i=1}^{n}\bigwedge_{t=1}^{H}(\neg{Y}_{i,t+1}\vee{Z}_{i,L,t})\wedge({Y}_{i,t+1}\vee\neg{Z}_{i,L,t}) (6)

Finally, the following conjunction of hard clauses encodes the activation function of each neuron in the learned BNN.

l=2Luj,lJ(l)t=1HAct(\displaystyle\bigwedge_{l=2}^{L}\bigwedge_{u_{j,l}\in J(l)}\bigwedge_{t=1}^{H}Act\bigl{(}
(ui,l1J(l1)w¯i,j,l(2Zi,l1,t1)+B(j,l)0)=Zj,l,t)\displaystyle(\sum_{u_{i,l-1}\in J(l-1)}{\bar{w}_{i,j,l}}(2{Z}_{i,l-1,t}-1)+B(j,l)\geq 0)={Z}_{j,l,t}\bigr{)} (7)

In the above notation, ActAct produces the CNF encoding of a given biconditional constraint [17] by extending the CNF encoding of Cardinality Networks [22].

III-D Soft Clauses

The weighted partial MaxSAT model uses the following soft clauses.

Reward Clauses

The following conjunction of soft clauses encodes the reward function RR.

t=1H(i=1n(ris,Yi,t+1)i=1m(ria,Xi,t))\displaystyle\bigwedge_{t=1}^{H}\bigl{(}\bigwedge_{i=1}^{n}(r^{s}_{i},{Y}_{i,t+1})\wedge\bigwedge_{i=1}^{m}(r^{a}_{i},{X}_{i,t})\bigr{)} (8)

IV Benchmark Domain Descriptions

In this section, we provide detailed description of four learned automated planning problems, namely: Navigation [23], Inventory Control [24], System Administrator [25] and Cellda [17].222The repository: https://github.com/saybuser/FD-SAT-Plan

TABLE I: The BNN architectures of all four learned automated planning problems.
Problem BNN Structure
Discrete Navigation (N=3N=3) 13:36:36:9
Discrete Navigation (N=4N=4) 20:96:96:16
Discrete Navigation (N=5N=5) 29:128:128:25
Inventory Control (N=2N=2) 7:96:96:5
Inventory Control (N=4N=4) 8:128:128:5
System Administrator (N=4N=4) 16:128:128:12
System Administrator (N=5N=5) 20:128:128:128:15
Cellda (policy=x-axis) 12:256:256:4
Cellda (policy=y-axis) 12:256:256:4

Navigation

Navigation [23] task for an agent in a two-dimensional (NN-by-NN where N+N\in\mathbb{Z}^{+}) maze is cast as an automated planning problem as follows.

  • The location of the agent is represented by N2N^{2} state variables S={s1,,sN2}S=\{s_{1},\dots,s_{N^{2}}\} where state variable sis_{i} represents whether the agent is located at position i{1,,N2}i\in\{1,\dots,N^{2}\} or not.

  • The intended movement of the agent is represented by four action variables A={a1,a2,a3,a4}A=\{a_{1},a_{2},a_{3},a_{4}\} where action variables a1a_{1}, a2a_{2}, a3a_{3} and a4a_{4} represent whether the agent attempts to move up, down, right or left, respectively.

  • Mutual exclusion on the intended movement of the agent is represented by the global function as follows.

    C(s1,,a4)={𝑡𝑟𝑢𝑒,if a1+a2+a3+a41𝑓𝑎𝑙𝑠𝑒,otherwise\displaystyle C(\langle s_{1},\dots,a_{4}\rangle)=\begin{cases}\mathit{true},&\text{if }a_{1}+a_{2}+a_{3}+a_{4}\leq 1\\ \mathit{false},&\text{otherwise}\end{cases}
  • The initial location of the agent is si=Vis_{i}=V_{i} for all positions i{1,,N2}i\in\{1,\dots,{N^{2}}\}.

  • The final location of the agent is represented by the goal state function as follows.

    G(s1,,sN2)={𝑡𝑟𝑢𝑒,if si=Vii{1,,N2}𝑓𝑎𝑙𝑠𝑒,otherwise\displaystyle G(\langle s_{1},\dots,s_{N^{2}}\rangle)=\begin{cases}\mathit{true},&\text{if }s_{i}=V^{\prime}_{i}\\ &\quad\forall{i\in\{1,\dots,{N^{2}}\}}\\ \mathit{false},&\text{otherwise}\end{cases}

    where ViV^{\prime}_{i} denotes the goal location of the agent (i.e., Vi=𝑡𝑟𝑢𝑒V^{\prime}_{i}=\mathit{true} if and only if position i{1,,N2}i\in\{1,\dots,{N^{2}}\} is the final location, Vi=𝑓𝑎𝑙𝑠𝑒V^{\prime}_{i}=\mathit{false} otherwise).

  • The objective is to minimize total number of intended movements by the agent and is represented by the reward function as follows.

    R(s1,,a4)=a1+a2+a3+a4\displaystyle R(\langle s_{1},\dots,a_{4}\rangle)=a_{1}+a_{2}+a_{3}+a_{4}
  • The next location of the agent is represented by the state transition function TT that is a complex function of state and action variables s1,,sN2,a1,,a4s_{1},\dots,s_{N^{2}},a_{1},\dots,a_{4}. The unknown function TT is approximated by a BNN T~\tilde{T}, and the details of T~\tilde{T} are provided in Table I.

We submitted problems with N=3,4,5N=3,4,5 over planning horizons H=4,,10H=4,\dots,10. Note that this automated planning problem is a deterministic version of its original from IPPC2011 [23].

Inventory Control

Inventory Control [24] is the problem of managing inventory of a product with demand cycle length N+N\in\mathbb{Z}^{+}, and is cast as an automated planning problem as follows.

  • The inventory level of the product, phase of the demand cycle and whether demand is met or not are represented by three state variables S={s1,s2,s3}S=\{s_{1},s_{2},s_{3}\} where state variables s1s_{1} and s2s_{2} have non-negative integer domains.

  • Ordering some fixed amount of inventory is represented by an action variable A={a1}A=\{a_{1}\}.

  • Meeting the demand is represented by the global function as follows.

    C(s1,s2,s3,a1)={𝑡𝑟𝑢𝑒,if s3=𝑡𝑟𝑢𝑒𝑓𝑎𝑙𝑠𝑒,otherwise\displaystyle C(\langle s_{1},s_{2},s_{3},a_{1}\rangle)=\begin{cases}\mathit{true},&\text{if }s_{3}=\mathit{true}\\ \mathit{false},&\text{otherwise}\end{cases}
  • The inventory, the phase of the demand cycle and meeting the demand are set to their initial values si=Vis_{i}=V_{i} for all i{1,2,3}{i\in\{1,2,3\}}.

  • Meeting the final demand is represented by the goal state function as follows.

    G(s1,s2,s3)={𝑡𝑟𝑢𝑒,if s3=𝑡𝑟𝑢𝑒𝑓𝑎𝑙𝑠𝑒,otherwise\displaystyle G(\langle s_{1},s_{2},s_{3}\rangle)=\begin{cases}\mathit{true},&\text{if }s_{3}=\mathit{true}\\ \mathit{false},&\text{otherwise}\end{cases}
  • The objective is to minimize total storage cost and is represented by the reward function as follows.

    R(s1,s2,s3,a1)=cs1\displaystyle R(\langle s_{1},s_{2},s_{3},a_{1}\rangle)=cs_{1}

    where cc denotes the unit storage cost.

  • The next inventory level, the next phase of the demand cycle and whether the next demand is met or not are represented by the state transition function TT that is a complex function of state and action variables s1,s2,s3,a1s_{1},s_{2},s_{3},a_{1}. The unknown function TT is approximated by a BNN T~\tilde{T}, and the details of T~\tilde{T} are provided in Table I.

We submitted problems with two demand cycle lengths N{2,4}N\in\{2,4\} over planning horizons H=5,,8H=5,\dots,8. The values of parameters are chosen as m1=4m_{1}=4 and m2=0m_{2}=0.

System Administrator

System Administrator [25, 23] is the problem of maintaining a computer network of size NN and is cast as an automated planning problem as follows.

  • The age of computer i{1,,N}i\in\{1,\dots,N\}, and whether computer i{1,,N}i\in\{1,\dots,N\} is running or not, are represented by 2N2N state variables S={s1,,s2N}S=\{s_{1},\dots,s_{2N}\} where state variables s1,,sNs_{1},\dots,s_{N} have non-negative integer domains.

  • Rebooting computers i{1,,N}i\in\{1,\dots,N\} are represented by NN action variables A={a1,,aN}A=\{a_{1},\dots,a_{N}\}.

  • The bounds on the number of computers that can be rebooted and the requirement that all computers must be running are represented by global function as follows.

    C(s1,,aN)={𝑡𝑟𝑢𝑒,if i=1Naiamaxand si=𝑡𝑟𝑢𝑒i{N+1,,2N}𝑓𝑎𝑙𝑠𝑒,otherwise\displaystyle C(\langle s_{1},\dots,a_{N}\rangle)=\begin{cases}\mathit{true},&\text{if }\sum_{i=1}^{N}a_{i}\leq a^{max}\\ &\text{and }s_{i}=\mathit{true}\\ &\quad\forall{i\in\{N+1,\dots,2N\}}\\ \mathit{false},&\text{otherwise}\end{cases}

    where amaxa^{max} is the maximum on the number of computers that can be rebooted at a given time.

  • The age of computer i{1,,N}i\in\{1,\dots,N\}, and whether computer i{1,,N}i\in\{1,\dots,N\} is running or not are set to their initial values si=Vis_{i}=V_{i} for all i{1,,2N}{i\in\{1,\dots,2N\}}.

  • The requirement that all computers must be running in the end is represented by the goal state function as follows.

    G(s1,,s2N)={𝑡𝑟𝑢𝑒,if si=𝑡𝑟𝑢𝑒i{N+1,,2N}𝑓𝑎𝑙𝑠𝑒,otherwise\displaystyle G(\langle s_{1},\dots,s_{2N}\rangle)=\begin{cases}\mathit{true},&\text{if }s_{i}=\mathit{true}\\ &\quad\forall{i\in\{N+1,\dots,2N\}}\\ \mathit{false},&\text{otherwise}\end{cases}
  • The objective is to minimize total number of reboots and is represented by the reward function as follows.

    R(s1,,s2N,a1,,aN)=i=1Nai\displaystyle R(\langle s_{1},\dots,s_{2N},a_{1},\dots,a_{N}\rangle)=\sum_{i=1}^{N}a_{i}
  • The next age of computer i{1,,N}i\in\{1,\dots,N\} and whether computer i{1,,N}i\in\{1,\dots,N\} will be running or not, are represented by the state transition function TT that is a complex function of state and action variables s1,,s2N,a1,,aNs_{1},\dots,s_{2N},a_{1},\dots,a_{N}. The unknown function TT is approximated by a BNN T~\tilde{T}, and the details of T~\tilde{T} are provided in Table I.

We submitted problems with N{4,5}N\in\{4,5\} computers over planning horizons H=2,3,4H=2,3,4. The values of parameters are chosen as m1=3m_{1}=3 and m2=0m_{2}=0.

Cellda

Influenced by the famous video game [26], Cellda [17] is the task of an agent who must escape from a two dimensional (NN-by-NN where N+N\in\mathbb{Z}^{+}) cell through a locked door by obtaining the key without getting hit by the enemy, and is cast as an automated planning problem as follows.

  • The location of the agent, the location of the enemy, whether the key is obtained or not and whether the agent is alive or not are represented by six state variables S={s1,,s6}S=\{s_{1},\dots,s_{6}\} where state variables s1s_{1} and s2s_{2} represent the horizontal and vertical locations of the agent, state variables s3s_{3} and s4s_{4} represent the horizontal and vertical locations of the enemy, state variable s5s_{5} represents whether the key is obtained or not, and state variable s6s_{6} represents whether the agent is alive or not. State variables s1s_{1}, s2s_{2}, s3s_{3} and s4s_{4} have positive integer domains.

  • The intended movement of the agent is represented by four action variables A={a1,a2,a3,a4}A=\{a_{1},a_{2},a_{3},a_{4}\} where action variables a1a_{1}, a2a_{2}, a3a_{3} and a4a_{4} represent whether the agent intends to move up, down, right or left, respectively.

  • Mutual exclusion on the intended movement of the agent, the boundaries of the maze and requirement that the agent must be alive are represented by global function as follows.

    C(s1,,a4)={𝑡𝑟𝑢𝑒,if a1+a2+a3+a41and 0si<Ni{1,2}and s6=𝑡𝑟𝑢𝑒𝑓𝑎𝑙𝑠𝑒,otherwise\displaystyle C(\langle s_{1},\dots,a_{4}\rangle)=\begin{cases}\mathit{true},&\text{if }a_{1}+a_{2}+a_{3}+a_{4}\leq 1\\ &\text{and }0\leq s_{i}<N\quad\forall{i\in\{1,2\}}\\ &\text{and }s_{6}=\mathit{true}\\ \mathit{false},&\text{otherwise}\end{cases}
  • The location of the agent, the location of the enemy, whether the key is obtained or not, and whether the agent is alive or not are set to their initial values si=Vis_{i}=V_{i} for all i{1,,6}{i\in\{1,\dots,6\}}.

  • The goal location of the agent (i.e., the location of the door), the requirement that the agent must be alive in the end and the requirement that the key must be obtained are represented by the goal state function as follows.

    G(s1,,s6)={𝑡𝑟𝑢𝑒,if s1=V1 and s2=V2and s5=𝑡𝑟𝑢𝑒 and s6=𝑡𝑟𝑢𝑒𝑓𝑎𝑙𝑠𝑒,otherwise\displaystyle G(\langle s_{1},\dots,s_{6}\rangle)=\begin{cases}\mathit{true},&\text{if }s_{1}=V^{\prime}_{1}\text{ and }s_{2}=V^{\prime}_{2}\\ &\text{and }s_{5}=\mathit{true}\text{ and }s_{6}=\mathit{true}\\ \mathit{false},&\text{otherwise}\end{cases}

    where V1V^{\prime}_{1} and V2V^{\prime}_{2} denote the goal location of the agent (i.e., the location of the door).

  • The objective is to minimize total number of intended movements by the agent and is represented by the reward function as follows.

    R(s1,,a4)=a1+a2+a3+a4\displaystyle R(\langle s_{1},\dots,a_{4}\rangle)=a_{1}+a_{2}+a_{3}+a_{4}
  • The next location of the agent, the next location of the enemy, whether the key will be obtained or not, and whether the agent will be alive or not, are represented by the state transition function TT that is a complex function of state and action variables s1,,s6,a1,,a4s_{1},\dots,s_{6},a_{1},\dots,a_{4}. The unknown function TT is approximated by a BNN T~\tilde{T}, and the details of T~\tilde{T} are provided in Table I.

We submitted problems with maze size N=4N=4 over planning horizons H=8,,12H=8,\dots,12 with two different enemy policies. The values of parameters are chosen as m1=2m_{1}=2 and m2=0m_{2}=0.

References

  • [1] D. Nau, M. Ghallab, and P. Traverso, Automated Planning: Theory & Practice.   San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2004.
  • [2] H. Kautz and B. Selman, “Planning as satisfiability,” in Proceedings of the Tenth European Conference on Artificial Intelligence, ser. ECAI’92, 1992, pp. 359–363.
  • [3] J. Hoffmann and B. Nebel, “The FF planning system: Fast plan generation through heuristic search,” in Journal of Artificial Intelligence Research, vol. 14.   USA: AI Access Foundation, 2001, pp. 253–302.
  • [4] M. Helmert, “The fast downward planning system,” in Journal Artificial Intelligence Research, vol. 26.   USA: AI Access Foundation, 2006, pp. 191–246.
  • [5] F. Pommerening, G. Röger, M. Helmert, and B. Bonet, “LP-based heuristics for cost-optimal planning,” in Proceedings of the Twenty-Fourth International Conference on Automated Planning and Scheduling, ser. ICAPS’14.   AAAI Press, 2014, pp. 226–234.
  • [6] T. O. Davies, A. R. Pearce, P. J. Stuckey, and N. Lipovetzky, “Sequencing operator counts,” in Proceedings of the Twenty-Fifth International Conference on Automated Planning and Scheduling.   AAAI Press, 2015, pp. 61–69.
  • [7] W.-M. Shen and H. A. Simon, “Rule creation and rule learning through environmental exploration,” in Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, ser. IJCAI’89.   San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1989, pp. 675––680.
  • [8] Y. Gil, “Acquiring domain knowledge for planning by experimentation,” Ph.D. dissertation, Carnegie Mellon University, USA, 1992.
  • [9] S. W. Bennett and G. F. DeJong, “Real-world robotics: Learning to plan for robust execution,” in Machine Learning, vol. 23, 1996, pp. 121–161.
  • [10] S. S. Benson, “Learning action models for reactive autonomous agents,” Ph.D. dissertation, Stanford University, Stanford, CA, USA, 1997.
  • [11] B. Say, G. Wu, Y. Q. Zhou, and S. Sanner, “Nonlinear hybrid planning with deep net learned transition models and mixed-integer linear programming,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, ser. IJCAI’17, 2017, pp. 750–756.
  • [12] B. Say and S. Sanner, “Planning in factored state and action spaces with learned binarized neural network transition models,” in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, ser. IJCAI’18, 2018, pp. 4815–4821.
  • [13] B. Say, “Optimal planning with learned neural network transition models,” Ph.D. dissertation, University of Toronto, Toronto, ON, Canada, 2020.
  • [14] B. Say, S. Sanner, and S. Thiébaux, “Reward potentials for planning with learned neural network transition models,” in Proceedings of the Twenty-Fifth International Conference on Principles and Practice of Constraint Programming, T. Schiex and S. de Givry, Eds.   Cham: Springer International Publishing, 2019, pp. 674–689.
  • [15] G. Wu, B. Say, and S. Sanner, “Scalable planning with deep neural network learned transition models,” Journal of Artificial Intelligence Research, vol. 68, pp. 571–606, 2020.
  • [16] B. Say, “A unified framework for planning with learned neural network transition models,” in Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021, pp. 5016–5024.
  • [17] B. Say and S. Sanner, “Compact and efficient encodings for planning in factored state and action spaces with learned binarized neural network transition models,” Artificial Intelligence, vol. 285, p. 103291, 2020.
  • [18] B. Say, J. Devriendt, J. Nordström, and P. Stuckey, “Theoretical and experimental results for planning with learned binarized neural network transition models,” in Proceedings of the Twenty-Sixth International Conference on Principles and Practice of Constraint Programming, H. Simonis, Ed.   Cham: Springer International Publishing, 2020, pp. 917–934.
  • [19] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Proceedings of the Thirtieth International Conference on Neural Information Processing Systems.   USA: Curran Associates Inc., 2016, pp. 4114–4122.
  • [20] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the Thirty-Second International Conference on International Conference on Machine Learning, ser. ICML.   JMLR.org, 2015, pp. 448–456.
  • [21] I. Abío and P. Stuckey, “Encoding linear constraints into SAT,” in Principles and Practice of Constraint Programming.   Springer Int Publishing, 2014, pp. 75–91.
  • [22] R. Asin and R. Nieuwenhuis, “Cardinality networks and their applications and oliveras, albert and rodriguez-carbonell, enric,” in International Conference on Theory and Applications of Satisfiability Testing, 2009, pp. 167–180.
  • [23] S. Sanner and S. Yoon, “International probabilistic planning competition,” 2011.
  • [24] T. Mann and S. Mannor, “Scaling up approximate value iteration with options: Better policies with fewer iterations,” in Proceedings of the Thirty-First International Conference on Machine Learning, ser. Machine Learning Research, E. P. Xing and T. Jebara, Eds., vol. 32.   Bejing, China: PMLR, 2014, pp. 127–135.
  • [25] C. Guestrin, D. Koller, and R. Parr, “Max-norm projections for factored MDPs,” in Seventeenth International Joint Conferences on Artificial Intelligence, 2001, pp. 673–680.
  • [26] Nintendo, “The legend of zelda,” 1986.