Planning with Learned Binarized Neural Networks Benchmarks for MaxSAT Evaluation 2021
Abstract
This document provides a brief introduction to learned automated planning problem where the state transition function is in the form of a binarized neural network (BNN), presents a general MaxSAT encoding for this problem, and describes the four domains, namely: Navigation, Inventory Control, System Administrator and Cellda, that are submitted as benchmarks for MaxSAT Evaluation 2021.
Index Terms:
binarized neural networks, automated planningI Introduction
Automated planning studies the reasoning side of acting in Artificial Intelligence, and automates the selection and ordering of actions to reach desired states of the world as best as possible [1]. An automated planning problem represents dynamics of the real-world using a model, which can either be manually encoded [2, 3, 4, 5, 6], or learned from data [7, 8, 9, 10]. In this document, we focus on the latter.
Automated planning with deep neural network (DNN) learned state transition functions is a two stage data-driven framework for learning and solving automated planning problems with unknown state transition functions [11, 12, 13]. The first stage of the framework learns the unknown state transition function from data as a DNN. The second stage of the framework plans optimally with respect to the learned DNN by solving an equivalent optimization problem (e.g., a mixed-integer programming model [11, 14, 15, 16], a 0–1 integer programming model [12, 17], a weighted partial MaxSAT model [12, 17], a constraint programming model [18], or a pseudo-Boolean optimization model [18]). In this document, we focus on the second stage of the data-driven framework where the learned DNN is a binarized neural network (BNN) [19].
The remaining of the document is organized as follows. We begin with the description of the learned automated planning problem and the binarized neural network (BNN). Then we present the weighted partial MaxSAT model of the general learned automated planning problem, and conclude with the description of four learned automated planning domains, namely: Navigation, Inventory Control, System Administrator and Cellda, that are submitted as benchmarks for MaxSAT Evaluation 2021.
II Automated Planning with Learned Binarized Neural Network State Transitions
II-A Problem Definition
A fixed-horizon learned deterministic automated planning problem [11, 12, 18] is a tuple , where and are sets of state and action variables for positive integers with domains and respectively. Moreover, is the global function, is the learned state transition function, and is the reward function. Finally, is a tuple of constants denoting the initial values of all state variables, is the goal state function, and is the planning horizon.
A solution to (i.e., a plan for) is a tuple of values for all action variables over time steps such that and holds for time steps , for all and . It has been shown that finding a feasible solution to is NP-complete [18]. An optimal solution to (i.e., an optimal plan for) is a solution such that the total reward is maximized.
We assume that the domains of action and state variables are binary unless otherwise stated111When the domain of a variable is not binary (e.g., see Inventory Control), we can use the following approximation for integers and ., the functions and function are known, functions can be equivalently represented by and linear constraints, function is a linear expression and function is a learned BNN [19].
II-B Binarized Neural Networks
Binarized neural networks (BNNs) are DNNs with binarized weights and activation functions [19]. Given layers with layer width of layer , and a set of neurons , is stacked in the following order.
Input Layer
The first layer consists of neurons that represent the domain of the learned state transition function where neurons represent the state variables and neurons represent the action variables . During the training of the BNN, values and of action and state variables are represented by and , respectively.
Batch Normalization Layers
For layers , Batch Normalization [20] sets the weighted sum of outputs at layer in to inputs of neurons using the formula , where is the output of neuron , and the parameters are the weight , input mean , input variance , numerical stability constant , input scaling , and input bias , all computed at training time.
Activation Layers
Given input , the activation function computes the output of neuron at layer , which is if and otherwise. The last activation layer consists of neurons that represent the codomain of the learned state transition function such that represent the state variables .
The BNN is trained to learn the function from data that consists of measurements on the domain and codomain of the unknown state transition function .
III The Weighted Partial MaxSAT Model
In this section, we present the weighted partial MaxSAT model [12, 17] of the learned automated planning problem.
III-A Decision Variables
The weighted partial MaxSAT model uses the following decision variables:
-
•
encodes whether action variable is executed at time step or not.
-
•
encodes whether state variable is true at time step or not.
-
•
encodes whether neuron in layer is active at time step or not.
III-B Parameters
The weighted partial MaxSAT model uses the following parameters:
-
•
is the value of the learned BNN weight between neuron and neuron in layer .
-
•
is the value of the bias for neuron in layer . Given the values of learned parameters , , , and , the bias is computed as .
-
•
and are constants of the reward function that is in the form of .
-
•
, and are constants of the set of linear constraints that represent the global function where each linear constraint is in the form of .
-
•
and are constants of the set of linear constraints that represent the goal state function where each linear constraint is in the form of .
III-C Hard Clauses
The weighted partial MaxSAT model uses the following hard clauses.
Initial State Clauses
The following conjunction of hard clauses sets the initial value of each state variable .
(1) |
Goal State Clauses
The following conjunction of hard clauses encodes the set of linear constraints that represent the goal state function .
(2) |
In the above notation, produces the CNF encoding of a given linear constraint [21].
Global Clauses
The following conjunction of hard clauses encodes the set of linear constraints that represent the global function .
(3) |
BNN Clauses
The following conjunction of hard clauses maps the input and the output of the BNN onto the state and action variables.
(4) | |||
(5) | |||
(6) |
Finally, the following conjunction of hard clauses encodes the activation function of each neuron in the learned BNN.
III-D Soft Clauses
The weighted partial MaxSAT model uses the following soft clauses.
Reward Clauses
The following conjunction of soft clauses encodes the reward function .
(8) |
IV Benchmark Domain Descriptions
In this section, we provide detailed description of four learned automated planning problems, namely: Navigation [23], Inventory Control [24], System Administrator [25] and Cellda [17].222The repository: https://github.com/saybuser/FD-SAT-Plan
Problem | BNN Structure |
---|---|
Discrete Navigation () | 13:36:36:9 |
Discrete Navigation () | 20:96:96:16 |
Discrete Navigation () | 29:128:128:25 |
Inventory Control () | 7:96:96:5 |
Inventory Control () | 8:128:128:5 |
System Administrator () | 16:128:128:12 |
System Administrator () | 20:128:128:128:15 |
Cellda (policy=x-axis) | 12:256:256:4 |
Cellda (policy=y-axis) | 12:256:256:4 |
Navigation
Navigation [23] task for an agent in a two-dimensional (-by- where ) maze is cast as an automated planning problem as follows.
-
•
The location of the agent is represented by state variables where state variable represents whether the agent is located at position or not.
-
•
The intended movement of the agent is represented by four action variables where action variables , , and represent whether the agent attempts to move up, down, right or left, respectively.
-
•
Mutual exclusion on the intended movement of the agent is represented by the global function as follows.
-
•
The initial location of the agent is for all positions .
-
•
The final location of the agent is represented by the goal state function as follows.
where denotes the goal location of the agent (i.e., if and only if position is the final location, otherwise).
-
•
The objective is to minimize total number of intended movements by the agent and is represented by the reward function as follows.
-
•
The next location of the agent is represented by the state transition function that is a complex function of state and action variables . The unknown function is approximated by a BNN , and the details of are provided in Table I.
We submitted problems with over planning horizons . Note that this automated planning problem is a deterministic version of its original from IPPC2011 [23].
Inventory Control
Inventory Control [24] is the problem of managing inventory of a product with demand cycle length , and is cast as an automated planning problem as follows.
-
•
The inventory level of the product, phase of the demand cycle and whether demand is met or not are represented by three state variables where state variables and have non-negative integer domains.
-
•
Ordering some fixed amount of inventory is represented by an action variable .
-
•
Meeting the demand is represented by the global function as follows.
-
•
The inventory, the phase of the demand cycle and meeting the demand are set to their initial values for all .
-
•
Meeting the final demand is represented by the goal state function as follows.
-
•
The objective is to minimize total storage cost and is represented by the reward function as follows.
where denotes the unit storage cost.
-
•
The next inventory level, the next phase of the demand cycle and whether the next demand is met or not are represented by the state transition function that is a complex function of state and action variables . The unknown function is approximated by a BNN , and the details of are provided in Table I.
We submitted problems with two demand cycle lengths over planning horizons . The values of parameters are chosen as and .
System Administrator
System Administrator [25, 23] is the problem of maintaining a computer network of size and is cast as an automated planning problem as follows.
-
•
The age of computer , and whether computer is running or not, are represented by state variables where state variables have non-negative integer domains.
-
•
Rebooting computers are represented by action variables .
-
•
The bounds on the number of computers that can be rebooted and the requirement that all computers must be running are represented by global function as follows.
where is the maximum on the number of computers that can be rebooted at a given time.
-
•
The age of computer , and whether computer is running or not are set to their initial values for all .
-
•
The requirement that all computers must be running in the end is represented by the goal state function as follows.
-
•
The objective is to minimize total number of reboots and is represented by the reward function as follows.
-
•
The next age of computer and whether computer will be running or not, are represented by the state transition function that is a complex function of state and action variables . The unknown function is approximated by a BNN , and the details of are provided in Table I.
We submitted problems with computers over planning horizons . The values of parameters are chosen as and .
Cellda
Influenced by the famous video game [26], Cellda [17] is the task of an agent who must escape from a two dimensional (-by- where ) cell through a locked door by obtaining the key without getting hit by the enemy, and is cast as an automated planning problem as follows.
-
•
The location of the agent, the location of the enemy, whether the key is obtained or not and whether the agent is alive or not are represented by six state variables where state variables and represent the horizontal and vertical locations of the agent, state variables and represent the horizontal and vertical locations of the enemy, state variable represents whether the key is obtained or not, and state variable represents whether the agent is alive or not. State variables , , and have positive integer domains.
-
•
The intended movement of the agent is represented by four action variables where action variables , , and represent whether the agent intends to move up, down, right or left, respectively.
-
•
Mutual exclusion on the intended movement of the agent, the boundaries of the maze and requirement that the agent must be alive are represented by global function as follows.
-
•
The location of the agent, the location of the enemy, whether the key is obtained or not, and whether the agent is alive or not are set to their initial values for all .
-
•
The goal location of the agent (i.e., the location of the door), the requirement that the agent must be alive in the end and the requirement that the key must be obtained are represented by the goal state function as follows.
where and denote the goal location of the agent (i.e., the location of the door).
-
•
The objective is to minimize total number of intended movements by the agent and is represented by the reward function as follows.
-
•
The next location of the agent, the next location of the enemy, whether the key will be obtained or not, and whether the agent will be alive or not, are represented by the state transition function that is a complex function of state and action variables . The unknown function is approximated by a BNN , and the details of are provided in Table I.
We submitted problems with maze size over planning horizons with two different enemy policies. The values of parameters are chosen as and .
References
- [1] D. Nau, M. Ghallab, and P. Traverso, Automated Planning: Theory & Practice. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2004.
- [2] H. Kautz and B. Selman, “Planning as satisfiability,” in Proceedings of the Tenth European Conference on Artificial Intelligence, ser. ECAI’92, 1992, pp. 359–363.
- [3] J. Hoffmann and B. Nebel, “The FF planning system: Fast plan generation through heuristic search,” in Journal of Artificial Intelligence Research, vol. 14. USA: AI Access Foundation, 2001, pp. 253–302.
- [4] M. Helmert, “The fast downward planning system,” in Journal Artificial Intelligence Research, vol. 26. USA: AI Access Foundation, 2006, pp. 191–246.
- [5] F. Pommerening, G. Röger, M. Helmert, and B. Bonet, “LP-based heuristics for cost-optimal planning,” in Proceedings of the Twenty-Fourth International Conference on Automated Planning and Scheduling, ser. ICAPS’14. AAAI Press, 2014, pp. 226–234.
- [6] T. O. Davies, A. R. Pearce, P. J. Stuckey, and N. Lipovetzky, “Sequencing operator counts,” in Proceedings of the Twenty-Fifth International Conference on Automated Planning and Scheduling. AAAI Press, 2015, pp. 61–69.
- [7] W.-M. Shen and H. A. Simon, “Rule creation and rule learning through environmental exploration,” in Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, ser. IJCAI’89. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1989, pp. 675––680.
- [8] Y. Gil, “Acquiring domain knowledge for planning by experimentation,” Ph.D. dissertation, Carnegie Mellon University, USA, 1992.
- [9] S. W. Bennett and G. F. DeJong, “Real-world robotics: Learning to plan for robust execution,” in Machine Learning, vol. 23, 1996, pp. 121–161.
- [10] S. S. Benson, “Learning action models for reactive autonomous agents,” Ph.D. dissertation, Stanford University, Stanford, CA, USA, 1997.
- [11] B. Say, G. Wu, Y. Q. Zhou, and S. Sanner, “Nonlinear hybrid planning with deep net learned transition models and mixed-integer linear programming,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, ser. IJCAI’17, 2017, pp. 750–756.
- [12] B. Say and S. Sanner, “Planning in factored state and action spaces with learned binarized neural network transition models,” in Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, ser. IJCAI’18, 2018, pp. 4815–4821.
- [13] B. Say, “Optimal planning with learned neural network transition models,” Ph.D. dissertation, University of Toronto, Toronto, ON, Canada, 2020.
- [14] B. Say, S. Sanner, and S. Thiébaux, “Reward potentials for planning with learned neural network transition models,” in Proceedings of the Twenty-Fifth International Conference on Principles and Practice of Constraint Programming, T. Schiex and S. de Givry, Eds. Cham: Springer International Publishing, 2019, pp. 674–689.
- [15] G. Wu, B. Say, and S. Sanner, “Scalable planning with deep neural network learned transition models,” Journal of Artificial Intelligence Research, vol. 68, pp. 571–606, 2020.
- [16] B. Say, “A unified framework for planning with learned neural network transition models,” in Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021, pp. 5016–5024.
- [17] B. Say and S. Sanner, “Compact and efficient encodings for planning in factored state and action spaces with learned binarized neural network transition models,” Artificial Intelligence, vol. 285, p. 103291, 2020.
- [18] B. Say, J. Devriendt, J. Nordström, and P. Stuckey, “Theoretical and experimental results for planning with learned binarized neural network transition models,” in Proceedings of the Twenty-Sixth International Conference on Principles and Practice of Constraint Programming, H. Simonis, Ed. Cham: Springer International Publishing, 2020, pp. 917–934.
- [19] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Proceedings of the Thirtieth International Conference on Neural Information Processing Systems. USA: Curran Associates Inc., 2016, pp. 4114–4122.
- [20] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the Thirty-Second International Conference on International Conference on Machine Learning, ser. ICML. JMLR.org, 2015, pp. 448–456.
- [21] I. Abío and P. Stuckey, “Encoding linear constraints into SAT,” in Principles and Practice of Constraint Programming. Springer Int Publishing, 2014, pp. 75–91.
- [22] R. Asin and R. Nieuwenhuis, “Cardinality networks and their applications and oliveras, albert and rodriguez-carbonell, enric,” in International Conference on Theory and Applications of Satisfiability Testing, 2009, pp. 167–180.
- [23] S. Sanner and S. Yoon, “International probabilistic planning competition,” 2011.
- [24] T. Mann and S. Mannor, “Scaling up approximate value iteration with options: Better policies with fewer iterations,” in Proceedings of the Thirty-First International Conference on Machine Learning, ser. Machine Learning Research, E. P. Xing and T. Jebara, Eds., vol. 32. Bejing, China: PMLR, 2014, pp. 127–135.
- [25] C. Guestrin, D. Koller, and R. Parr, “Max-norm projections for factored MDPs,” in Seventeenth International Joint Conferences on Artificial Intelligence, 2001, pp. 673–680.
- [26] Nintendo, “The legend of zelda,” 1986.