Reinforcement Learning for Task Specifications with Action-Constraints
Abstract
In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton; and propose a supervisor that disables a subset of actions at every state of the MDP so that the constraints on action sequence are satisfied. Then we present a version of the -learning algorithm for learning optimal policies in the presence of non-Markovian action-sequence and state constraints, where we use the development of reward machines to handle the state constraints. We illustrate the method using an example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning and show the results of simulations in this setting.
I Introduction
In reinforcement learning (RL) [1], an agent evaluates its optimal behaviour from experience gained through interaction with the environment which is guided by the rewards that an action at a given state fetches from it. Safe reinforcement learning is a paradigm wherein the idea is to not violate a set of constraints during the process of learning [2]. In many applications, the constraints may be given in terms of the sequences of actions. For example, consider an agent, that can move in the four cardinal directions, traversing in a grid-like warehouse environment. In this case, the safety requirement corresponding to one-way traffic can have a specification of not taking two consecutive right turns which will result in a U-turn. In several scheduling applications, if scheduling an item is interpreted as an action, then the constraints on the sequence of actions arise naturally. For example, in real-time scheduling of home appliances, the dryer cannot be scheduled before the washer [3, 4]. In this paper, we are interested in a method for obtaining optimal control policies for such action-constrained scenarios.
Oftentimes, the system states and constraints are affected by uncontrollable events. For example, some actuators of a robot can break down due to faults. The task specification for an agent should be robust to such uncontrollable events. The subject of uncontrollable events altering the state of the system is well studied in the theory of supervisory control of discrete event systems[5, 6]. A Discrete Event System (DES) is a discrete-state system, where the state changes at discrete time instants due to the occurrence of events. If we associate a (not necessarily unique) label to each event of the DES, then its behaviour can be described by the language it generates using the alphabet defined by the set of labels. In such a setting, the desired behaviour of the DES is also specified by a language specification. A supervisory policy enforces a desired language specification by disabling a subset of events known as the controllable events. The complementary set of events that cannot be disabled by the supervisor are called uncontrollable events. It is important to note that the supervisory policy cannot force an event to occur, it can only disable a controllable event. As such, a supervisory policy aims for minimum intervention in the operation of the system— only when a particular event can lead to the violation of a specification. In this paper, we interpret the uncontrollable events as (uncontrollable) actions which can then be used to specify an overall action-constraint for the system by taking the possibilities of uncontrollable events into account. We limit our analysis to those action-constraints that can be represented by a finite automaton.
The environment in the Reinforcement Learning (RL) paradigm is typically modeled as a Markov Decision Process (MDP). If all the events of a Disrete Event System (DES) are controllable, then there are parallels between supervisory control of a DES and the theory of synthesizing policies for an MDP. In particular, fundamentally, both cases involve solving a sequential decision-making problem in which the state evolution in the model is Markovian in nature. On the other hand, a key difference between the two is that the action space of an MDP does not, generally, consider actions to be uncontrollable. This is because even if such aspects do exist in the system, they can be absorbed in the MDP model by introducing a probability distribution for the next state such that it accounts for the possibility of occurrence of an uncontrollable action. Such an approach works because the objective in the MDP setting is to find an optimal policy that minimizes an expected cost. However, if uncontrollable actions appear in the safety specification, then they cannot be abstracted by a probability distribution anymore and must appear explicitly in the MDP model. In such scenarios, the ‘safety’ part of the safe RL paradigm naturally invites the results from supervisory control theory. Our main contributions in this paper are the following:
-
1.
To the best of our information, this paper is the first work that bridges the areas of supervisory control theory (SCT) and reinforcement learning. Such an association permits us to directly use an important theoretical result from SCT: to determine if there exists a policy that satisfies the action-constraints in the presence of uncontrollable actions; offline, before training. If there does not exist such a policy, then there is another important result in SCT that, informally, evaluates the closest to the given constraints that a supervisory policy can enforce. These results are relevant in safe reinforcement learning for complex systems in which correctly formulating the constraints is a task in itself.
-
2.
We present a version of Q-learning algorithm to learn optimal control policies for an underlying finite-state MDP in which:
-
(a)
a subset of actions might be uncontrollable, and
-
(b)
the task specification or the safety constraints for the system is given in terms of sequences of actions modeled by a finite state automaton. The constraints can be given in terms of sequences that are safe and/or unsafe. This generality simplifies the problem formulation to some extent.
The adapted Q-learning algorithm also integrates the work on reward machines[7] that supports reward function specification as a function of state.
-
(a)
-
3.
Experimental results for an illustrative example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning are presented.
The rest of the paper is organized as follows. Section II presents some notations and definitions that we will be using in the rest of the paper. It also presents the relevant results from supervisory control theory. Section III discusses supervisor synthesis. We will largely be working with finite automata in this section and use the development for supervisor synthesis to propose a version of -learning algorithm [8] that can be used to learn an optimal policy in the presence of state and action constraints. In Section IV, we present an example to illustrate the use of automata-based methods to handle non-Markovian state and action-constraints. We use the proposed method for action-constraints and reward machines[7] for the state constraints. We show the results of the experiments and observe that the Q-learning algorithm adapted to this setting helps to make the agent learn to complete the task optimally 100% of the time. We conclude the paper with Section V.
II Motivation, Notations and Definitions
In a reinforcement learning problem, the environment is modeled as a Markov Decision Process where and denote the finite set of states and actions, respectively, denotes the reward function, is the transition probability distribution and is the discount factor. In the paradigm of safe reinforcement learning, the objective is to learn optimal policies while satisfying a set of constraints. In this paper, we consider safety constraints that are given as a set of sequence of actions of the MDP. Next, we discuss the example gridworld in Fig. 1 to motivate the development.
Consider a manufacturing facility layout as shown in Fig. 1. When a work piece appears at location ⓘ, the robot picks it up and delivers it to the buffer zone . The robot picks the item from and delivers it to the processing facility o. The buffer can only store one item at a time; so it has two states Empty() and Full(). The robot state is determined by the position that it occupies in the grid and by a binary variable indicating whether it has picked up an item or not. There are six controllable actions for each robot where four of them correspond to their movement in the grid and two correspond to picking and dropping of an item (), . Additionally, we assume that robot can either be Working(W) or Down(D). There are two uncontrollable actions denoting the fault and rectification actions which take from to and vice versa respectively. We call them uncontrollable because they are caused by external factors/agents not modeled in this set up. We can also assume that an uncontrollable action makes an item appear in ⓘ. The objective is to design an optimal policy to satisfy the specifications below:
-
1.
delivers an item to only if is in .
-
2.
picks an item from only if is in (thereby driving to ).
-
3.
and deliver a picked item in no more than steps.
The above specifications ‘look’ okay but are not robust to uncontrollable actions. To see this, consider a string of actions . Formally, the specification in item 3 says that for a substring , it must be that where denotes the size of , and and denote picking and dropping of item by respectively. Consider the scenario in which takes an action when the buffer is full, with the assumption that it will be emptied by in the next step. Such an action is consistent with item 1 of the specification. At this point, if the uncontrollable action occurs, then the satisfaction of item 3 of the specification will entirely depend on the occurrence of the uncontrollable rectification action . As such, the specifications above are not robust to uncontrollable actions. That is, the occurrence of an uncontrollable action resulted in the generation of a string of actions that was not in the specification. In what follows, we discuss developments from supervisory control theory which facilitate a formal analysis of such aspects.
An automaton is a 5-tuple [6]. Here and denote the set of states and (labelled) actions respectively. The initial state is , and is called the set of marked states. The states can be marked for different reasons and we discuss more about this later in the paper. The function is the state transition function, where for , implies there is an action labeled from state to state . In general, the state transition function may not be defined for all state-action pairs. The active action function , identifies the set of all actions for which is defined.
Let denote the set of all finite strings of elements of including the empty string . We write if there is a string of actions from to state , that is if is reachable from . Given , the set of all admissible strings of actions is denoted by and is referred to as the language generated by over the alphabet . Formally, for the initial state :
(1) |
The marked language, , generated by is
A string is a prefix of a string if for some , . If is an admissible string in , then so are all its prefixes. We define the prefix closure of to be the language defined as:
We can use the above notations to interpret the MDP as a language generator: where denotes a set of labels and denotes a labelling function that associates a label to each action. For simplicity, we assume that each action is uniquely labeled and, with some abuse of notation, use as a proxy for the set of actions. Then the function is the state transition function, where for means that there is an action labeled such that . The language generated can be defined appropriately.
A policy in the context of RL is a probability distribution over the set of actions at a given state . At each time step, , the agent is in a particular state, say . It selects an action according to and moves to a next state with probability . It also receives a reward for this transition. The process then repeats from . The agent’s goal is to find a policy that maximizes the expected discounted future reward from every state in . Next we introduce the notion of supervisory control which trims the set of candidate policies to only those that do not violate the safety specification.
The set of actions () is partitioned into the set of controllable actions () and uncontrollable actions (). More specifically, (resp. ) denotes the set of actions that can (resp. cannot) be disabled by the supervisor. Formally, a supervisor is a function from the language generated by to the power set of . We use to denote the supervised MDP (SMDP) in which supervisor is controlling the MDP . Note that the supervisor only disables an action and it does not choose which action should be taken at any time instant. That (optimal) choice is determined by . The uncontrollable actions are always enabled by the supervisor.
We are interested in evaluating an optimal policy that satisfies the constraints specified in terms of a set of sequence of actions. The first question that is of interest to us is if such a policy exists. We can directly use the controllability condition from the supervisory control theory literature to answer this question [5, 6]:
Theorem 1
Let be a desired language specification over the alphabet denoting the safety constraints for the MDP . There is a supervisor such that if and only if .
Proof:
See Section 3.4.1 of [6]. ∎
The above condition states that there is a supervisor that enforces the desired specificaiton if and only if the occurrence of an uncontrollable action in from a prefix of results in a string that is also a prefix of . For the gridworld example in Fig. 1, the uncontrollable action resulted in a string from which it can generate a string in which item 3 of the specification was not satisfied. Therefore, it violated the controllability condition.
If the condition in Theorem 1 is not satisfied, then there does not exist a supervisor that can enforce the given specification. In such a case, the best that the supervisor can do is enforce the supremal controllable sublanguage of , which is the union of all controllable sublanguages of [9]:
Without going into the formal analysis, the supremal controllable sublanguage (the modified specification) for the gridworld example would involve not picking an item unless is empty.
In this paper, we will assume, in different ways, that the given specification can be modeled as a finite automaton and that it is controllable. An automaton is represented by a directed graph (state transition diagram) whose nodes correspond to the states of the automaton, with an edge from to labeled for each triple such that .
We use the concept of reward machines to handle the state constraints[7]. Given a set of propositional symbols , a set of (environment) states , and a set of actions , a Reward Machine (RM) is a tuple where is a finite set of states, is an initial state, is the state-transition function and is the reward-transition function. An MDP with a Reward Machine (MDPRM) is a tuple , where , and are defined as in an MDP, is a set of propositional symbols, is a labelling function , and and are defined as in an RM. The operation of an MDPRM is as follows. At every decision epoch of , there is a transition in also. If the current state of is and an action is taken from in , then is the next state of and it outputs a reward function which determines the reward in when action is taken from . The main idea in handling state constraints using RMs is to associate barrier rewards with state sequences that violate the desired specification. Reward machines can be used to model any Markovian reward function. They can also used to express any non-Markovian reward function as long as the state history on which the reward depends can be distinguished by different elements of a finite set of regular expressions over .
III Reinforcement Learning with Action-Constraints
Consider an MDP . The action-constraints can be given in terms of the sequence of actions that are safe to perform or by sequences that are unsafe. We propose methods for supervisor synthesis for both cases in this section.
Consider an automaton that is the model of a sequence of actions that is safe with an empty marked set of states. We construct an automaton from by appropriately adding a (marked) state to flag the occurrence of an action violating the specification. The formal construction is as follows:
-
1.
has one additional state: , .
-
2.
The transition function for is identical to that of for the state-action pairs for which is defined. : .
-
3.
For the state-action pairs for which is not defined, we have: : .
-
4.
If reaches , it stays there: : .
At every decision epoch of , there is a transition in also. If the current state of is and an action is taken in , then is the next state of . Now, if an action is taken in for which is not defined, then it means that it violates the safety constraint. If is not defined for and , then as per item 3 in the above construction, the state of will transition to . The state of indicates that an action constraint has been violated. A supervisor for this case will simply disable actions that lead to a transition to the state . If there is an uncontrollable action that can result in a transition to , then it means that the specification does not satisfy the controllability condition of theorem 1 and there does not exist a policy that can enforce the given specification. Automaton in Fig. 4 is an example in which is the sequence of actions that are to be performed when an uncontrollable action happens.
Next, we consider the case in which the constraints are given in terms of the sequence of actions that are unsafe111One way of handling this situation would be to evaluate the sequence of actions that are safe and then proceed with the earlier method (we assume that the set of safe sequences can be represented by an automaton).. Recall that denotes the marked language generated by an automaton . We assume that the set of strings of actions that are unsafe are given as an automaton for which . We then modify to obtain an automaton as follows:
-
1.
, , , .
-
2.
, , .
-
3.
, , .
That is, can be constructed from by adding additional outgoing arcs from every state in to the initial state, corresponding to actions in that do not already appear as an outgoing arc in . This modification ensures that for every action in , a corresponding transition is defined in . The operation of is the same as that of . At every decision epoch of , there is a transition in . If the current state of is and an action is taken in , then is the next state of , where is the transition function of . For simplicity, we label every by . The supervisor will disable every controllable action which results in a next state that is marked (). Suppose has generated a string that has not resulted in a transition to in , then it means that is a prefix of an admissible string. Following , if there is an uncontrollable action that results in a transition to , then it means that the specification does not satisfy the controllability condition. The automaton in Fig. 3 shows an automaton which models the specification in which two consecutive lefts or rights () are deemed unsafe.
Remark: We have modeled the sequence of actions that are unsafe by “marking” some states as unsafe. The marked states have a completely different interpretation in supervisory control theory (SCT) where they are often associated with desirable features— like signifying the end of an execution with the automata “recognizing” the string to be in its language. In fact, in SCT, if , then we say that is non-blocking, which means that the system can always reach a target (marked) state from any state in its reachable set. Deadlock freedom is an instance of the non-blocking property. We will need some simple modifications in the method discussed above if we have to tackle both the interpretations simultaneously.
Algorithm 1 shows the psuedocode for Supervised Q-learning for non-Markovian action and state constraints. It receives as input the automata and describing the action constraints, the set of propositional symbols , the labelling function , the discount factor , and the reward machines over . The goal is to learn an optimal policy given the state and action constraints.
The algorithm learns one q-value function per state of the automata and reward machine. That is, it will learn many q-value functions in total, where denotes the size of the set argument. These q-functions are stored in , and corresponds to the q-value function for states and of the two automata, and of the reward machine.
After initializing in step 3, the algorithm has 2 nested loops: one for the number of episodes that we are running (step 4), and the other for the length of each episode (step 6). Step 10 of the algorithm selects an action under supervision wherein only those actions that do not result in the next state in either of the two automata are considered for selection by epsilon-greedy method. Thereafter, the agent executes the action (Step 11), evaluates the reward function through the transition in the reward machine (Steps 12 and 13) and then updates the q-value function (Steps 15-19) using the standard q-learning rule. Note that the maximization step is over since those q-values would be used for the selection of as and would have transitioned. Action is to be picked so that is not reached in and .
IV Illustrative Example
Setting: Consider the pick-up and delivery grid world in Fig. 2. The agent starts from the position , picks up items labeled and , and returns back to . The agent unloads the items in a last-in-first-out order and the unloading sequence must be . Therefore, the nominal pick-up of the items is in sequence from to . The objective of the agent is to finish this process in minimum number of steps.
State and action space: The state of the agent is composed of its position in the grid, its orientation and an indicator of the items that the agent has picked up. The position of the agent in the grid can be denoted by the coordinate of the respective pixel it occupies, with the bottom-leftmost corner of the grid denoting the origin. We can use a set of symbols to describe its orientation and a 3-bit binary variable to denote the items that the agent has picked up. The agent’s movement in the grid is denoted by four actions and denoting the movement in four directions: left, right, forward and backward respectively. The actions and change the orientation of the agent by degrees clockwise and anticlockwise respectively. In addition, they also lead to a change in position by one pixel forward in the direction of the new orientation. For example, if are are feasible actions from a position ,then:
The actions and do not change the orientation and only result in a change of position by one pixel forward or backward respectively relative to the current orientation of the agent. If are are feasible actions from a position , then we have:
We use , and ( and ) for actions denoting dropping (respectively picking) of the respective items by the agent. We assume that the agent cannot reliably carry all three items together because of which it can accidentally drop an item when it is carrying all three of them. For simplicity, we assume that it can accidentally drop only item 1. We model this by associating an uncontrollable action . The illustration can easily be extended to account for dropping of either of the items.
Action Constraints
-
1.
The agent cannot make a -turn. We interpret it has two consecutive rights or lefts. That is, the set denotes the illegal substrings of the string of actions. The automaton , with initial state , in Fig. 3 describes the illegal sequence of actions corresponding to the set of strings . For ease of exposition, the actions and are not shown in and they can be added with a self loop around every state.
-
2.
In order for the agent to satisfy the constraint on the unloading sequence, if an agent accidentally drops the first item then it must also immediately drop the other two and pick all three in sequence again. That is, must be followed by the string . Fig. 4 shows the automaton that constrains the agent to take the action sequence when an uncontrollable action happens222For the case when the agent can drop either of the other two items, we can introduce two more uncontrollable actions and with remedial strings and respectively..
Fig. 5 shows the reward machine for imposing the specification of sequential pickup of items in which . The labeling function generates a 3-bit binary number corresponding to the items that the agent has picked up. It uses the symbol S to indicate whether the agent reached location S in the grid. The RM starts at state . It transitions to state if the items are not picked up in the desired sequence and thereafter obtains a reward of at every step. The reward for every step increases from to (from to ) after the agent picks up item (respectively, item ). It further increases to after the agent picks up all three items. Note that the RM only handles the sequential pickup initially. The uncontrollable drop of item after the agent picks all three items is handled by the automaton . Once all three items are picked up in sequence, the RM stays at until the agent reaches S.
Simulation Results The simulation was run for 500,000 iterations, where each iteration consisted of one episode of a maximum of 60 steps. The hyperparameters were set as and . Additionally, was decayed by 5% after every period of 10,000 iterations. The single-step reward is defined as in Fig. 5 in accordance to the items picked and their order. Fig. 6 shows the average score, i.e., the average accumulated reward over an episode, as training progressed. Fig. 7 shows the average percentage of times when the task was completed. All data shown are averages over intervals of 10,000 iterations during training. At the end of training, achieves a value of . This -greedy policy completes the task around 95% of the time as shown in Fig. 7. Upon inference, i.e., upon setting post training, the agent completes the task optimally 100% of the time.
V Conclusion
In this paper, we borrowed concepts from supervisory control theory of discrete event system to develop a method to learn optimal control policies for a finite-state Markov Decision Process in which (only) certain sequences of actions are deemed unsafe (respectively safe) while accounting for the possibility that a subset of actions might be uncontrollable. We assumed that the constraints can be modeled using a finite automaton and a natural future direction of research is to develop such methods for a more general class of constraints.
References
- [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
- [2] J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015.
- [3] K. C. Sou, J. Weimer, H. Sandberg, and K. H. Johansson, “Scheduling smart home appliances using mixed integer linear programming,” in 2011 50th IEEE Conference on Decision and Control and European Control Conference. IEEE, 2011, pp. 5144–5149.
- [4] R. Kaur, C. Schaye, K. Thompson, D. C. Yee, R. Zilz, R. Sreenivas, and R. B. Sowers, “Machine learning and price-based load scheduling for an optimal iot control in the smart and frugal home,” Energy and AI, vol. 3, p. 100042, 2021.
- [5] P. J. Ramadge and W. M. Wonham, “The control of discrete event systems,” Proceedings of the IEEE, vol. 77, no. 1, pp. 81–98, 1989.
- [6] C. G. Cassandras and S. Lafortune, Introduction to discrete event systems. Springer Science & Business Media, 2009.
- [7] R. T. Icarte, T. Klassen, R. Valenzano, and S. McIlraith, “Using reward machines for high-level task specification and decomposition in reinforcement learning,” in International Conference on Machine Learning. PMLR, 2018, pp. 2107–2116.
- [8] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
- [9] W. M. Wonham and P. J. Ramadge, “On the supremal controllable sublanguage of a given language,” SIAM Journal on Control and Optimization, vol. 25, no. 3, pp. 637–659, 1987.