Reinforcement Learning for Task Specifications with Action-Constraints

Arun Raman¹, Keerthan Shagrithaya¹, Shalabh Bhatnagar¹ *The work of the first author was supported by the C. V. Raman postdoctoral fellowship. The work of the second and third authors was supported by the J. C. Bose fellowship of the third author.¹ Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India {arunraman, keerthans, shalabh}@iisc.ac.in

Abstract

In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton; and propose a supervisor that disables a subset of actions at every state of the MDP so that the constraints on action sequence are satisfied. Then we present a version of the $Q$ -learning algorithm for learning optimal policies in the presence of non-Markovian action-sequence and state constraints, where we use the development of reward machines to handle the state constraints. We illustrate the method using an example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning and show the results of simulations in this setting.

I Introduction

In reinforcement learning (RL) [1], an agent evaluates its optimal behaviour from experience gained through interaction with the environment which is guided by the rewards that an action at a given state fetches from it. Safe reinforcement learning is a paradigm wherein the idea is to not violate a set of constraints during the process of learning [2]. In many applications, the constraints may be given in terms of the sequences of actions. For example, consider an agent, that can move in the four cardinal directions, traversing in a grid-like warehouse environment. In this case, the safety requirement corresponding to one-way traffic can have a specification of not taking two consecutive right turns which will result in a U-turn. In several scheduling applications, if scheduling an item is interpreted as an action, then the constraints on the sequence of actions arise naturally. For example, in real-time scheduling of home appliances, the dryer cannot be scheduled before the washer [3, 4]. In this paper, we are interested in a method for obtaining optimal control policies for such action-constrained scenarios.

Oftentimes, the system states and constraints are affected by uncontrollable events. For example, some actuators of a robot can break down due to faults. The task specification for an agent should be robust to such uncontrollable events. The subject of uncontrollable events altering the state of the system is well studied in the theory of supervisory control of discrete event systems[5, 6]. A Discrete Event System (DES) is a discrete-state system, where the state changes at discrete time instants due to the occurrence of events. If we associate a (not necessarily unique) label to each event of the DES, then its behaviour can be described by the language it generates using the alphabet defined by the set of labels. In such a setting, the desired behaviour of the DES is also specified by a language specification. A supervisory policy enforces a desired language specification by disabling a subset of events known as the controllable events. The complementary set of events that cannot be disabled by the supervisor are called uncontrollable events. It is important to note that the supervisory policy cannot force an event to occur, it can only disable a controllable event. As such, a supervisory policy aims for minimum intervention in the operation of the system— only when a particular event can lead to the violation of a specification. In this paper, we interpret the uncontrollable events as (uncontrollable) actions which can then be used to specify an overall action-constraint for the system by taking the possibilities of uncontrollable events into account. We limit our analysis to those action-constraints that can be represented by a finite automaton.

The environment in the Reinforcement Learning (RL) paradigm is typically modeled as a Markov Decision Process (MDP). If all the events of a Disrete Event System (DES) are controllable, then there are parallels between supervisory control of a DES and the theory of synthesizing policies for an MDP. In particular, fundamentally, both cases involve solving a sequential decision-making problem in which the state evolution in the model is Markovian in nature. On the other hand, a key difference between the two is that the action space of an MDP does not, generally, consider actions to be uncontrollable. This is because even if such aspects do exist in the system, they can be absorbed in the MDP model by introducing a probability distribution for the next state such that it accounts for the possibility of occurrence of an uncontrollable action. Such an approach works because the objective in the MDP setting is to find an optimal policy that minimizes an expected cost. However, if uncontrollable actions appear in the safety specification, then they cannot be abstracted by a probability distribution anymore and must appear explicitly in the MDP model. In such scenarios, the ‘safety’ part of the safe RL paradigm naturally invites the results from supervisory control theory. Our main contributions in this paper are the following:

1.

To the best of our information, this paper is the first work that bridges the areas of supervisory control theory (SCT) and reinforcement learning. Such an association permits us to directly use an important theoretical result from SCT: to determine if there exists a policy that satisfies the action-constraints in the presence of uncontrollable actions; offline, before training. If there does not exist such a policy, then there is another important result in SCT that, informally, evaluates the closest to the given constraints that a supervisory policy can enforce. These results are relevant in safe reinforcement learning for complex systems in which correctly formulating the constraints is a task in itself.
2.
We present a version of Q-learning algorithm to learn optimal control policies for an underlying finite-state MDP in which:
1. (a)
  
  a subset of actions might be uncontrollable, and
2. (b)
  
  the task specification or the safety constraints for the system is given in terms of sequences of actions modeled by a finite state automaton. The constraints can be given in terms of sequences that are safe and/or unsafe. This generality simplifies the problem formulation to some extent.
The adapted Q-learning algorithm also integrates the work on reward machines[7] that supports reward function specification as a function of state.
3.

Experimental results for an illustrative example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning are presented.

The rest of the paper is organized as follows. Section II presents some notations and definitions that we will be using in the rest of the paper. It also presents the relevant results from supervisory control theory. Section III discusses supervisor synthesis. We will largely be working with finite automata in this section and use the development for supervisor synthesis to propose a version of $Q$ -learning algorithm [8] that can be used to learn an optimal policy in the presence of state and action constraints. In Section IV, we present an example to illustrate the use of automata-based methods to handle non-Markovian state and action-constraints. We use the proposed method for action-constraints and reward machines[7] for the state constraints. We show the results of the experiments and observe that the Q-learning algorithm adapted to this setting helps to make the agent learn to complete the task optimally 100% of the time. We conclude the paper with Section V.

II Motivation, Notations and Definitions

In a reinforcement learning problem, the environment is modeled as a Markov Decision Process $\mathcal{M}=(S,A,r,p,\gamma)$ where $S$ and $A$ denote the finite set of states and actions, respectively, $r:S\times A\times S\rightarrow\mathbb{R}$ denotes the reward function, $p(s_{t+1}|s_{t},a_{t})$ is the transition probability distribution and $\gamma\in(0,1)$ is the discount factor. In the paradigm of safe reinforcement learning, the objective is to learn optimal policies while satisfying a set of constraints. In this paper, we consider safety constraints that are given as a set of sequence of actions of the MDP. Next, we discuss the example gridworld in Fig. 1 to motivate the development.

Figure 1: A pick-up and delivery grid world

Consider a manufacturing facility layout as shown in Fig. 1. When a work piece appears at location ⓘ, the robot $R_{1}$ picks it up and delivers it to the buffer zone $B$ . The robot $R_{2}$ picks the item from $B$ and delivers it to the processing facility o. The buffer $B$ can only store one item at a time; so it has two states Empty( $E$ ) and Full( $F$ ). The robot state is determined by the position that it occupies in the grid and by a binary variable indicating whether it has picked up an item or not. There are six controllable actions for each robot where four of them correspond to their movement in the grid and two correspond to picking and dropping of an item ( $p_{i},d_{i}$ ), $i\in\{1,2\}$ . Additionally, we assume that robot $R_{2}$ can either be Working(W) or Down(D). There are two uncontrollable actions $\{\phi,\rho\}$ denoting the fault and rectification actions which take $R_{2}$ from $W$ to $D$ and vice versa respectively. We call them uncontrollable because they are caused by external factors/agents not modeled in this set up. We can also assume that an uncontrollable action makes an item appear in ⓘ. The objective is to design an optimal policy to satisfy the specifications below:

1.

$R_{1}$ delivers an item to $B$ only if $B$ is in $E$ .
2.

$R_{2}$ picks an item from $B$ only if $B$ is in $F$ (thereby driving $B$ to $E$ ).
3.

$R_{1}$ and $R_{2}$ deliver a picked item in no more than $5$ steps.

The above specifications ‘look’ okay but are not robust to uncontrollable actions. To see this, consider a string of actions $\sigma$ . Formally, the specification in item 3 says that for a substring $p_{1}\sigma d_{1}$ , it must be that $|\sigma|\leq 5$ where $|\sigma|$ denotes the size of $\sigma$ , and $p_{1}$ and $d_{1}$ denote picking and dropping of item by $R_{1}$ respectively. Consider the scenario in which $R_{1}$ takes an action $p_{1}$ when the buffer is full, with the assumption that it will be emptied by $R_{2}$ in the next step. Such an action is consistent with item 1 of the specification. At this point, if the uncontrollable action $\phi$ occurs, then the satisfaction of item 3 of the specification will entirely depend on the occurrence of the uncontrollable rectification action $\rho$ . As such, the specifications above are not robust to uncontrollable actions. That is, the occurrence of an uncontrollable action resulted in the generation of a string of actions that was not in the specification. In what follows, we discuss developments from supervisory control theory which facilitate a formal analysis of such aspects.

An automaton is a 5-tuple $G=(Q,\Sigma,\delta,q_{0},Q_{m})$ [6]. Here $Q$ and $\Sigma$ denote the set of states and (labelled) actions respectively. The initial state is $q_{0}\in Q$ , and $Q_{m}\subseteq Q$ is called the set of marked states. The states can be marked for different reasons and we discuss more about this later in the paper. The function $\delta:\Sigma\times Q\rightarrow Q$ is the state transition function, where for $q\in Q,\sigma\in\Sigma,\delta(\sigma,q)=\widehat{q}$ , implies there is an action labeled $\sigma$ from state $q$ to state $\widehat{q}$ . In general, the state transition function $\delta$ may not be defined for all state-action pairs. The active action function $\Gamma:Q\rightarrow 2^{\Sigma}$ , identifies the set of all actions $\sigma\in\Sigma$ for which $\delta(\sigma,q)$ is defined.

Let $\Sigma^{*}$ denote the set of all finite strings of elements of $\Sigma$ including the empty string $\epsilon$ . We write $\delta^{*}(\omega,q)=\widehat{q}$ if there is a string of actions $\omega\in\Sigma^{*}$ from $q$ to state $\widehat{q}$ , that is if $\widehat{q}$ is reachable from $q$ . Given $G$ , the set of all admissible strings of actions is denoted by $L(G)$ and is referred to as the language generated by $G$ over the alphabet $\Sigma$ . Formally, for the initial state $q_{0}$ :

\displaystyle L(G)=\{\omega\in\Sigma^{*}:\delta^{*}(\omega,q_{0})\text{ is defined}\}.

(1)

The marked language, $L_{m}(G)\subseteq L(G)$ , generated by $G$ is

\displaystyle L_{m}(G)=\{\omega\in\Sigma^{*}:\delta^{*}(\omega,q_{0})\in Q_{m}\}.

A string $u$ is a prefix of a string $v\in\Sigma^{*}$ if for some $w\in\Sigma^{*}$ , $v=uw$ . If $v$ is an admissible string in $G$ , then so are all its prefixes. We define the prefix closure of $L\subseteq\Sigma^{*}$ to be the language $\overline{L}$ defined as:

\displaystyle\overline{L}=\{u:uv\in L\text{ for some }v\in\Sigma^{*}\}.

We can use the above notations to interpret the MDP $\mathcal{M}$ as a language generator: $(S,A,\Sigma,\mathcal{L},r,p,\gamma)$ where $\Sigma$ denotes a set of labels and $\mathcal{L}:A\rightarrow\Sigma$ denotes a labelling function that associates a label to each action. For simplicity, we assume that each action is uniquely labeled and, with some abuse of notation, use $\Sigma$ as a proxy for the set of actions. Then the function $\delta:\Sigma\times S\rightarrow S$ is the state transition function, where for $q\in S,\sigma\in\Sigma:\delta(\sigma,q)=\widehat{q}$ means that there is an action $a\in A$ labeled $\sigma$ such that $p(\widehat{q}|q,a)>0$ . The language generated can be defined appropriately.

A policy $\pi(a|s)$ in the context of RL is a probability distribution over the set of actions $a\in A$ at a given state $s\in S$ . At each time step, $t$ , the agent is in a particular state, say $s_{t}$ . It selects an action $a_{t}$ according to $\pi(\cdot|s_{t})$ and moves to a next state $s_{t+1}$ with probability $p(s_{t+1}|s_{t},a_{t})$ . It also receives a reward $r(s_{t+1},a_{t},s_{t})$ for this transition. The process then repeats from $s_{t+1}$ . The agent’s goal is to find a policy $\pi^{*}$ that maximizes the expected discounted future reward from every state in $S$ . Next we introduce the notion of supervisory control which trims the set of candidate policies to only those that do not violate the safety specification.

The set of actions ( $\Sigma$ ) is partitioned into the set of controllable actions ( $\Sigma_{c}$ ) and uncontrollable actions ( $\Sigma_{u}$ ). More specifically, $\Sigma_{c}$ (resp. $\Sigma_{u}$ ) denotes the set of actions that can (resp. cannot) be disabled by the supervisor. Formally, a supervisor $\mathcal{S}:L(\mathcal{M})\rightarrow 2^{\Sigma}$ is a function from the language generated by $\mathcal{M}$ to the power set of $\Sigma$ . We use $\mathcal{S}/\mathcal{M}$ to denote the supervised MDP (SMDP) in which supervisor $\mathcal{S}$ is controlling the MDP $\mathcal{M}$ . Note that the supervisor only disables an action and it does not choose which action should be taken at any time instant. That (optimal) choice is determined by $\pi(a|s)$ . The uncontrollable actions are always enabled by the supervisor.

We are interested in evaluating an optimal policy $\pi^{*}$ that satisfies the constraints specified in terms of a set of sequence of actions. The first question that is of interest to us is if such a policy exists. We can directly use the controllability condition from the supervisory control theory literature to answer this question [5, 6]:

Theorem 1

Let $K$ be a desired language specification over the alphabet $\Sigma$ denoting the safety constraints for the MDP $\mathcal{M}$ . There is a supervisor $\mathcal{S}$ such that $L(\mathcal{S}/\mathcal{M})=\overline{K}$ if and only if $\overline{K}\Sigma_{u}\cap L(\mathcal{M})\subseteq\overline{K}$ .

Proof:

See Section 3.4.1 of [6]. ∎

The above condition states that there is a supervisor that enforces the desired specificaiton $K$ if and only if the occurrence of an uncontrollable action in $\mathcal{M}$ from a prefix of $K$ results in a string that is also a prefix of $K$ . For the gridworld example in Fig. 1, the uncontrollable action $\phi$ resulted in a string from which it can generate a string in which item 3 of the specification was not satisfied. Therefore, it violated the controllability condition.

If the condition in Theorem 1 is not satisfied, then there does not exist a supervisor that can enforce the given specification. In such a case, the best that the supervisor can do is enforce the supremal controllable sublanguage of $K$ , which is the union of all controllable sublanguages of $K$ [9]:

\displaystyle K^{\uparrow}=\bigcup_{i}\ \{K_{i}:(K_{i}\subseteq K)\wedge(\overline{K}_{i}\Sigma_{u}\cap L(G)\subseteq\overline{K}_{i})\}.

Without going into the formal analysis, the supremal controllable sublanguage (the modified specification) for the gridworld example would involve $R_{1}$ not picking an item unless $B$ is empty.

In this paper, we will assume, in different ways, that the given specification can be modeled as a finite automaton and that it is controllable. An automaton is represented by a directed graph (state transition diagram) whose nodes correspond to the states of the automaton, with an edge from $q$ to $\widehat{q}$ labeled $\sigma$ for each triple $(\sigma,q,\widehat{q})$ such that $\widehat{q}=\delta(\sigma,q)$ .

We use the concept of reward machines to handle the state constraints[7]. Given a set of propositional symbols $\mathcal{P}$ , a set of (environment) states $S$ , and a set of actions $A$ , a Reward Machine (RM) is a tuple $\mathcal{R}_{\mathcal{P}SA}=\langle U,u_{0},\delta_{u},\delta_{r}\rangle$ where $U$ is a finite set of states, $u_{0}\in U$ is an initial state, $\delta_{u}:U\times 2^{\mathcal{P}}\rightarrow U$ is the state-transition function and $\delta_{r}:U\times U\rightarrow[S\times A\times S\rightarrow\mathbb{R}]$ is the reward-transition function. An MDP with a Reward Machine (MDPRM) is a tuple $T=\langle S,A,p,\gamma,\mathcal{P},L,U,u_{0},\delta_{u},\delta_{r}\rangle$ , where $S,A,p$ , and $\gamma$ are defined as in an MDP, $\mathcal{P}$ is a set of propositional symbols, $L$ is a labelling function $L:S\rightarrow 2^{\mathcal{P}}$ , and $U,u_{0},\delta_{u},$ and $\delta_{r}$ are defined as in an RM. The operation of an MDPRM is as follows. At every decision epoch of $\mathcal{M}$ , there is a transition in $\mathcal{R}_{\mathcal{P}SA}$ also. If the current state of $\mathcal{R}_{\mathcal{P}SA}$ is $u$ and an action $a$ is taken from $s$ in $\mathcal{M}$ , then $\delta_{u}(a,u)$ is the next state of $\mathcal{R}_{\mathcal{P}SA}$ and it outputs a reward function $\delta_{r}(u,\delta_{u}(a,u))$ which determines the reward in $\mathcal{M}$ when action $a$ is taken from $s$ . The main idea in handling state constraints using RMs is to associate barrier rewards with state sequences that violate the desired specification. Reward machines can be used to model any Markovian reward function. They can also used to express any non-Markovian reward function as long as the state history $S^{*}$ on which the reward depends can be distinguished by different elements of a finite set of regular expressions over $\mathcal{P}$ .

III Reinforcement Learning with Action-Constraints

Consider an MDP $\mathcal{M}=(S,A,\Sigma,L,r,p,\gamma)$ . The action-constraints can be given in terms of the sequence of actions that are safe to perform or by sequences that are unsafe. We propose methods for supervisor synthesis for both cases in this section.

Consider an automaton $G_{s}^{\prime}=(Q_{s}^{\prime},\Sigma,\delta_{s}^{\prime},q_{0s},\emptyset)$ that is the model of a sequence of actions that is safe with an empty marked set of states. We construct an automaton $G_{s}=(Q_{s},\Sigma,\delta_{s},q_{0s},\{s_{a}\})$ from $G_{s}^{\prime}$ by appropriately adding a (marked) state $s_{a}$ to flag the occurrence of an action violating the specification. The formal construction is as follows:

1.

$G_{s}$ has one additional state: $Q_{s}=Q_{s}^{\prime}\cup\{s_{a}\}$ , $Q_{m}=\{s_{a}\}$ .
2.

The transition function for $G_{s}$ is identical to that of $G_{s}^{\prime}$ for the state-action pairs for which $\delta_{s}^{\prime}$ is defined. $\forall s\in Q_{s}^{\prime},\forall\sigma\in\Gamma(s)$ : $\delta_{s}(\sigma,s)=\delta_{s}^{\prime}(\sigma,s)$ .
3.

For the state-action pairs for which $\delta_{s}^{\prime}$ is not defined, we have: $\forall s\in Q_{s}^{\prime},\forall\sigma\in(\Sigma-\Gamma(s))$ : $\delta_{s}(\sigma,s)=s_{a}$ .
4.

If $G_{s}$ reaches $s_{a}$ , it stays there: $\forall\sigma\in\Sigma$ : $\delta_{s}(\sigma,s_{a})=s_{a}$ .

At every decision epoch of $\mathcal{M}$ , there is a transition in $G_{s}$ also. If the current state of $G_{s}$ is $s$ and an action $a$ is taken in $\mathcal{M}$ , then $\delta_{s}(a,s)$ is the next state of $G_{s}$ . Now, if an action $\omega$ is taken in $\mathcal{M}$ for which $\delta_{s}^{\prime}(\omega,s)$ is not defined, then it means that it violates the safety constraint. If $\delta_{s}^{\prime}(\omega,s)$ is not defined for $\omega\in\Sigma$ and $s\in Q_{s}^{\prime}$ , then as per item 3 in the above construction, the state of $G_{s}$ will transition to $s_{a}$ . The state $s_{a}$ of $G_{s}$ indicates that an action constraint has been violated. A supervisor for this case will simply disable actions that lead to a transition to the state $s_{a}$ . If there is an uncontrollable action that can result in a transition to $s_{a}$ , then it means that the specification does not satisfy the controllability condition of theorem 1 and there does not exist a policy that can enforce the given specification. Automaton $G_{s}$ in Fig. 4 is an example in which $d_{2}d_{3}p_{1}p_{2}p_{3}$ is the sequence of actions that are to be performed when an uncontrollable action $d_{u}$ happens.

Next, we consider the case in which the constraints are given in terms of the sequence of actions that are unsafe¹¹1One way of handling this situation would be to evaluate the sequence of actions that are safe and then proceed with the earlier method (we assume that the set of safe sequences can be represented by an automaton).. Recall that $L_{m}(\mathcal{G})\subseteq L(\mathcal{G})$ denotes the marked language generated by an automaton $\mathcal{G}$ . We assume that the set of strings of actions that are unsafe $S_{un}$ are given as an automaton $H_{1}=(Q_{h}^{\prime},\Sigma_{h}^{\prime},\delta_{h}^{\prime},q_{0h}^{\prime},Q_{mh^{\prime}})$ for which $L_{m}(H_{1})=S_{un}$ . We then modify $H_{1}$ to obtain an automaton $H=(Q_{h},\Sigma_{h},\delta_{h},q_{0h},Q_{mh})$ as follows:

1.

$Q_{h}=Q_{h}^{\prime}$ , $\Sigma_{h}=\Sigma$ , $q_{0h}=q_{0h}^{\prime}$ , $Q_{mh}=Q_{mh}^{\prime}$ .
2.

$\forall q\in Q_{h}^{\prime}$ , $\forall\sigma\in\Gamma_{h}^{\prime}(q)$ , $\delta_{h}(\sigma,q)=\delta_{h}^{\prime}(\sigma,q)$ .
3.

$\forall q\in Q_{h}^{\prime}$ , $\forall\sigma\in\Sigma-\Gamma_{h}^{\prime}(q)$ , $\delta_{h}(\sigma,q)=q_{0h}$ .

That is, $H$ can be constructed from $H_{1}$ by adding additional outgoing arcs from every state in $H_{1}$ to the initial state, corresponding to actions in $\Sigma$ that do not already appear as an outgoing arc in $H_{1}$ . This modification ensures that for every action in $\mathcal{M}$ , a corresponding transition is defined in $H$ . The operation of $H$ is the same as that of $G_{s}$ . At every decision epoch of $\mathcal{M}$ , there is a transition in $H$ . If the current state of $H$ is $s$ and an action $a$ is taken in $\mathcal{M}$ , then $\delta_{H}(a,s)$ is the next state of $H$ , where $\delta_{H}$ is the transition function of $H$ . For simplicity, we label every $q\in Q_{mh}$ by $s_{a}$ . The supervisor will disable every controllable action which results in a next state that is marked ( $s_{a}$ ). Suppose $\mathcal{M}$ has generated a string $\sigma$ that has not resulted in a transition to $s_{a}$ in $H$ , then it means that $\sigma$ is a prefix of an admissible string. Following $\sigma$ , if there is an uncontrollable action that results in a transition to $s_{a}$ , then it means that the specification does not satisfy the controllability condition. The automaton $H$ in Fig. 3 shows an automaton which models the specification in which two consecutive lefts or rights ( $\{ll,rr,lrr,rlrll,\ldots\}$ ) are deemed unsafe.

Remark: We have modeled the sequence of actions that are unsafe by “marking” some states as unsafe. The marked states have a completely different interpretation in supervisory control theory (SCT) where they are often associated with desirable features— like signifying the end of an execution with the automata “recognizing” the string to be in its language. In fact, in SCT, if $\overline{L}_{m}(G)=L(G)$ , then we say that $G$ is non-blocking, which means that the system can always reach a target (marked) state from any state in its reachable set. Deadlock freedom is an instance of the non-blocking property. We will need some simple modifications in the method discussed above if we have to tackle both the interpretations simultaneously.

Algorithm 1 shows the psuedocode for Supervised Q-learning for non-Markovian action and state constraints. It receives as input the automata $G_{s}^{\prime}$ and $H_{1}$ describing the action constraints, the set of propositional symbols $\mathcal{P}$ , the labelling function $L$ , the discount factor $\gamma$ , and the reward machines $\mathcal{R}$ over $\mathcal{P}$ . The goal is to learn an optimal policy given the state and action constraints.

Algorithm 1 Supervised Q-learning for non-Markovian action and state constraints

1: Input:

{G_{s}^{\prime}},H_{1},\mathcal{P},L,\gamma,\mathcal{R}=\langle U,\delta_{u},\delta_{r},u_{0}\rangle

2: Construct

G_{s}

and

H

from

G_{s}^{\prime}

and

H_{1}

respectively

\widetilde{Q}\leftarrow

InitializeQValueFunction()

4: for

l=0

to num_episodes do

u_{j}\leftarrow u_{0};q_{s}\leftarrow q_{0s};q_{h}\leftarrow q_{0h}

;

s\leftarrow

EnvInitialState();

6: for

t=0

to length_episode do

7: if EnvDeadEnd(

s

) then

8: break

9: end if

10:

a\leftarrow

GetActionEpsilonGreedy(

\widetilde{q}_{q_{s},q_{h},u_{j}},s

) such that

(\delta_{s}(a,q_{s})\neq s_{a})\wedge(\delta_{h}(a,q_{h})\neq s_{a})

11:

s^{\prime}\leftarrow

EnvExecuteAction

(s,a)

12:

u_{k}\leftarrow\delta_{u}(u_{j},L(s^{\prime}))

13:

r\leftarrow\delta_{r}(u_{j},u_{k})

14:

q_{m}\leftarrow\delta_{s}(q_{s},a)

;

q_{n}\leftarrow\delta_{h}(q_{h},a)

15: if EnvDeadEnd(

s^{\prime}

) then

16:

\widetilde{q}_{q_{s},q_{h},u_{j}}(s,a)=\widetilde{q}_{q_{s},q_{h},u_{j}}(s,a)+{\alpha}r(s,a,s^{\prime})

17: else

18:

\widetilde{q}_{q_{s},q_{h},u_{j}}(s,a)=\widetilde{q}_{q_{s},q_{h},u_{j}}(s,a)

+

{\alpha}

[r(s,a,s^{\prime})

+

\gamma\max_{a^{\prime}}\widetilde{q}_{q_{m},q_{n},u_{k}}(s^{\prime},a^{\prime})-\widetilde{q}_{q_{s},q_{h},u_{j}}(s,a)]

19: end if

20:

q_{s}\leftarrow\delta_{s}(q_{s},a)

;

q_{h}\leftarrow\delta_{h}(q_{h},a)

21:

u_{j}\leftarrow\delta_{u}(u_{j},L(s^{\prime}))

;

s\leftarrow s^{\prime}

22: end for

23: end for

The algorithm learns one q-value function per state of the automata and reward machine. That is, it will learn $|Q_{s}|\times|Q_{h}|\times|U|$ many q-value functions in total, where $|\cdot|$ denotes the size of the set argument. These q-functions are stored in $\widetilde{Q}$ , and $\widetilde{q}_{q_{s},q_{h},u_{j}}\in\widetilde{Q}$ corresponds to the q-value function for states $q_{s}\in Q_{s}$ and $q_{h}\in Q_{h}$ of the two automata, and $u_{j}\in U$ of the reward machine.

After initializing $\widetilde{Q}$ in step 3, the algorithm has 2 nested loops: one for the number of episodes that we are running (step 4), and the other for the length of each episode (step 6). Step 10 of the algorithm selects an action $a$ under supervision wherein only those actions that do not result in the next state $s_{a}$ in either of the two automata are considered for selection by epsilon-greedy method. Thereafter, the agent executes the action (Step 11), evaluates the reward function through the transition in the reward machine (Steps 12 and 13) and then updates the q-value function (Steps 15-19) using the standard q-learning rule. Note that the maximization step is over $\widetilde{q}_{q_{m},q_{n},u_{k}}$ since those q-values would be used for the selection of $a^{\prime}$ as $G_{s},H_{1}$ and $\mathcal{R}$ would have transitioned. Action $a^{\prime}$ is to be picked so that $s_{a}$ is not reached in $G_{s}$ and $H_{1}$ .

IV Illustrative Example

Figure 2: A pick-up and delivery grid world

Setting: Consider the pick-up and delivery grid world in Fig. 2. The agent starts from the position $S$ , picks up items labeled $1,2$ and $3$ , and returns back to $S$ . The agent unloads the items in a last-in-first-out order and the unloading sequence must be $3-2-1$ . Therefore, the nominal pick-up of the items is in sequence from $1$ to $3$ . The objective of the agent is to finish this process in minimum number of steps.

State and action space: The state of the agent is composed of its position in the grid, its orientation and an indicator of the items that the agent has picked up. The position of the agent in the grid can be denoted by the coordinate of the respective pixel it occupies, with the bottom-leftmost corner of the grid denoting the origin. We can use a set of symbols $\{\uparrow,\downarrow,\leftarrow,\rightarrow\}$ to describe its orientation and a 3-bit binary variable to denote the items that the agent has picked up. The agent’s movement in the grid is denoted by four actions $l,r,f$ and $b$ denoting the movement in four directions: left, right, forward and backward respectively. The actions $r$ and $l$ change the orientation of the agent by $90$ degrees clockwise and anticlockwise respectively. In addition, they also lead to a change in position by one pixel forward in the direction of the new orientation. For example, if $l$ are $r$ are feasible actions from a position $(x,y,\uparrow)$ ,then:

	$\displaystyle(x,y,\uparrow)\xrightarrow{l}(x-1,y,\leftarrow),$
	$\displaystyle(x,y,\uparrow)\xrightarrow{r}(x+1,y,\rightarrow).$

The actions $f$ and $b$ do not change the orientation and only result in a change of position by one pixel forward or backward respectively relative to the current orientation of the agent. If $f$ are $b$ are feasible actions from a position $(x,y,\uparrow)$ , then we have:

	$\displaystyle(x,y,\uparrow)\xrightarrow{f}(x,y+1,\uparrow),$
	$\displaystyle(x,y,\uparrow)\xrightarrow{b}(x,y-1,\uparrow).$

We use $d_{1}$ , $d_{2}$ and $d_{3}$ ( $p_{1},p_{2}$ and $p_{3}$ ) for actions denoting dropping (respectively picking) of the respective items by the agent. We assume that the agent cannot reliably carry all three items together because of which it can accidentally drop an item when it is carrying all three of them. For simplicity, we assume that it can accidentally drop only item 1. We model this by associating an uncontrollable action $d_{u}$ . The illustration can easily be extended to account for dropping of either of the items.

Action Constraints

1.

The agent cannot make a $U$ -turn. We interpret it has two consecutive rights or lefts. That is, the set $\{ll,rr\}$ denotes the illegal substrings of the string of actions. The automaton $H$ , with initial state $0$ , in Fig. 3 describes the illegal sequence of actions corresponding to the set of strings $\{ll,rr\}$ . For ease of exposition, the actions $d_{i}$ and $p_{i}$ are not shown in $H$ and they can be added with a self loop around every state.
2.

In order for the agent to satisfy the constraint on the unloading sequence, if an agent accidentally drops the first item then it must also immediately drop the other two and pick all three in sequence again. That is, $d_{u}$ must be followed by the string $d_{2}d_{3}p_{1}p_{2}p_{3}$ . Fig. 4 shows the automaton $G_{s}$ that constrains the agent to take the action sequence $d_{2}d_{3}p_{1}p_{2}p_{3}$ when an uncontrollable action $d_{u}$ happens²²2For the case when the agent can drop either of the other two items, we can introduce two more uncontrollable actions $d_{u2}$ and $d_{u3}$ with remedial strings $d_{1}d_{3}p_{1}p_{2}p_{3}$ and $d_{1}d_{2}p_{1}p_{2}p_{3}$ respectively..

Figure 3: An automaton

H

with initial state

0

that accepts strings equivalent to a U-turn by the agent; for example the strings {

ll,rr,lrfbrr,\ldots

Fig. 5 shows the reward machine for imposing the specification of sequential pickup of items in which $\mathcal{P}=\{0,1,\framebox{S}\}$ . The labeling function generates a 3-bit binary number corresponding to the items that the agent has picked up. It uses the symbol S to indicate whether the agent reached location S in the grid. The RM starts at state $u_{0}$ . It transitions to state $u_{1}$ if the items are not picked up in the desired sequence and thereafter obtains a reward of $-20$ at every step. The reward for every step increases from $-10$ to $-8$ (from $-8$ to $-6$ ) after the agent picks up item $1$ (respectively, item $2$ ). It further increases to $-1$ after the agent picks up all three items. Note that the RM only handles the sequential pickup initially. The uncontrollable drop of item $1$ after the agent picks all three items is handled by the automaton $G_{s}$ . Once all three items are picked up in sequence, the RM stays at $u_{4}$ until the agent reaches S.

Simulation Results The simulation was run for 500,000 iterations, where each iteration consisted of one episode of a maximum of 60 steps. The hyperparameters were set as $\alpha=0.1,\epsilon=0.25$ and $\gamma=0.9$ . Additionally, $\epsilon$ was decayed by 5% after every period of 10,000 iterations. The single-step reward is defined as in Fig. 5 in accordance to the items picked and their order. Fig. 6 shows the average score, i.e., the average accumulated reward over an episode, as training progressed. Fig. 7 shows the average percentage of times when the task was completed. All data shown are averages over intervals of 10,000 iterations during training. At the end of training, $\epsilon$ achieves a value of $0.019$ . This $\epsilon$ -greedy policy completes the task around 95% of the time as shown in Fig. 7. Upon inference, i.e., upon setting $\epsilon=0$ post training, the agent completes the task optimally 100% of the time.

Figure 4: An automaton

G_{s}

with initial state

w_{0}

describing the sequence of safe actions when an uncontrollable action

d_{u}

happens.

Figure 5: Reward Machine for pick up and delivery grid world example.

Figure 6: Average score achieved averaged across intervals of 10,000 iterations during training

Figure 7: Percentage of times task was completed averaged across intervals of 10,000 iterations during training

V Conclusion

In this paper, we borrowed concepts from supervisory control theory of discrete event system to develop a method to learn optimal control policies for a finite-state Markov Decision Process in which (only) certain sequences of actions are deemed unsafe (respectively safe) while accounting for the possibility that a subset of actions might be uncontrollable. We assumed that the constraints can be modeled using a finite automaton and a natural future direction of research is to develop such methods for a more general class of constraints.

References

[1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[2] J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015.
[3] K. C. Sou, J. Weimer, H. Sandberg, and K. H. Johansson, “Scheduling smart home appliances using mixed integer linear programming,” in 2011 50th IEEE Conference on Decision and Control and European Control Conference. IEEE, 2011, pp. 5144–5149.
[4] R. Kaur, C. Schaye, K. Thompson, D. C. Yee, R. Zilz, R. Sreenivas, and R. B. Sowers, “Machine learning and price-based load scheduling for an optimal iot control in the smart and frugal home,” Energy and AI, vol. 3, p. 100042, 2021.
[5] P. J. Ramadge and W. M. Wonham, “The control of discrete event systems,” Proceedings of the IEEE, vol. 77, no. 1, pp. 81–98, 1989.
[6] C. G. Cassandras and S. Lafortune, Introduction to discrete event systems. Springer Science & Business Media, 2009.
[7] R. T. Icarte, T. Klassen, R. Valenzano, and S. McIlraith, “Using reward machines for high-level task specification and decomposition in reinforcement learning,” in International Conference on Machine Learning. PMLR, 2018, pp. 2107–2116.
[8] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
[9] W. M. Wonham and P. J. Ramadge, “On the supremal controllable sublanguage of a given language,” SIAM Journal on Control and Optimization, vol. 25, no. 3, pp. 637–659, 1987.