This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reinforcement Learning for Task Specifications with Action-Constraints

Arun Raman1, Keerthan Shagrithaya1, Shalabh Bhatnagar1 *The work of the first author was supported by the C. V. Raman postdoctoral fellowship. The work of the second and third authors was supported by the J. C. Bose fellowship of the third author.1 Department of Computer Science and Automation, Indian Institute of Science, Bangalore, India {arunraman, keerthans, shalabh}@iisc.ac.in
Abstract

In this paper, we use concepts from supervisory control theory of discrete event systems to propose a method to learn optimal control policies for a finite-state Markov Decision Process (MDP) in which (only) certain sequences of actions are deemed unsafe (respectively safe). We assume that the set of action sequences that are deemed unsafe and/or safe are given in terms of a finite-state automaton; and propose a supervisor that disables a subset of actions at every state of the MDP so that the constraints on action sequence are satisfied. Then we present a version of the QQ-learning algorithm for learning optimal policies in the presence of non-Markovian action-sequence and state constraints, where we use the development of reward machines to handle the state constraints. We illustrate the method using an example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning and show the results of simulations in this setting.

I Introduction

In reinforcement learning (RL) [1], an agent evaluates its optimal behaviour from experience gained through interaction with the environment which is guided by the rewards that an action at a given state fetches from it. Safe reinforcement learning is a paradigm wherein the idea is to not violate a set of constraints during the process of learning [2]. In many applications, the constraints may be given in terms of the sequences of actions. For example, consider an agent, that can move in the four cardinal directions, traversing in a grid-like warehouse environment. In this case, the safety requirement corresponding to one-way traffic can have a specification of not taking two consecutive right turns which will result in a U-turn. In several scheduling applications, if scheduling an item is interpreted as an action, then the constraints on the sequence of actions arise naturally. For example, in real-time scheduling of home appliances, the dryer cannot be scheduled before the washer [3, 4]. In this paper, we are interested in a method for obtaining optimal control policies for such action-constrained scenarios.

Oftentimes, the system states and constraints are affected by uncontrollable events. For example, some actuators of a robot can break down due to faults. The task specification for an agent should be robust to such uncontrollable events. The subject of uncontrollable events altering the state of the system is well studied in the theory of supervisory control of discrete event systems[5, 6]. A Discrete Event System (DES) is a discrete-state system, where the state changes at discrete time instants due to the occurrence of events. If we associate a (not necessarily unique) label to each event of the DES, then its behaviour can be described by the language it generates using the alphabet defined by the set of labels. In such a setting, the desired behaviour of the DES is also specified by a language specification. A supervisory policy enforces a desired language specification by disabling a subset of events known as the controllable events. The complementary set of events that cannot be disabled by the supervisor are called uncontrollable events. It is important to note that the supervisory policy cannot force an event to occur, it can only disable a controllable event. As such, a supervisory policy aims for minimum intervention in the operation of the system— only when a particular event can lead to the violation of a specification. In this paper, we interpret the uncontrollable events as (uncontrollable) actions which can then be used to specify an overall action-constraint for the system by taking the possibilities of uncontrollable events into account. We limit our analysis to those action-constraints that can be represented by a finite automaton.

The environment in the Reinforcement Learning (RL) paradigm is typically modeled as a Markov Decision Process (MDP). If all the events of a Disrete Event System (DES) are controllable, then there are parallels between supervisory control of a DES and the theory of synthesizing policies for an MDP. In particular, fundamentally, both cases involve solving a sequential decision-making problem in which the state evolution in the model is Markovian in nature. On the other hand, a key difference between the two is that the action space of an MDP does not, generally, consider actions to be uncontrollable. This is because even if such aspects do exist in the system, they can be absorbed in the MDP model by introducing a probability distribution for the next state such that it accounts for the possibility of occurrence of an uncontrollable action. Such an approach works because the objective in the MDP setting is to find an optimal policy that minimizes an expected cost. However, if uncontrollable actions appear in the safety specification, then they cannot be abstracted by a probability distribution anymore and must appear explicitly in the MDP model. In such scenarios, the ‘safety’ part of the safe RL paradigm naturally invites the results from supervisory control theory. Our main contributions in this paper are the following:

  1. 1.

    To the best of our information, this paper is the first work that bridges the areas of supervisory control theory (SCT) and reinforcement learning. Such an association permits us to directly use an important theoretical result from SCT: to determine if there exists a policy that satisfies the action-constraints in the presence of uncontrollable actions; offline, before training. If there does not exist such a policy, then there is another important result in SCT that, informally, evaluates the closest to the given constraints that a supervisory policy can enforce. These results are relevant in safe reinforcement learning for complex systems in which correctly formulating the constraints is a task in itself.

  2. 2.

    We present a version of Q-learning algorithm to learn optimal control policies for an underlying finite-state MDP in which:

    1. (a)

      a subset of actions might be uncontrollable, and

    2. (b)

      the task specification or the safety constraints for the system is given in terms of sequences of actions modeled by a finite state automaton. The constraints can be given in terms of sequences that are safe and/or unsafe. This generality simplifies the problem formulation to some extent.

    The adapted Q-learning algorithm also integrates the work on reward machines[7] that supports reward function specification as a function of state.

  3. 3.

    Experimental results for an illustrative example that captures the utility of automata-based methods for non-Markovian state and action specifications for reinforcement learning are presented.

The rest of the paper is organized as follows. Section II presents some notations and definitions that we will be using in the rest of the paper. It also presents the relevant results from supervisory control theory. Section III discusses supervisor synthesis. We will largely be working with finite automata in this section and use the development for supervisor synthesis to propose a version of QQ-learning algorithm [8] that can be used to learn an optimal policy in the presence of state and action constraints. In Section IV, we present an example to illustrate the use of automata-based methods to handle non-Markovian state and action-constraints. We use the proposed method for action-constraints and reward machines[7] for the state constraints. We show the results of the experiments and observe that the Q-learning algorithm adapted to this setting helps to make the agent learn to complete the task optimally 100% of the time. We conclude the paper with Section V.

II Motivation, Notations and Definitions

In a reinforcement learning problem, the environment is modeled as a Markov Decision Process =(S,A,r,p,γ)\mathcal{M}=(S,A,r,p,\gamma) where SS and AA denote the finite set of states and actions, respectively, r:S×A×Sr:S\times A\times S\rightarrow\mathbb{R} denotes the reward function, p(st+1|st,at)p(s_{t+1}|s_{t},a_{t}) is the transition probability distribution and γ(0,1)\gamma\in(0,1) is the discount factor. In the paradigm of safe reinforcement learning, the objective is to learn optimal policies while satisfying a set of constraints. In this paper, we consider safety constraints that are given as a set of sequence of actions of the MDP. Next, we discuss the example gridworld in Fig. 1 to motivate the development.

iR1R_{1}oR2R_{2}BB
Figure 1: A pick-up and delivery grid world

Consider a manufacturing facility layout as shown in Fig. 1. When a work piece appears at location ⓘ, the robot R1R_{1} picks it up and delivers it to the buffer zone BB. The robot R2R_{2} picks the item from BB and delivers it to the processing facility o. The buffer BB can only store one item at a time; so it has two states Empty(EE) and Full(FF). The robot state is determined by the position that it occupies in the grid and by a binary variable indicating whether it has picked up an item or not. There are six controllable actions for each robot where four of them correspond to their movement in the grid and two correspond to picking and dropping of an item (pi,dip_{i},d_{i}), i{1,2}i\in\{1,2\}. Additionally, we assume that robot R2R_{2} can either be Working(W) or Down(D). There are two uncontrollable actions {ϕ,ρ}\{\phi,\rho\} denoting the fault and rectification actions which take R2R_{2} from WW to DD and vice versa respectively. We call them uncontrollable because they are caused by external factors/agents not modeled in this set up. We can also assume that an uncontrollable action makes an item appear in ⓘ. The objective is to design an optimal policy to satisfy the specifications below:

  1. 1.

    R1R_{1} delivers an item to BB only if BB is in EE.

  2. 2.

    R2R_{2} picks an item from BB only if BB is in FF (thereby driving BB to EE).

  3. 3.

    R1R_{1} and R2R_{2} deliver a picked item in no more than 55 steps.

The above specifications ‘look’ okay but are not robust to uncontrollable actions. To see this, consider a string of actions σ\sigma. Formally, the specification in item 3 says that for a substring p1σd1p_{1}\sigma d_{1}, it must be that |σ|5|\sigma|\leq 5 where |σ||\sigma| denotes the size of σ\sigma, and p1p_{1} and d1d_{1} denote picking and dropping of item by R1R_{1} respectively. Consider the scenario in which R1R_{1} takes an action p1p_{1} when the buffer is full, with the assumption that it will be emptied by R2R_{2} in the next step. Such an action is consistent with item 1 of the specification. At this point, if the uncontrollable action ϕ\phi occurs, then the satisfaction of item 3 of the specification will entirely depend on the occurrence of the uncontrollable rectification action ρ\rho. As such, the specifications above are not robust to uncontrollable actions. That is, the occurrence of an uncontrollable action resulted in the generation of a string of actions that was not in the specification. In what follows, we discuss developments from supervisory control theory which facilitate a formal analysis of such aspects.

An automaton is a 5-tuple G=(Q,Σ,δ,q0,Qm)G=(Q,\Sigma,\delta,q_{0},Q_{m})[6]. Here QQ and Σ\Sigma denote the set of states and (labelled) actions respectively. The initial state is q0Qq_{0}\in Q, and QmQQ_{m}\subseteq Q is called the set of marked states. The states can be marked for different reasons and we discuss more about this later in the paper. The function δ:Σ×QQ\delta:\Sigma\times Q\rightarrow Q is the state transition function, where for qQ,σΣ,δ(σ,q)=q^q\in Q,\sigma\in\Sigma,\delta(\sigma,q)=\widehat{q}, implies there is an action labeled σ\sigma from state qq to state q^\widehat{q}. In general, the state transition function δ\delta may not be defined for all state-action pairs. The active action function Γ:Q2Σ\Gamma:Q\rightarrow 2^{\Sigma}, identifies the set of all actions σΣ\sigma\in\Sigma for which δ(σ,q)\delta(\sigma,q) is defined.

Let Σ\Sigma^{*} denote the set of all finite strings of elements of Σ\Sigma including the empty string ϵ\epsilon. We write δ(ω,q)=q^\delta^{*}(\omega,q)=\widehat{q} if there is a string of actions ωΣ\omega\in\Sigma^{*} from qq to state q^\widehat{q}, that is if q^\widehat{q} is reachable from qq. Given GG, the set of all admissible strings of actions is denoted by L(G)L(G) and is referred to as the language generated by GG over the alphabet Σ\Sigma. Formally, for the initial state q0q_{0}:

L(G)={ωΣ:δ(ω,q0) is defined}.\displaystyle L(G)=\{\omega\in\Sigma^{*}:\delta^{*}(\omega,q_{0})\text{ is defined}\}. (1)

The marked language, Lm(G)L(G)L_{m}(G)\subseteq L(G), generated by GG is

Lm(G)={ωΣ:δ(ω,q0)Qm}.\displaystyle L_{m}(G)=\{\omega\in\Sigma^{*}:\delta^{*}(\omega,q_{0})\in Q_{m}\}.

A string uu is a prefix of a string vΣv\in\Sigma^{*} if for some wΣw\in\Sigma^{*}, v=uwv=uw. If vv is an admissible string in GG, then so are all its prefixes. We define the prefix closure of LΣL\subseteq\Sigma^{*} to be the language L¯\overline{L} defined as:

L¯={u:uvL for some vΣ}.\displaystyle\overline{L}=\{u:uv\in L\text{ for some }v\in\Sigma^{*}\}.

We can use the above notations to interpret the MDP \mathcal{M} as a language generator: (S,A,Σ,,r,p,γ)(S,A,\Sigma,\mathcal{L},r,p,\gamma) where Σ\Sigma denotes a set of labels and :AΣ\mathcal{L}:A\rightarrow\Sigma denotes a labelling function that associates a label to each action. For simplicity, we assume that each action is uniquely labeled and, with some abuse of notation, use Σ\Sigma as a proxy for the set of actions. Then the function δ:Σ×SS\delta:\Sigma\times S\rightarrow S is the state transition function, where for qS,σΣ:δ(σ,q)=q^q\in S,\sigma\in\Sigma:\delta(\sigma,q)=\widehat{q} means that there is an action aAa\in A labeled σ\sigma such that p(q^|q,a)>0p(\widehat{q}|q,a)>0. The language generated can be defined appropriately.

A policy π(a|s)\pi(a|s) in the context of RL is a probability distribution over the set of actions aAa\in A at a given state sSs\in S. At each time step, tt, the agent is in a particular state, say sts_{t}. It selects an action ata_{t} according to π(|st)\pi(\cdot|s_{t}) and moves to a next state st+1s_{t+1} with probability p(st+1|st,at)p(s_{t+1}|s_{t},a_{t}). It also receives a reward r(st+1,at,st)r(s_{t+1},a_{t},s_{t}) for this transition. The process then repeats from st+1s_{t+1}. The agent’s goal is to find a policy π\pi^{*} that maximizes the expected discounted future reward from every state in SS. Next we introduce the notion of supervisory control which trims the set of candidate policies to only those that do not violate the safety specification.

The set of actions (Σ\Sigma) is partitioned into the set of controllable actions (Σc\Sigma_{c}) and uncontrollable actions (Σu\Sigma_{u}). More specifically, Σc\Sigma_{c} (resp. Σu\Sigma_{u}) denotes the set of actions that can (resp. cannot) be disabled by the supervisor. Formally, a supervisor 𝒮:L()2Σ\mathcal{S}:L(\mathcal{M})\rightarrow 2^{\Sigma} is a function from the language generated by \mathcal{M} to the power set of Σ\Sigma. We use 𝒮/\mathcal{S}/\mathcal{M} to denote the supervised MDP (SMDP) in which supervisor 𝒮\mathcal{S} is controlling the MDP \mathcal{M}. Note that the supervisor only disables an action and it does not choose which action should be taken at any time instant. That (optimal) choice is determined by π(a|s)\pi(a|s). The uncontrollable actions are always enabled by the supervisor.

We are interested in evaluating an optimal policy π\pi^{*} that satisfies the constraints specified in terms of a set of sequence of actions. The first question that is of interest to us is if such a policy exists. We can directly use the controllability condition from the supervisory control theory literature to answer this question [5, 6]:

Theorem 1

Let KK be a desired language specification over the alphabet Σ\Sigma denoting the safety constraints for the MDP \mathcal{M}. There is a supervisor 𝒮\mathcal{S} such that L(𝒮/)=K¯L(\mathcal{S}/\mathcal{M})=\overline{K} if and only if K¯ΣuL()K¯\overline{K}\Sigma_{u}\cap L(\mathcal{M})\subseteq\overline{K}.

Proof:

See Section 3.4.1 of [6]. ∎

The above condition states that there is a supervisor that enforces the desired specificaiton KK if and only if the occurrence of an uncontrollable action in \mathcal{M} from a prefix of KK results in a string that is also a prefix of KK. For the gridworld example in Fig. 1, the uncontrollable action ϕ\phi resulted in a string from which it can generate a string in which item 3 of the specification was not satisfied. Therefore, it violated the controllability condition.

If the condition in Theorem 1 is not satisfied, then there does not exist a supervisor that can enforce the given specification. In such a case, the best that the supervisor can do is enforce the supremal controllable sublanguage of KK, which is the union of all controllable sublanguages of KK [9]:

K=i{Ki:(KiK)(K¯iΣuL(G)K¯i)}.\displaystyle K^{\uparrow}=\bigcup_{i}\ \{K_{i}:(K_{i}\subseteq K)\wedge(\overline{K}_{i}\Sigma_{u}\cap L(G)\subseteq\overline{K}_{i})\}.

Without going into the formal analysis, the supremal controllable sublanguage (the modified specification) for the gridworld example would involve R1R_{1} not picking an item unless BB is empty.

In this paper, we will assume, in different ways, that the given specification can be modeled as a finite automaton and that it is controllable. An automaton is represented by a directed graph (state transition diagram) whose nodes correspond to the states of the automaton, with an edge from qq to q^\widehat{q} labeled σ\sigma for each triple (σ,q,q^)(\sigma,q,\widehat{q}) such that q^=δ(σ,q)\widehat{q}=\delta(\sigma,q).

We use the concept of reward machines to handle the state constraints[7]. Given a set of propositional symbols 𝒫\mathcal{P}, a set of (environment) states SS, and a set of actions AA, a Reward Machine (RM) is a tuple 𝒫SA=U,u0,δu,δr\mathcal{R}_{\mathcal{P}SA}=\langle U,u_{0},\delta_{u},\delta_{r}\rangle where UU is a finite set of states, u0Uu_{0}\in U is an initial state, δu:U×2𝒫U\delta_{u}:U\times 2^{\mathcal{P}}\rightarrow U is the state-transition function and δr:U×U[S×A×S]\delta_{r}:U\times U\rightarrow[S\times A\times S\rightarrow\mathbb{R}] is the reward-transition function. An MDP with a Reward Machine (MDPRM) is a tuple T=S,A,p,γ,𝒫,L,U,u0,δu,δrT=\langle S,A,p,\gamma,\mathcal{P},L,U,u_{0},\delta_{u},\delta_{r}\rangle, where S,A,pS,A,p, and γ\gamma are defined as in an MDP, 𝒫\mathcal{P} is a set of propositional symbols, LL is a labelling function L:S2𝒫L:S\rightarrow 2^{\mathcal{P}}, and U,u0,δu,U,u_{0},\delta_{u}, and δr\delta_{r} are defined as in an RM. The operation of an MDPRM is as follows. At every decision epoch of \mathcal{M}, there is a transition in 𝒫SA\mathcal{R}_{\mathcal{P}SA} also. If the current state of 𝒫SA\mathcal{R}_{\mathcal{P}SA} is uu and an action aa is taken from ss in \mathcal{M}, then δu(a,u)\delta_{u}(a,u) is the next state of 𝒫SA\mathcal{R}_{\mathcal{P}SA} and it outputs a reward function δr(u,δu(a,u))\delta_{r}(u,\delta_{u}(a,u)) which determines the reward in \mathcal{M} when action aa is taken from ss. The main idea in handling state constraints using RMs is to associate barrier rewards with state sequences that violate the desired specification. Reward machines can be used to model any Markovian reward function. They can also used to express any non-Markovian reward function as long as the state history SS^{*} on which the reward depends can be distinguished by different elements of a finite set of regular expressions over 𝒫\mathcal{P}.

III Reinforcement Learning with Action-Constraints

Consider an MDP =(S,A,Σ,L,r,p,γ)\mathcal{M}=(S,A,\Sigma,L,r,p,\gamma). The action-constraints can be given in terms of the sequence of actions that are safe to perform or by sequences that are unsafe. We propose methods for supervisor synthesis for both cases in this section.

Consider an automaton Gs=(Qs,Σ,δs,q0s,)G_{s}^{\prime}=(Q_{s}^{\prime},\Sigma,\delta_{s}^{\prime},q_{0s},\emptyset) that is the model of a sequence of actions that is safe with an empty marked set of states. We construct an automaton Gs=(Qs,Σ,δs,q0s,{sa})G_{s}=(Q_{s},\Sigma,\delta_{s},q_{0s},\{s_{a}\}) from GsG_{s}^{\prime} by appropriately adding a (marked) state sas_{a} to flag the occurrence of an action violating the specification. The formal construction is as follows:

  1. 1.

    GsG_{s} has one additional state: Qs=Qs{sa}Q_{s}=Q_{s}^{\prime}\cup\{s_{a}\}, Qm={sa}Q_{m}=\{s_{a}\}.

  2. 2.

    The transition function for GsG_{s} is identical to that of GsG_{s}^{\prime} for the state-action pairs for which δs\delta_{s}^{\prime} is defined. sQs,σΓ(s)\forall s\in Q_{s}^{\prime},\forall\sigma\in\Gamma(s): δs(σ,s)=δs(σ,s)\delta_{s}(\sigma,s)=\delta_{s}^{\prime}(\sigma,s).

  3. 3.

    For the state-action pairs for which δs\delta_{s}^{\prime} is not defined, we have: sQs,σ(ΣΓ(s))\forall s\in Q_{s}^{\prime},\forall\sigma\in(\Sigma-\Gamma(s)): δs(σ,s)=sa\delta_{s}(\sigma,s)=s_{a}.

  4. 4.

    If GsG_{s} reaches sas_{a}, it stays there: σΣ\forall\sigma\in\Sigma: δs(σ,sa)=sa\delta_{s}(\sigma,s_{a})=s_{a}.

At every decision epoch of \mathcal{M}, there is a transition in GsG_{s} also. If the current state of GsG_{s} is ss and an action aa is taken in \mathcal{M}, then δs(a,s)\delta_{s}(a,s) is the next state of GsG_{s}. Now, if an action ω\omega is taken in \mathcal{M} for which δs(ω,s)\delta_{s}^{\prime}(\omega,s) is not defined, then it means that it violates the safety constraint. If δs(ω,s)\delta_{s}^{\prime}(\omega,s) is not defined for ωΣ\omega\in\Sigma and sQss\in Q_{s}^{\prime}, then as per item 3 in the above construction, the state of GsG_{s} will transition to sas_{a}. The state sas_{a} of GsG_{s} indicates that an action constraint has been violated. A supervisor for this case will simply disable actions that lead to a transition to the state sas_{a}. If there is an uncontrollable action that can result in a transition to sas_{a}, then it means that the specification does not satisfy the controllability condition of theorem 1 and there does not exist a policy that can enforce the given specification. Automaton GsG_{s} in Fig. 4 is an example in which d2d3p1p2p3d_{2}d_{3}p_{1}p_{2}p_{3} is the sequence of actions that are to be performed when an uncontrollable action dud_{u} happens.

Next, we consider the case in which the constraints are given in terms of the sequence of actions that are unsafe111One way of handling this situation would be to evaluate the sequence of actions that are safe and then proceed with the earlier method (we assume that the set of safe sequences can be represented by an automaton).. Recall that Lm(𝒢)L(𝒢)L_{m}(\mathcal{G})\subseteq L(\mathcal{G}) denotes the marked language generated by an automaton 𝒢\mathcal{G}. We assume that the set of strings of actions that are unsafe SunS_{un} are given as an automaton H1=(Qh,Σh,δh,q0h,Qmh)H_{1}=(Q_{h}^{\prime},\Sigma_{h}^{\prime},\delta_{h}^{\prime},q_{0h}^{\prime},Q_{mh^{\prime}}) for which Lm(H1)=SunL_{m}(H_{1})=S_{un}. We then modify H1H_{1} to obtain an automaton H=(Qh,Σh,δh,q0h,Qmh)H=(Q_{h},\Sigma_{h},\delta_{h},q_{0h},Q_{mh}) as follows:

  1. 1.

    Qh=QhQ_{h}=Q_{h}^{\prime}, Σh=Σ\Sigma_{h}=\Sigma, q0h=q0hq_{0h}=q_{0h}^{\prime}, Qmh=QmhQ_{mh}=Q_{mh}^{\prime}.

  2. 2.

    qQh\forall q\in Q_{h}^{\prime}, σΓh(q)\forall\sigma\in\Gamma_{h}^{\prime}(q), δh(σ,q)=δh(σ,q)\delta_{h}(\sigma,q)=\delta_{h}^{\prime}(\sigma,q).

  3. 3.

    qQh\forall q\in Q_{h}^{\prime}, σΣΓh(q)\forall\sigma\in\Sigma-\Gamma_{h}^{\prime}(q), δh(σ,q)=q0h\delta_{h}(\sigma,q)=q_{0h}.

That is, HH can be constructed from H1H_{1} by adding additional outgoing arcs from every state in H1H_{1} to the initial state, corresponding to actions in Σ\Sigma that do not already appear as an outgoing arc in H1H_{1}. This modification ensures that for every action in \mathcal{M}, a corresponding transition is defined in HH. The operation of HH is the same as that of GsG_{s}. At every decision epoch of \mathcal{M}, there is a transition in HH. If the current state of HH is ss and an action aa is taken in \mathcal{M}, then δH(a,s)\delta_{H}(a,s) is the next state of HH, where δH\delta_{H} is the transition function of HH. For simplicity, we label every qQmhq\in Q_{mh} by sas_{a}. The supervisor will disable every controllable action which results in a next state that is marked (sas_{a}). Suppose \mathcal{M} has generated a string σ\sigma that has not resulted in a transition to sas_{a} in HH, then it means that σ\sigma is a prefix of an admissible string. Following σ\sigma, if there is an uncontrollable action that results in a transition to sas_{a}, then it means that the specification does not satisfy the controllability condition. The automaton HH in Fig. 3 shows an automaton which models the specification in which two consecutive lefts or rights ({ll,rr,lrr,rlrll,}\{ll,rr,lrr,rlrll,\ldots\}) are deemed unsafe.

Remark: We have modeled the sequence of actions that are unsafe by “marking” some states as unsafe. The marked states have a completely different interpretation in supervisory control theory (SCT) where they are often associated with desirable features— like signifying the end of an execution with the automata “recognizing” the string to be in its language. In fact, in SCT, if L¯m(G)=L(G)\overline{L}_{m}(G)=L(G), then we say that GG is non-blocking, which means that the system can always reach a target (marked) state from any state in its reachable set. Deadlock freedom is an instance of the non-blocking property. We will need some simple modifications in the method discussed above if we have to tackle both the interpretations simultaneously.

Algorithm 1 shows the psuedocode for Supervised Q-learning for non-Markovian action and state constraints. It receives as input the automata GsG_{s}^{\prime} and H1H_{1} describing the action constraints, the set of propositional symbols 𝒫\mathcal{P}, the labelling function LL, the discount factor γ\gamma, and the reward machines \mathcal{R} over 𝒫\mathcal{P}. The goal is to learn an optimal policy given the state and action constraints.

Algorithm 1 Supervised Q-learning for non-Markovian action and state constraints
1:  Input: Gs,H1,𝒫,L,γ,=U,δu,δr,u0{G_{s}^{\prime}},H_{1},\mathcal{P},L,\gamma,\mathcal{R}=\langle U,\delta_{u},\delta_{r},u_{0}\rangle
2:  Construct GsG_{s} and HH from GsG_{s}^{\prime} and H1H_{1} respectively
3:   Q~\widetilde{Q}\leftarrow InitializeQValueFunction()
4:  for l=0l=0 to num_episodes do
5:     uju0;qsq0s;qhq0hu_{j}\leftarrow u_{0};q_{s}\leftarrow q_{0s};q_{h}\leftarrow q_{0h}; ss\leftarrow EnvInitialState();
6:     for t=0t=0 to length_episode do
7:        if EnvDeadEnd(ssthen
8:           break
9:        end if
10:        aa\leftarrow GetActionEpsilonGreedy(q~qs,qh,uj,s\widetilde{q}_{q_{s},q_{h},u_{j}},s) such that (δs(a,qs)sa)(δh(a,qh)sa)(\delta_{s}(a,q_{s})\neq s_{a})\wedge(\delta_{h}(a,q_{h})\neq s_{a})
11:        ss^{\prime}\leftarrow EnvExecuteAction(s,a)(s,a)
12:        ukδu(uj,L(s))u_{k}\leftarrow\delta_{u}(u_{j},L(s^{\prime}))
13:        rδr(uj,uk)r\leftarrow\delta_{r}(u_{j},u_{k})
14:        qmδs(qs,a)q_{m}\leftarrow\delta_{s}(q_{s},a); qnδh(qh,a)q_{n}\leftarrow\delta_{h}(q_{h},a)
15:        if EnvDeadEnd(ss^{\prime}then
16:           q~qs,qh,uj(s,a)=q~qs,qh,uj(s,a)+αr(s,a,s)\widetilde{q}_{q_{s},q_{h},u_{j}}(s,a)=\widetilde{q}_{q_{s},q_{h},u_{j}}(s,a)+{\alpha}r(s,a,s^{\prime})
17:        else
18:           q~qs,qh,uj(s,a)=q~qs,qh,uj(s,a)\widetilde{q}_{q_{s},q_{h},u_{j}}(s,a)=\widetilde{q}_{q_{s},q_{h},u_{j}}(s,a) ++ α{\alpha} [r(s,a,s)[r(s,a,s^{\prime}) ++ γmaxaq~qm,qn,uk(s,a)q~qs,qh,uj(s,a)]\gamma\max_{a^{\prime}}\widetilde{q}_{q_{m},q_{n},u_{k}}(s^{\prime},a^{\prime})-\widetilde{q}_{q_{s},q_{h},u_{j}}(s,a)]
19:        end if
20:        qsδs(qs,a)q_{s}\leftarrow\delta_{s}(q_{s},a); qhδh(qh,a)q_{h}\leftarrow\delta_{h}(q_{h},a)
21:        ujδu(uj,L(s))u_{j}\leftarrow\delta_{u}(u_{j},L(s^{\prime})); sss\leftarrow s^{\prime}
22:     end for
23:  end for

The algorithm learns one q-value function per state of the automata and reward machine. That is, it will learn |Qs|×|Qh|×|U||Q_{s}|\times|Q_{h}|\times|U| many q-value functions in total, where |||\cdot| denotes the size of the set argument. These q-functions are stored in Q~\widetilde{Q}, and q~qs,qh,ujQ~\widetilde{q}_{q_{s},q_{h},u_{j}}\in\widetilde{Q} corresponds to the q-value function for states qsQsq_{s}\in Q_{s} and qhQhq_{h}\in Q_{h} of the two automata, and ujUu_{j}\in U of the reward machine.

After initializing Q~\widetilde{Q} in step 3, the algorithm has 2 nested loops: one for the number of episodes that we are running (step 4), and the other for the length of each episode (step 6). Step 10 of the algorithm selects an action aa under supervision wherein only those actions that do not result in the next state sas_{a} in either of the two automata are considered for selection by epsilon-greedy method. Thereafter, the agent executes the action (Step 11), evaluates the reward function through the transition in the reward machine (Steps 12 and 13) and then updates the q-value function (Steps 15-19) using the standard q-learning rule. Note that the maximization step is over q~qm,qn,uk\widetilde{q}_{q_{m},q_{n},u_{k}} since those q-values would be used for the selection of aa^{\prime} as Gs,H1G_{s},H_{1} and \mathcal{R} would have transitioned. Action aa^{\prime} is to be picked so that sas_{a} is not reached in GsG_{s} and H1H_{1}.

IV Illustrative Example

1S23
Figure 2: A pick-up and delivery grid world

Setting: Consider the pick-up and delivery grid world in Fig. 2. The agent starts from the position SS, picks up items labeled 1,21,2 and 33, and returns back to SS. The agent unloads the items in a last-in-first-out order and the unloading sequence must be 3213-2-1. Therefore, the nominal pick-up of the items is in sequence from 11 to 33. The objective of the agent is to finish this process in minimum number of steps.

State and action space: The state of the agent is composed of its position in the grid, its orientation and an indicator of the items that the agent has picked up. The position of the agent in the grid can be denoted by the coordinate of the respective pixel it occupies, with the bottom-leftmost corner of the grid denoting the origin. We can use a set of symbols {,,,}\{\uparrow,\downarrow,\leftarrow,\rightarrow\} to describe its orientation and a 3-bit binary variable to denote the items that the agent has picked up. The agent’s movement in the grid is denoted by four actions l,r,fl,r,f and bb denoting the movement in four directions: left, right, forward and backward respectively. The actions rr and ll change the orientation of the agent by 9090 degrees clockwise and anticlockwise respectively. In addition, they also lead to a change in position by one pixel forward in the direction of the new orientation. For example, if ll are rr are feasible actions from a position (x,y,)(x,y,\uparrow),then:

(x,y,)𝑙(x1,y,),\displaystyle(x,y,\uparrow)\xrightarrow{l}(x-1,y,\leftarrow),
(x,y,)𝑟(x+1,y,).\displaystyle(x,y,\uparrow)\xrightarrow{r}(x+1,y,\rightarrow).

The actions ff and bb do not change the orientation and only result in a change of position by one pixel forward or backward respectively relative to the current orientation of the agent. If ff are bb are feasible actions from a position (x,y,)(x,y,\uparrow), then we have:

(x,y,)𝑓(x,y+1,),\displaystyle(x,y,\uparrow)\xrightarrow{f}(x,y+1,\uparrow),
(x,y,)𝑏(x,y1,).\displaystyle(x,y,\uparrow)\xrightarrow{b}(x,y-1,\uparrow).

We use d1d_{1}, d2d_{2} and d3d_{3} (p1,p2p_{1},p_{2} and p3p_{3}) for actions denoting dropping (respectively picking) of the respective items by the agent. We assume that the agent cannot reliably carry all three items together because of which it can accidentally drop an item when it is carrying all three of them. For simplicity, we assume that it can accidentally drop only item 1. We model this by associating an uncontrollable action dud_{u}. The illustration can easily be extended to account for dropping of either of the items.

Action Constraints

  1. 1.

    The agent cannot make a UU-turn. We interpret it has two consecutive rights or lefts. That is, the set {ll,rr}\{ll,rr\} denotes the illegal substrings of the string of actions. The automaton HH, with initial state 0, in Fig. 3 describes the illegal sequence of actions corresponding to the set of strings {ll,rr}\{ll,rr\}. For ease of exposition, the actions did_{i} and pip_{i} are not shown in HH and they can be added with a self loop around every state.

  2. 2.

    In order for the agent to satisfy the constraint on the unloading sequence, if an agent accidentally drops the first item then it must also immediately drop the other two and pick all three in sequence again. That is, dud_{u} must be followed by the string d2d3p1p2p3d_{2}d_{3}p_{1}p_{2}p_{3}. Fig. 4 shows the automaton GsG_{s} that constrains the agent to take the action sequence d2d3p1p2p3d_{2}d_{3}p_{1}p_{2}p_{3} when an uncontrollable action dud_{u} happens222For the case when the agent can drop either of the other two items, we can introduce two more uncontrollable actions du2d_{u2} and du3d_{u3} with remedial strings d1d3p1p2p3d_{1}d_{3}p_{1}p_{2}p_{3} and d1d2p1p2p3d_{1}d_{2}p_{1}p_{2}p_{3} respectively..

01122sas_{a}sas_{a}f,bf,bl,r,f,bl,r,f,brrllrrllllrrf,bf,bf,bf,b
Figure 3: An automaton HH with initial state 0 that accepts strings equivalent to a U-turn by the agent; for example the strings {ll,rr,lrfbrr,ll,rr,lrfbrr,\ldots}.

Fig. 5 shows the reward machine for imposing the specification of sequential pickup of items in which 𝒫={0,1,S}\mathcal{P}=\{0,1,\framebox{S}\}. The labeling function generates a 3-bit binary number corresponding to the items that the agent has picked up. It uses the symbol S to indicate whether the agent reached location S in the grid. The RM starts at state u0u_{0}. It transitions to state u1u_{1} if the items are not picked up in the desired sequence and thereafter obtains a reward of 20-20 at every step. The reward for every step increases from 10-10 to 8-8 (from 8-8 to 6-6) after the agent picks up item 11 (respectively, item 22). It further increases to 1-1 after the agent picks up all three items. Note that the RM only handles the sequential pickup initially. The uncontrollable drop of item 11 after the agent picks all three items is handled by the automaton GsG_{s}. Once all three items are picked up in sequence, the RM stays at u4u_{4} until the agent reaches S.

Simulation Results The simulation was run for 500,000 iterations, where each iteration consisted of one episode of a maximum of 60 steps. The hyperparameters were set as α=0.1,ϵ=0.25\alpha=0.1,\epsilon=0.25 and γ=0.9\gamma=0.9. Additionally, ϵ\epsilon was decayed by 5% after every period of 10,000 iterations. The single-step reward is defined as in Fig. 5 in accordance to the items picked and their order. Fig. 6 shows the average score, i.e., the average accumulated reward over an episode, as training progressed. Fig. 7 shows the average percentage of times when the task was completed. All data shown are averages over intervals of 10,000 iterations during training. At the end of training, ϵ\epsilon achieves a value of 0.0190.019. This ϵ\epsilon-greedy policy completes the task around 95% of the time as shown in Fig. 7. Upon inference, i.e., upon setting ϵ=0\epsilon=0 post training, the agent completes the task optimally 100% of the time.

sas_{a}sas_{a}w1w_{1}w2w_{2}w3w_{3}w4w_{4}w5w_{5}w0w_{0}dud_{u}d2d_{2}d3d_{3}p1p_{1}p2p_{2}p3p_{3}A\d2A\backslash d_{2}A\d3A\backslash d_{3}A\p1A\backslash p_{1}A\p2A\backslash p_{2}A\p3A\backslash p_{3}AA
Figure 4: An automaton GsG_{s} with initial state w0w_{0} describing the sequence of safe actions when an uncontrollable action dud_{u} happens.
u0u_{0}u1u_{1}u2u_{2}u3u_{3}u4u_{4}u5u_{5}000,10\langle 000,-10\rangletrue,20\langle true,-20\rangle100,8\langle 100,-8\rangle110,6\langle 110,-6\rangle111(¬S),1\langle 111\wedge(\neg\framebox{S}),-1\rangle010001,20\langle 010\vee 001,-20\rangle100,8\langle 100,-8\rangle101,20\langle 101,-20\rangle110,6\langle 110,-6\rangle111,1\langle 111,-1\rangle111S,0\langle 111\wedge\framebox{S},0\rangletrue,0\langle true,0\rangle
Figure 5: Reward Machine for pick up and delivery grid world example.
01001002002003003004004005005001,000-1{,}000800-800600-600400-400200-200Epoch (x10310^{3})Average Score
Figure 6: Average score achieved averaged across intervals of 10,000 iterations during training
010010020020030030040040050050002020404060608080100100Epoch (x10310^{3})Average Task Completion %
Figure 7: Percentage of times task was completed averaged across intervals of 10,000 iterations during training

V Conclusion

In this paper, we borrowed concepts from supervisory control theory of discrete event system to develop a method to learn optimal control policies for a finite-state Markov Decision Process in which (only) certain sequences of actions are deemed unsafe (respectively safe) while accounting for the possibility that a subset of actions might be uncontrollable. We assumed that the constraints can be modeled using a finite automaton and a natural future direction of research is to develop such methods for a more general class of constraints.

References

  • [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [2] J. Garcıa and F. Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 1437–1480, 2015.
  • [3] K. C. Sou, J. Weimer, H. Sandberg, and K. H. Johansson, “Scheduling smart home appliances using mixed integer linear programming,” in 2011 50th IEEE Conference on Decision and Control and European Control Conference.   IEEE, 2011, pp. 5144–5149.
  • [4] R. Kaur, C. Schaye, K. Thompson, D. C. Yee, R. Zilz, R. Sreenivas, and R. B. Sowers, “Machine learning and price-based load scheduling for an optimal iot control in the smart and frugal home,” Energy and AI, vol. 3, p. 100042, 2021.
  • [5] P. J. Ramadge and W. M. Wonham, “The control of discrete event systems,” Proceedings of the IEEE, vol. 77, no. 1, pp. 81–98, 1989.
  • [6] C. G. Cassandras and S. Lafortune, Introduction to discrete event systems.   Springer Science & Business Media, 2009.
  • [7] R. T. Icarte, T. Klassen, R. Valenzano, and S. McIlraith, “Using reward machines for high-level task specification and decomposition in reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2018, pp. 2107–2116.
  • [8] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, no. 3-4, pp. 279–292, 1992.
  • [9] W. M. Wonham and P. J. Ramadge, “On the supremal controllable sublanguage of a given language,” SIAM Journal on Control and Optimization, vol. 25, no. 3, pp. 637–659, 1987.