This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\tocauthor

Ivar Ekeland, Roger Temam, Jeffrey Dean, David Grove, Craig Chambers, Kim B. Bruce, and Elisa Bertino 11institutetext: Arizona State University, Tempe AZ 85281, USA,
11email: [email protected], [email protected],
22institutetext: University of California, Los Angeles CA 90095, USA,
22email: [email protected]

Using Reinforcement Learning to Herd a Robotic Swarm to a Target Distribution

Zahi Kakish 11    Karthik Elamvazhuthi 22    Spring Berman 11
Abstract

In this paper, we present a reinforcement learning approach to designing a control policy for a “leader” agent that herds a swarm of “follower” agents, via repulsive interactions, as quickly as possible to a target probability distribution over a strongly connected graph. The leader control policy is a function of the swarm distribution, which evolves over time according to a mean-field model in the form of an ordinary difference equation. The dependence of the policy on agent populations at each graph vertex, rather than on individual agent activity, simplifies the observations required by the leader and enables the control strategy to scale with the number of agents. Two Temporal-Difference learning algorithms, SARSA and Q-Learning, are used to generate the leader control policy based on the follower agent distribution and the leader’s location on the graph. A simulation environment corresponding to a grid graph with 4 vertices was used to train and validate the control policies for follower agent populations ranging from 10 to 1000. Finally, the control policies trained on 100 simulated agents were used to successfully redistribute a physical swarm of 10 small robots to a target distribution among 4 spatial regions.

keywords:
swarm robotics, graph theory, mean-field model, reinforcement learning

1 Introduction

We present two Temporal-Difference learning algorithms [1] for generating a control policy that guides a mobile agent, referred to as a leader, to herd a swarm of autonomous follower agents to a target distribution among a small set of states. This leader-follower control approach can be used to redistribute a swarm of low-cost robots with limited capabilities and information using a single robot with sophisticated sensing, localization, computation, and planning capabilities, in scenarios where the leader lacks a model of the swarm dynamics. Such a control strategy is useful for many applications in swarm robotics, including exploration, environmental monitoring, inspection tasks, disaster response, and targeted drug delivery at the micro-nanoscale.

There has been a considerable amount of work on leader-follower multi-agent control schemes in which the leader has an attractive effect on the followers [2, 3]. Several recent works have presented models for herding robotic swarms using leaders that have a repulsive effect on the swarm [4, 5, 6]. Using such models, analytical controllers for herding a swarm have been constructed for the case when there is a single leader [5, 6] and multiple leaders [4]. The controllers designed in these works are not necessarily optimal for a given performance metric. To design optimal control policies for a herding model, the authors in [7] consider a reinforcement learning (RL) approach. While existing herding models are suitable for the objective of confining a swarm to a small region in space, many applications require a swarm to cover an area according to some target probability density. If the robots do not have spatial localization capabilities, then the controllers developed in [2, 3, 4, 5, 6, 7] cannot be applied for such coverage problems. Moreover, these models are not suitable for herding large swarms using RL-based control approaches, since such approaches would not scale well with the number of robots. This loss of scalability is due to the fact that the models describe individual agents, which may not be necessary since robot identities are not important for many swarm applications.

In this paper, we consider a mean-field or macroscopic model that describes the swarm of follower agents as a probability distribution over a graph, which represents the configuration space of each agent. Previous work has utilized similar mean-field models to design a set of control policies that is implemented on each robot in a swarm in order to drive the entire swarm to a target distribution, e.g. for problems in spatial coverage and task allocation [8]. In this prior work, all the robots must be reprogrammed with a new set of control policies if the target distribution is changed. In contrast, our approach can achieve new target swarm distributions via redesign of the control policy of a single leader agent, while the control policies of the swarm agents remain fixed. The follower agents switch stochastically out of their current location on the graph whenever the leader is at their location; in this way, the leader has a “repulsive” effect on the followers. The transition rates out of each location are common to all the followers, and are therefore independent of the agents’ identities. Using the mean-field model, herding objectives for the swarm are framed in terms of the distribution of the followers over the graph. The objective is to compute leader control policies that are functions of the agent distribution, rather than the individual agents’ states, which makes the control policies scalable with the number of agents.

We apply RL-based approaches to the mean-field model to construct leader control policies that minimize the time required for the swarm of follower agents to converge to a user-defined target distribution. The RL-based control policies are not hindered by curse-of-dimensionality issues that arise in classical optimal control approaches. Additionally, RL-based approaches can more easily accommodate the stochastic nature of the follower agent transitions on the graph. There is prior work on RL-based control approaches for mean-field models of swarms in which each agent can localize itself in space and a state-dependent control policy can be assigned to each agent directly [9, 10, 11]. However, to our knowledge, there is no existing work on RL-based approaches applied to mean-field models for herding a swarm using a leader agent. Our approach provides an RL-based framework for designing scalable strategies to control swarms of resource-constrained robots using a single leader robot, and it can be extended to other types of swarm control objectives.

2 Methodology

2.1 Problem Statement

We first define some notation from graph theory and matrix analysis that we use to formally state our problem. We denote by 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) a directed graph with a set of MM vertices, 𝒱={1,,M}\mathcal{V}=\{1,...,M\}, and a set of NN_{\mathcal{E}} edges, 𝒱×𝒱\mathcal{E}\subset\mathcal{V}\times\mathcal{V}, where e=(i,j)e=(i,j)\in\mathcal{E} if there is an edge from vertex i𝒱i\in\mathcal{V} to vertex j𝒱j\in\mathcal{V}. We define a source map σ:𝒱\sigma:\mathcal{E}\rightarrow\mathcal{V} and a target map τ:𝒱\tau:\mathcal{E}\rightarrow\mathcal{V} for which σ(e)=i\sigma(e)=i and τ(e)=j\tau(e)=j whenever e=(i,j)e=(i,j)\in\mathcal{E}. Given a vector XMX\in\mathbb{R}^{M}, XiX_{i} refers to the ithi^{th} coordinate value of XX. For a matrix AM×NA\in\mathbb{R}^{M\times N}, AijA^{ij} refers to the element in the ithi^{th} row and jthj^{th} column of AA.

We consider a finite swarm of NN follower agents and a single leader agent. The locations of the leader and followers evolve on a graph, 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}), where 𝒱={1,,M}\mathcal{V}=\{1,...,M\} is a finite set of vertices and ={(i,j)|i,j𝒱}\mathcal{E}=\{(i,j)\leavevmode\nobreak\ |\leavevmode\nobreak\ i,j\in\mathcal{V}\} is a set of edges that define the pairs of vertices between which agents can transition. The vertices in 𝒱\mathcal{V} represent a set of spatial locations obtained by partitioning the agents’ environment. We will assume that the graph 𝒢=(𝒱,)\mathcal{G}=(\mathcal{V},\mathcal{E}) is strongly connected and that there is a self-edge (i,i)(i,i)\in\mathcal{E} at every vertex i𝒱i\in\mathcal{V}. We assume that the leader agent can count the number of follower agents at each vertex in the graph. The follower agents at a location vv only decide to move to an adjacent location if the leader agent is currently at location vv and is in a particular behavioral state. In other words, the presence of the leader repels the followers at the leader’s location. The leader agent does not have a model of the follower agents’ behavior.

The leader agent performs a sequence of transitions from one location (vertex) to another. The leader’s location at time k+k\in\mathbb{Z}_{+} is denoted by 1(k)𝒱\ell_{1}(k)\in\mathcal{V}. In addition to the spatial state 1(k)\ell_{1}(k), the leader has a behavioral state at each time kk, defined as 2(k){0,1}\ell_{2}(k)\in\{0,1\}. The location of each follower agent i{1,,N}i\in\{1,...,N\} is defined by a discrete-time Markov chain (DTMC) Xi(k)X_{i}(k) that evolves on the state space 𝒱\mathcal{V} according to the conditional probabilities

(Xi(k+1)=τ(e)|Xi(k)=σ(e))=ue(k)\mathbb{P}(X_{i}(k+1)=\tau(e)\leavevmode\nobreak\ |\leavevmode\nobreak\ X_{i}(k)=\sigma(e))=u_{e}(k) (1)

For each v𝒱v\in\mathcal{V} and each ee\in\mathcal{E} such that σ(e)=vτ(e)\sigma(e)=v\neq\tau(e), ue(k)u_{e}(k) is given by

ue(k)={βeif 1(k)=σ(e) and 2(k)=1, 0if 1(k)=σ(e) and 2(k)=0, 0if 1(k)σ(e),u_{e}(k)=\begin{cases}&\leavevmode\nobreak\ \beta_{e}\quad\text{if }\leavevmode\nobreak\ \ell_{1}(k)=\sigma(e)\leavevmode\nobreak\ \text{ and }\leavevmode\nobreak\ \ell_{2}(k)=1,\\ &\leavevmode\nobreak\ 0\quad\leavevmode\nobreak\ \leavevmode\nobreak\ \text{if }\leavevmode\nobreak\ \ell_{1}(k)=\sigma(e)\leavevmode\nobreak\ \text{ and }\leavevmode\nobreak\ \ell_{2}(k)=0,\\ &\leavevmode\nobreak\ 0\quad\leavevmode\nobreak\ \leavevmode\nobreak\ \text{if }\leavevmode\nobreak\ \ell_{1}(k)\neq\sigma(e),\end{cases} (2)

where βe\beta_{e} are positive parameters such that ev=σ(e)τ(e)βe<1\sum_{\begin{subarray}{c}e\in\mathcal{E}\\ v=\sigma(e)\neq\tau(e)\end{subarray}}\beta_{e}<1. Additionally, for each v𝒱v\in\mathcal{V}, u(v,v)(k)u_{(v,v)}(k) is given by

u(v,v)(k)=1ev=σ(e)τ(e)ue(k)u_{(v,v)}(k)=1\leavevmode\nobreak\ -\sum_{\begin{subarray}{c}e\in\mathcal{E}\\ v=\sigma(e)\neq\tau(e)\end{subarray}}u_{e}(k) (3)

For each vertex v𝒱v\in\mathcal{V}, we define a set of possible actions AvA_{v} taken by the leader when it is located at vv:

Av=ev=σ(e){e}×{0,1}A_{v}=\bigcup_{\begin{subarray}{c}e\in\mathcal{E}\\ v=\sigma(e)\end{subarray}}\{e\}\times\{0,1\} (4)

The leader transitions between states in 𝒱×{0,1}\mathcal{V}\times\{0,1\} according to the conditional probabilities

(1(k+1)=τ(e),2(k+1)=d|1(k)=σ(e))=1\mathbb{P}(\ell_{1}(k+1)=\tau(e),\ell_{2}(k+1)=d\leavevmode\nobreak\ |\leavevmode\nobreak\ \ell_{1}(k)=\sigma(e))=1 (5)

if p(k)p(k), the action taken by the leader at time kk when it is at vertex vv, is given by p(k)=(e,d)Avp(k)=(e,d)\in A_{v}.

The fraction, or empirical distribution, of follower agents that are at location v𝒱v\in\mathcal{V} at time kk is given by 1Ni=1Nχv(Xi(k))\frac{1}{N}\sum_{i=1}^{N}\chi_{v}(X_{i}(k)), where χv(w)=1\chi_{v}(w)=1 if w=vw=v and 0 otherwise. Our goal is to learn a policy that navigates the leader between vertices using the actions p(k)p(k) such that the follower agents are redistributed (“herded”) from their initial empirical distribution 1Ni=1Nχv(Xi(0))\frac{1}{N}\sum_{i=1}^{N}\chi_{v}(X_{i}(0)) among the vertices to a desired empirical distribution 1Ni=1Nχv(Xi(T))\frac{1}{N}\sum_{i=1}^{N}\chi_{v}(X_{i}(T)) at some final time TT, where TT is as small as possible. Since the identities of the follower agents are not important, we aim to construct a control policy for the leader that is a function of the current empirical distribution 1Ni=1Nχv(Xi(k))\frac{1}{N}\sum_{i=1}^{N}\chi_{v}(X_{i}(k)), rather than the individual agent states Xi(k)X_{i}(k). However, 1Ni=1Nχv(Xi(k))\frac{1}{N}\sum_{i=1}^{N}\chi_{v}(X_{i}(k)) is not a state variable of the DTMC. In order to treat 1Ni=1Nχv(Xi(k))\frac{1}{N}\sum_{i=1}^{N}\chi_{v}(X_{i}(k)) as the state, we consider the mean-field limit of this quantity as NN\rightarrow\infty. Let 𝒫(𝒱)={Y0M;v=1MYv=1}\mathcal{P}(\mathcal{V})=\{Y\in\mathbb{R}^{M}_{\geq 0};\leavevmode\nobreak\ \sum_{v=1}^{M}Y_{v}=1\} be the simplex of probability densities on 𝒱\mathcal{V}. When NN\rightarrow\infty, the empirical distribution 1Ni=1Nχv(Xi(k))\frac{1}{N}\sum_{i=1}^{N}\chi_{v}(X_{i}(k)) converges to a deterministic quantity S^(k)𝒫(𝒱)\hat{S}(k)\in\mathcal{P(V)}, which evolves according to the following mean-field model, a system of difference equations that define the discrete-time Kolmogorov Forward Equation:

S^(k+1)=eue(k)BeS^(k),S^(0)=S^0𝒫(𝒱),\hat{S}(k+1)=\sum_{e\in\mathcal{E}}u_{e}(k)B_{e}\hat{S}(k),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \hat{S}(0)=\hat{S}^{0}\in\mathcal{P}(\mathcal{V}), (6)

where BeB_{e} are matrices whose entries are given by

Beij={1if i=τ(e),j=σ(e),0otherwise.B_{e}^{ij}=\begin{cases}1&\text{if }i=\tau(e),\hskip 2.84526ptj=\sigma(e),\\ 0&\text{otherwise.}\end{cases}

The random variable Xi(k)X_{i}(k) is related to the solution of the difference equation (6) by the relation (Xi(k)=v)=S^v(k)\mathbb{P}(X_{i}(k)=v)=\hat{S}_{v}(k).

We formulate an optimization problem that minimizes the number of time steps kk required for the follower agents to converge to S^target\hat{S}_{target}, the target distribution. In this optimization problem, the reward function is defined as

R(k)=1𝔼S^(k)S^target2.R(k)=-1\cdot\mathbb{E}||\hat{S}(k)-\hat{S}_{target}||^{2}. (7)
Problem 2.1.

Given a target follower agent distribution S^target\hat{S}_{target}, devise a leader control policy π:𝒫(𝒱)×𝒱A\pi:\mathcal{P(V)}\times\mathcal{V}\rightarrow A that drives the follower agent distribution to S^(T)=S^target\hat{S}(T)=\hat{S}_{target}, where the final time TT is as small as possible, by minimizing the total reward k=1TR(k)\sum_{k=1}^{T}R(k). The leader action at time kk when it is at vertex v𝒱v\in\mathcal{V} is defined as p(k)=π(S^(k),1(k))Avp(k)=\pi(\hat{S}(k),\ell_{1}(k))\in A_{v} for all k{1,,T}k\in\{1,...,T\}, where A=v𝒱AvA=\cup_{v\in\mathcal{V}}A_{v}.

2.2 Design of Leader Control Policies using Temporal-Difference Methods

Two Temporal-Difference (TD) learning methods [1], SARSA and Q-Learning, were adapted to generate an optimal leader control policy. These methods’ use of bootstrapping provides the flexibility needed to accommodate the stochastic nature of the follower agents’ transitions between vertices. Additionally, TD methods are model-free approaches, which are suitable for our control objective since the leader does not have a model of the followers’ behavior. We compare the two methods to identify their advantages and disadvantages when applied to our swarm herding problem. Our approach is based on the mean-field model (6) in the sense that the leader learns a control policy using its observations of the population fractions of followers at all vertices in the graph.

Sutton and Barto [1] provide a formulation of the two TD algorithms that we utilize. Let SS denote the state of the environment, defined later in this section; AA denote the action set of the leader, defined as the set AvA_{v} in Equation (4); and Q(S,A)Q(S,A) denote the state-action value function. We define α[0,1]\alpha\in[0,1] and γ[0,1]\gamma\in[0,1] as the learning rate and the discount factor, respectively. The policy used by the leader is determined by a state-action pair (S,A)(S,A). RR denotes the reward for the implemented policy’s transition from the current to the next state-action pair and is defined in Equation (7). In the SARSA algorithm, an on-policy method, the state-action value function is defined as:

Q(S,A)Q(S,A)+α[R+γQ(S,A)Q(S,A)]Q(S,A)\leftarrow Q(S,A)+\alpha[R+\gamma Q(S^{\prime},A^{\prime})-Q(S,A)] (8)

where the update is dependent on the current state-action pair (S,A)(S,A) and the next state-action pair (S,A)(S^{\prime},A^{\prime}) determined by enacting the policy. In the Q-Learning algorithm, an off-policy method, the state-action value function is:

Q(S,A)Q(S,A)+α[R+γmaxaQ(S,a)Q(S,A)]Q(S,A)\leftarrow Q(S,A)+\alpha[R+\gamma\max_{a}Q(S^{\prime},a)-Q(S,A)] (9)

Whereas the SARSA algorithm update (8) requires knowing the next action AA^{\prime} taken by the policy, the Q-learning update (9) does not require this information.

Both algorithms use a discretization of the observed state SS and represent the state-action value function QQ in tabular form as a multi-dimensional matrix, indexed by the leader actions and states. The state SS is defined as a vector that contains a discretized form of the population fraction of follower agents at each location v𝒱v\in\mathcal{V} and the location 1(k)𝒱\ell_{1}(k)\in\mathcal{V} of the leader agent. The leader’s spatial state 1(k)\ell_{1}(k) must be taken into account because the leader’s possible actions depend on its current location on the graph. Since the population fractions of follower agents are continuous values, we convert them into discrete integer quantities serving as a discrete function approximation of the continuous fraction populations. Instead of defining FvF_{v} as the integer count of followers at location vv, which could be very large, we reduce the dimensionality of the state space by discretizing the follower population fractions into DD intervals and scaling them up to integers between 1 and DD:

Fv=round(DNi=1Nχv(Xi(0))),whereFv[1,,D],v𝒱.\begin{array}[]{l}F_{v}=\text{round}\left(\frac{D}{N}\sum_{i=1}^{N}\chi_{v}(X_{i}(0))\right),\\ \text{where}\leavevmode\nobreak\ \leavevmode\nobreak\ F_{v}\in[1,\ldots,D],\leavevmode\nobreak\ \leavevmode\nobreak\ v\in\mathcal{V}.\end{array} (10)

For example, suppose D=10D=10. Then a follower population fraction of 0.24 at location vv would have a corresponding state value Sv=2S_{v}=2. Using a larger value of DD provides a finer classification of agent populations, but at the cost of increasing the size of the state SS. Given these definitions, the state vector SS is defined as:

Senv=[F1,,FM,1]S_{env}=[F_{1},\ldots,F_{M},\ell_{1}] (11)

The state vector SenvS_{env} contains many states that are inapplicable to the learning process. For example, the state vector for a 2×22\times 2 grid graph with D=10D=10 has 10×10×10×10×410\times 10\times 10\times 10\times 4 possible variations, but only 10×10×10×410\times 10\times 10\times 4 are applicable since they satisfy the constraint that the follower population fractions at all vertices must sum up to 11 (note that the sum vFv\sum_{v}F_{v} may differ slightly from 11 due to the rounding used in Equation (10).) The new state SenvS_{env} is used as the state SS in the state-action value functions (8) and (9).

The leader’s control policy for both functions (8) and (9) is the following ϵ\epsilon-greedy policy, where X[0,1]X\in[0,1] is a uniform random variable and ϵ\epsilon is a threshold parameter that determines the degree of state exploration during training:

π(Senv)=argmaxAQ(Senv,A)if X>ϵ\pi(S_{env})=\operatorname*{arg\,max}_{A}Q(S_{env},A)\quad\text{if }X>\epsilon (12)

3 Simulation Results

An OpenAI Gym environment [12] was created in order to design, simulate, and visualize our leader-based herding control policies [13]. This open source virtual environment can be easily modified to simulate swarm controllers for different numbers of agents and graph vertices, making it a suitable environment for training leader control policies using our model-free approaches. The simulated controllers can then be implemented in physical robot experiments. Figure 1 shows the simulated environment for a scenario with 100100 follower agents, represented by the blue ×\times symbols, that are herded by a leader, shown as a red circle, over a 2×22\times 2 grid. The OpenAI environment does not store the individual positions of each follower agent within a grid cell; instead, each cell is associated with an agent count. The renderer disperses agents randomly within a cell based on the cell’s current agent count. The agent count for a grid cell is updated whenever an agent enters or leaves the cell according to the DTMC (1), and the environment is re-rendered. Recording the agent counts in each cell rather than their individual positions significantly reduces memory allocation and computational time when training the leader control policy on scenarios with large numbers of agents.

Refer to caption
Figure 1: Visual rendering of a simulated scenario in our OpenAI environment for iterations k=k= 0 and 50. The environment simulates a strongly connected 2×22\times 2 grid graph such as the one shown in Figure 2. The leader (red circle) moves between grid cells in a horizontal or vertical direction. It may not move diagonally. Follower agents (blue ×\times symbols) are randomly distributed in each cell. The borders of each cell are represented by the grid lines. The histogram to the right of each grid shows both the target (red) and current (blue) agent population fractions in each vertex at iteration kk.
Refer to caption
Figure 2: The bidirected grid graph 𝒢\mathcal{G} used in our simulated scenario. The leader agent (red ×\times symbol) is at vertex 33. The movement options for the leader are Left to vertex 22 or Up to vertex 11. The leader can also Stay at vertex 33, where its presence triggers follower agents at the vertex to probabilistically transition to vertex 11 or vertex 22.

The graph 𝒢\mathcal{G} that models the environment in Figure 1, with each vertex of 𝒢\mathcal{G} corresponding to a grid cell, was defined as the 2×22\times 2 graph in Figure 2. In the graph, agents transition along edges in either a horizontal or vertical direction, or they can stay at the current vertex. The action set is thus defined as:

A=[Left,Right,Up,Down,Stay]A=[\leavevmode\nobreak\ \textit{Left},\leavevmode\nobreak\ \textit{Right},\leavevmode\nobreak\ \textit{Up},\leavevmode\nobreak\ \textit{Down},\leavevmode\nobreak\ \textit{Stay}\leavevmode\nobreak\ ] (13)

Using the graph in Figure 2, we trained and tested a leader control policy for follower agent populations of N=10N=10100100 at 1010-agent increments. Both the SARSA and Q-Learning paradigms were applied and trained on 50005000 episodes with 50005000 iterations each. In every episode, the initial distribution S^initial\hat{S}_{initial} and target distribution S^target\hat{S}_{target} of the follower agents were defined as:

S^initial=[0.4 0.1 0.1 0.4]T\hat{S}_{initial}=\begin{bmatrix}0.4\leavevmode\nobreak\ &\leavevmode\nobreak\ 0.1\leavevmode\nobreak\ &\leavevmode\nobreak\ 0.1\leavevmode\nobreak\ &\leavevmode\nobreak\ 0.4\end{bmatrix}^{T} (14)
S^target=[0.1 0.4 0.4 0.1]T\hat{S}_{target}=\begin{bmatrix}0.1\leavevmode\nobreak\ &\leavevmode\nobreak\ 0.4\leavevmode\nobreak\ &\leavevmode\nobreak\ 0.4\leavevmode\nobreak\ &\leavevmode\nobreak\ 0.1\end{bmatrix}^{T} (15)

The initial leader location, 1\ell_{1}, was randomized to allow many possible permutations of states SenvS_{env} for training. During training, an episode completes once the distribution of NN follower agents reaches a specified terminal state. Instead of defining the terminal state as the exact target distribution S^target\hat{S}_{target}, which becomes more difficult to reach as NN increases due to the stochastic nature of the followers’ transitions, we define this state as a distribution that is sufficiently close to S^target\hat{S}_{target}. The learning rate and discount factor were set to α=0.3\alpha=0.3 and γ=0.9\gamma=0.9, respectively. The follower agent transition rate βe\beta_{e} was defined as the same value β\beta for all edges ee in the graph and was set to β=0.025\beta=0.025, 0.050.05, or 0.10.1. We use the mean squared error (MSE) to measure the difference between the current follower distribution and S^target\hat{S}_{target}. The terminal state is reached when the MSE decreases below a threshold value μ\mu. We trained our algorithms on threshold values of μ=0.0005\mu=0.0005, 0.0010.001, 0.00250.0025, and 0.0050.005.

Refer to caption
Figure 3: Number of iterations until convergence to S^target\hat{S}_{target} (plotted on a log scale) versus the MSE threshold value μ\mu for leader control policies that were learned using Q-Learning (left) and SARSA (right) with β=0.05\beta=0.05 and N=100N=100 follower agents. Each circle on the plots marks the mean number of iterations until convergence over 1000 test runs of a leader policy in the simulated grid graph environment in Figure 2. The shaded regions indicate the range of ±\pm1 standard deviation about the mean numbers of iterations (blue for D=10D=10; orange for D=20D=20.)
Refer to caption
Figure 4: Number of iterations until convergence to S^target\hat{S}_{target} versus number of follower agents NN for leader control policies that were learned using Q-Learning (left column) and SARSA (right column) with D=20D=20 and μ=0.0025\mu=0.0025. Each circle on the plots marks the mean number of iterations until convergence over 1000 test runs of a leader policy in the simulated grid graph environment in Figure 2 with the same value of NN that the policy was trained on. The plot for each β\beta value in the top two figures are reproduced individually in the three figures below them, along with shaded regions that indicate the range of ±\pm1 standard deviation about the mean numbers of iterations.

After training the leader control policies on each follower agent population size NN, the policies were tested on scenarios with the same environment and value of NN. The policy for each scenario was run 10001000 times to evaluate its performance. The policies were compared for terminal states that corresponded to the four different MSE threshold values μ\mu, and were given 10001000 iterations to converge within the prescribed MSE threshold of the target distribution (15) from the initial distribution (14).

Figure 3 compares the performance of leader control policies that were designed using each TD algorithm as a function of the tested values of μ\mu. The leader control policies were trained on N=100N=100 follower agents, using the parameters β=0.05\beta=0.05 and D=10D=10 or 2020, and tested in simulations with N=100N=100. The plots show that for both policies, the mean number of iterations required to converge to S^target\hat{S}_{target} decreases as the threshold μ\mu increases for constant DD, and at low values of μ\mu, the mean number of iterations decreases when DD is increased. In addition, as μ\mu increases, the variance in the number of iterations decreases (note the log scale of the yy-axis in the plots) or remains approximately constant, except for the D=20D=20 case of SARSA.

Figure 4 compares the performance of leader control policies that were designed using each algorithm as a function of NN, where the leader policies were tested in simulations with the same value of NN that they were trained on. The other parameters used for training were μ=0.0025\mu=0.0025, D=20D=20, and β=0.025,\beta=0.025, 0.050.05, or 0.10.1. The figures show that raising β\beta from 0.050.05 to 0.10.1 does not significantly affect the mean number of iterations until convergence, while decreasing β\beta from 0.050.05 to 0.0250.025 results in a higher mean number of iterations. This effect is evident for both Q-Learning and SARSA trained leader control policies for N>50N>50. Both leader control policies result in similar numbers of iterations for convergence at each agent population size. Therefore, both the Q-Learning and SARSA training algorithms yield comparable performance for these scenarios.

The results in Figure 4 show that as NN increases above 5050 agents, the mean number of iterations until convergence decreases slightly or remains approximately constant for all β\beta values and for μ=0.0025\mu=0.0025. Moreover, from Figure 3, we see that MSE threshold values μ<0.0025\mu<0.0025 for N=100N=100 result in a higher number of iterations than the N=100N=100 case in Figure 4. This trend may be due to differences in the magnitude of the smallest possible change in MSE over an iteration kk relative to the MSE threshold μ\mu for different values of NN. For example, for N=10N=10, a similarity in iteration counts for all four MSE thresholds μ\mu can be attributed to the fact that the change in the MSE due to a transition of one agent, corresponding to a change in population fraction of 1/N=1/101/N=1/10, is much higher than the four MSE thresholds (i.e., (1/10)2>0.005(1/10)^{2}>0.005, 0.00250.0025, 0.0010.001, and 0.00050.0005). Compare this to the iteration count for N=50N=50, which would have a corresponding change in MSE of (1/50)2(1/50)^{2}; this quantity is much smaller than 0.0050.005 and 0.00250.0025, but not much smaller than 0.0010.001 and 0.00050.0005. The iteration counts for N=100N=100 are much lower, since (1/100)2(1/100)^{2} is much smaller than all four MSE thresholds.

Finally, Figure 5 compares the performance of leader control policies that were designed using each algorithm as a function of NN, where the leader policies were trained with N=10N=10, 100100, or 10001000 follower agents and tested in simulations with N=10N=10100100 (at 1010-agent increments) and N=1000N=1000 agents. This was done to evaluate the robustness of the policies trained on the three agents populations to changes in NN. The other parameters used for training were μ=0.0025\mu=0.0025, D=20D=20, and β=0.025,\beta=0.025, 0.050.05, or 0.10.1. As the plots in Figure 5 show, policies trained on the smallest population, N=10N=10, yield an increased mean number of iterations until convergence when applied to populations N>10N>10. The reverse effect is observed, in general, for policies that are trained on higher values of NN than they are tested on. An exception is the case where the policies are trained on N=100N=100 and 10001000 and tested on N=10N=10, which produce much higher numbers of iterations than the policies that are both trained and tested on N=10N=10. This is likely a result of the large variance, and hence greater uncertainty, in the time evolution of such a small agent population. The lower amount of uncertainty in the time evolution of large swarms may make it easier for leader policies that are trained on large values of NN to control the distribution of a given follower agent population than policies that are trained on smaller values of NN. We thus hypothesize that training a leader agent with the mean-field model (6) instead of the DTMC model would lead to improved performance in terms of a lower training time, since the policy would only need to be trained on one value of NN, and fewer iterations until convergence to the target distribution.

Refer to caption
Figure 5: Number of iterations until convergence to S^target\hat{S}_{target} versus number of follower agents NN (plotted on a log scale) for leader control polices that were trained using Q-Learning (left column) and SARSA (right column) with D=20D=20; μ=0.0025\mu=0.0025; and β=0.025\beta=0.025, 0.050.05, or 0.10.1; and N=10N=10, 100100, or 10001000 agents. Each circle on the plots marks the mean number of iterations until convergence over 1000 test runs of a leader policy in the simulated grid graph environment in Figure 2.

4 Experimental Results

We also conducted experiments to verify that our herding approach is effective in a real-world environment with physical constraints on robot dynamics and inter-robot spacing. Two of the leader control policies that were generated in the simulated environment were tested on a swarm of small differential-drive robots in the Robotarium [14], a remotely accessible swarm robotics testbed that provides an experimental platform for users to validate swarm algorithms and controllers. Experiments are set up in the Robotarium with MATLAB or Python scripts. The robots move to target locations on the testbed surface using a position controller and avoid collisions with one another through the use of barrier certificates [15], a modification to the robots’ controllers that satisfy particular safety constraints. To implement this collision-avoidance strategy, the robots’ positions and orientations in a global coordinate frame are measured from images taken from multiple VICON motion capture cameras.

A video recording of our experiments is shown in [16]. The environment was represented as a 2×22\times 2 grid, as in the simulations, and N=N= 10 robots were used as follower agents. The leader agent, shown as the blue circle, and the boundaries of the four grid cells were projected onto the surface of the testbed using an overhead projector. As in the simulations, at each iteration kk, the leader moves from one grid cell to another depending on the action prescribed by its control policy. Both the SARSA and Q-Learning leader control policies trained with N=100N=100 follower agents, D=10D=10, μ=0.0025\mu=0.0025, and a β=0.1\beta=0.1 were implemented, and [16] shows the performance of both control policies. In the video, the leader is red if it is executing the Stay action and blue if it is executing any of the other actions in the set AA (i.e., a movement action). The current iteration kk and leader action are displayed at the top of the video frames. Actions that display ϵ\epsilon next to them signify a random action as specified in (12). Each control policy was able to achieve the exact target distribution (15). The SARSA method took 5959 iterations to reach this distribution, while the Q-Learning method took 2323 iterations.

5 Conclusion and Future Work

We have presented two Temporal-Difference learning approaches to designing a leader-follower control policy for herding a swarm of agents as quickly as possible to a target distribution over a graph. We demonstrated the effectiveness of the leader control policy in simulations and physical robot experiments for a range of swarm sizes NN, illustrating the scalability of the control policy with NN, and investigated the effect of NN on the convergence time to the target distribution. However, these approaches do not scale well with the graph size due to the computational limitations of tabular TD approaches and, in particular, our discretization of the system state into population fraction intervals. Our implementation requires a matrix with DM×M×|A|D^{M}\times M\times|A| state-action values to train the leader control policy. For our 2×22\times 2 grid graph with |A|=5|A|=5 possible leader actions and D=20D=20 intervals, this is about 204×4×520^{4}\times 4\times 5 values.

To address this issue, our future work focuses on designing leader control policies using the mean-field model rather than the DTMC model for training, as suggested at the end of Section 3. In this approach, the leader policies would be trained on follower agent population fractions that are computed from solutions of the discrete-time mean-field model (6), rather than from discrete numbers of agents that transition between locations according to a DTMC. The leader control policies can also be modified to use function approximators such as neural networks for our training algorithm, allowing for utilization of modern deep reinforcement learning techniques. Neural network function approximators provide a more practical approach than tabular methods to improve the scalability of the leader control policy with graph size, in addition to swarm size. In addition, the control policies could be implemented on a swarm robotic testbed in a decentralized manner, in which each follower robot avoids collisions with other robots based on its local sensor information.

Acknowledgment Many thanks to Dr. Sean Wilson at the Georgia Tech Research Institute for running the robot experiments on the Robotarium.

References

  • [1] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018.
  • [2] Meng Ji, Giancarlo Ferrari-Trecate, Magnus Egerstedt, and Annalisa Buffa. Containment control in mobile networks. IEEE Transactions on Automatic Control, 53(8):1972–1975, 2008.
  • [3] Mehran Mesbahi and Magnus Egerstedt. Graph theoretic methods in multiagent networks, volume 33. Princeton University Press, 2010.
  • [4] Alyssa Pierson and Mac Schwager. Controlling noncooperative herds with robotic herders. IEEE Transactions on Robotics, 34(2):517–525, 2017.
  • [5] Karthik Elamvazhuthi, Sean Wilson, and Spring Berman. Confinement control of double integrators using partially periodic leader trajectories. In American Control Conference, pages 5537–5544, 2016.
  • [6] Aditya A Paranjape, Soon-Jo Chung, Kyunam Kim, and David Hyunchul Shim. Robotic herding of a flock of birds using an unmanned aerial vehicle. IEEE Transactions on Robotics, 34(4):901–915, 2018.
  • [7] Clark Kendrick Go, Bryan Lao, Junichiro Yoshimoto, and Kazushi Ikeda. A reinforcement learning approach to the shepherding task using SARSA. In International Joint Conference on Neural Networks, pages 3833–3836, 2016.
  • [8] Karthik Elamvazhuthi and Spring Berman. Mean-field models in swarm robotics: A survey. Bioinspiration & Biomimetics, 15(1):015001, 2019.
  • [9] Adrian Šošić, Abdelhak M Zoubir, and Heinz Koeppl. Reinforcement learning in a continuum of agents. Swarm Intelligence, 12(1):23–51, 2018.
  • [10] Maximilian Hüttenrauch, Sosic Adrian, and Gerhard Neumann. Deep reinforcement learning for swarm systems. Journal of Machine Learning Research, 20(54):1–31, 2019.
  • [11] Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multi-agent reinforcement learning. In International Conference on Machine Learning, pages 5567–5576, 2018.
  • [12] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
  • [13] Zahi Kakish. Herding OpenAI Gym Environment, 2019. https://github.com/acslaboratory/gym-herding.
  • [14] S. Wilson, P. Glotfelter, L. Wang, S. Mayya, G. Notomista, M. Mote, and M. Egerstedt. The Robotarium: Globally impactful opportunities, challenges, and lessons learned in remote-access, distributed control of multirobot systems. IEEE Control Systems Magazine, 40(1):26–44, 2020.
  • [15] L. Wang, A. D. Ames, and M. Egerstedt. Safety barrier certificates for collisions-free multirobot systems. IEEE Transactions on Robotics, 33(3):661–674, 2017.
  • [16] Z. Kakish, K. Elamvazhuthi, and S. Berman. Using reinforcement learning to herd a robotic swarm to a target distribution, 2020. Autonomous Collective Systems Laboratory YouTube channel, https://youtu.be/py3Pe24YDjE.