This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Payoff Mechanism Design for Coordination
in Multi-Agent Task Allocation Games

Shinkyu Park and Julian Barreiro-Gomez
Park’s work was supported by funding from King Abdullah University of Science and Technology (KAUST). Barreiro-Gomez’s work was supported by the Center on Stability, Instability, and Turbulence (SITE) and Tamkeen under the NYU Abu Dhabi Research Institute grant CG002.Park is with the Electrical and Computer Engineering, King Abdullah University of Science and Technology (KAUST), Thuwal 23955, Saudi Arabia. [email protected]Barreiro-Gomez is with NYUAD Research Institute, New York University Abu Dhabi, PO Box 129188, Abu Dhabi, United Arab Emirates. [email protected]
Abstract

We investigate a multi-agent decision-making problem where a large population of agents is responsible for carrying out a set of assigned tasks. The amount of jobs in each task varies over time governed by a dynamical system model. Each agent needs to select one of the available strategies to take on one or more tasks. Since each strategy allows an agent to perform multiple tasks at a time, possibly at distinct rates, the strategy selection of the agents needs to be coordinated. We formulate the problem using the population game formalism and refer to it as the task allocation game. We discuss the design of a decision-making model that incentivizes the agents to coordinate in the strategy selection process.

As key contributions, we propose a method to find a payoff-driven decision-making model, and discuss how the model allows the strategy selection of the agents to be responsive to the amount of remaining jobs in each task while asymptotically attaining the optimal strategies. Leveraging analytical tools from feedback control theory, we derive technical conditions that the model needs to satisfy, which are used to construct a numerical approach to compute the model. We validate our solution through simulations to highlight how the proposed approach coordinates the agents in task allocation games.

I Introduction

We investigate task allocation games to study coordination in repeated strategic interactions in a large population of agents. Consider that there is a finite number of tasks to be carried out by the agents. We quantify the amount of jobs remaining in each task with a positive variable, and every agent can select one of the available strategies at a time to take on one or more tasks. The main objective is to design a decentralized decision-making model that allows the agents to coordinate and minimize remaining jobs in all tasks.

Task allocation games are relevant to engineering applications. For instance, in multi-robot resource retrieval [1, 2], a team of multiple robots is tasked with searching and collecting target resources across partitioned areas in a given environment. Each task can be defined as collecting resources from an area and the strategy selection refers to taking one of the tasks. In target tracking applications [3], a group of mobile units with heterogeneous sensing capabilities are deployed to collect data about the states of multiple targets of our interests. Based on the type of equipped sensors, each mobile unit can collect different sets of data on the targets’ states. A task is defined as collecting data on a portion of the target states and a strategy specifies which pair of available sensors a mobile unit can equip. In both scenarios, the amount of resources to collect and the data to gather vary depending on past strategy selection of the agents and also on environmental changes and target dynamics.

To design a model for the agent strategy selection in such engineering applications, we investigate task allocation in dynamically changing environments. Multi-agent task allocation problems have been widely studied across various research communities [4, 5, 6, 7, 8, 9, 10]. A game-theoretic approach to the problem using replicator dynamics is investigated in [8]. The authors of [5, 6] use the hedonic game to study the coordination of multiple agents in task allocation. Applications of population game approaches to address task allocation in swarm robotics [9] and the control of a water distribution system [10] are discussed. Also, relevant to the task allocation game that we investigate in this work, whose formalism is defined in a state space, the state-based potential game has been studied in [11], and the design of state-based game to solve distributed optimization is proposed in [12].

A majority of existing works assume that the environment underlying the game is static and aim to find the optimal task allocation. In contrast, we study the design of a decision-making model under which the agents can repeatedly switch among multiple tasks to minimize remaining jobs in the tasks. We adopt the population game formalism [13] to state the problem and to study the decision-making model design. The model prescribes how the agents take on a given set of tasks and how the agents should switch among the tasks by revising their strategy selection to asymptotically attain optimality. We consider that each agent in the population is given a set of nn strategies to carry out assigned tasks where we denote the agents’ strategy profile – the distribution of the agents’ strategy selection – by a non-negative vector x=(x1,,xn)x=(x_{1},\cdots,x_{n}). Remaining jobs associated with mm tasks are denoted by a non-negative vector q=(q1,,qm)q=(q_{1},\cdots,q_{m}) for which a dynamic model describes how qq changes – both growth by environmental changes and reduction by the agents – based on the agents’ strategy selection xx.

Based on the evolutionary dynamics framework [13], we specify a decentralized decision-making model that allows individual agents to revise their strategy selection based on a payoff vector p=(p1,,pn)p=(p_{1},\cdots,p_{n}), where each pip_{i} is the payoff an agent receives when it selects the ii-th strategy. As the main contribution, we design a payoff mechanism to define how pp should depend on qq to encourage the agents to select the tasks with more jobs to perform and asymptotically attain the minimum of a given cost c(q)c(q). Applying convergence analysis tools [14, 15, 16, 17, 18, 19] that are based on passivity theory in the population games literature, we establish conditions under which the agents’ strategy profile converges and asymptotically attain the optimal profile. We use the conditions to compute the payoff mechanism.

The paper is organized as follows. In Section II, we explain the task allocation game formulation and the main problem we address in this paper. In Section III, we present the main result on the payoff mechanism design and analysis on convergence of the agent strategy revision process to the optimal strategy profile. In Section IV, we present simulation results to illustrate our main contribution. We conclude the paper with a summary and future plans in Section V.

II Problem Description

Consider a large population of agents that are assigned with mm tasks and are given nn strategies to carry out the tasks.111The number of tasks is not necessarily the same as that of available strategies, i.e., mnm\neq n. We associate each task j{1,,m}j\in\{1,\cdots,m\} with a variable qj0q_{j}\geq 0 which quantifies the amount of jobs remaining in the task. Let xi0x_{i}\geq 0 denote the portion of the agents selecting strategy i{1,,n}i\in\{1,\cdots,n\} and a fixed positive number MM to be the mass of the agent population satisfying M=i=1nxiM=\sum_{i=1}^{n}x_{i}.222Considering the population state xx as control input to (2), the population mass MM can be interpreted as a limit on the control input. Each agent selects one of the strategies at a time based on payoff vector p=(p1,,pn)p=(p_{1},\cdots,p_{n}).

Let +n\mathbb{R}_{+}^{n} be the set of all nn-dimensional vectors with non-negative entries, and let 𝕏M\mathbb{X}_{M} be the space of all feasible states x=(x1,,xn)x=(x_{1},\cdots,x_{n}) of the population defined as 𝕏M={x+n|i=1nxi=M}.\mathbb{X}_{M}=\left\{x\in\mathbb{R}_{+}^{n}\,\big{|}\,\textstyle\sum_{i=1}^{n}x_{i}=M\right\}. Given a matrix Gn×mG\in\mathbb{R}^{n\times m}, we represent GG using its column and row vectors as follows:

G=(G1colGmcol)=(G1rowGnrow).\displaystyle G=\begin{pmatrix}G_{1}^{\tiny\text{col}}&\cdots&G_{m}^{\tiny\text{col}}\end{pmatrix}=\begin{pmatrix}G_{1}^{\tiny\text{row}}\\ \vdots\\ G_{n}^{\tiny\text{row}}\end{pmatrix}. (1)

II-A Task Allocation Games

To investigate the task allocation problem, we formalize the problem as a large population game in which the agents select strategies to perform jobs in the assigned tasks quantified by q=(q1,,qm)q=(q_{1},\cdots,q_{m}). The vector qq varies over time based on the agents’ strategy selection and changes in the environment. Hence, each agent needs to evaluate and adaptively select a strategy based on qq.

Given x(t)x(t) and q(t)q(t), at each time tt, the following ordinary differential equation describes the rate of change of q(t)q(t).

q˙(t)\displaystyle\dot{q}(t) =(q(t),x(t))reduction rate+wgrowth rate,q(0)=q0+m,\displaystyle=-\underbrace{\mathcal{F}(q(t),x(t))}_{\text{reduction rate}}+\underbrace{w}_{\text{growth rate}},~{}q(0)=q_{0}\in\mathbb{R}^{m}_{+}, (2)

where :+m×+n+m\mathcal{F}:\mathbb{R}_{+}^{m}\times\mathbb{R}_{+}^{n}\to\mathbb{R}_{+}^{m} is a continuously differentiable mapping333To have the reduction rate mapping defined for any population mass MM, we define the domain of \mathcal{F} as +m×+n\mathbb{R}_{+}^{m}\times\mathbb{R}_{+}^{n}. that defines the reduction rate, which quantifies how fast the agents adopting strategy profile xx reduce qq, and the constant vector w=(w1,,wm)+mw=(w_{1},\cdots,w_{m})\in\mathbb{R}^{m}_{+} represents the growth rate for qq due to environmental changes. To ensure that the positive orthant +m\mathbb{R}_{+}^{m} is forward-invariant for q(t)q(t) under (2), each i\mathcal{F}_{i} of =(1,,m)\mathcal{F}=(\mathcal{F}_{1},\cdots,\mathcal{F}_{m}) satisfies i(q,x)wi if qi=0\mathcal{F}_{i}(q,x)\leq w_{i}\text{ if }q_{i}=0. For notational convenience, let us define 𝕆\mathbb{O} as the set of stationary points of (2), i.e.,

𝕆={(q,x)+m×𝕏M|(q,x)=w}.\displaystyle\mathbb{O}=\{(q,x)\in\mathbb{R}_{+}^{m}\times\mathbb{X}_{M}\,|\,\mathcal{F}(q,x)=w\}. (3)

We make the following assumption on the mapping \mathcal{F}.

Assumption 1

The reduction rate i\mathcal{F}_{i} for each task ii depends only on its associated variable qi(t)q_{i}(t) and the agent strategy selection x(t)x(t), and increases as does qi(t)q_{i}(t). For instance, in the resource retrieval application discussed in Section I, when there is a larger volume of resources spread out across the areas, the robots would need to travel a shorter distance on average to locate and retrieve the resources and hence given a fixed strategy profile xx, the variable qi(t)q_{i}(t) decreases at a faster rate. We formalize such assumptions as jqi(q,x)=0\frac{\partial\mathcal{F}_{j}}{\partial q_{i}}(q,x)=0 if iji\neq j and iqi(q,x)>0\frac{\partial\mathcal{F}_{i}}{\partial q_{i}}(q,x)>0. \square

According to Assumption 1, we represent the reduction rate as (q,x)=(1(q1,x),,m(qm,x))\mathcal{F}(q,x)=(\mathcal{F}_{1}(q_{1},x),\cdots,\mathcal{F}_{m}(q_{m},x)), where for fixed xx, each i\mathcal{F}_{i} is an increasing function of qiq_{i}.

Remark 1

Suppose that given xx in 𝕏M\mathbb{X}_{M}, there is qq in +m\mathbb{R}_{+}^{m} satisfying (q,x)=w\mathcal{F}(q,x)=w. By Assumption 1, qq is unique. \square

The following examples illustrate how the dynamic game model (2) can be adopted in control systems applications.

Example 1 (Multi-Robot Resource Collection [1])

Let m=nm=n and =(1,,n)\mathcal{F}=(\mathcal{F}_{1},\cdots,\mathcal{F}_{n}) be defined as

i(qi,xi)=Riexp(αiqi)1exp(αiqi)+1xiβi,\displaystyle\mathcal{F}_{i}(q_{i},x_{i})=R_{i}\frac{\exp(\alpha_{i}q_{i})-1}{\exp(\alpha_{i}q_{i})+1}x_{i}^{\beta_{i}}, (4)

where RiR_{i}, αi\alpha_{i}, and βi\beta_{i} are positive constants. The parameter RiR_{i} represents the maximum reduction rate associated with strategy ii, and αi\alpha_{i} and βi\beta_{i} are coefficients specifying how the reduction rate i\mathcal{F}_{i} depends on qiq_{i} and xix_{i}, respectively. Note that each function i\mathcal{F}_{i} satisfies i(0,x)=0\mathcal{F}_{i}(0,x)=0 and Assumption 1. Here, m=nm=n and only the agent selecting strategy ii can reduce qiq_{i} associated with task ii.

Example 2 (Heterogeneous Sensor Scheduling [3])

We adopt the model (2) as an abstract description of how mobile units’ sensor scheduling affects the uncertainty reduction in estimating states of multiple targets. Let m<nm<n and =(1,,m)\mathcal{F}=(\mathcal{F}_{1},\cdots,\mathcal{F}_{m}) be defined as

i(qi,x)=jiRiexp(αiqi)1exp(αiqi)+1xjβi,\displaystyle\mathcal{F}_{i}(q_{i},x)=\sum_{j\in\mathbb{N}_{i}}R_{i}\frac{\exp(\alpha_{i}q_{i})-1}{\exp(\alpha_{i}q_{i})+1}x_{j}^{\beta_{i}}, (5)

where i\mathbb{N}_{i} denotes the set of the strategies (available sensor configurations of a mobile unit) that can collect data on the state of the ii-th target. The parameters RiR_{i}, αi\alpha_{i}, βi\beta_{i} have the same interpretation as in Example 1. Unlike the previous example, the strategies are defined to allow the agents to reduce multiple task-associated variables of qq.

Example 3 (Water Distribution Control [14, 20, 16])

There are mm reservoirs each of which is assigned with a respective maximum water level l¯i\bar{l}_{i} for ii in {1,,m}\{1,\cdots,m\}. Denote by (l1(t),,lm(t))(l_{1}(t),\cdots,l_{m}(t)) water levels of the reservoirs at each time tt. Let ww be a constant outflow specifying water demands by consumers and (x1(t),,xn(t))(x_{1}(t),\cdots,x_{n}(t)) be controllable inflows, where nn does not necessarily coincide with the number mm of reservoirs. The simplified dynamics for the water levels can be defined as l˙i(t)=i(l¯ili(t),x(t))w\dot{l}_{i}(t)=\mathcal{F}_{i}(\bar{l}_{i}-l_{i}(t),x(t))–w satisfying i(0,x)=0\mathcal{F}_{i}(0,x)=0 to ensure that each reservoir cannot hold water above its maximum level: for instance, i(l¯ili,x)=(l¯ili)l¯ixi\mathcal{F}_{i}(\bar{l}_{i}-l_{i},x)=\frac{(\bar{l}_{i}-l_{i})}{\bar{l}_{i}}x_{i}. By defining qi(t)=l¯ili(t)q_{i}(t)=\bar{l}_{i}–l_{i}(t) as remaining space in each reservoir ii, we can derive the dynamic model as

q˙i(t)=1l¯iqi(t)xi(t)+w.\displaystyle\dot{q}_{i}(t)=-\frac{1}{\bar{l}_{i}}q_{i}(t)x_{i}(t)+w.

II-B Agent Strategy Revision Model

Our model is based on the evolutionary dynamics framework [13] in which the strategy revision protocol ϱiθ:n+\varrho_{i}^{\theta}:\mathbb{R}^{n}\to\mathbb{R}_{+} determines an agent’s strategy revision based on the payoff vector pnp\in\mathbb{R}^{n}, where θ=(θ1,,θn)𝕏M\theta=(\theta_{1},\cdots,\theta_{n})\in\mathbb{X}_{M} is a parameter of the protocol. We adopt the Kullback-Leibler Divergence Regularized Learning (KLD-RL) protocol [21, 22] to define ϱiθ(p)\varrho_{i}^{\theta}(p) as

ϱiθ(p)=θiexp(η1pi)l=1nθlexp(η1pl),\displaystyle\varrho_{i}^{\theta}(p)=\frac{\theta_{i}\exp(\eta^{-1}p_{i})}{\sum_{l=1}^{n}\theta_{l}\exp(\eta^{-1}p_{l})}, (6)

where η>0\eta>0. The protocol ϱiθ(p)\varrho_{i}^{\theta}(p) describes the probability of an agent switching to strategy ii given pp and θ\theta. Note that the smaller the value of η\eta, the more the strategy revision depends on the value of pp.

Each agent is given an opportunity to revise its strategy selection at each jump time of an independent and identically distributed Poisson process, and uses the protocol to select a new strategy or keep its current strategy selection. Since the strategy revision of individual agents only depends on the payoff vector and takes place independently of each other, their decision-making is decentralized and the coordination among them occurs implicitly through their decision-making model. Based on discussions in [13, Chapter 4], as the number of agents in the population tends to infinity, the following ordinary differential equation describes how each component of x(t)=(x1(t),,xn(t))x(t)=(x_{1}(t),\cdots,x_{n}(t)) evolves over time.

x˙i(t)\displaystyle\dot{x}_{i}(t) =𝒱iθ(p(t),x(t))\displaystyle=\mathcal{V}_{i}^{\theta}(p(t),x(t))
=j=1nxj(t)ϱiθ(p(t))xi(t)j=1nϱjθ(p(t)).\displaystyle=\textstyle\sum_{j=1}^{n}x_{j}(t)\varrho_{i}^{\theta}(p(t))-x_{i}(t)\textstyle\sum_{j=1}^{n}\varrho_{j}^{\theta}(p(t)). (7)

We refer to (7) as the Evolutionary Dynamics Model (EDM).

Note that at an equilibrium state (p,x)(p^{\ast},x^{\ast}) of the EDM (7) under the KLD-RL protocol (6), if θ=x\theta=x^{\ast}, the following implication holds:

xi>0pi=max1jnpj.\displaystyle x_{i}^{\ast}>0\implies p_{i}^{\ast}=\max_{1\leq j\leq n}p_{j}^{\ast}. (8)

Eq. (8) means that every agent receives the highest payoff at (p,x)(p^{\ast},x^{\ast}) if the parameter θ\theta of (6) is the same as xx^{\ast}.

Refer to caption
Refer to caption
Figure 1: Graphs depicting trajectories q(t)2,t0\|q(t)\|_{2},\,t\geq 0 determined by (2) and (7) using Example 1. The parameters of (2) are defined as m=n=4m=n=4, M=1M=1, Ri=3.5R_{i}=3.5, αi=0.05\alpha_{i}=0.05, βi=1\beta_{i}=1, w=(0.05,0.25,1.00,2.00)w=(0.05,0.25,1.00,2.00), and those of (6) as θ=x\theta=x^{\ast}, η=0.001\eta=0.001, where (q,x)𝕆(q^{\ast},x^{\ast})\in\mathbb{O} is the equilibrium state minimizing max1i4qi\max_{1\leq i\leq 4}q_{i}. In (a), the dotted black line is the minimum 22-norm achievable when the payoff mechanism p=Gqp=Gq is optimally designed, and the blue line represents the trajectory q(t)2,t0\|q(t)\|_{2},\,t\geq 0 when the population state is determined by (7) with p=qp=q. In (b), the blue and orange lines represent the trajectories q(t)2,t0\|q(t)\|_{2},\,t\geq 0 when the population state is determined by (7) and is fixed to xx^{\ast}, respectively.

Given the protocol ϱiθ\varrho_{i}^{\theta} as in (6), we aim to design a payoff mechanism for the agents to asymptotically adopt the optimal strategy profile that minimizes a given cost c(q)c(q). For instance, in Example 1, if we design the payoff mechanism as p=qp=q, the robots would select strategy ii to take on task ii and asymptotically minimize limtmax1imqi(t)\lim_{t\to\infty}\max_{1\leq i\leq m}q_{i}(t), as discussed in [1]. However, in many applications, such one-to-one correspondence between tasks and available strategies may not exist, and depending on the cost we want to minimize, such a simple payoff mechanism would not be the best design choice as we illustrate in Figure 1.

In addition, since the payoff mechanism depends on the vector q(t)q(t), the mechanism would incentivize the agents to take on the tasks with larger qi(t)q_{i}(t). Hence, compared to other models that directly control the population state x(t)x(t) to the optimal state xx^{\ast} (for instance, the model proposed in [9]), our strategy revision model is more responsive to changes of q(t)q(t) and hence reduces the task-associated variables q(t)q(t) at a faster rate as we depict in Figure 1.

Two examples of the cost function we consider are

  • (square of) the 2-norm of qq: c(q)=i=1mqi2c(q)=\sum_{i=1}^{m}q_{i}^{2}, and

  • the \infty-norm of qq: c(q)=max1imqic(q)=\max_{1\leq i\leq m}q_{i}.

For the payoff mechanism design, we consider a linear model defined by a matrix Gn×mG\in\mathbb{R}^{n\times m} as follows:

p=Gq.\displaystyle p=Gq. (9)

Our main problem investigates finding the matrix GG that allows the agents to asymptotically minimize the cost c(q(t))c(q(t)). We formally state the problem as follows.

Problem 1

Given the dynamic model (2) of the task allocation game and the EDM (7), compute the payoff matrix GG under which the cost c(q(t))c(q(t)) is asymptotically minimized.

III Payoff Matrix Design

Refer to caption
Figure 2: Feedback model in the task allocation game.

By interconnecting the dynamic model of the game (2), the payoff mechanism (9), and the EDM (7) with (6) as its revision protocol, as illustrated in Figure 2, we can write the state equation of the resulting closed-loop model as follows:

{q˙(t)=(q(t),x(t))+wp(t)=Gq(t)\displaystyle\begin{cases}\dot{q}(t)&=-\mathcal{F}(q(t),x(t))+w\\ p(t)&=Gq(t)\end{cases} (10a)
x˙i(t)=Mθiexp(η1pi(t))l=1nθlexp(η1pl(t))xi(t).\displaystyle\dot{x}_{i}(t)=M\frac{\theta_{i}\exp(\eta^{-1}p_{i}(t))}{\sum_{l=1}^{n}\theta_{l}\exp(\eta^{-1}p_{l}(t))}-x_{i}(t). (10b)

Given an initial condition (q(0),x(0))+m×𝕏M(q(0),\!x(0))\!\in\!\mathbb{R}_{+}^{m}\!\times\!\mathbb{X}_{M}, we assume the closed-loop model (10) has a unique solution. Let 𝕊\mathbb{S} be the set of equilibrium states of (10). The proper design of GG should ensure that the following two conditions hold.

  1. (R1.

    The state (q(t),x(t))(q(t),x(t)) converges to the stationary points of (10a), i.e., it holds that limtinf(r,z)𝕆(q(t)r2+x(t)z2)=0.\lim_{t\to\infty}\inf_{(r,z)\in\mathbb{O}}(\|q(t)-r\|_{2}+\|x(t)-z\|_{2})=0.

  2. (R2.

    When the closed-loop model (10) reaches an equilibrium state (q,x)(q^{\ast},x^{\ast}), it attains the minimum cost, i.e., c(q)=inf(q,x)𝕆c(q).c(q^{\ast})=\inf_{(q,x)\in\mathbb{O}}c(q).

We adopt passivity tools [15, 17] to find technical conditions under which (R1) and (R2) are attained and use the conditions to design the payoff matrix GG. The critical step in the convergence analysis (R1) is in establishing passivity for both (10a) and (10b) by finding a so-called δ\delta-storage function for (10a) and δ\delta-antistorage function for (10b).444We refer to [17, Definition 10] and [17, Definition 12], respectively, for the formal definitions of passivity for (10a) and (10b). Then, by constructing a Lyapunov function using the two storage functions, we establish convergence results for (10).

To proceed, by [22, Lemma 3], (10b) is δ\delta-passive and has the δ\delta-storage function 𝒮θ:n×𝕏M+\mathcal{S}^{\theta}:\mathbb{R}^{n}\times\mathbb{X}_{M}\to\mathbb{R}_{+} given by

𝒮θ(p,x)=maxz𝕏M(pTzη𝒟(z||θ))(pTxη𝒟(x||θ)),\mathcal{S}^{\theta}(p,x)\!=\!\max_{z\in\mathbb{X}_{M}}(p^{T}z\!-\!\eta\mathcal{D}(z||\theta))\!-\!(p^{T}x\!-\!\eta\mathcal{D}(x||\theta)), (11)

where 𝒟()\mathcal{D}(\cdot\|\cdot) is the KL divergence. Note that 𝒮θ\mathcal{S}^{\theta} satisfies

𝒮θ(p,x)=0𝒱θ(p,x)=0xT𝒮θ(p,x)𝒱θ(p,x)=0\displaystyle\mathcal{S}^{\theta}(p,x)\!=\!0\Leftrightarrow\mathcal{V}^{\theta}(p,x)\!=\!0\Leftrightarrow\nabla_{x}^{T}\mathcal{S}^{\theta}(p,x)\mathcal{V}^{\theta}(p,x)\!=\!0 (12a)
𝒮θ(p(t),x(t))𝒮θ(p(t0),x(t0))\displaystyle\mathcal{S}^{\theta}(p(t),x(t))-\mathcal{S}^{\theta}\left(p(t_{0}),x(t_{0})\right)
t0tp˙T(τ)x˙(τ)dτ,tt00\displaystyle\qquad\qquad\qquad\leq\int_{t_{0}}^{t}\dot{p}^{T}(\tau)\dot{x}(\tau)\,\mathrm{d}\tau,~{}\forall t\geq t_{0}\geq 0 (12b)

for any payoff vector trajectory p(t),t0p(t),\,t\geq 0. The mapping 𝒱θ=(𝒱1θ,,𝒱nθ)\mathcal{V}^{\theta}=(\mathcal{V}_{1}^{\theta},\cdots,\mathcal{V}_{n}^{\theta}) is the vector field of the EDM (7).

The dynamic game model (2) is qualified as δ\delta-antipassive [17] if there is a δ\delta-antistorage function :+m×𝕏M+\mathcal{L}:\mathbb{R}_{+}^{m}\times\mathbb{X}_{M}\to\mathbb{R}_{+} satisfying the following two conditions:

(q,x)=0\displaystyle\mathcal{L}(q,x)=0 (q,x)=w\displaystyle\Leftrightarrow\mathcal{F}(q,x)=w
qT(q,x)((q,x)w)=0\displaystyle\Leftrightarrow\nabla_{q}^{T}\mathcal{L}(q,x)(\mathcal{F}(q,x)-w)=0 (13a)
(q(t),x(t))(q(t0),x(t0))\displaystyle\mathcal{L}(q(t),x(t))-\mathcal{L}\left(q(t_{0}),x(t_{0})\right)
t0tq˙T(τ)GTx˙(τ)dτ,tt00,\displaystyle\qquad\qquad\leq-\int_{t_{0}}^{t}\dot{q}^{T}(\tau)G^{T}\dot{x}(\tau)\,\mathrm{d}\tau,~{}\forall t\geq t_{0}\geq 0, (13b)

where (13b) needs to hold for any given population state trajectory x(t),t0x(t),~{}t\geq 0. According to (13a), the function (q,x)\mathcal{L}(q,x) can be used to measure how far the state (q,x)(q,x) is from the equilibrium of (10a). By their respective definitions [17], both 𝒮θ\mathcal{S}^{\theta} and \mathcal{L} need to be continuously differentiable.

Recall 𝕆\mathbb{O} given as in (3). For (q,x)𝕆(q^{\ast},x^{\ast})\in\mathbb{O} satisfying

xi>0pi=max1jnpj,i{1,,n}\displaystyle x_{i}^{\ast}>0\implies p_{i}^{\ast}=\max_{1\leq j\leq n}p_{j}^{\ast},~{}\forall i\in\{1,\cdots,n\} (14)

with p=Gqp^{\ast}=Gq^{\ast}, let us assign θ=x\theta=x^{\ast} for (10b). We can establish the following lemma.

Lemma 1

If the dynamic game model (10a) is δ\delta-antipassive, then given that q(t),t0q(t),\,t\geq 0 is bounded, the state (q(t),x(t))(q(t),x(t)) of the closed-loop model (10) converges to 𝕊\mathbb{S}. Also (q,x)(q^{\ast},x^{\ast}) is the equilibrium state of (10) for all η>0\eta>0.

The proof of the lemma is given in Appendix. Resorting to Lemma 1, to meet the requirements (R1) and (R2), we need to construct the payoff matrix GG such a way that (10a) becomes δ\delta-antipassive and (q,x)𝕆(q^{\ast},x^{\ast})\in\mathbb{O} minimizing c(q)c(q) is an equilibrium state of (10). The following theorem states the technical conditions on GG that ensure (R1) and (R2). To state the theorem, we define a continuously differentiable mapping g:+m+ng:\mathbb{R}_{+}^{m}\to\mathbb{R}_{+}^{n} that maps any q+mq\in\mathbb{R}_{+}^{m} to y=g(q)y=g(q) satisfying (q,y)=w\mathcal{F}(q,y)=w.555We remark that g(q)g(q) does not necessarily belong to 𝕏M\mathbb{X}_{M}. We interpret g(q)g(q) has the strategy profile that attains the equilibrium state for a given qq when there is no limit on the population mass MM. The statement of the theorem holds if such gg exists.

Theorem 1

Let us define

hi(q,x)=(i(qi,x)wi)(xg(q)),i{1,,m}\displaystyle h_{i}(q,x)=(\mathcal{F}_{i}(q_{i},x)-w_{i})\,(x-g(q)),~{}i\in\{1,\cdots,m\}

and let (q,x)(q^{\ast},x^{\ast}) be the stationary point of (10a) attaining the minimum cost inf(q,x)𝕆c(q)\inf_{(q,x)\in\mathbb{O}}c(q). Suppose the matrix GG satisfies

Gx(q,x)=xT(q,x)GT,(q,x)+m×+n\displaystyle G\nabla_{x}\mathcal{F}(q,x)=\nabla_{x}^{T}\mathcal{F}(q,x)G^{T},\,\forall(q,x)\in\mathbb{R}_{+}^{m}\times\mathbb{R}_{+}^{n} (15a)
hiT(q,x)Gicol>0,(q,x)𝕆,i{1,,m}\displaystyle h_{i}^{T}(q,x)G_{i}^{\tiny\text{col}}>0,~{}\forall(q,x)\notin\mathbb{O},\,\forall i\in\{1,\cdots,m\} (15b)
(GirowGjrow)xiq0,i,j{1,,n},\displaystyle(G_{i}^{\tiny\text{row}}-G_{j}^{\tiny\text{row}})x_{i}^{\ast}q^{\ast}\geq 0,~{}\forall i,j\in\{1,\cdots,n\}, (15c)

where GicolG_{i}^{\tiny\text{col}} and GirowG_{i}^{\tiny\text{row}} are the column and row vectors of GG defined as in (1), respectively. The dynamic game model (10a) is δ\delta-antipassive and (q,x)(q^{\ast},x^{\ast}) is an equilibrium state of (10) with θ=x\theta=x^{\ast} for any η>0\eta>0.

The proof of the theorem is given in Appendix. Under the condition (15b), whenever qi(t)q_{i}(t) is increasing, i.e., q˙i(t)=i(qi(t),x(t))+wi>0\dot{q}_{i}(t)=-\mathcal{F}_{i}(q_{i}(t),x(t))+w_{i}>0, the matrix GG incentivizes the agents to revise their strategies toward g(q)g(q), which is the strategy profile required to make the rate q˙(t)\dot{q}(t) to zero. In other words, GG is designed to encourage the agents to select strategies that reduce the rate q˙(t)\dot{q}(t).

Proposition 1

Let (q,x)(q^{\ast},x^{\ast}) be the stationary point of (10a) attaining the minimum cost inf(q,x)𝕆c(q)\inf_{(q,x)\in\mathbb{O}}c(q). Consider the closed-loop model (10) for which θ=x\theta=x^{\ast} and the payoff matrix GG satisfies (15). As the parameter η\eta of (10b) increases, (q,x)(q^{\ast},x^{\ast}) becomes the unique equilibrium state of (10). In other words, it holds that limηsup(q¯,x¯)𝕊𝒟(x¯x)=0\lim_{\eta\to\infty}\sup_{(\bar{q},\bar{x})\in\mathbb{S}}\mathcal{D}(\bar{x}\,\|\,x^{\ast})=0, where 𝕊\mathbb{S} is the set of equilibrium states of (10).

The proof of the proposition is provided in Appendix. In conjunction with Lemma 1 and Theorem 1, Proposition 1 implies that as η\eta becomes sufficiently large, the state trajectory (q(t),x(t)),t0(q(t),x(t)),~{}t\geq 0 converges to near the optimal state (q,x)(q^{\ast},x^{\ast}). According to (6), we note that smaller η\eta is desired to make the agent strategy revision responsive to changes in p(t)p(t) and also in q(t)q(t). Hence, a good practice is to use smaller η\eta at the beginning of the task allocation game, and if needed, as q˙(t)\dot{q}(t) goes to zero, the agents can gradually increase the value of η\eta to ensure that x(t)x(t) converges to xx^{\ast}.

IV Simulations

We use Examples 1 and 2 to illustrate our main results and discuss how the cost function and parameters of the dynamic model (2) affect the payoff matrix design. In both examples, we select the following fixed parameters M=1M=1, Ri=3.5R_{i}=3.5, αi=0.05\alpha_{i}=0.05, and βi=1\beta_{i}=1 for (4) and (5), and η=0.001\eta=0.001 for (10b).666We select η=0.001\eta=0.001 as all population state trajectories in the simulations converge to the optimal xx^{\ast} with the small positive η\eta. We use two different cost functions c(q)c(q) for Example 1 and two distinct growth rates ww for Example 2.

IV-A Computation of GG

We explain the steps to compute GG. First, note that (15a) is satisfied if GG has the following structures:

  1. 1.

    For Example 1, Gij=0G_{ij}=0 if iji\neq j.

  2. 2.

    For Example 2, Gij=0G_{ij}=0 if iji\notin\mathbb{N}_{j} and Gij=GjG_{ij}=G_{j} otherwise, where GjG_{j} is a real number.

Then, we find (q,x)𝕆(q^{\ast},x^{\ast})\in\mathbb{O} that minimizes the cost function c(q)c(q) using the following optimization.

min(q,x)+m×𝕏Mc(q) subject to (q,x)=w.\displaystyle\min_{(q,x)\in\mathbb{R}_{+}^{m}\times\mathbb{X}_{M}}c(q)\text{ subject to }\mathcal{F}(q,x)=w. (16)

Note that since \mathcal{F} is a nonlinear mapping, the optimization can be non-convex and the solution we find is locally optimal.

Once we find (q,x)(q^{\ast},x^{\ast}), we compute the matrix GG satisfying (15) for which we first need to find the mapping gg. Instead of explicitly finding gg, we draw random samples {(qs,xs)}s=1S+m×𝕏M\{(q_{s},x_{s})\}_{s=1}^{S}\subset\mathbb{R}_{+}^{m}\times\mathbb{X}_{M} and find ys+ny_{s}\in\mathbb{R}_{+}^{n} that minimizes (qs,ys)w22\|\mathcal{F}(q_{s},y_{s})-w\|_{2}^{2} for each sample (qs,xs)(q_{s},x_{s}). Note that assuming x(q,x)\nabla_{x}\mathcal{F}(q,x) has full rank at (qs,ys)(q_{s},y_{s}), which is the case in both examples, the minimizer ysy_{s} satisfies (qs,ys)=w\mathcal{F}(q_{s},y_{s})=w.

As the last step, the design of GG can be formulated as the following linear programming:

minGn×m1\displaystyle\min_{G\in\mathbb{R}^{n\times m}}1 (17)
subject to (i(qs,i,xs)wi)(xsys)TGicol>0,i{1,,m},s{1,,S}(GirowGjrow)xiq0,i,j{1,,n},\displaystyle\begin{aligned} \text{subject to }&(\mathcal{F}_{i}(q_{s,i},x_{s})-w_{i})\,(x_{s}-y_{s})^{T}G_{i}^{\tiny\text{col}}>0,\\ &\qquad\qquad~{}\forall i\in\{1,\cdots,m\},~{}\forall s\in\{1,\cdots,S\}\\ &(G_{i}^{\tiny\text{row}}-G_{j}^{\tiny\text{row}})x_{i}^{\ast}q^{\ast}\geq 0,~{}\forall i,j\in\{1,\cdots,n\},\end{aligned}

where qs,iq_{s,i} is the ii-th element of qs=(qs,1,,qs,m)q_{s}=(q_{s,1},\cdots,q_{s,m}). Since we evaluate the condition (15b) using a finite number of sampled points {(qs,xs)}s=1S\{(q_{s},x_{s})\}_{s=1}^{S}, we would obtain an approximate solution satisfying (15) only at the sampled points. However, as the sample size SS tends to infinity, the solution GG is more likely to satisfy (15) over the entire state space +m×𝕏M\mathbb{R}_{+}^{m}\times\mathbb{X}_{M}.

IV-B Simulation results for Example 1 (m=4,n=4m=4,n=4)

Refer to caption
(a) Population state x(t)x(t)
Refer to caption
(b) Task-associated vector q(t)q(t)
Figure 3: Graphs depicting the trajectories of the (a) population state x(t)x(t) and (b) task-associated vector q(t)q(t) derived by the closed-loop model (10) in Example 1 using the cost function c(q)=i=14qi2c(q)=\sum_{i=1}^{4}q_{i}^{2}.
Refer to caption
(a) Population state x(t)x(t)
Refer to caption
(b) Task-associated vector q(t)q(t)
Figure 4: Graphs depicting the trajectories of the (a) population state x(t)x(t) and (b) task-associated vector q(t)q(t) derived by the closed-loop model (10) in Example 1 using the cost function c(q)=max1i4qic(q)=\max_{1\leq i\leq 4}q_{i}.

Using the methods explained in Section IV-A, we compute the optimal state (q,x)(q^{\ast},x^{\ast}) minimizing i) c(q)=i=1mqi2c(q)=\sum_{i=1}^{m}q_{i}^{2} and ii) c(q)=max1imqic(q)=\max_{1\leq i\leq m}q_{i}, where we use the fixed growth rate w=(0.05,0.25,1.00,2.00)w=(0.05,0.25,1.00,2.00) for both cases. Then, we design the payoff matrix GG using (17) as follows.

  1. i.

    For c(q)=i=1mqi2c(q)=\sum_{i=1}^{m}q_{i}^{2},

    G=(1.000.000.000.000.000.660.000.000.000.000.480.000.000.000.000.40).\displaystyle\footnotesize G=\begin{pmatrix}1.00&0.00&0.00&0.00\\ 0.00&0.66&0.00&0.00\\ 0.00&0.00&0.48&0.00\\ 0.00&0.00&0.00&0.40\end{pmatrix}.
  2. ii.

    For c(q)=max1imqic(q)=\max_{1\leq i\leq m}q_{i},

    G=(1.000.000.000.000.001.000.000.000.000.001.000.000.000.000.001.00).\displaystyle\footnotesize G=\begin{pmatrix}1.00&0.00&0.00&0.00\\ 0.00&1.00&0.00&0.00\\ 0.00&0.00&1.00&0.00\\ 0.00&0.00&0.00&1.00\end{pmatrix}.

Note that when we use the \infty-norm to define the cost c(q)c(q), the optimal design of GG equally incentivizes the agents proportional to the remaining jobs qq. On the other hand, when the 22-norm is used, given that the values of q1(t),,q4(t)q_{1}(t),\cdots,q_{4}(t) are equal, the payoff matrix GG assigns the highest payoff to strategy 11 and the lowest payoff to strategy 44. Recall that under the pre-selected growth rate ww, task 11 has the lowest growth rate and task 44 has the highest, and hence maintaining lower q1(t)q_{1}(t) is easier – it needs a less number of agents – than q4(t)q_{4}(t). Hence, under the 22-norm cost function, the agents prioritize to carry out the tasks with lower growth rates.

Figures 4 and 4 depict the resulting trajectories for x(t)x(t) and q(t)q(t). Notice that the population states at the equilibrium in the two cases are similar; however, the trajectories for q(t)q(t) are different and, hence, so do the costs evaluated along the trajectories as we discussed in Figure 1. We observe that there is a large variation in the agent strategy revision at the beginning of the simulations as the agents repeatedly switch among the strategies to reduce qi(t)q_{i}(t) with a larger value.

IV-C Simulation results for Example 2 (m=4,n=6m=4,n=6)

Refer to caption
(a) Population state x(t)x(t)
Refer to caption
(b) Task-associated vector q(t)q(t)
Figure 5: Graphs depicting the trajectories of the (a) population state x(t)x(t) and (b) task-associated vector q(t)q(t) derived by the closed-loop model (10) in Example 2 using the growth rate w=(0.5,1.0,1.5,2.0)w=(0.5,1.0,1.5,2.0).
Refer to caption
(a) Population state x(t)x(t)
Refer to caption
(b) Task-associated vector q(t)q(t)
Figure 6: Graphs depicting the trajectories of the (a) population state x(t)x(t) and (b) task-associated vector q(t)q(t) derived by the closed-loop model (10) in Example 2 using the growth rate w=(0.1,0.5,1.0,2.0)w=(0.1,0.5,1.0,2.0).

We consider that there are 4 target states and 4 types of sensors each of which can measure a single state of the target. Each mobile unit can be equipped with two types of sensors and we define a strategy based on a pair of sensors employed on a mobile unit. According to the strategy definition, we can define the set i\mathbb{N}_{i} in (5) as the strategies that can measure the ii-th target state: 1={1,2,3}\mathbb{N}_{1}=\{1,2,3\}, 2={1,4,5}\mathbb{N}_{2}=\{1,4,5\}, 3={2,4,6}\mathbb{N}_{3}=\{2,4,6\}, and 4={3,5,6}\mathbb{N}_{4}=\{3,5,6\}. We use the square of the 22-norm to define the cost function, i.e., c(q)=i=1mqi2c(q)=\sum_{i=1}^{m}q_{i}^{2}.

We design GG with two distinct growth rates as follows.

  1. i.

    For w=(0.5,1.0,1.5,2.0)w=(0.5,1.0,1.5,2.0),

    G=(1.000.810.000.001.000.000.720.001.000.000.000.670.000.810.720.000.000.810.000.670.000.000.720.67).\displaystyle\footnotesize G=\begin{pmatrix}1.00&0.81&0.00&0.00\\ 1.00&0.00&0.72&0.00\\ 1.00&0.00&0.00&0.67\\ 0.00&0.81&0.72&0.00\\ 0.00&0.81&0.00&0.67\\ 0.00&0.00&0.72&0.67\end{pmatrix}.
  2. ii.

    For w=(0.1,0.5,1.0,2.0)w=(0.1,0.5,1.0,2.0),

    G=(1.680.990.000.001.680.000.800.001.680.000.000.670.000.990.800.000.000.990.000.670.000.000.800.67).\displaystyle\footnotesize G=\begin{pmatrix}1.68&0.99&0.00&0.00\\ 1.68&0.00&0.80&0.00\\ 1.68&0.00&0.00&0.67\\ 0.00&0.99&0.80&0.00\\ 0.00&0.99&0.00&0.67\\ 0.00&0.00&0.80&0.67\end{pmatrix}.

By comparing the above two payoff matrices, we can infer that the optimal GG assigns higher payoffs to the strategies as their respective growth rates become smaller. Figures 6 and 6 depict the resulting trajectories for x(t)x(t) and q(t)q(t). Notably, as the growth rate of the 44-th target state becomes relatively higher than those of other target states, more agents adopt strategies 3,53,5, and 66, which can be used to measure the 44-th state. Similar to the simulation results in Section IV-B, we can observe a large variation in the agent strategy revision at the beginning of the simulations as the agents responsively revise their strategies based on the value of qi(t)q_{i}(t).

V Conclusions

We investigated the design of the payoff mechanism in the task allocation games. The mechanism determines payoffs pp given the vector qq that quantifies the amount of jobs in the assigned tasks to the agents, and the payoffs incentivize the agents to repeatedly revise their strategy selection. We discussed how to design the payoff matrix GG using the passivity tools to ensure that the agents asymptotically attain the optimal strategy profile. Using the numerical examples, we demonstrated how our results can be used to design GG and how the parameters of the dynamic game model affect the optimal design of GG. For future directions, we plan to consider the design of nonlinear payoff mechanisms p=G(q)p=G(q), and to explore the idea of learning the dynamic model and computing GG alongside the model learning.

-A Proof of Lemma 1

From (12b) and (13b), we can derive

ddt(𝒮θ(p(t),x(t))+(q(t),x(t)))\displaystyle\frac{\mathrm{d}}{\mathrm{d}t}\left(\mathcal{S}^{\theta}(p(t),x(t))+\mathcal{L}(q(t),x(t))\right)
=xT𝒮θ(p(t),x(t))𝒱θ(p(t),x(t))\displaystyle=\nabla_{x}^{T}\mathcal{S}^{\theta}(p(t),x(t))\mathcal{V}^{\theta}(p(t),x(t))
+qT(q(t),x(t))((q(t),x(t))+w)0,\displaystyle\qquad+\nabla_{q}^{T}\mathcal{L}(q(t),x(t))(-\mathcal{F}(q(t),x(t))+w)\leq 0, (18)

where by (12a) and (13a), the equality holds if and only if 𝒮θ(p(t),x(t))=(q(t),x(t))=0\mathcal{S}^{\theta}(p(t),x(t))=\mathcal{L}(q(t),x(t))=0. Therefore, by LaSalle’s invariance principle [23], if q(t)q(t) is bounded, then the trajectory (q(t),x(t)),t0(q(t),x(t)),\,t\geq 0 converges to the equilibrium states of (10).

By the definition of the KLD-RL protocol [21, 22], with θ=x\theta=x^{\ast}, (q,x)(q^{\ast},x^{\ast}) is an equilibrium state of (10) for any η>0\eta>0 if it holds that xargmaxz𝕏M(zTGq)x^{\ast}\in\operatorname*{arg\,max}_{z\in\mathbb{X}_{M}}(z^{T}Gq^{\ast}), which can be validated by (14). ∎

-B Proof of Theorem 1

We define a δ\delta-antistorage function :+m×𝕏M+\mathcal{L}:\mathbb{R}_{+}^{m}\times\mathbb{X}_{M}\to\mathbb{R}_{+} using a line integral along a curve ss from g(q)g(q) to xx:

(q,x)=g(q)xG((q,s)w)ds.\displaystyle\mathcal{L}(q,x)=\int_{g(q)}^{x}G(\mathcal{F}(q,s)-w)\bullet\mathrm{d}s. (19)

Recall that g(q)+ng(q)\in\mathbb{R}_{+}^{n} is such that (q,g(q))=w\mathcal{F}(q,g(q))=w. By evaluating the line integral along the curve given by s(τ)=τx+(1τ)g(q),τ[0,1]s(\tau)=\tau x+(1-\tau)g(q),~{}\tau\in[0,1], we can derive

(q,x)=(xg(q))TG01((q,s(τ))w)dτ.\displaystyle\mathcal{L}(q,x)=(x-g(q))^{T}G\int_{0}^{1}\left(\mathcal{F}(q,s(\tau))-w\right)\mathrm{d}\tau. (20)

Note that for every τ\tau in (0,1)(0,1), it holds that

(xg(q))TG((q,s(τ))w)\displaystyle(x-g(q))^{T}G\left(\mathcal{F}(q,s(\tau))-w\right)
=1τi=1m(s(τ)g(q))TGicol(i(qi,s(τ))wi).\displaystyle=\frac{1}{\tau}\sum_{i=1}^{m}(s(\tau)-g(q))^{T}G_{i}^{\tiny\text{col}}\left(\mathcal{F}_{i}(q_{i},s(\tau))-w_{i}\right). (21)

Hence, under (15b), we can verify that (q,x)0\mathcal{L}(q,x)\geq 0 where the equality holds if and only if (q,x)=w\mathcal{F}(q,x)=w. In what follows, we show that \mathcal{L} satisfies (13a) and (13b). To begin with, using (15a), we can establish

x(q,x)\displaystyle\nabla_{x}\mathcal{L}(q,x) =G01((q,s(τ))w)dτ\displaystyle=G\int_{0}^{1}\left(\mathcal{F}(q,s(\tau))-w\right)\mathrm{d}\tau
+01zT(q,z)|z=s(τ)τdτGT(xg(q))\displaystyle\quad+\int_{0}^{1}\nabla_{z}^{T}\mathcal{F}(q,z)\Big{|}_{z=s(\tau)}\tau\,\mathrm{d}\tau\,G^{T}(x-g(q))
=G01ddττ(q,s(τ))dτGw\displaystyle=G\int_{0}^{1}\frac{\mathrm{d}}{\mathrm{d}\tau}\tau\mathcal{F}(q,s(\tau))\,\mathrm{d}\tau-Gw
=G((q,x)w).\displaystyle=G\left(\mathcal{F}(q,x)-w\right). (22)

Hence, we can derive

xT(q,x)x˙=((q,x)w)TGTx˙=p˙Tx˙.\displaystyle\nabla_{x}^{T}\mathcal{L}(q,x)\dot{x}=(\mathcal{F}(q,x)-w)^{T}G^{T}\dot{x}=-\dot{p}^{T}\dot{x}. (23)

Similarly, the partial derivative of (q,x)\mathcal{L}(q,x) with respect to qq can be computed as

q(q,x)\displaystyle\nabla_{q}\mathcal{L}(q,x)
=qTg(q)01G((q,s(τ))w)dτ\displaystyle=-\nabla_{q}^{T}g(q)\int_{0}^{1}G\left(\mathcal{F}(q,s(\tau))-w\right)\mathrm{d}\tau
+01(qTG(q,s(τ))\displaystyle\quad+\int_{0}^{1}\bigg{(}\nabla_{q}^{T}G\mathcal{F}(q,s(\tau))
+qTg(q)zTG(q,z)|z=s(τ)(1τ))dτ(xg(q))\displaystyle\qquad+\nabla_{q}^{T}g(q)\nabla_{z}^{T}G\mathcal{F}(q,z)\Big{|}_{z=s(\tau)}(1-\tau)\bigg{)}\mathrm{d}\tau\,(x-g(q))
=01qT(q,s(τ))dτGT(xg(q))\displaystyle=\int_{0}^{1}\nabla_{q}^{T}\mathcal{F}(q,s(\tau))\,\mathrm{d}\tau\,G^{T}(x-g(q))
+qTg(q)01G(ddτ(q,s(τ))(1τ)+w)dτ=0.\displaystyle\quad+\nabla_{q}^{T}g(q)\underbrace{\int_{0}^{1}G\bigg{(}\frac{\mathrm{d}}{\mathrm{d}\tau}\mathcal{F}(q,s(\tau))(1-\tau)+w\bigg{)}\mathrm{d}\tau}_{=0}. (24)

Therefore, we can derive

qT(q,x)q˙\displaystyle\nabla_{q}^{T}\mathcal{L}(q,x)\dot{q}
=(xg(q))T01qG(q,s(τ))dτ((q,x)+w).\displaystyle=(x-g(q))^{T}\int_{0}^{1}\nabla_{q}G\mathcal{F}(q,s(\tau))\,\mathrm{d}\tau\,(-\mathcal{F}(q,x)+w).

Then, the time derivative of \mathcal{L} becomes

ddt(q,x)=(xg(q))T×01qG(q,s(τ))dτ((q,x)+w)p˙Tx˙\frac{\mathrm{d}}{\mathrm{d}t}\mathcal{L}(q,x)=(x-g(q))^{T}\\ \times\int_{0}^{1}\nabla_{q}G\mathcal{F}(q,s(\tau))\,\mathrm{d}\tau\,(-\mathcal{F}(q,x)+w)-\dot{p}^{T}\dot{x} (25)

Hence, for \mathcal{L} to satisfy (13b), it suffices to show that the following inequality holds.

(xg(q))T01qG(q,s(τ))dτ((q,x)+w)0.\displaystyle(x\!-\!g(q))^{T}\!\int_{0}^{1}\nabla_{q}G\mathcal{F}(q,s(\tau))\mathrm{d}\tau(-\mathcal{F}(q,x)\!+\!w)\!\leq\!0. (26)

By Assumption 1, we can rewrite the above equation as

(xg(q))T01qG(q,s(τ))dτ((q,x)+w)\displaystyle(x-g(q))^{T}\int_{0}^{1}\nabla_{q}G\mathcal{F}(q,s(\tau))\,\mathrm{d}\tau\,(-\mathcal{F}(q,x)+w)
=i=1m01iqi(qi,s(τ))dτ(xg(q))TGicol(i(qi,x)+wi),\displaystyle=\!\sum_{i=1}^{m}\!\int_{0}^{1}\!\frac{\partial\mathcal{F}_{i}}{\partial q_{i}}(q_{i},s(\tau))\mathrm{d}\tau(x\!-\!g(q))^{T}G_{i}^{\tiny\text{col}}(-\mathcal{F}_{i}(q_{i},x)\!+\!w_{i}),

where GicolG_{i}^{\tiny\text{col}} is the ii-th column vector of GG defined as in (1). Consequently, by Assumption 1, the condition (15b) ensures (26), where the equality holds if and only if (q,x)𝕆(q,x)\in\mathbb{O}. Hence, in conjunction with the fact that (q,x)=0(q,x)=w\mathcal{L}(q,x)=0\Leftrightarrow\mathcal{F}(q,x)=w, we can validate that (13a) holds.

Recall that, according to (8), for (q,x)(q^{\ast},x^{\ast}) minimizing the cost function c(q)c(q) to be the equilibrium state of the closed-loop model (10), it suffices to show

xi>0\displaystyle x_{i}^{\ast}>0 pi=max1jnpj\displaystyle\implies p_{i}^{\ast}=\max_{1\leq j\leq n}p_{j}^{\ast}
Girowq=max1jnGjrowq\displaystyle\implies G_{i}^{\tiny\text{row}}q^{\ast}=\max_{1\leq j\leq n}G_{j}^{\tiny\text{row}}q^{\ast}
(GirowGjrow)q0\displaystyle\implies(G_{i}^{\tiny\text{row}}-G_{j}^{\tiny\text{row}})q^{\ast}\geq 0 (27)

holds for all i,ji,j in {1,,n}\{1,\cdots,n\}, where GirowG_{i}^{\tiny\text{row}} is the ii-th row vector of GG defined as in (1). Hence, the condition (15c) ensures that (q,x)(q^{\ast},x^{\ast}) is the equilibrium state of (10). This completes the proof. ∎

-C Proof of Proposition 1

First of all, according to Lemma 1 and (15c), (q,x)(q^{\ast},x^{\ast}) is an equilibrium point of (10). By the definition of the KLD-RL protocol [21, 22], with θ=x\theta=x^{\ast}, for any other equilibrium state (q¯,x¯)(\bar{q},\bar{x}) of (10), it holds that

(x¯x)TGq¯η𝒟(x¯x).\displaystyle(\bar{x}-x^{\ast})^{T}G\bar{q}\geq\eta\mathcal{D}(\bar{x}\,\|\,x^{\ast}). (28)

We prove the statement of the proposition by showing that for any fixed positive constant ϵ\epsilon, when η\eta is sufficiently large, there is no equilibrium state (q¯,x¯)(\bar{q},\bar{x}) of (10) satisfying 𝒟(x¯x)ϵ\mathcal{D}(\bar{x}\,\|\,x^{\ast})\geq\epsilon.

By contradiction, for each η>0\eta>0, suppose there is an equilibrium state (q¯,x¯)(\bar{q},\bar{x}) for which 𝒟(x¯x)ϵ\mathcal{D}(\bar{x}\,\|\,x^{\ast})\geq\epsilon holds. When η\eta is sufficiently large, for (q¯,x¯)(\bar{q},\bar{x}) to be an equilibrium state of (10), (x¯x)TGq¯(\bar{x}-x^{\ast})^{T}G\bar{q} needs to be large enough to satisfy (28). Note that (x¯x)TGq¯=i=1m(x¯x)TGicolq¯i(\bar{x}-x^{\ast})^{T}G\bar{q}=\sum_{i=1}^{m}(\bar{x}-x^{\ast})^{T}G_{i}^{\text{col}}\bar{q}_{i}. Let ii be an index for which (x¯x)TGicolq¯i(\bar{x}-x^{\ast})^{T}G_{i}^{\text{col}}\bar{q}_{i} becomes arbitrarily large and so does q¯i\bar{q}_{i} as η\eta increases. According to (15b), when q=q¯q=\bar{q} and x=xx=x^{\ast}, it holds that g(q¯)=x¯g(\bar{q})=\bar{x} and hence we have

(i(q¯i,x)wi)(xx¯)TGicol>0.\displaystyle(\mathcal{F}_{i}(\bar{q}_{i},x^{\ast})-w_{i})(x^{\ast}-\bar{x})^{T}G_{i}^{\text{col}}>0. (29)

By Assumption 1 and by the fact that (q,x)(q^{\ast},x^{\ast}) is an equilibrium state of (10), as q¯i\bar{q}_{i} becomes arbitrarily large, i(q¯i,x)>wi\mathcal{F}_{i}(\bar{q}_{i},x^{\ast})>w_{i} holds in which case by (29), it holds that (x¯x)TGicol<0(\bar{x}-x^{\ast})^{T}G_{i}^{\text{col}}<0. However, this contradicts the requirement that (x¯x)TGicolq¯i(\bar{x}-x^{\ast})^{T}G_{i}^{\text{col}}\bar{q}_{i} takes a large positive value. This completes the proof. ∎

References

  • [1] S. Park, Y. D. Zhong, and N. E. Leonard, “Multi-robot task allocation games in dynamically changing environments,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021.
  • [2] L. Parker, “Alliance: an architecture for fault tolerant multirobot cooperation,” IEEE Transactions on Robotics and Automation, vol. 14, no. 2, pp. 220–240, 1998.
  • [3] C. Yang, L. Kaplan, E. Blasch, and M. Bakich, “Optimal placement of heterogeneous sensors for targets with Gaussian priors,” IEEE Transactions on Aerospace and Electronic Systems, vol. 49, no. 3, pp. 1637–1653, 2013.
  • [4] L. Parker and F. Tang, “Building multirobot coalitions through automated task solution synthesis,” Proceedings of the IEEE, vol. 94, no. 7, pp. 1289–1305, 2006.
  • [5] I. Jang, H.-S. Shin, and A. Tsourdos, “Anonymous hedonic game for task allocation in a large-scale multiple agent system,” IEEE Transactions on Robotics, vol. 34, no. 6, pp. 1534–1548, 2018.
  • [6] W. Saad, Z. Han, T. Basar, M. Debbah, and A. Hjorungnes, “Hedonic coalition formation for distributed task allocation among wireless agents,” IEEE Transactions on Mobile Computing, vol. 10, no. 9, 2011.
  • [7] H.-L. Choi, L. Brunet, and J. P. How, “Consensus-based decentralized auctions for robust task allocation,” IEEE Transactions on Robotics, vol. 25, no. 4, pp. 912–926, 2009.
  • [8] S. Amaya and A. Mateus, “Tasks allocation for rescue robotics: A replicator dynamics approach,” in Artificial Intelligence and Soft Computing, L. Rutkowski, R. Scherer, M. Korytkowski, W. Pedrycz, R. Tadeusiewicz, and J. M. Zurada, Eds.   Cham: Springer International Publishing, 2019, pp. 609–621.
  • [9] S. Berman, A. Halasz, M. A. Hsieh, and V. Kumar, “Optimized stochastic policies for task allocation in swarms of robots,” IEEE Transactions on Robotics, vol. 25, no. 4, pp. 927–937, 2009.
  • [10] A. Pashaie, L. Pavel, and C. J. Damaren, “A population game approach for dynamic resource allocation problems,” International Journal of Control, vol. 90, no. 9, pp. 1957–1972, 2017.
  • [11] J. R. Marden, “State based potential games,” Automatica, vol. 48, no. 12, pp. 3075–3088, 2012.
  • [12] N. Li and J. R. Marden, “Designing games for distributed optimization,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 2, pp. 230–242, 2013.
  • [13] W. H. Sandholm, Population Games and Evolutionary Dynamics.   MIT Press, 2011.
  • [14] E. Ramirez-Llanos and N. Quijano, “A population dynamics approach for the water distribution problem,” International Journal of Control, vol. 83, pp. 1947–1964, 2010.
  • [15] M. J. Fox and J. S. Shamma, “Population games, stable games, and passivity,” Games, vol. 4, pp. 561–583, Oct. 2013.
  • [16] J. Barreiro-Gomez, C. Ocampo-Martinez, N. Quijano, and J. M. Maestre, “Non-centralized control for flow-based distribution networks: A game-theoretical insight,” Journal of The Franklin Institute, vol. 354, pp. 5771–5796, 2017.
  • [17] S. Park, N. C. Martins, and J. S. Shamma, “From population games to payoff dynamics models: A passivity-based approach,” in 2019 IEEE 58th Conference on Decision and Control (CDC), 2019.
  • [18] S. Park, J. S. Shamma, and N. C. Martins, “Passivity and evolutionary game dynamics,” in 2018 IEEE Conference on Decision and Control (CDC), 2018, pp. 3553–3560.
  • [19] M. Arcak and N. C. Martins, “Dissipativity tools for convergence to Nash equilibria in population games,” IEEE Transactions on Control of Network Systems, vol. 8, no. 1, pp. 39–50, 2021.
  • [20] N. Quijano, C. Ocampo-Martinez, J. Barreiro-Gomez, G. Obando, A. Pantoja, and E. Mojica-Nava, “The role of population games and evolutionary dynamics in distributed control systems: The advantages of evolutionary game theory,” IEEE Control Systems Magazine, vol. 37, no. 1, pp. 70–97, 2017.
  • [21] S. Park and N. E. Leonard, “KL divergence regularized learning model for multi-agent decision making,” in 2021 American Control Conference (ACC), 2021, pp. 4509–4514.
  • [22] ——, “Learning with delayed payoffs in population games using Kullback-Leibler divergence regularization,” arXiv.org, 2023.
  • [23] H. K. Khalil, Nonlinear Systems; 3rd ed.   Upper Saddle River, NJ: Prentice-Hall, 2002.