Payoff Mechanism Design for Coordination
in Multi-Agent Task Allocation Games
Abstract
We investigate a multi-agent decision-making problem where a large population of agents is responsible for carrying out a set of assigned tasks. The amount of jobs in each task varies over time governed by a dynamical system model. Each agent needs to select one of the available strategies to take on one or more tasks. Since each strategy allows an agent to perform multiple tasks at a time, possibly at distinct rates, the strategy selection of the agents needs to be coordinated. We formulate the problem using the population game formalism and refer to it as the task allocation game. We discuss the design of a decision-making model that incentivizes the agents to coordinate in the strategy selection process.
As key contributions, we propose a method to find a payoff-driven decision-making model, and discuss how the model allows the strategy selection of the agents to be responsive to the amount of remaining jobs in each task while asymptotically attaining the optimal strategies. Leveraging analytical tools from feedback control theory, we derive technical conditions that the model needs to satisfy, which are used to construct a numerical approach to compute the model. We validate our solution through simulations to highlight how the proposed approach coordinates the agents in task allocation games.
I Introduction
We investigate task allocation games to study coordination in repeated strategic interactions in a large population of agents. Consider that there is a finite number of tasks to be carried out by the agents. We quantify the amount of jobs remaining in each task with a positive variable, and every agent can select one of the available strategies at a time to take on one or more tasks. The main objective is to design a decentralized decision-making model that allows the agents to coordinate and minimize remaining jobs in all tasks.
Task allocation games are relevant to engineering applications. For instance, in multi-robot resource retrieval [1, 2], a team of multiple robots is tasked with searching and collecting target resources across partitioned areas in a given environment. Each task can be defined as collecting resources from an area and the strategy selection refers to taking one of the tasks. In target tracking applications [3], a group of mobile units with heterogeneous sensing capabilities are deployed to collect data about the states of multiple targets of our interests. Based on the type of equipped sensors, each mobile unit can collect different sets of data on the targets’ states. A task is defined as collecting data on a portion of the target states and a strategy specifies which pair of available sensors a mobile unit can equip. In both scenarios, the amount of resources to collect and the data to gather vary depending on past strategy selection of the agents and also on environmental changes and target dynamics.
To design a model for the agent strategy selection in such engineering applications, we investigate task allocation in dynamically changing environments. Multi-agent task allocation problems have been widely studied across various research communities [4, 5, 6, 7, 8, 9, 10]. A game-theoretic approach to the problem using replicator dynamics is investigated in [8]. The authors of [5, 6] use the hedonic game to study the coordination of multiple agents in task allocation. Applications of population game approaches to address task allocation in swarm robotics [9] and the control of a water distribution system [10] are discussed. Also, relevant to the task allocation game that we investigate in this work, whose formalism is defined in a state space, the state-based potential game has been studied in [11], and the design of state-based game to solve distributed optimization is proposed in [12].
A majority of existing works assume that the environment underlying the game is static and aim to find the optimal task allocation. In contrast, we study the design of a decision-making model under which the agents can repeatedly switch among multiple tasks to minimize remaining jobs in the tasks. We adopt the population game formalism [13] to state the problem and to study the decision-making model design. The model prescribes how the agents take on a given set of tasks and how the agents should switch among the tasks by revising their strategy selection to asymptotically attain optimality. We consider that each agent in the population is given a set of strategies to carry out assigned tasks where we denote the agents’ strategy profile – the distribution of the agents’ strategy selection – by a non-negative vector . Remaining jobs associated with tasks are denoted by a non-negative vector for which a dynamic model describes how changes – both growth by environmental changes and reduction by the agents – based on the agents’ strategy selection .
Based on the evolutionary dynamics framework [13], we specify a decentralized decision-making model that allows individual agents to revise their strategy selection based on a payoff vector , where each is the payoff an agent receives when it selects the -th strategy. As the main contribution, we design a payoff mechanism to define how should depend on to encourage the agents to select the tasks with more jobs to perform and asymptotically attain the minimum of a given cost . Applying convergence analysis tools [14, 15, 16, 17, 18, 19] that are based on passivity theory in the population games literature, we establish conditions under which the agents’ strategy profile converges and asymptotically attain the optimal profile. We use the conditions to compute the payoff mechanism.
The paper is organized as follows. In Section II, we explain the task allocation game formulation and the main problem we address in this paper. In Section III, we present the main result on the payoff mechanism design and analysis on convergence of the agent strategy revision process to the optimal strategy profile. In Section IV, we present simulation results to illustrate our main contribution. We conclude the paper with a summary and future plans in Section V.
II Problem Description
Consider a large population of agents that are assigned with tasks and are given strategies to carry out the tasks.111The number of tasks is not necessarily the same as that of available strategies, i.e., . We associate each task with a variable which quantifies the amount of jobs remaining in the task. Let denote the portion of the agents selecting strategy and a fixed positive number to be the mass of the agent population satisfying .222Considering the population state as control input to (2), the population mass can be interpreted as a limit on the control input. Each agent selects one of the strategies at a time based on payoff vector .
Let be the set of all -dimensional vectors with non-negative entries, and let be the space of all feasible states of the population defined as Given a matrix , we represent using its column and row vectors as follows:
(1) |
II-A Task Allocation Games
To investigate the task allocation problem, we formalize the problem as a large population game in which the agents select strategies to perform jobs in the assigned tasks quantified by . The vector varies over time based on the agents’ strategy selection and changes in the environment. Hence, each agent needs to evaluate and adaptively select a strategy based on .
Given and , at each time , the following ordinary differential equation describes the rate of change of .
(2) |
where is a continuously differentiable mapping333To have the reduction rate mapping defined for any population mass , we define the domain of as . that defines the reduction rate, which quantifies how fast the agents adopting strategy profile reduce , and the constant vector represents the growth rate for due to environmental changes. To ensure that the positive orthant is forward-invariant for under (2), each of satisfies . For notational convenience, let us define as the set of stationary points of (2), i.e.,
(3) |
We make the following assumption on the mapping .
Assumption 1
The reduction rate for each task depends only on its associated variable and the agent strategy selection , and increases as does . For instance, in the resource retrieval application discussed in Section I, when there is a larger volume of resources spread out across the areas, the robots would need to travel a shorter distance on average to locate and retrieve the resources and hence given a fixed strategy profile , the variable decreases at a faster rate. We formalize such assumptions as if and .
According to Assumption 1, we represent the reduction rate as , where for fixed , each is an increasing function of .
Remark 1
Suppose that given in , there is in satisfying . By Assumption 1, is unique.
The following examples illustrate how the dynamic game model (2) can be adopted in control systems applications.
Example 1 (Multi-Robot Resource Collection [1])
Let and be defined as
(4) |
where , , and are positive constants. The parameter represents the maximum reduction rate associated with strategy , and and are coefficients specifying how the reduction rate depends on and , respectively. Note that each function satisfies and Assumption 1. Here, and only the agent selecting strategy can reduce associated with task .
Example 2 (Heterogeneous Sensor Scheduling [3])
We adopt the model (2) as an abstract description of how mobile units’ sensor scheduling affects the uncertainty reduction in estimating states of multiple targets. Let and be defined as
(5) |
where denotes the set of the strategies (available sensor configurations of a mobile unit) that can collect data on the state of the -th target. The parameters , , have the same interpretation as in Example 1. Unlike the previous example, the strategies are defined to allow the agents to reduce multiple task-associated variables of .
Example 3 (Water Distribution Control [14, 20, 16])
There are reservoirs each of which is assigned with a respective maximum water level for in . Denote by water levels of the reservoirs at each time . Let be a constant outflow specifying water demands by consumers and be controllable inflows, where does not necessarily coincide with the number of reservoirs. The simplified dynamics for the water levels can be defined as satisfying to ensure that each reservoir cannot hold water above its maximum level: for instance, . By defining as remaining space in each reservoir , we can derive the dynamic model as
II-B Agent Strategy Revision Model
Our model is based on the evolutionary dynamics framework [13] in which the strategy revision protocol determines an agent’s strategy revision based on the payoff vector , where is a parameter of the protocol. We adopt the Kullback-Leibler Divergence Regularized Learning (KLD-RL) protocol [21, 22] to define as
(6) |
where . The protocol describes the probability of an agent switching to strategy given and . Note that the smaller the value of , the more the strategy revision depends on the value of .
Each agent is given an opportunity to revise its strategy selection at each jump time of an independent and identically distributed Poisson process, and uses the protocol to select a new strategy or keep its current strategy selection. Since the strategy revision of individual agents only depends on the payoff vector and takes place independently of each other, their decision-making is decentralized and the coordination among them occurs implicitly through their decision-making model. Based on discussions in [13, Chapter 4], as the number of agents in the population tends to infinity, the following ordinary differential equation describes how each component of evolves over time.
(7) |
We refer to (7) as the Evolutionary Dynamics Model (EDM).
Note that at an equilibrium state of the EDM (7) under the KLD-RL protocol (6), if , the following implication holds:
(8) |
Eq. (8) means that every agent receives the highest payoff at if the parameter of (6) is the same as .


Given the protocol as in (6), we aim to design a payoff mechanism for the agents to asymptotically adopt the optimal strategy profile that minimizes a given cost . For instance, in Example 1, if we design the payoff mechanism as , the robots would select strategy to take on task and asymptotically minimize , as discussed in [1]. However, in many applications, such one-to-one correspondence between tasks and available strategies may not exist, and depending on the cost we want to minimize, such a simple payoff mechanism would not be the best design choice as we illustrate in Figure 1.
In addition, since the payoff mechanism depends on the vector , the mechanism would incentivize the agents to take on the tasks with larger . Hence, compared to other models that directly control the population state to the optimal state (for instance, the model proposed in [9]), our strategy revision model is more responsive to changes of and hence reduces the task-associated variables at a faster rate as we depict in Figure 1.
Two examples of the cost function we consider are
-
•
(square of) the 2-norm of : , and
-
•
the -norm of : .
For the payoff mechanism design, we consider a linear model defined by a matrix as follows:
(9) |
Our main problem investigates finding the matrix that allows the agents to asymptotically minimize the cost . We formally state the problem as follows.
III Payoff Matrix Design

By interconnecting the dynamic model of the game (2), the payoff mechanism (9), and the EDM (7) with (6) as its revision protocol, as illustrated in Figure 2, we can write the state equation of the resulting closed-loop model as follows:
(10a) | |||
(10b) |
Given an initial condition , we assume the closed-loop model (10) has a unique solution. Let be the set of equilibrium states of (10). The proper design of should ensure that the following two conditions hold.
We adopt passivity tools [15, 17] to find technical conditions under which (R1) and (R2) are attained and use the conditions to design the payoff matrix . The critical step in the convergence analysis (R1) is in establishing passivity for both (10a) and (10b) by finding a so-called -storage function for (10a) and -antistorage function for (10b).444We refer to [17, Definition 10] and [17, Definition 12], respectively, for the formal definitions of passivity for (10a) and (10b). Then, by constructing a Lyapunov function using the two storage functions, we establish convergence results for (10).
To proceed, by [22, Lemma 3], (10b) is -passive and has the -storage function given by
(11) |
where is the KL divergence. Note that satisfies
(12a) | |||
(12b) |
for any payoff vector trajectory . The mapping is the vector field of the EDM (7).
The dynamic game model (2) is qualified as -antipassive [17] if there is a -antistorage function satisfying the following two conditions:
(13a) | ||||
(13b) |
where (13b) needs to hold for any given population state trajectory . According to (13a), the function can be used to measure how far the state is from the equilibrium of (10a). By their respective definitions [17], both and need to be continuously differentiable.
Recall given as in (3). For satisfying
(14) |
with , let us assign for (10b). We can establish the following lemma.
Lemma 1
The proof of the lemma is given in Appendix. Resorting to Lemma 1, to meet the requirements (R1) and (R2), we need to construct the payoff matrix such a way that (10a) becomes -antipassive and minimizing is an equilibrium state of (10). The following theorem states the technical conditions on that ensure (R1) and (R2). To state the theorem, we define a continuously differentiable mapping that maps any to satisfying .555We remark that does not necessarily belong to . We interpret has the strategy profile that attains the equilibrium state for a given when there is no limit on the population mass . The statement of the theorem holds if such exists.
Theorem 1
Let us define
and let be the stationary point of (10a) attaining the minimum cost . Suppose the matrix satisfies
(15a) | |||
(15b) | |||
(15c) |
where and are the column and row vectors of defined as in (1), respectively. The dynamic game model (10a) is -antipassive and is an equilibrium state of (10) with for any .
The proof of the theorem is given in Appendix. Under the condition (15b), whenever is increasing, i.e., , the matrix incentivizes the agents to revise their strategies toward , which is the strategy profile required to make the rate to zero. In other words, is designed to encourage the agents to select strategies that reduce the rate .
Proposition 1
Let be the stationary point of (10a) attaining the minimum cost . Consider the closed-loop model (10) for which and the payoff matrix satisfies (15). As the parameter of (10b) increases, becomes the unique equilibrium state of (10). In other words, it holds that , where is the set of equilibrium states of (10).
The proof of the proposition is provided in Appendix. In conjunction with Lemma 1 and Theorem 1, Proposition 1 implies that as becomes sufficiently large, the state trajectory converges to near the optimal state . According to (6), we note that smaller is desired to make the agent strategy revision responsive to changes in and also in . Hence, a good practice is to use smaller at the beginning of the task allocation game, and if needed, as goes to zero, the agents can gradually increase the value of to ensure that converges to .
IV Simulations
We use Examples 1 and 2 to illustrate our main results and discuss how the cost function and parameters of the dynamic model (2) affect the payoff matrix design. In both examples, we select the following fixed parameters , , , and for (4) and (5), and for (10b).666We select as all population state trajectories in the simulations converge to the optimal with the small positive . We use two different cost functions for Example 1 and two distinct growth rates for Example 2.
IV-A Computation of
We explain the steps to compute . First, note that (15a) is satisfied if has the following structures:
Then, we find that minimizes the cost function using the following optimization.
(16) |
Note that since is a nonlinear mapping, the optimization can be non-convex and the solution we find is locally optimal.
Once we find , we compute the matrix satisfying (15) for which we first need to find the mapping . Instead of explicitly finding , we draw random samples and find that minimizes for each sample . Note that assuming has full rank at , which is the case in both examples, the minimizer satisfies .
As the last step, the design of can be formulated as the following linear programming:
(17) | ||||
where is the -th element of . Since we evaluate the condition (15b) using a finite number of sampled points , we would obtain an approximate solution satisfying (15) only at the sampled points. However, as the sample size tends to infinity, the solution is more likely to satisfy (15) over the entire state space .
IV-B Simulation results for Example 1 ()




Using the methods explained in Section IV-A, we compute the optimal state minimizing i) and ii) , where we use the fixed growth rate for both cases. Then, we design the payoff matrix using (17) as follows.
-
i.
For ,
-
ii.
For ,
Note that when we use the -norm to define the cost , the optimal design of equally incentivizes the agents proportional to the remaining jobs . On the other hand, when the -norm is used, given that the values of are equal, the payoff matrix assigns the highest payoff to strategy and the lowest payoff to strategy . Recall that under the pre-selected growth rate , task has the lowest growth rate and task has the highest, and hence maintaining lower is easier – it needs a less number of agents – than . Hence, under the -norm cost function, the agents prioritize to carry out the tasks with lower growth rates.
Figures 4 and 4 depict the resulting trajectories for and . Notice that the population states at the equilibrium in the two cases are similar; however, the trajectories for are different and, hence, so do the costs evaluated along the trajectories as we discussed in Figure 1. We observe that there is a large variation in the agent strategy revision at the beginning of the simulations as the agents repeatedly switch among the strategies to reduce with a larger value.
IV-C Simulation results for Example 2 ()




We consider that there are 4 target states and 4 types of sensors each of which can measure a single state of the target. Each mobile unit can be equipped with two types of sensors and we define a strategy based on a pair of sensors employed on a mobile unit. According to the strategy definition, we can define the set in (5) as the strategies that can measure the -th target state: , , , and . We use the square of the -norm to define the cost function, i.e., .
We design with two distinct growth rates as follows.
-
i.
For ,
-
ii.
For ,
By comparing the above two payoff matrices, we can infer that the optimal assigns higher payoffs to the strategies as their respective growth rates become smaller. Figures 6 and 6 depict the resulting trajectories for and . Notably, as the growth rate of the -th target state becomes relatively higher than those of other target states, more agents adopt strategies , and , which can be used to measure the -th state. Similar to the simulation results in Section IV-B, we can observe a large variation in the agent strategy revision at the beginning of the simulations as the agents responsively revise their strategies based on the value of .
V Conclusions
We investigated the design of the payoff mechanism in the task allocation games. The mechanism determines payoffs given the vector that quantifies the amount of jobs in the assigned tasks to the agents, and the payoffs incentivize the agents to repeatedly revise their strategy selection. We discussed how to design the payoff matrix using the passivity tools to ensure that the agents asymptotically attain the optimal strategy profile. Using the numerical examples, we demonstrated how our results can be used to design and how the parameters of the dynamic game model affect the optimal design of . For future directions, we plan to consider the design of nonlinear payoff mechanisms , and to explore the idea of learning the dynamic model and computing alongside the model learning.
-A Proof of Lemma 1
-B Proof of Theorem 1
We define a -antistorage function using a line integral along a curve from to :
(19) |
Recall that is such that . By evaluating the line integral along the curve given by , we can derive
(20) |
Note that for every in , it holds that
(21) |
Hence, under (15b), we can verify that where the equality holds if and only if . In what follows, we show that satisfies (13a) and (13b). To begin with, using (15a), we can establish
(22) |
Hence, we can derive
(23) |
Similarly, the partial derivative of with respect to can be computed as
(24) |
Therefore, we can derive
Then, the time derivative of becomes
(25) |
Hence, for to satisfy (13b), it suffices to show that the following inequality holds.
(26) |
By Assumption 1, we can rewrite the above equation as
where is the -th column vector of defined as in (1). Consequently, by Assumption 1, the condition (15b) ensures (26), where the equality holds if and only if . Hence, in conjunction with the fact that , we can validate that (13a) holds.
Recall that, according to (8), for minimizing the cost function to be the equilibrium state of the closed-loop model (10), it suffices to show
(27) |
holds for all in , where is the -th row vector of defined as in (1). Hence, the condition (15c) ensures that is the equilibrium state of (10). This completes the proof. ∎
-C Proof of Proposition 1
First of all, according to Lemma 1 and (15c), is an equilibrium point of (10). By the definition of the KLD-RL protocol [21, 22], with , for any other equilibrium state of (10), it holds that
(28) |
We prove the statement of the proposition by showing that for any fixed positive constant , when is sufficiently large, there is no equilibrium state of (10) satisfying .
By contradiction, for each , suppose there is an equilibrium state for which holds. When is sufficiently large, for to be an equilibrium state of (10), needs to be large enough to satisfy (28). Note that . Let be an index for which becomes arbitrarily large and so does as increases. According to (15b), when and , it holds that and hence we have
(29) |
By Assumption 1 and by the fact that is an equilibrium state of (10), as becomes arbitrarily large, holds in which case by (29), it holds that . However, this contradicts the requirement that takes a large positive value. This completes the proof. ∎
References
- [1] S. Park, Y. D. Zhong, and N. E. Leonard, “Multi-robot task allocation games in dynamically changing environments,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021.
- [2] L. Parker, “Alliance: an architecture for fault tolerant multirobot cooperation,” IEEE Transactions on Robotics and Automation, vol. 14, no. 2, pp. 220–240, 1998.
- [3] C. Yang, L. Kaplan, E. Blasch, and M. Bakich, “Optimal placement of heterogeneous sensors for targets with Gaussian priors,” IEEE Transactions on Aerospace and Electronic Systems, vol. 49, no. 3, pp. 1637–1653, 2013.
- [4] L. Parker and F. Tang, “Building multirobot coalitions through automated task solution synthesis,” Proceedings of the IEEE, vol. 94, no. 7, pp. 1289–1305, 2006.
- [5] I. Jang, H.-S. Shin, and A. Tsourdos, “Anonymous hedonic game for task allocation in a large-scale multiple agent system,” IEEE Transactions on Robotics, vol. 34, no. 6, pp. 1534–1548, 2018.
- [6] W. Saad, Z. Han, T. Basar, M. Debbah, and A. Hjorungnes, “Hedonic coalition formation for distributed task allocation among wireless agents,” IEEE Transactions on Mobile Computing, vol. 10, no. 9, 2011.
- [7] H.-L. Choi, L. Brunet, and J. P. How, “Consensus-based decentralized auctions for robust task allocation,” IEEE Transactions on Robotics, vol. 25, no. 4, pp. 912–926, 2009.
- [8] S. Amaya and A. Mateus, “Tasks allocation for rescue robotics: A replicator dynamics approach,” in Artificial Intelligence and Soft Computing, L. Rutkowski, R. Scherer, M. Korytkowski, W. Pedrycz, R. Tadeusiewicz, and J. M. Zurada, Eds. Cham: Springer International Publishing, 2019, pp. 609–621.
- [9] S. Berman, A. Halasz, M. A. Hsieh, and V. Kumar, “Optimized stochastic policies for task allocation in swarms of robots,” IEEE Transactions on Robotics, vol. 25, no. 4, pp. 927–937, 2009.
- [10] A. Pashaie, L. Pavel, and C. J. Damaren, “A population game approach for dynamic resource allocation problems,” International Journal of Control, vol. 90, no. 9, pp. 1957–1972, 2017.
- [11] J. R. Marden, “State based potential games,” Automatica, vol. 48, no. 12, pp. 3075–3088, 2012.
- [12] N. Li and J. R. Marden, “Designing games for distributed optimization,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 2, pp. 230–242, 2013.
- [13] W. H. Sandholm, Population Games and Evolutionary Dynamics. MIT Press, 2011.
- [14] E. Ramirez-Llanos and N. Quijano, “A population dynamics approach for the water distribution problem,” International Journal of Control, vol. 83, pp. 1947–1964, 2010.
- [15] M. J. Fox and J. S. Shamma, “Population games, stable games, and passivity,” Games, vol. 4, pp. 561–583, Oct. 2013.
- [16] J. Barreiro-Gomez, C. Ocampo-Martinez, N. Quijano, and J. M. Maestre, “Non-centralized control for flow-based distribution networks: A game-theoretical insight,” Journal of The Franklin Institute, vol. 354, pp. 5771–5796, 2017.
- [17] S. Park, N. C. Martins, and J. S. Shamma, “From population games to payoff dynamics models: A passivity-based approach,” in 2019 IEEE 58th Conference on Decision and Control (CDC), 2019.
- [18] S. Park, J. S. Shamma, and N. C. Martins, “Passivity and evolutionary game dynamics,” in 2018 IEEE Conference on Decision and Control (CDC), 2018, pp. 3553–3560.
- [19] M. Arcak and N. C. Martins, “Dissipativity tools for convergence to Nash equilibria in population games,” IEEE Transactions on Control of Network Systems, vol. 8, no. 1, pp. 39–50, 2021.
- [20] N. Quijano, C. Ocampo-Martinez, J. Barreiro-Gomez, G. Obando, A. Pantoja, and E. Mojica-Nava, “The role of population games and evolutionary dynamics in distributed control systems: The advantages of evolutionary game theory,” IEEE Control Systems Magazine, vol. 37, no. 1, pp. 70–97, 2017.
- [21] S. Park and N. E. Leonard, “KL divergence regularized learning model for multi-agent decision making,” in 2021 American Control Conference (ACC), 2021, pp. 4509–4514.
- [22] ——, “Learning with delayed payoffs in population games using Kullback-Leibler divergence regularization,” arXiv.org, 2023.
- [23] H. K. Khalil, Nonlinear Systems; 3rd ed. Upper Saddle River, NJ: Prentice-Hall, 2002.