© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Reinforcement Learning of Multi-robot Task Allocation for Multi-object Transportation with Infeasible Tasks
Abstract
Multi-object transport using multi-robot systems has the potential for diverse practical applications such as delivery services owing to its efficient individual and scalable cooperative transport. However, allocating transportation tasks of objects with unknown weights remains challenging. Moreover, the presence of infeasible tasks (untransportable objects) can lead to robot stoppage (deadlock). This paper proposes a framework for dynamic task allocation that involves storing task experiences for each task in a scalable manner with respect to the number of robots. First, these experiences are broadcasted from the cloud server to the entire robot system. Subsequently, each robot learns the exclusion levels for each task based on those task experiences, enabling it to exclude infeasible tasks and reset its task priorities. Finally, individual transportation, cooperative transportation, and the temporary exclusion of tasks considered infeasible are achieved. The scalability and versatility of the proposed method were confirmed through numerical experiments with an increased number of robots and objects, including unlearned weight objects. The effectiveness of the temporary deadlock avoidance was also confirmed by introducing additional robots within an episode. The proposed method enables the implementation of task allocation strategies that are feasible for different numbers of robots and various transport tasks without prior consideration of feasibility.
I INTRODUCTION
In recent years, multi-robot transportation tasks have attracted considerable attention in various fields such as delivery services, factory logistics, search and rescue, and precision agriculture. Systems in which multiple robots are controlled via a cloud server to execute various transportation tasks within facilities have been developed. These multi-robot systems achieve efficient transportation because each robot can cover a wide area independently [1] (Fig. 1 (a)). In addition, scalability is realized by enabling multiple robots to cooperate with each other when transporting objects that individual robots find infeasible [2, 3, 4, 5] (Fig. 1 (c)). Furthermore, distributed control improves the resilience of robots by facilitating the seamless addition of robots to the system.
Multi-robot task allocation (MRTA) is important for achieving multiple transportation tasks with multi-robot systems [1, 6]. Garkely et al. (2004) [1] categorized MRTA approaches into centralized algorithms [7, 8], distributed algorithms [9], and hybrid algorithms that combine centralized and distributed approaches, such as auction-based methods [10, 11, 12].

Recently, several approaches based on multi-agent reinforcement learning (MARL) [13] have been proposed [14]. However, MARL remains challenging because of the partial observability of each robot and the simultaneous learning of policies for each robot [13, 15]; therefore, the approach involving centralized training and decentralized execution (CTDE) is often used [16, 17]. The CTDE approach has demonstrated scalability for systems with various numbers of robots that need to perform diverse tasks [14, 18]. Furthermore, cooperative actions can be performed by learning communication between robots [14, 19].
Naturally, there are objects that cannot be transported (infeasible tasks) by all robots (Fig. 1 (b)), and there may be no prior information about them. In conventional methods [1], the cost and task completion probability for each task must be specified explicitly. In addition, when the method [14] based on MARL is applied to task allocation in an environment in which infeasible tasks exist, all robots are connected to the infeasible tasks, causing a stoppage (deadlock) in the robot movements.
To avoid deadlocks, it is necessary to exclude infeasible tasks from task allocation. However, when additional cooperative robots are introduced, the tasks become feasible, as shown in Fig. 1 (d), and it is desired to release the exclusion. Therefore, a mechanism is required to temporarily exclude the allocation of tasks that have already been allocated.
In this paper, we propose a framework for dynamic task allocation that enables a multi-robot system to continue executing tasks even when there are infeasible tasks among the transportation tasks. Specifically, a cloud server stores the task experiences in a scalable manner with respect to the number of robots and broadcasts them to the entire robot system. In our proposed method, each robot learns the exclusion level for each object based on the task experience and other information. The task priorities [14] are reset using an output gate based on the task experience and exclusion levels. Consequently, individual transportation, cooperative transportation, and the temporary exclusion of tasks that were considered as infeasible were achieved. Finally, we validated the performance of our proposed method in terms of success rate and transportation time by performing numerical experiments with a larger number of robots and objects compared with the training experiments, as well as with untrained objects and a varying number of robots.
The contributions of this study are as follows:
-
•
We propose a task alloation framework with infeasible tasks comprising task experiences broadcasted from the cloud server and task exlusion levels learned by structured policy models.
-
•
The proposed framework differs from conventional MRTA approaches that require the specification of the cost and task completion probability, in that it can temporarily exclude infeasible tasks without prior information until additional robots are introduced.
-
•
We confirm that the proposed method successfully completes feasible tasks while excluding infeasible tasks, even in numerical experiments that differ from the training experiments, including an episode where additional robots are introduced within the episode.
The remainder of this paper is structured as follows: Section II presents related work, Section III explains the problem setting, Section IV introduces the proposed method, Section V evaluates the performance of the proposed method through numerical experiments, and Section VI concludes the paper.
II Related Work
II-A Combinatorial Optimization
Task allocation can be formulated as a combinatorial optimization problem that aims to minimize the total travel distance, and can be solved using methods such as the Hungarian algorithm [7] or integer linear programming [8]. For these methods, it is necessary to provide the required resources (number of robots) for each task. However, obtaining this information in advance may not always be possible. In addition, while the tasks need to be feasible, there are cases in which the allocated resources are insufficient to execute tasks.
II-B Bio-inspired Approach
Metaheuristic methods [20, 21, 22], which were inspired by biological systems and natural processes, such as the division of labor in social insects, have been used to solve MRTA problems [6]. A commonly used approach is the threshold model [20, 21], in which each robot selects tasks using activation thresholds and a stimulus associated with each task based on local information. These methods are flexible and can be adapted to different conditions with varying number of robots and tasks. However, infeasible tasks may be allocated to robots, which can lead to decreased efficiency.
II-C Auction Approach
The auction algorithm is a commonly used method for task allocation among multi-robots [1], and has been studied using both centralized and decentralized approaches. With the centralized approach [23], an auctioneer allocates tasks to bidders. In contrast, Choi et al. (2009) [11] proposed a decentralized auction-based algorithm without an auctioneer. With this method, tasks are allocated based on a consensus algorithm that involves local communication among bidders (robots). However, their method focused on problems in which a single robot could execute each task.
Braquet and Bakolas (2021) [12] addressed a problem similar to that in our research. It focuses on the cases in which multiple robots are required for each task. Their method employed a consensus algorithm that is similar to that of Choi et al. (2009) [11] to estimate the lists of selected tasks, winning bids, and completed allocations. The robots allocate tasks to the robot with the highest bid among the unallocated robots based on the list of completed allocations. However, this method requires the probability of completing each task, which becomes computationally challenging when dealing with objects for which the resources required for completion are unknown.
II-D MARL Approach
Previous studies [24, 25, 26, 27, 28, 29] have focused on task allocation problems using MARL. These methods formulate task allocation as a Markov decision process and learn policies using learning algorithms, such as MADDPG [16]. However, these methods are designed for a fixed number of agents and tasks, rendering them ineffective under different conditions. To address this issue, policy models that only obtain neighboring agents and tasks have been utilized [14, 18]. In particular, a framework for task allocation in the presence of an unknown number of robots required for cooperative transportation was proposed in previous studies [14]. This proposed framework utilizes dynamic priorities for each task and global robot communication. However, if the tasks do not have sufficient resources (number of robots), the robots may encounter a deadlock scenario until the introduction of additional robots.
Our proposed framework, which is similar to the approach proposed by Shibata et al. (2022) [14], uses dynamic priorities, but in contrast to their method, it incorporates the learning of dynamic exclusion levels for each task by leveraging task experiences broadcasted from the cloud server. Dynamic exclusion levels are used in the output gate, which temporarily resets dynamic priorities. Therefore, if a robot encounters a deadlock with a specific object, the priority of that task is reset. In addition, when more cooperative robots are introduced, the reset for that object is released, enabling the robots to again be transported cooperatively.
III Problem Setting
Consider the task of transporting different objects with varying weights using robots to the goals associated with the tasks (Fig. 1). However, the weights of the objects are unknown and there may be objects that cannot be transported. In this study, the transport task aims to efficiently transport all feasible objects using all robots as quickly as possible.
The study focuses on the allocation of objects to robots. Therefore, the robot allocated to object reaches it within the shortest possible time. The number of robots allocated to object , denoted by , determines the ability of robots to carry objects. If is greater than the weight of object , a sufficient number of robots are connected to the object. Subsequently, the object is transported to the goal . Here,
represents the set of robots that have actually connected to object , where is the position of robot , is the position of object . If distance between robots and objects is smaller than small positive constant , the robots are able to connect to the object.
This study made the following assumptions:
-
•
The cloud server can always obtain the latest information (global information) about the robots and objects.
-
•
When cooperating with other robots is necessary, each robot can obtain global information via the cloud server.
IV Method
The weights of the objects are unknown, and there may be objects that cannot be transported; hence, the simple allocation of tasks to the robots in advance would result in a deadlock. Therefore, we propose the framework shown in Fig. 2. The cloud server holds global information, such as the task experience for the entire robot system. In addition, each robot has a task exclusion level for each object in addition to the task priority . The task exclusion level enables the temporary resetting of the task priority for each object based on the broadcasted from the cloud server. Consequently, deadlock avoidance was enabled.

IV-A Task Experience
The cloud server stores the overall task experience for object in a scalable manner based on the number of robots. can be expressed as follows:
(1) |
Here, represents the number of robots connected to object at current time , and represents the velocity of object . In addition, is a step function with a threshold of , and is a positive constant.
IV-B Dynamic Task Exclusion
To obtain exclusion levels that are scalable with the number of objects , robot learns a static policy network model that outputs the target exclusion level for neighboring objects as follows:
(2) |
Robot sets the current exclusion levels for other objects. The target exclusion levels for all objects are set as follows:
Each robot utilizes partial observables , denoted as follows:
(3) | |||||
which contains nearest robots , objects , and task experience for the nearest objects.
Furthermore, for objects that are outside the nearest neighbors of robot , a mechanism is introduced to share the exclusion levels with other robots via the cloud server at time when . The timing was obtained by learning a static policy network model, which is denoted as follows:
(4) |
Based on the above, robot updates the exclusion levels for all objects using a consensus protocol via the cloud server:
(5) |
Here, is a positive constant,
is a step function with a threshold of for variable .
IV-C Integration with Dynamic Task Priority
Each robot sequentially allocates objects using dynamic priorities [14].
Similar to the exclusion levels, to achieve scalable learning with a number of objects , robot sets the target priorities of all objects as follows:
where is the current priority of robot for object ; in addition, robot learns the target priorities for neighboring objects using a policy network model as follows:
(6) |
Cooperation with robots that are not located nearby is required for handling heavy objects. Therefore, similar to the exclusion level, when the required timing is , robot updates the priorities for all objects using a consensus protocol via a cloud server:
(7) |
Here, denotes a positive constant. In addition, the timing for sharing priorities with other robots is obtained by learning a static policy network as follows:
(8) |
The exclusion level in Equation (5) is integrated with priority in Equation (7) using the output gate, as shown in Fig. 3. The integration is expressed as follows:
(9) |
Here, is a step function with the threshold of a small positive constant . If object satisfies and , transportation is determined to be infeasible with the current resources, and its priority is reset.
Finally, robot selects the object with the highest priority among priorities after the output gate in Equation (9):
Furthermore, the priorities of the objects that have reached their goals or are being transported by other robots are set to . In addition, if an object is already allocated and is being transported, the update of all priorities is stopped.

IV-D Policy Optimization
V Numerical Experiment
In this section, we present the results of numerical experiments on the transportation tasks of multiple objects. The performance of the proposed method was evaluated under various settings by varying the number of robots and objects.
V-A Simulation Setup
Starting positions of the participants were randomly selected, and the target positions of each object were fixed for each simulation (Fig. 4). The carrying capacity of each robot was set to . By the inclusion of robots cooperating to transport the same object, it becomes possible to transport an object with a weight of .

In the training experiments (Training in Table I), robots and objects were used. The weights of the objects were set to three types, including objects that cannot be transported by robots, with weights . Among the six objects, three objects had a weight of , whereas the remaining three objects were randomly chosen from weights of and with a probability.
Experiment | Robot | Object | Object weight | ||||
---|---|---|---|---|---|---|---|
Training | |||||||
Validation 1 | |||||||
Validation 2 | |||||||
Validation 3 |
We used the MADDPG code [30] and set the simulation parameters listed in Table II. In the proposed method, the numbers of neighboring robots and objects for each robot were set to . The other parameters were set as follows: , , , .
Parameter | Value |
---|---|
Sampling period [s] | 1.0 |
Number of steps per episode | 300 |
Number of episodes | 50,000 |
Number of hidden layers (critic) | 4 |
Number of hidden layers (actor) | 4 |
Activation function (hidden) | RelU |
Activation function (output, critic) | tanh |
Activation function (output, actor) | linear |
Optimizer (critic) | Adam |
Optimizer (actor) | SGD |
Discount factor | 0.99 |
Batch size | 1024 |
The validation experiments (refer to validations 1 and 2 in Table I) used robots and objects. Among these, two objects have an untransportable weight of , whereas the remaining eight objects have weights of either or . The proportion of objects with weights of or was selected using probabilities, and episodes were run, with each episode consisting of 1,000 steps.
To evaluate the scalability and versatility of the proposed method, we used the following two criteria:
-
•
Success rate: The percentage of episodes in which all objects reached their goals within the total number of steps.
-
•
Transportation time: The average number of steps (seconds) until all transportable objects reached their goals for the episodes in which they succeed in transportation.
Furthermore, in an additional validation experiment (refer to Validation 3 in Table I), the number of robots was varied from to during each episode to evaluate the effectiveness of the proposed method.
V-B Training Performance
The training performance of the proposed method was evaluated based on the average cumulative reward over training runs. The results were compared with those of a method that used only the dynamic priority mechanism [14] without DE and ), as shown in Fig. 5. Here, DE in the figure refers to dynamic exclusion.
The average cumulative reward of the method using only the dynamic priority mechanism [14] was low because the robots became stuck in a deadlock while gathering around objects that could not be transported.
However, the proposed method achieved high cumulative rewards and did not cause deadlocks. It is believed that the output gate in Equation (9), which is controlled by the dynamic exclusion in Equation (5), effectively avoids deadlock.

V-C Scalability and Versatility
Fig. 7 shows the validation results obtained when increasing the number of robots and objects compared to the learning phase, with the weights of the objects remaining the same. For comparison, the results excluding the dynamic exclusion (DE) and inputs from the proposed method are shown in red and yellow, respectively. When using only the dynamic priority mechanism [14] without DE and , there were cases in which objects could not be transported because of deadlock occurrences with infeasible objects, resulting in a success rate below . However, the proposed method achieved successful transportation in shorter times than the other methods under all conditions. Furthermore, the proposed method reduced the transportation time when there were fewer heavy objects. This indicates that both individual and cooperative transportation occurred simultaneously, depending on the composition of the objects. These results demonstrated the scalability of the proposed method.
Fig. 7 shows the validation results obtained when increasing the number of robots and objects compared to the learning phase, including the case where there are unlearned weight values for the objects. Even for objects with unlearned weights (), the proposed method successfully completed the transport tasks. This demonstrates the versatility of the proposed method.


In Fig. 7 and 7, the policies obtained from the learning of methods removing either the DE or input from the proposed method resulted in a significant decrease in the success rate. For the method without DE (shown by the red bars in the figures), even when the ratio of weights to is zero (i.e., all objects except for infeasible ones can be transported individually with weight ), deadlock occurs because of the infeasible objects, and the success rate does not reach 100 %. For the method without input (shown by the yellow bars in the figures), when the ratio of weights to was , the success rate reached . However, when heavy objects are transported cooperatively, the success rate decreases to . From these results, it can be inferred that the removal of DE results in the loss of deadlock avoidance functionality, and the removal of the input resulted in the loss of cooperative transportation capability.
V-D Effectiveness of Temporary Priority Exclusions



We confirmed the effect of temporarily resetting the priority using output gates through an experiment (Validation 3) in which robots were added within an episode. Fig. 8 shows the results sampled during a s episode. In Fig. 8 (a)–(c), cooperating robots were able to transport objects weighing up to before encountering a deadlock with objects weighing . In Fig. 8 (d)–(f), with additional robots, all robots cooperate to transport all objects weighing .
Fig. 9 shows the number of robots connected to objects as , and the values of the output gates for each robot . The results of the output gates are shown in Fig. 9 (c); evidently, the priority of objects weighing was temporarily excluded. Consequently, they were able to transport objects weighing and to the goal early without encountering a deadlock. Furthermore, after the addition of robots, the output gates were released, allowing the robots to collaboratively transport objects weighing .
Fig. 10 shows the task experiences and for objects and with a weight of . After robots were added at s, the current task experiences were recalculated with and the values decreased. Consequently, the temporary exclusions were released. Normalization and recalculation of task experiences using current number of robots enables the proposed method to handle objects with a weight of .
VI CONCLUSIONS
In this study, we propose a framework for dynamic task allocation in multiple transportation tasks where the weights of the objects are unknown and there are infeasible objects. In the proposed method, multi-robots sequentially select objects using a dynamic task allocation approach. To achieve scalability in terms of the number of robots and objects, we learn the timing for utilizing global information and the target priorities of neighboring tasks based on local observations. We calculated dynamic priorities using an consensus protocol. We also constructed target exclusion levels and consensus protocols to temporarily reset priorities using an output gate, thereby avoiding deadlocks caused by infeasible objects. This construction enables task allocation strategies to be executable without prior consideration of the feasibility of the task. By performing numerical experiments, we confirmed the efficiency of task allocation through both individual and cooperative transportation, which can be achieved even with an increasing number of robots and objects (scalability), the ability to allocate tasks to unlearned objects (versatility), and the effectiveness of the temporary prevention of deadlocks while the number of robots was insufficient to execute the tasks.
Our proposed method assumes the use of a cloud server to obtain the required global information. In the future, we plan to implement the proposed method to validate its feasibility in real-world environments.
References
- [1] B. P. Gerkey and M. J. Matarić, “A formal analysis and taxonomy of task allocation in multi-robot systems,” The International journal of robotics research, vol. 23, no. 9, pp. 939–954, 2004.
- [2] N. Lissandrini, C. K. Verginis, P. Roque, A. Cenedese, and D. V. Dimarogonas, “Decentralized nonlinear mpc for robust cooperative manipulation by heterogeneous aerial-ground robots,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 1531–1536.
- [3] F. Bertoncelli, F. Ruggiero, and L. Sabattini, “Characterization of grasp configurations for multi-robot object pushing,” in 2021 International Symposium on Multi-Robot and Multi-Agent Systems (MRS). IEEE, 2021, pp. 38–46.
- [4] Y. Liu, F. Zhang, P. Huang, and X. Zhang, “Analysis, planning and control for cooperative transportation of tethered multi-rotor uavs,” Aerospace Science and Technology, vol. 113, p. 106673, 2021.
- [5] M. Doakhan, M. Kabganian, and A. Azimi, “Cooperative payload transportation with real-time formation control of multi-quadrotors in the presence of uncertainty,” Journal of the Franklin Institute, vol. 360, no. 2, pp. 1284–1307, 2023.
- [6] H. Chakraa, F. Guérin, E. Leclercq, and D. Lefebvre, “Optimization techniques for multi-robot task allocation problems: Review on the state-of-the-art,” Robotics and Autonomous Systems, p. 104492, 2023.
- [7] L. Liu and D. A. Shell, “Assessing optimal assignment under uncertainty: An interval-based algorithm,” The International Journal of Robotics Research, vol. 30, no. 7, pp. 936–953, 2011.
- [8] L. Sabattini, V. Digani, C. Secchi, and C. Fantuzzi, “Optimized simultaneous conflict-free task assignment and path planning for multi-agv systems,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 1083–1088.
- [9] A. Kimmel and K. Bekris, “Decentralized multi-agent path selection using minimal information,” in Distributed Autonomous Robotic Systems, N.-Y. Chong and Y.-J. Cho, Eds. Tokyo: Springer Japan, 2016, pp. 341–356.
- [10] M. B. Dias, R. Zlot, N. Kalra, and A. Stentz, “Market-based multirobot coordination: A survey and analysis,” Proceedings of the IEEE, vol. 94, no. 7, pp. 1257–1270, 2006.
- [11] H.-L. Choi, L. Brunet, and J. P. How, “Consensus-based decentralized auctions for robust task allocation,” IEEE transactions on robotics, vol. 25, no. 4, pp. 912–926, 2009.
- [12] M. Braquet and E. Bakolas, “Greedy decentralized auction-based task allocation for multi-agent systems,” IFAC-PapersOnLine, vol. 54, no. 20, pp. 675–680, 2021.
- [13] Y. Yang and J. Wang, “An overview of multi-agent reinforcement learning from game theoretical perspective,” arXiv preprint arXiv:2011.00583, 2020.
- [14] K. Shibata, T. Jimbo, T. Odashima, K. Takeshita, and T. Matsubara, “Learning locally, communicating globally: Reinforcement learning of multi-robot task allocation for cooperative transport,” IFAC-PapersOnLine, vol. 56, no. 2, pp. 11 436–11 443, 2023.
- [15] M. Wen, J. Kuba, R. Lin, W. Zhang, Y. Wen, J. Wang, and Y. Yang, “Multi-agent reinforcement learning is a sequence modeling problem,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 16 509–16 521.
- [16] R. Lowe, Y. WU, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017.
- [17] J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
- [18] C. D. Hsu, H. Jeong, G. J. Pappas, and P. Chaudhari, “Scalable reinforcement learning policies for multi-agent control,” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4785–4791, 2020.
- [19] T. S. Dahl, M. J. Mataric, and G. S. Sukhatme, “Adaptive spatio-temporal organization in groups of robots,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, vol. 1. IEEE, 2002, pp. 1044–1049.
- [20] G. Theraulaz, E. Bonabeau, and J. Denuebourg, “Response threshold reinforcements and division of labour in insect societies,” Proceedings of the Royal Society of London. Series B: Biological Sciences, vol. 265, no. 1393, pp. 327–332, 1998.
- [21] M. J. Krieger and J.-B. Billeter, “The call of duty: Self-organised task allocation in a population of up to twelve mobile robots,” Robotics and Autonomous Systems, vol. 30, no. 1-2, pp. 65–84, 2000.
- [22] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, vol. 8, pp. 279–292, 1992.
- [23] A. M. Kwasnica, J. O. Ledyard, D. Porter, and C. DeMartini, “A new and improved design for multiobject iterative auctions,” Management science, vol. 51, no. 3, pp. 419–434, 2005.
- [24] Y.-T. Tian, M. Yang, X.-Y. Qi, and Y.-M. Yang, “Multi-robot task allocation for fire-disaster response based on reinforcement learning,” in 2009 International Conference on Machine Learning and Cybernetics, vol. 4. IEEE, 2009, pp. 2312–2317.
- [25] X. Zhao, Q. Zong, B. Tian, B. Zhang, and M. You, “Fast task allocation for heterogeneous unmanned aerial vehicles through reinforcement learning,” Aerospace Science and Technology, vol. 92, pp. 588–594, 2019.
- [26] Y. Wang, H. Liu, W. Zheng, Y. Xia, Y. Li, P. Chen, K. Guo, and H. Xie, “Multi-objective workflow scheduling with deep-q-network-based multi-agent reinforcement learning,” IEEE access, vol. 7, pp. 39 974–39 982, 2019.
- [27] H. Qie, D. Shi, T. Shen, X. Xu, Y. Li, and L. Wang, “Joint optimization of multi-uav target assignment and path planning based on multi-agent reinforcement learning,” IEEE access, vol. 7, pp. 146 264–146 272, 2019.
- [28] H. Tang, A. Wang, F. Xue, J. Yang, and Y. Cao, “A novel hierarchical soft actor-critic algorithm for multi-logistics robots task allocation,” Ieee Access, vol. 9, pp. 42 568–42 582, 2021.
- [29] T. Niwa, K. Shibata, and T. Jimbo, “Multi-agent reinforcement learning and individuality analysis for cooperative transportation with obstacle removal,” in Distributed Autonomous Robotic Systems: 15th International Symposium. Springer, 2022, pp. 202–213.
- [30] R. Lowe, Y. WU, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Maddpg algorithm,” github. [Online]. Available:https://github.com/openai/maddpg [Accessed: 3-Nov-2021].