The Bandit Whisperer: Communication Learning for Restless Bandits

Yunfan Zhao^{1, 2}\equalcontribTonghan Wang¹\equalcontrib, Dheeraj Nagaraj³, Aparna Taneja³, Milind Tambe^{1, 3}

Abstract

Applying Reinforcement Learning (RL) to Restless Multi-Arm Bandits (RMABs) offers a promising avenue for addressing allocation problems with resource constraints and temporal dynamics. However, classic RMAB models largely overlook the challenges of (systematic) data errors—a common occurrence in real-world scenarios due to factors like varying data collection protocols and intentional noise for differential privacy. We demonstrate that conventional RL algorithms used to train RMABs can struggle to perform well in such settings. To solve this problem, we propose the first communication learning approach in RMABs, where we study which arms, when involved in communication, are most effective in mitigating the influence of such systematic data errors. In our setup, the arms receive Q-function parameters from similar arms as messages to guide behavioral policies, steering Q-function updates. We learn communication strategies by considering the joint utility of messages across all pairs of arms and using a Q-network architecture that decomposes the joint utility. Both theoretical and empirical evidence validate the effectiveness of our method in significantly improving RMAB performance across diverse problems.

1 Introduction

Restless Multi-Arm Bandits (RMABs) have been successfully applied to a range of multi-agent constrained resource allocation problems, such as healthcare, online advertising, and anti-poaching (Ruiz-Hernández, Pinar-Pérez, and Delgado-Gómez 2020; Lee, Lavieri, and Volk 2019; Hodge and Glazebrook 2015; Mate et al. 2022; Modi, Mary, and Moy 2019; Zhao, Krishnamachari, and Liu 2008; Bagheri and Scaglione 2015; Tripathi and Modiano 2019). In classical RMABs, there are a fixed number of arms, each may be described as an MDP with two actions (active and passive). At each timestep, constrained by the global budget, the RMAB strategically selects a subset of arms to play the active action with the objective of maximizing long-term accumulated rewards. Although finding an exact RMAB solution is PSPACE-hard (Papadimitriou and Tsitsiklis 1999), the whittle index solution is known to be near optimal (Whittle 1988; Weber and Weiss 1990) in this situation. Beyond classical RMABs, important applications of RMABs involve more complex systems with multiple actions, continuous state spaces, and arms opting in and out of the system. Recent studies have explored a variety of methods to address these challenges (Hawkins 2003; Killian, Perrault, and Tambe 2021; Verloop 2016; Zayas-Caban, Jasin, and Wang 2019; Ghosh et al. 2023), with deep Reinforcement Learning (RL) emerging as a particularly effective approach, achieving state-of-the-art results across various RMAB settings (Zhao et al. 2023; Xiong and Li 2023; Wang et al. 2023; Killian et al. 2022).

Unfortunately, these previous studies consider RMAB models where the reward data can be collected from every arm reliably. In real-world scenarios modeled via RMABs, such as maternal healthcare (Biswas et al. 2021; Verma et al. 2024; Behari et al. 2024) and SIS epidemic model concerning COVID intervention (Yaesoubi and Cohen 2011), each arm represents a different entity, which could be a beneficiary or a geographical region, where data collection protocols are potentially different. Different arms may be affected by noise caused by systematic errors that arise due to the following practical reasons. (1) Variability in data collection and handling protocols. There may be differences in data collection in different arms (entities), e.g., differences in the frequency and quality of data collected, in worker awareness or standards for data preprocessing, cleaning, and digitization. This variability represents sampling bias (Paulus et al. 2023) and may cause systematic errors (Dubrow 2022) in some arms. For example, in maternal care, specific data collection approaches may cause systematic overestimation of expected delivery dates (Fulcher et al. 2020). (2) Data integrity and privacy. For example, when data is gathered through surveys, topical or political bias (Paulus et al. 2023) can introduce systematic errors in some arms due to influence of authorities or impacted populations (e.g., keeping the number of COVID cases low (Paulus et al. 2023)). Additionally, driven by the principle of differential privacy, noise might be intentionally added to the data (Dwork, Roth et al. 2014).

Motivated by these scenarios, in this paper we study the RMAB model where some arms are affected by noise in their observed rewards at some states, with these noisy arms and rewards not known a priori. The challenge of applying existing RL methods to this setting is that systematic errors might create false optima, preventing the exploration of states with true high rewards and leading to suboptimal actions selected by the RMAB. For example, the aforementioned systematic overestimation errors of women’s delivery dates can lead to suboptimal resource allocation, thereby reducing the number of deliveries in healthcare facilities (Fulcher et al. 2020).

We propose the first communication learning approach in RMABs to deal with systematic reward errors, enabling arms to learn to communicate useful information to improve local Q-learning. We formally model the communication learning problem as a Multi-Agent MDP (Boutilier 1996), which is in addition to and distinct from per-arm MDP without considering communication. In this framework, each arm has a binary communication action. Choosing to communicate allows an arm to receive the local Q function parameters from a similar arm. Using these parameters, the receiving arm establishes a behavior policy to collect experience samples for updating its local Q function. This behavior policy may enable exploration and help escape false optima. In this way, the communication reward can be calculated as the difference in RMAB return with and without communication. We then adopt Q-learning for communication learning. Since RMABs optimize a centralized objective involving return from all the arms, the communication reward captures the joint influence of communication of all arms. To solve this problem and learn the individual communication strategy of each arm, we propose to use a decomposed Q-network architecture for communication learning that factorizes the joint communication Q function into local communication Q function.

To demonstrate the effectiveness our method, we rigorously compare the sample complexity required to learn a near-optimal per-arm Q function when learning with and without communication. We show theoretically that there exists RMABs where communication from non-noisy to noisy arms can reduce the Q estimation errors exponentially with respect to the number of states. Interestingly, further studies reveal that communication from noisy to non-noisy arms can also be beneficial, provided that the behavior policy, on the receiver arms’ MDP, achieves reasonable coverage over the state-action space and does not decorrelate too slowly from the initial states.

Empirically, we validate our method on synthetic problems, as well as the ARMMAN maternal healthcare (Verma et al. 2024) and the SIS epidemic intervention problem (Yaesoubi and Cohen 2011) built upon real-world data. Results show that our method significantly outperforms both the non-communicative learning approach and the approach with fixed communication strategies, achieving performance levels comparable to those learning in noise-free environments. Results are robust across problems, resources budgets, and noise levels. Furthermore, visualization of the learned communication strategies supports our theoretical findings that communication from noisy arms to non-noisy arms, and vice versa, can be helpful.

Related works. To the best of our knowledge, this is the first work that introduces communication into RMABs. This is perhaps surprising given that communication is an important topic that has been extensively studied in other multi-agent problems. In Dec-POMDPs (Oliehoek, Amato et al. 2016; Kwak et al. 2010; Nair et al. 2004), previous works explore decentralized communication learning (Sukhbaatar, Fergus et al. 2016; Hoshen 2017; Jiang and Lu 2018; Singh, Jain, and Sukhbaatar 2019; Wang et al. 2019b), the emergence of natural language (Foerster et al. 2016; Das et al. 2017; Lazaridou, Peysakhovich, and Baroni 2017; Mordatch and Abbeel 2018; Zhao et al. 2024b; Kang, Wang, and de Melo 2020), and implicit communication by coordination graphs (Böhmer, Kurin, and Whiteson 2020; Wang et al. 2021; Kang et al. 2022; Yang et al. 2022), influence on each other (Jaques et al. 2019; Wang et al. 2019a), and attention mechanisms (Huang, Mordatch, and Pathak 2020; Kurin et al. 2020; Dong et al. 2022, 2023). Different from these works, we consider communication under noisy returns and in the RMAB setting where the central planner policy constrained by resource budgets determines the influence of communication.

In the literature of cooperative multi-agent reinforcement learning (Yu et al. 2022; Wen et al. 2022; Kuba et al. 2021; Wang et al. 2020a; Christianos, Schäfer, and Albrecht 2020; Peng et al. 2021; Wang et al. 2020c; Jiang et al. 2019; Wang et al. 2020b; Nagaraj et al. 2023), value decomposition methods (Sunehag et al. 2018; Rashid et al. 2018; Son et al. 2019; Li et al. 2021) have recently witnessed great success in addressing challenges such as partial observability and learning scalability. These methods decompose the joint Q-value conditioned on true states and joint actions into per agent local Q functions based on local action-observation history and local actions. In this work, we adopt a linear value decomposition framework (Sunehag et al. 2018) for the communication Q network to solve the unique problems of learning with the presence of noisy arms.

2 Preliminaries

2.1 RMABs with Noisy Arms

We study multi-action Restless Multi-Arm Bandits (RMABs) with $N$ arms. Each arm $i\in[N]$ is associated with a Markov Decision Process $\mathcal{M}_{i}=(\mathcal{S}_{i},\mathcal{A}_{i},\mathcal{C}_{i},T_{i},R_{i},\beta,\bm{z}_{i})$ . Here, $\mathcal{S}_{i}$ is the state space that can be either discrete or continuous. The reward at a state is given by a function $R_{i}:\mathcal{S}_{i}\rightarrow\mathbb{R}$ , and $\beta\in[0,1)$ is a discount factor for future rewards. We assume that there is a set of noisy arms $\mathcal{N}_{\epsilon}\subset[N]$ , and the rewards of a noisy arm $i\in\mathcal{N}_{\epsilon}$ are unreliable and affected by noise $\epsilon_{i}(s_{i})$ at some states $s_{i}\in\mathcal{S}_{\epsilon,i}\subset\mathcal{S}_{i}$ . We consider systematic errors: $\tilde{R}_{i}(s_{i})=R_{i}(s_{i})+\epsilon_{i}(s_{i})$ , where $\epsilon_{i}(s_{i})$ is sampled from a distribution $\mathcal{E}_{i}$ and fixed during training and testing. The identity of noisy arms $\mathcal{N}_{\epsilon}$ , the affected states $\mathcal{S}_{\epsilon,i}$ , the reward noise $\epsilon_{i}$ , and its distribution $\mathcal{E}_{i}$ are not known a priori.

$\mathcal{A}_{i}$ is a finite set of discrete actions. Each action $a\in\mathcal{A}_{i}$ has a cost $\mathcal{C}_{i}(a)$ . Following the standard bandit assumption that an arm can be ”not pulled” at no cost, we set $\mathcal{C}_{i}(0)=0$ . The probability of transitioning from one state to another given an action is specified by the function $T_{i}:\mathcal{S}_{i}\times\mathcal{A}_{i}\times\mathcal{S}_{i}\rightarrow[0,1]$ . In line with previous work (Zhao et al. 2023; Elmachtoub, Gupta, and Zhao 2023; Elmachtoub et al. 2023; Zhao et al. 2022) we assume that $\mathcal{S}_{i},\mathcal{A}_{i}$ , and $\mathcal{C}_{i}$ are the same for all arms $i\in[N]$ and omit the subscript $i$ for simplicity. Each arm has a feature vector $\bm{z}_{i}\in\mathbb{R}^{m}$ that provides useful information about the transition dynamics of arm $i$ . We define the arm with the most similar features with arm $i$ as $\nu_{i}=\arg\min_{j}\|z_{j}-z_{i}\|_{2}$ .

We consider a constrained global resource setting where, at every timestep $t\in[H]$ , where $H$ is the horizon, the total cost of actions taken is no greater than a given budget $B$ . Under this constraint, the central planner selects one action for each of the $N$ arms at each timestep. Formally, the central planner has a policy $\pi$ to maximize its expected discounted return $G(\pi)=\mathbb{E}_{\pi}[\sum_{t=1}^{H}\beta^{t}\sum_{i\in[N]}R_{i}(s_{i}^{t})]$ , where $s_{i}^{t}$ is the state of arm $i$ at time step $t$ . $\pi$ is typically computed by solving the constrained Bellman equation (Hawkins 2003; Killian, Perrault, and Tambe 2021; Killian et al. 2022):

		$\displaystyle J(\bm{s})=\sum_{i=1}^{N}R_{i}\left(s_{i}\right)+\beta\,\max_{\bm{a}}\mathbb{E}_{\bm{s}^{\prime}}\left[J\left(\bm{s}^{\prime}\right)\mid\bm{s},\bm{a}\right]$
		$\displaystyle\text{ s.t. }\sum_{i=1}^{N}\mathcal{C}(a_{i})\leq B.$		(1)

2.2 Reinforcement Learning in RMABs

Deep reinforcement learning has been proven to be efficient in solving the RMAB problem (Avrachenkov and Borkar 2022; Killian et al. 2022; Zhao et al. 2024a). To learn arm policies, we take the Lagrangian relaxation of the budget constraint (Hawkins 2003), defining

\displaystyle J(s,\lambda)=\frac{\lambda B}{1-\beta}+\sum_{i=1}^{N}\max_{j\in[A]}\left\{Q_{i}(s_{i},j)\right\},

(2)

where $Q_{i}$ is arm $i$ ’s Q function estimating its expected discount return of an action at a state. We use deep Q-learning and parameterize the action-value function $Q_{i}(s_{i},a_{i};\theta_{i})$ of each arm $i$ with $\theta_{i}$ , which is updated by the TD-loss (Sutton and Barto 2018; Killian, Perrault, and Tambe 2021):

	$\displaystyle\mathcal{L}_{\texttt{TD}}(\theta_{i})=$	$\displaystyle\mathbb{E}_{(s_{i},a_{i},s_{i}^{\prime})\sim\mathcal{D}}\Big{[}\Big{(}R_{i}(s_{i})-\lambda\mathcal{C}(a_{i})$
		$\displaystyle+\beta\max_{a}Q_{i}(s_{i}^{\prime},a;\theta_{i})-Q_{i}(s_{i},a_{i};\tilde{\theta}_{i})\Big{)}^{2}\Big{]},$		(3)

where $Q_{i}(\cdot;\tilde{\theta}_{i})$ is a target network whose parameters are periodically copied from $Q_{i}$ . The expectation means that the samples come from a replay buffer $\mathcal{D}$ .

To infer a central planner policy that adheres to the budget constraint, previous research (Killian et al. 2022) constructs a knapsack problem using $Q_{i}(s_{i},a_{i})$ as values and $\mathcal{C}(a_{i})$ as weights, within the constraints defined in the optimization problem 2.1. The solution to this knapsack problem identifies the joint action that maximizes the sum of the learned $Q_{i}$ values while conforming to the budget constraint. In this case, we denote the planner policy by $\pi(\bm{\theta})$ to highlight its dependency on the joint parameters of individual Q-functions $\bm{\theta}=(\theta_{1},\cdots,\theta_{N})$ .

3 RMAB Learning with Communication

In this section, we introduce our method of learning communication to share useful knowledge among arms. We first define the communication problem as a Multi-Agent MDP (Boutilier 1996) and address the following problems: (1) What are the messages? (Sec. 3.1) How do they affect local Q learning? (Sec. 3.2) (2) How to generate reward signals that measure the influence of communication? (Sec. 3.3) (3) How to use the communication reward to learn communication strategies? (Sec. 3.4) (4) Why and under what conditions can communication help learning? (Sec. 3.2 and 4)

3.1 Define the Communication Learning Problem

Akin to traditional deep RMAB learning algorithms (Fu et al. 2019; Biswas et al. 2021; Killian et al. 2022; Avrachenkov and Borkar 2022), our approach involves each arm learning its local Q networks $Q_{i}(s_{i},a_{i};\theta_{i})$ to predict the expected return of taking action $a_{i}$ in state $s_{i}$ . The central planner then selects centralized actions based on the local Q values.

When the data source is noisy, the local Q functions may be inaccurate, leading to the central planner choosing suboptimal actions. Figure 3 in Appendix C.3 shows such an example. To address this issue, we introduce a mechanism that allows each arm to learn a communication policy that diminishes the impact of noisy data and enhances the overall effectiveness of the central planner.

We define the communication learning problem within the framework of a Multi-Agent MDP (Boutilier 1996).

Definition 1 (The Communication MDP).

The communication MDP in the RMAB setting is modelled by a multi-agent MDP $\mathcal{M}^{\texttt{c}}=(\mathcal{S}^{\texttt{c}},I,\mathcal{A}^{\texttt{c}},T^{\texttt{c}},r^{\texttt{c}})$ . $I$ is the set of players, which in our case is the set of arms. $\mathcal{S}^{\texttt{c}}=\Theta$ is the state space, where $\Theta$ is the joint parameter space of Q networks. $\bm{\theta}=(\theta_{1},\cdots,\theta_{N})\in\mathcal{S}^{\texttt{c}}$ corresponds to joint Q function parameters. $\mathcal{A}^{\texttt{c}}=\{0,1\}^{I}$ is the action space. $a_{i}^{\texttt{c}}=1$ means arm $i$ decides to ask Q network parameters from the arm with the most similar features, $\nu_{i}$ , as the message. If $a_{i}^{\texttt{c}}=0$ , arm $i$ will not receive a message. The joint communication action $\bm{a}^{\texttt{c}}$ leads to a transition in $\bm{\theta}$ (Eq.6) and a reward $r^{\texttt{c}}$ (Eq. 7).

The definition of this Communication MDP is associated with the underlying RMAB in the sense that the Communication MDP solution affects the action selection of the central planner. After every $K$ learning epochs of $Q_{i}$ in the MDP $\mathcal{M}_{i}$ , parameters of arms’ Q functions are used as the state for this Communication MDP $\mathcal{M}^{\texttt{c}}$ . We assume that each arm $i$ in the system has the chance to communicate with the arm with the most similar features, denoted by arm $\nu_{i}$ . This results in $N$ possible communication channels (i.e. arm $i$ will receive a message if $a_{i}^{\texttt{c}}=1$ ). In this way, the set of players in $\mathcal{M}^{\texttt{c}}$ is the set of arms $[N]$ . In addition to selecting actions in the arm MDP $\mathcal{M}_{i}$ , each arm also selects actions in this Communication MDP $\mathcal{M}^{\texttt{c}}$ . These communication actions determine the communication channel activation pattern, which remains frozen for the next $K$ learning epochs. We will discuss the details in the following subsections.

An alternative design option is to adopt a denser communication scheme, i.e., an arm can receive messages from multiple other arms. As formally detailed in Proposition 4 in Sec. 4, under certain assumptions regarding concentratability (Munos 2003; Chen and Jiang 2019), our sparse communication scheme is shown to effectively learn near-optimal value functions with the same errors as the dense scheme. Therefore, we favor the sparse scheme due to its lower demands on communication bandwidth.

3.2 Communication Actions and Transition Dynamics

We now describe how the communication action $\bm{a}^{\texttt{c}}$ induces a state transition in the space $\mathcal{S}^{\texttt{c}}$ . Specifically, when $a_{i}^{\texttt{c}}$ is 1, arm $i$ receives a message from arm $\nu_{i}$ , which is the parameters of the local Q network of arm $\nu_{i}$ , $\theta_{\nu_{i}}$ . Using this message, arm $i$ formulates a behavior policy $\pi_{{\nu_{i}}}$ as follows:

\displaystyle\pi_{{\nu_{i}}}(\cdot|s_{i})=\texttt{SoftMax}\left(Q_{\nu_{i}}(s_{i},\cdot)\right).

(4)

Using behavior policy $\pi_{{\nu_{i}}}$ , arm $i$ navigates in its arm MDP $\mathcal{M}_{i}$ , collecting transition samples in a replay buffer $\mathcal{D}_{\pi_{\nu_{i}}}$ . This buffer is then used to update the local Q-network. Therefore, the TD-loss for arm $i$ with the communication channel $i$ activated is given by:

	$\displaystyle\mathcal{L}^{\texttt{c}}_{\texttt{TD}}(\theta_{i}\|a_{i}^{\texttt{c}}\textup{{=}}1)=$	$\displaystyle\mathbb{E}_{(s_{i},a_{i},s_{i}^{\prime})\sim\mathcal{D}_{\pi_{\nu_{i}}}}\Big{[}\Big{(}R_{i}(s_{i})-\lambda c_{a_{i}}$
		$\displaystyle+\beta\max_{a}Q_{i}(s_{i}^{\prime},a;\theta_{i})-Q_{i}(s_{i},a_{i};\tilde{\theta}_{i})\Big{)}^{2}\Big{]}.$

Alternatively, if arm $i$ opts not to receive communication, it engages in the $\epsilon$ -greedy strategy to explore the MDP $\mathcal{M}_{i}$ and generate the replay buffer. Therefore,

\displaystyle\mathcal{L}^{\texttt{c}}_{\texttt{TD}}(\theta_{i}|a_{i}^{\texttt{c}}\textup{{=}}0)=\mathcal{L}_{\texttt{TD}}(\theta_{i}).

(5)

As such, the updates to local Q parameters are dependent on the communication action:

\displaystyle\theta_{i}^{\prime}(a_{i}^{\texttt{c}})=\theta_{i}-\alpha\nabla_{\theta_{i}}\mathcal{L}^{\texttt{c}}_{\texttt{TD}}(\theta_{i}|a_{i}^{\texttt{c}}),

(6)

where $\alpha$ is the learning rate. Eq. 6 specifies the state transition dynamics of the Communication MDP.

3.3 Communication Reward

Lemma 5 and 6 in Sec. 4 reveal that some seemingly reasonable fixed communication patterns, like those from a non-noisy arm to a noisy arm, can lead to an exponential higher sample complexity for learning a near-optimal Q function in the noisy arm. In fact, as shown in Proposition 2 of Sec. 4, the influence of communication is determined by the state-action space coverage and the mixing time of the behavior policy $\pi_{\nu_{i}}$ on the receiver arms.

These findings highlight the intricate and dynamic nature of the communication learning problem in RMAB. In practice, to adaptively decide the optimal communication action, we propose that each arm learns a communication Q function $Q_{i}^{\texttt{c}}$ . This approach encounters obstacles in the RMAB setting, where clear, separate reward signals for each $Q_{i}^{\texttt{c}}$ are absent. The central planner policy instead bases action decisions on the updated Q-functions of all arms and the budget. The collaborative contribution of the joint communication action $\bm{a}^{\texttt{c}}$ to the learning objective is:

\displaystyle r^{\texttt{c}}(\bm{\theta},\bm{a}^{\texttt{c}})=G_{T}\left(\pi(\bm{\theta}^{\prime}(\bm{a}^{\texttt{c}}))\right)-G_{T}\left(\pi(\bm{\theta}^{\prime}(\bm{0}))\right).

(7)

Here, $G_{T}(\pi)=\frac{1}{T}\sum_{k=1}^{T}[\sum_{t=1}^{H}\beta^{t}\sum_{i\in[N]}R_{i}(s_{i}^{k,t})]$ is the empirical return under policy $\pi$ averaged over $T$ episodes, where $s_{i}^{k,t}$ is arm $i$ state at timestep $t$ of episode $k$ . $\pi(\bm{\theta}^{\prime}(\bm{a}^{\texttt{c}}))$ and $\pi(\bm{\theta}^{\prime}(\bm{0}))$ are the planner policies with the arm Q-functions updated with and without messages, respectively. The communication reward is defined as the difference in the planner’s return in these two cases.

3.4 Decomposed Communication Q Function

With the definition of the communication reward (Eq. 7), the simplest strategy of communication learning is to estimate a joint communication action-value function $Q^{\texttt{c}}(\bm{\theta},\bm{a}^{\texttt{c}})$ . However, the question is that there are $N$ communication channels, and even though our communication action is binary, there are $O(2^{N})$ possible joint actions $\bm{a}^{\texttt{c}}$ . With this exponential number of actions, it is challenging to conduct deep Q-learning, which requires a maximization operation over the whole action space.

We propose to solve this problem by value decomposition (Sunehag et al. 2018). Specifically, we represent the global communication Q-function as a summation of the channel-level, individual communication Q-functions:

\displaystyle Q^{\texttt{c}}(\bm{\theta},\bm{a}^{\texttt{c}};\bm{\xi})=\sum\nolimits_{i\in[N]}Q_{i}^{\texttt{c}}(\bm{\theta},a_{i}^{\texttt{c}};\xi_{i}).

(8)

Here, $\xi_{i}$ is the parameters of the channel-level communication Q-function, and $\bm{\xi}$ is the joint parameter consisting of all $\xi_{i}$ . Such a decomposition satisfies an important property:

\displaystyle\arg\max_{\bm{a}^{\texttt{c}}}Q^{\texttt{c}}=\left(\arg\max_{a_{1}^{\texttt{c}}}Q_{1}^{\texttt{c}},\cdots,\arg\max_{a_{N}^{\texttt{c}}}Q_{N}^{\texttt{c}}\right),

(9)

which is because Eq. 8 allows $\partial Q^{\texttt{c}}/\partial Q_{i}^{\texttt{c}}\geq 0$ . This means that the joint action that maximizes the $Q^{\texttt{c}}$ can be obtained by independently selecting the local communication action that maximize $Q_{i}$ . As a whole, network $Q^{\texttt{c}}(\cdot;\bm{\xi})$ is a standard Q-network conditioned on states $\bm{\theta}$ and actions $\bm{a}^{\texttt{c}}$ , allowing it to be trained using joint rewards $r^{\texttt{c}}$ (Eq. 7). On a finer level, it has a decomposable structure simplifying the $\arg\max$ calculation. Therefore, $Q^{\texttt{c}}(\cdot;\bm{\xi})$ can be trained by the following TD-loss:

	$\displaystyle\mathcal{L}^{\texttt{d}}_{\texttt{TD}}(\bm{\xi})=\mathbb{E}_{(\bm{\theta},\bm{a}^{\texttt{c}},\bm{\theta}^{\prime})\sim\mathcal{D}}\Big{[}\Big{(}r^{\texttt{c}}(\bm{\theta},\bm{a}^{\texttt{c}})$
	$\displaystyle+\beta\sum\nolimits_{i\in[N]}\max_{a}Q_{i}^{\texttt{c}}(\bm{\theta}^{\prime},a;\tilde{\xi}_{i})-\sum\nolimits_{i\in[N]}Q_{i}^{\texttt{c}}(\bm{\theta},a_{i}^{\texttt{c}};\xi_{i})\Big{)}^{2}\Big{]}.$

$Q_{i}^{\texttt{c}}(\cdot;\tilde{\xi}_{i})$ is a target network with parameters periodically copied from $Q_{i}^{\texttt{c}}$ , and $\mathcal{D}$ is a replay buffer. In this way, we implicitly achieve credit assignment, because we use a joint reward $r^{\texttt{c}}$ for training but get a set of Q functions for individual arms, which can recover the best joint action locally.

4 Theoretical Analysis

We begin by investigating the influence of communication and establishing the specific conditions under which communication can reduce the sample complexity for the receiver arm in learning a near-optimal Q function. We define the minimum state-action occupancy probability,

\displaystyle\mu_{\texttt{min}}:=\min_{s,a}\mu_{\pi_{\nu_{i}}}(s,a),

(10)

as the lowest probability of any state-action pair under the stationary distribution $\mu_{\pi_{\nu_{i}}}$ induced by the behavior policy $\pi_{\nu_{i}}$ on the receiver MDP, where $\nu_{i}$ is the arm with the most similar feature to arm $i$ . We further define the mixing time (Hsu, Kontorovich, and Szepesvári 2015; Li et al. 2020),

\displaystyle t_{\texttt{mix}}=\min\left\{t|\max_{s_{0},a_{0}}\sup_{s,a}|P^{t}(s,a|s_{0},a_{0})-\mu_{\pi_{\nu_{i}}}(s,a)|\leq\frac{1}{4}\right\},

(11)

where $P^{t}$ is the state-action distribution at timestep $t$ given the initial state and action. The mixing time $t_{\texttt{mix}}$ captures how fast the behavior policy decorrelates from the initial state. We now show the condition under which communication can be helpful.

Proposition 2 (Useful Communication).

For the Q-learning of arm $i$ , when

\displaystyle\mu_{\texttt{min}}>\frac{2(1-\beta)^{2}}{|\mathcal{S}||\mathcal{A}|},\ \ \text{and}\ \ t_{\texttt{mix}}\leq\frac{1}{\epsilon_{e}^{2}(1-\beta)^{4}},

(12)

with a probability at least $(1-\delta)^{2}$ for any $0<\delta<1$ , compared to learning without communication, choosing to receive communication and using behavior policy $\pi_{\nu_{i}}$ leads to better sample complexity in the worst case for learning $\epsilon_{e}$ -optimal Q function such that $\|Q_{i}(s,a)-Q_{i}^{*}(s,a)\|<\epsilon_{e},\forall(s,a)$ .

The Proof is detailed in Appendix C.2 Proposition 2 shows that communication can accelerate arm $i$ ’s Q-learning as long as $\pi_{\nu_{i}}$ visits every state-action pair with a small probability and is not too slowly disentangling from the initial state. This finding is solely dependent upon the characteristics of the behavior policy in the receiver MDP $\mathcal{M}_{i}$ and does not require the sender or the receiver to be non-noisy or noisy.

We then analyze the effectiveness of a sparse communication strategy. Specifically, we compare our setup where each arm $i$ receives messages from an arm $\nu_{i}$ against a dense communication scheme where arm $i$ receives messages from $d$ other arms. In the dense communication scheme, the behavior policy $\pi_{\bar{\nu}_{i}}$ is the average of policies from all senders. Specifically, we collect the same number of samples using each sender’s policy and amalgamate them into a unified experience replay buffer.

The following assumption on concentratability asserts that both $\pi_{\nu_{i}}$ and $\pi_{\bar{\nu}_{i}}$ provide a reasonable coverage over the state-action space.

Assumption 3 (Concentratability Coefficient).

There exists $C_{\nu_{i}}<\infty$ , $C_{\bar{\nu}_{i}}<\infty$ s.t.

\displaystyle\frac{v(s,a)}{\mu_{\bar{\nu}_{i}}(s,a)}\leq C_{\bar{\nu}_{i}}\text{ and }\frac{v(s,a)}{\mu_{\nu_{i}}(s,a)}\leq C_{\nu_{i}}\text{, for }\forall(s,a)\in(\mathcal{S},\mathcal{A}),

for any admissible distribution $v$ that can be generated by following some policy for some timesteps in the MDP $\mathcal{M}_{i}$ . Here $\mu_{\nu_{i}}$ and $\mu_{\bar{\nu}_{i}}$ are the stationary distribution of $\pi_{\nu_{i}}$ and $\pi_{\bar{\nu}_{i}}$ on $\mathcal{M}_{i}$ , respectively.

Under this assumption, we can prove that the sparse communication scheme can also learn a near-optimal Q function as under the dense scheme. The key observation that underpins the proof is $\frac{1}{d}\mu_{\nu_{i}}(s,a)\leq\mu_{\bar{\nu}_{i}}(s,a)$ , where $d<N$ is the maximum number of other arms that can send messages to arm $i$ . The convergence of the communication learning under the dense scheme implies that $C_{\bar{\nu}_{i}}$ is finite, which further indicates the finiteness of $C_{\nu_{i}}$ and thus the convergence of the communication learning of the sparse scheme.

Proposition 4 (Sparse Communication).

For a dense communication strategy, there exists a sparse communication strategy which approximates the optimization step in Equation (6) faithfully. Additionally, both these two strategies can learn near-optimal value functions, and the sparse communication scheme requires at most $O(d)$ times more samples to achieve this.

Detailed proof is deferred to Appendix C.1. We next show that a communication channel from a non-noisy arm to a noisy arm can be either helpful (Lemma 5) or detrimental (Lemma 6), depending on the specific structures of the sender and receiver MDPs.

Lemma 5.

There exists a family of RMABs where, without communication from non-noisy arms to noisy arms, the error in Q values of a noisy arm could be $\max_{s,a}|Q_{i}^{*}(s,a)-Q_{i}(s,a)|=O\left((\frac{1}{\beta})^{|\mathcal{S}|}\right)$ even after $O\left((\frac{1}{\epsilon})^{|\mathcal{S}|}\right)$ epochs. $\epsilon\in(0,1)$ is the exploration rate in $\epsilon$ -greedy exploration scheme. By contrast, this Q error can be reduced to 0 with communication from non-noisy arms to noisy arms.

Lemma 6.

There exists a family of RMABs where, with communication from a non-noisy arm to a noisy arm, we need $O\left((\frac{1}{\epsilon})^{|\mathcal{S}|}\right)$ times more samples to learn a near-optimal Q function in the noisy arm than without communication.

The proofs of Lemma 5 and 6 can be found in Appendix C.3. The observation that the errors in the Q function and the sample complexity of Q learning can increase exponentially with the number of states in the worst case scenarios highlights the need to move away from predefined communication channels. Instead, communication Q-learning provides a more adaptable and effective approach.

5 Experiments

Refer to caption — Figure 1: Performance (interquartile mean (Agarwal et al. 2021) and standard error of return over 200 random seeds) of our method, baselines, and ablations in three environments with different numbers of arms $N$ and resource budgets $B$ . Communication learning starts at epoch 200. We present results on two additional baseline in Tables 1 and 2 (see Appendix B for more ablations).

We present experimental evaluations and ablation studies to investigate: (1) The learning performance of our communication method compared to baselines; (2) The characteristics of learnt communication strategies; (3) The robustness of our method in terms of resource budgets, the number of arms, noise types and levels. We discuss additional experimental details and hyperparameters in Appendix B.

We evaluate our methods on the following domains. In all domains, 80% of the arms are noisy, and noisy arms observe reward drawn from a perturbed reward function $\tilde{R}(s)=R(s)+\epsilon(s)$ , where $\epsilon\sim\mathcal{N}(0,1)$ . In ablation studies (Tables 2 and 3), we test various numbers of noisy arms and noise type.

Synthetic: We consider a synthetic RMAB with continuous states. For state $s_{i}\in[0,1]$ and action $a\in\{0,1\}$ of arm $i$ , the next state $s_{i}^{\prime}$ is determined by: $s_{i}^{\prime}=clip\left(s_{i}+\mathcal{N}(\mu_{ia},\sigma_{ia}),0,1\right)$ . Here transition parameters are sampled uniformly, specifically $\mu_{i1}=-\mu_{i0}\in[-0.2,0.2]$ , $\sigma_{i1}=\sigma_{i0}\in[0.1,0.2]$ .

SIS Epidemic Model (Yaesoubi and Cohen 2011; Killian et al. 2022; Zhao et al. 2024a): Each arm $p$ represents a subpopulation in a geographic region, with the number of uninfected individuals being the state $s_{p}$ . We implement a budget-constrained intervention framework with three actions: $a_{0}$ represents no intervention, $a_{1}$ requires weekly COVID testing, and $a_{2}$ involves distributing face masks. Actions change the disease transmission parameters in a targeted manner, reflecting their real-world public-health impacts. The reward noise is due to variability in data collection, which may be caused by poor COVID testing equipment (Embrett et al. 2022; Schmitt-Grohé, Teoh, and Uribe 2020) and influence of authorities or impacted populations (keeping the COVID numbers low) (Paulus et al. 2023).

ARMMAN: The dataset is collected by ARMMAN (ARMMAN 2019), an NGO in India aimed at improving health awareness for a million expectant and new mothers via automated voice messaging programs. Each arm represents a cluster of mothers in a geographical location. Actions are sending a health worker to visit mothers in a location to improve engagement. States, continuous valued in [0,1], represent the average engagement of mothers in a location (Danassis et al. 2023). The transition dynamics is established empirically from the data (data usage and consent are in Appendix A.2). Systematic reward errors are present due to difference in infrastructure such as cellular network and electricity (Kumar et al. 2022) (e.g. a geographic location may have poor cellular signals or frequent power outage, causing a systematic underestimation of mothers’ engagement).

Method	100%	110%	120%	100%	110%	120%	100%	110%	120%
	Synthetic (N=21, b=15)			SIS (N=32, B=16)			ARMMAN (N=48, B=20)
$\mathtt{No\ Noise}$	100%	100%	100%	100%	100%	100%	100%	100%	100%
$\mathtt{Comm}$	75%	53%	38%	63%	55%	51%	77%	72%	68%
$\mathtt{No\ Comm}$	62%	39%	24%	55%	45%	42%	63%	59%	52%
$\mathtt{Fixed/Comm}$	21%	6%	12%	56%	42%	46%	24%	15%	27%
$\mathtt{Nearest\ Neighbor\ Comm}$	44%	20%	3%	32%	22%	23%	46%	43%	31%
$\mathtt{Random\ Comm}$	-24%	-45%	-58%	22%	13%	9%	-22%	-26%	-35%

Table 1: Average returns (100 seeds) with increased noise, normalized with respect to

\mathtt{No\ Noise}

and

\mathtt{Comm}

(ours) before communication (see Remark 7 in Appendix B). We test Gaussian Noise

\mathcal{N}(0,\sigma)

x\%

noise level means

\sigma=x/100

Method	15	16	17	22	24	26	34	36	38
	Synthetic (N=21, B=15)			SIS (N=32, B=16)			ARMMAN (N=48, B=20)
$\mathtt{No\ Noise}$	100%	100%	100%	100%	100%	100%	100%	100%	100%
$\mathtt{Comm}$	78%	78%	75%	69%	71%	63%	82%	80%	77%
$\mathtt{No\ Comm}$	67%	66%	62%	61%	61%	55%	68%	59%	63%
$\mathtt{Fixed/Comm}$	63%	45%	21%	65%	62%	56%	43%	49%	24%
$\mathtt{Nearest\ Neighbor\ Comm}$	42%	47%	44%	11%	29%	32%	47%	37%	46%
$\mathtt{Random\ Comm}$	-46%	-32%	-24%	-8%	17%	22%	-60%	-30%	-22%

Table 2: Performance (return averaged across 100 random seeds) with numbers of noisy arms, normalized with respect to

\mathtt{No\ Noise}

and

\mathtt{Comm}

(ours) before communication starts (see Remark 7 in Appendix B).

Learning Performance. Figure 1 shows the experimental results in $\mathtt{Synthetic}$ , $\mathtt{SIS}$ , and $\mathtt{ARMMAN}$ where we compare our method $\mathtt{Comm\ (Ours)}$ against various baselines and ablations. $\mathtt{No\ Comm}$ is an ablation where the RMAB with noisy arms is trained without communication. $\mathtt{Fixed\ Comm}$ involves a preset, fixed communication pattern where at each timestep, each noisy arm receives information from a similar ( $\ell_{2}$ distance of features), noise-free arm. $\mathtt{Fixed\ Comm}$ uses oracle information about which arms are noisy, which is not known by other methods. $\mathtt{Nearest\ Neighbor\ Comm}$ is a baseline strategy where an arm always receives communication from the nearest neighbor in feature space, regardless of whether the sender/receiver is a noisy arm. We include two additional baselines in Tables 1 and 2.

Our method significantly outperforms $\mathtt{No\ Comm}$ , $\mathtt{Fixed\ Comm}$ , and $\mathtt{Nearest\ neighbor\ Comm}$ . Notably, the $\mathtt{Fixed\ Comm}$ strategy, despite the oracle information, often performs worse than $\mathtt{No\ Comm}$ . This observation aligns with our theoretical analysis in Proposition 2 showing that communication channels from non-noisy to noisy arms are not always helpful. Our method consistently outperforms across various number of arms, budgets, and domains. For instance, in the Synthetic environment with N=21 and B=15, our method achieves a return of approximately 10 at epoch 600, compared to about 8 for the no communication (No Comm) baseline, representing an 20 percentage point improvement. Similarly, in the ARMMAN environment with N=48 and B=20, our method reaches a return of 15, outperforming the No Comm baseline’s return of approximately 12.5, which is a 20 percentage point increase. These results demonstrate the effectiveness of our communication algorithm in handling systematic errors and enhancing learning in noisy RMAB environments.

Communication Pattern. Figure 2 illustrates how our learnt communication strategy evolves. We make several observations. First, as the number of learning epochs increase, we observe more information are sent from non-noisy arms to noisy arms. During the early stages of learning, when arms are primarily exploring, communication from noisy arms to non-noisy arms can be beneficial, as shown in Proposition 2. As the learning progresses, the information from noisy sources may become less helpful as the arms have explored sufficiently. As a result, the proportion of communication from non-noisy arms increases. This observation is consistent across the three experimental environments.

Second, notice the percentage of noise-free receivers do not drop to zero and noise-free senders are not at 100% even at the end of training, indicating value of communication from noisy arms. Finally, the differing rates of noise-free arms as senders and receivers across domains indicate that our communication approach adapts to the domain.

Robustness. Tables 1 and 2 show the performance of different methods under various settings (see Appendix B for more ablations). In $\mathtt{No\ Noise}$ , all arms receive accurate, true reward signals without any noise. Arms learn individually on their own data without any communication. In $\mathtt{Random\ Comm}$ , each arm receives communication from a randomly chosen arm. The results demonstrate that our method is robust under increasing noise levels, numbers of noisy arms, and different noise distributions. In contrast, the performance of baselines $\mathtt{No\ Comm}$ and $\mathtt{Fixed\ Comm}$ changes significantly when the environment changes. In particular, on Synthetic, when the number of noisy arms increases from 15 to 17 (Table 2), our $\mathtt{Comm}$ method has only 3% drop in performance, while $\mathtt{Fixed\ Comm}$ has 42 percentage points drop in performance, despite that $\mathtt{Fixed\ Comm}$ uses ground truth information about which arms are noisy that is not known to our $\mathtt{Comm}$ method. Similar trends can be seen on SIS and ARMMAN. These results demonstrate that, compared to a fixed communication protocol with access to oracle information, our approach for communication learning provides better robustness. Additionally, the performance of $\mathtt{No\ Comm}$ supports the argument that conventional RL methods struggle in noisy RMABs.

6 Closing Remarks

The study of RMABs in environments with noisy and unreliable data sources underscores the need for innovative solutions in RL methods. We introduce a novel communication learning algorithm to alleviate the influence of noisy data so that the central planner can make better coordination decisions. Theoretically, we prove that communication can help reduce errors in value function approximation if the behavioral policy’s state-action occupancy probability is lower bounded by a suitable constant. Empirically, we show that our approach significantly improves RMAB performance.

Acknowledgments

The work is supported by Harvard HDSI funding

References

Agarwal et al. (2021) Agarwal, R.; Schwarzer, M.; Castro, P. S.; Courville, A. C.; and Bellemare, M. 2021. Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information processing systems, 34: 29304–29320.
ARMMAN (2019) ARMMAN. 2019. Assessing the Impact of Mobile-based Intervention on Health Literacy among Pregnant Women in Urban India. https://armman.org/wp-content/uploads/2019/09/Sion-Study-Abstract.pdf. Accessed: 2022-08-12.
Avrachenkov and Borkar (2022) Avrachenkov, K. E.; and Borkar, V. S. 2022. Whittle index based Q-learning for restless bandits with average reward. Automatica, 139: 110186.
Bagheri and Scaglione (2015) Bagheri, S.; and Scaglione, A. 2015. The restless multi-armed bandit formulation of the cognitive compressive sensing problem. IEEE Transactions on Signal Processing, 63(5): 1183–1198.
Behari et al. (2024) Behari, N.; Zhang, E.; Zhao, Y.; Taneja, A.; Nagaraj, D.; and Tambe, M. 2024. A Decision-Language Model (DLM) for Dynamic Restless Multi-Armed Bandit Tasks in Public Health. arXiv preprint arXiv:2402.14807.
Biswas et al. (2021) Biswas, A.; Aggarwal, G.; Varakantham, P.; and Tambe, M. 2021. Learn to intervene: An adaptive learning policy for restless bandits in application to preventive healthcare. arXiv preprint arXiv:2105.07965.
Boehmer et al. (2024) Boehmer, N.; Zhao, Y.; Xiong, G.; Rodriguez-Diaz, P.; Cibrian, P. D. C.; Ngonzi, J.; Boatin, A.; and Tambe, M. 2024. Optimizing vital sign monitoring in resource-constrained maternal care: An rl-based restless bandit approach. arXiv preprint arXiv:2410.08377.
Böhmer, Kurin, and Whiteson (2020) Böhmer, W.; Kurin, V.; and Whiteson, S. 2020. Deep coordination graphs. In International Conference on Machine Learning, 980–991. PMLR.
Bottou (2010) Bottou, L. 2010. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers, 177–186. Springer.
Boutilier (1996) Boutilier, C. 1996. Planning, learning and coordination in multiagent decision processes. In TARK, volume 96, 195–210. Citeseer.
Bukharin et al. (2024) Bukharin, A.; Li, Y.; Yu, Y.; Zhang, Q.; Chen, Z.; Zuo, S.; Zhang, C.; Zhang, S.; and Zhao, T. 2024. Robust multi-agent reinforcement learning via adversarial regularization: Theoretical foundation and stable algorithms. Advances in Neural Information Processing Systems, 36.
Chen and Jiang (2019) Chen, J.; and Jiang, N. 2019. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, 1042–1051. PMLR.
Christianos, Schäfer, and Albrecht (2020) Christianos, F.; Schäfer, L.; and Albrecht, S. 2020. Shared experience actor-critic for multi-agent reinforcement learning. Advances in neural information processing systems, 33: 10707–10717.
Danassis et al. (2023) Danassis, P.; Verma, S.; Killian, J. A.; Taneja, A.; and Tambe, M. 2023. Limited resource allocation in a non-Markovian world: the case of maternal and child healthcare. In Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 5950–5958.
Das et al. (2017) Das, A.; Kottur, S.; Moura, J. M.; Lee, S.; and Batra, D. 2017. Learning cooperative visual dialog agents with deep reinforcement learning. In Proceedings of the IEEE International Conference on Computer Vision, 2951–2960.
Dong et al. (2021) Dong, H.; Wang, T.; Liu, J.; Han, C.; and Zhang, C. 2021. Birds of a feather flock together: A close look at cooperation emergence via multi-agent rl. arXiv preprint arXiv:2104.11455.
Dong et al. (2022) Dong, H.; Wang, T.; Liu, J.; and Zhang, C. 2022. Low-rank modular reinforcement learning via muscle synergy. Advances in Neural Information Processing Systems, 35: 19861–19873.
Dong et al. (2023) Dong, H.; Zhang, J.; Wang, T.; and Zhang, C. 2023. Symmetry-aware robot design with structured subgroups. In International Conference on Machine Learning, 8334–8355. PMLR.
Dubrow (2022) Dubrow, J. K. 2022. Local data and upstream reporting as sources of error in the administrative data undercount of Covid 19. International Journal of Social Research Methodology, 25(4): 471–476.
Dütting et al. (2024) Dütting, P.; Feng, Z.; Narasimhan, H.; Parkes, D. C.; and Ravindranath, S. S. 2024. Optimal auctions through deep learning: Advances in differentiable economics. Journal of the ACM, 71(1): 1–53.
Dwork, Roth et al. (2014) Dwork, C.; Roth, A.; et al. 2014. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4): 211–407.
Elmachtoub, Gupta, and Zhao (2023) Elmachtoub, A.; Gupta, V.; and Zhao, Y. 2023. Balanced off-policy evaluation for personalized pricing. In International Conference on Artificial Intelligence and Statistics, 10901–10917. PMLR.
Elmachtoub et al. (2023) Elmachtoub, A. N.; Lam, H.; Zhang, H.; and Zhao, Y. 2023. Estimate-then-optimize versus integrated-estimationoptimization: A stochastic dominance perspective. arXiv preprint arXiv:2304.06833.
Embrett et al. (2022) Embrett, M.; Sim, S. M.; Caldwell, H. A.; Boulos, L.; Yu, Z.; Agarwal, G.; Cooper, R.; Aj, A. J. G.; Bielska, I. A.; Chishtie, J.; et al. 2022. Barriers to and strategies to address COVID-19 testing hesitancy: a rapid scoping review. BMC public health, 22(1): 750.
Foerster et al. (2016) Foerster, J.; Assael, I. A.; de Freitas, N.; and Whiteson, S. 2016. Learning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, 2137–2145.
Fu et al. (2019) Fu, J.; Nazarathy, Y.; Moka, S.; and Taylor, P. G. 2019. Towards q-learning the whittle index for restless bandits. In 2019 Australian & New Zealand Control Conference (ANZCC), 249–254. IEEE.
Fulcher et al. (2020) Fulcher, I.; Hedt, K.; Marealle, S.; Tibaijuka, J.; Abdalla, O.; Hofmann, R.; Layer, E.; Mitchell, M.; and Hedt-Gauthier, B. 2020. Errors in estimated gestational ages reduce the likelihood of health facility deliveries: results from an observational cohort study in Zanzibar. BMC Health Services Research, 20: 1–10.
Ghosh et al. (2023) Ghosh, A.; Nagaraj, D.; Jain, M.; and Tambe, M. 2023. Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, 1294–1302.
Guo et al. (2022) Guo, J.; Chen, Y.; Hao, Y.; Yin, Z.; Yu, Y.; and Li, S. 2022. Towards comprehensive testing on the robustness of cooperative multi-agent reinforcement learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 115–122.
Hartford, Wright, and Leyton-Brown (2016) Hartford, J. S.; Wright, J. R.; and Leyton-Brown, K. 2016. Deep learning for predicting human strategic behavior. Advances in neural information processing systems, 29.
Hawkins (2003) Hawkins, J. T. 2003. A Langrangian decomposition approach to weakly coupled dynamic optimization problems and its applications. Ph.D. thesis, Massachusetts Institute of Technology.
Hodge and Glazebrook (2015) Hodge, D. J.; and Glazebrook, K. D. 2015. On the asymptotic optimality of greedy index heuristics for multi-action restless bandits. Advances in Applied Probability, 47(3): 652–667.
Hoshen (2017) Hoshen, Y. 2017. Vain: Attentional multi-agent predictive modeling. In Advances in Neural Information Processing Systems, 2701–2711.
Hossain et al. (2024) Hossain, S.; Wang, T.; Lin, T.; Chen, Y.; Parkes, D. C.; and Xu, H. 2024. Multi-Sender Persuasion: A Computational Perspective. In Forty-first International Conference on Machine Learning.
Hsu, Kontorovich, and Szepesvári (2015) Hsu, D. J.; Kontorovich, A.; and Szepesvári, C. 2015. Mixing time estimation in reversible markov chains from a single sample path. Advances in neural information processing systems, 28.
Huang, Mordatch, and Pathak (2020) Huang, W.; Mordatch, I.; and Pathak, D. 2020. One policy to control them all: Shared modular policies for agent-agnostic control. In International Conference on Machine Learning, 4455–4464. PMLR.
Jaques et al. (2019) Jaques, N.; Lazaridou, A.; Hughes, E.; Gulcehre, C.; Ortega, P.; Strouse, D.; Leibo, J. Z.; and De Freitas, N. 2019. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International conference on machine learning, 3040–3049. PMLR.
Jiang et al. (2019) Jiang, J.; Dun, C.; Huang, T.; and Lu, Z. 2019. Graph Convolutional Reinforcement Learning. In International Conference on Learning Representations.
Jiang and Lu (2018) Jiang, J.; and Lu, Z. 2018. Learning attentional communication for multi-agent cooperation. In Advances in Neural Information Processing Systems, 7254–7264.
Jin et al. (2021) Jin, C.; Liu, Q.; Wang, Y.; and Yu, T. 2021. V-Learning–A Simple, Efficient, Decentralized Algorithm for Multiagent RL. arXiv preprint arXiv:2110.14555.
Kang, Wang, and de Melo (2020) Kang, Y.; Wang, T.; and de Melo, G. 2020. Incorporating pragmatic reasoning communication into emergent language. Advances in neural information processing systems, 33: 10348–10359.
Kang et al. (2022) Kang, Y.; Wang, T.; Yang, Q.; Wu, X.; and Zhang, C. 2022. Non-linear coordination graphs. Advances in Neural Information Processing Systems, 35: 25655–25666.
Killian et al. (2021) Killian, J. A.; Biswas, A.; Shah, S.; and Tambe, M. 2021. Q-Learning Lagrange Policies for Multi-Action Restless Bandits. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, 871–881. New York, NY, USA: Association for Computing Machinery. ISBN 9781450383325.
Killian, Perrault, and Tambe (2021) Killian, J. A.; Perrault, A.; and Tambe, M. 2021. Beyond ”To Act or Not to Act”: Fast Lagrangian Approaches to General Multi-Action Restless Bandits. In AAMAS, 710–718. UK: AAMAS.
Killian et al. (2022) Killian, J. A.; Xu, L.; Biswas, A.; and Tambe, M. 2022. Restless and uncertain: Robust policies for restless bandits via deep multi-agent reinforcement learning. In Uncertainty in Artificial Intelligence, 990–1000. PMLR.
Kuba et al. (2021) Kuba, J. G.; Chen, R.; Wen, M.; Wen, Y.; Sun, F.; Wang, J.; and Yang, Y. 2021. Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning. In International Conference on Learning Representations.
Kumar et al. (2022) Kumar, S. K. A.; Ihita, G. V.; Chaudhari, S.; and Arumugam, P. 2022. A survey on rural internet connectivity in India. In 2022 14th International Conference on COMmunication Systems & NETworkS (COMSNETS), 911–916. IEEE.
Kuo et al. (2020) Kuo, K.; Ostuni, A.; Horishny, E.; Curry, M. J.; Dooley, S.; Chiang, P.-y.; Goldstein, T.; and Dickerson, J. P. 2020. Proportionnet: Balancing fairness and revenue for auction design with deep learning. arXiv preprint arXiv:2010.06398.
Kurin et al. (2020) Kurin, V.; Igl, M.; Rocktäschel, T.; Boehmer, W.; and Whiteson, S. 2020. My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control. In International Conference on Learning Representations.
Kwak et al. (2010) Kwak, J.-y.; Yang, R.; Yin, Z.; Taylor, M. E.; and Tambe, M. 2010. Teamwork and coordination under model uncertainty in DEC-POMDPs. In Workshops at the Twenty-Fourth AAAI Conference on Artificial Intelligence.
Lazaridou, Peysakhovich, and Baroni (2017) Lazaridou, A.; Peysakhovich, A.; and Baroni, M. 2017. Multi-agent cooperation and the emergence of (natural) language. In Proceedings of the International Conference on Learning Representations (ICLR).
Lee, Lavieri, and Volk (2019) Lee, E.; Lavieri, M. S.; and Volk, M. 2019. Optimal screening for hepatocellular carcinoma: A restless bandit model. Manufacturing & Service Operations Management, 21(1): 198–212.
Li et al. (2021) Li, C.; Wang, T.; Wu, C.; Zhao, Q.; Yang, J.; and Zhang, C. 2021. Celebrating diversity in shared multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 34: 3991–4002.
Li et al. (2020) Li, G.; Wei, Y.; Chi, Y.; Gu, Y.; and Chen, Y. 2020. Sample complexity of asynchronous Q-learning: Sharper analysis and variance reduction. Advances in neural information processing systems, 33: 7031–7043.
Li et al. (2019) Li, S.; Wu, Y.; Cui, X.; Dong, H.; Fang, F.; and Russell, S. 2019. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In Proceedings of the AAAI conference on artificial intelligence, volume 33, 4213–4220.
Littman (1994) Littman, M. L. 1994. Markov games as a framework for multi-agent reinforcement learning. In Machine learning proceedings 1994, 157–163. Elsevier.
Liu and Lai (2024) Liu, G.; and Lai, L. 2024. Efficient adversarial attacks on online multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 36.
Mate et al. (2022) Mate, A.; Madaan, L.; Taneja, A.; Madhiwalla, N.; Verma, S.; Singh, G.; Hegde, A.; Varakantham, P.; and Tambe, M. 2022. Field Study in Deploying Restless Multi-Armed Bandits: Assisting Non-profits in Improving Maternal and Child Health. Proceedings of the AAAI Conference on Artificial Intelligence, 36: 12017–12025.
Modi, Mary, and Moy (2019) Modi, N.; Mary, P.; and Moy, C. 2019. Transfer restless multi-armed bandit policy for energy-efficient heterogeneous cellular network. EURASIP Journal on Advances in Signal Processing, 2019(1): 1–19.
Mordatch and Abbeel (2018) Mordatch, I.; and Abbeel, P. 2018. Emergence of grounded compositional language in multi-agent populations. In Thirty-Second AAAI Conference on Artificial Intelligence.
Munos (2003) Munos, R. 2003. Error bounds for approximate policy iteration. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, 560–567.
Nagaraj et al. (2023) Nagaraj, D. M.; Kowshik, S. S.; Agarwal, N.; Netrapalli, P.; and Jain, P. 2023. Multi-user reinforcement learning with low rank rewards. In International Conference on Machine Learning, 25627–25659. PMLR.
Nair et al. (2004) Nair, R.; Tambe, M.; Roth, M.; and Yokoo, M. 2004. Communications for improving policy computation in distributed POMDPs. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems, 2004. AAMAS 2004., 1098–1105. IEEE.
Nakhleh et al. (2021) Nakhleh, K.; Ganji, S.; Hsieh, P.-C.; Hou, I.; Shakkottai, S.; et al. 2021. NeurWIN: Neural Whittle index network for restless bandits via deep RL. Advances in Neural Information Processing Systems, 34: 828–839.
Oliehoek, Amato et al. (2016) Oliehoek, F. A.; Amato, C.; et al. 2016. A concise introduction to decentralized POMDPs, volume 1. Springer.
Papadimitriou and Tsitsiklis (1999) Papadimitriou, C. H.; and Tsitsiklis, J. N. 1999. The Complexity of Optimal Queuing Network Control. Mathematics of Operations Research, 24(2): 293–305.
Paulus et al. (2023) Paulus, D.; de Vries, G.; Janssen, M.; and Van de Walle, B. 2023. Reinforcing data bias in crisis information management: The case of the Yemen humanitarian response. International Journal of Information Management, 72: 102663.
Peng et al. (2021) Peng, B.; Rashid, T.; Schroeder de Witt, C.; Kamienny, P.-A.; Torr, P.; Böhmer, W.; and Whiteson, S. 2021. Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, 34: 12208–12221.
Peri et al. (2021) Peri, N.; Curry, M.; Dooley, S.; and Dickerson, J. 2021. Preferencenet: Encoding human preferences in auction design with deep learning. Advances in Neural Information Processing Systems, 34: 17532–17542.
Polyak and Juditsky (1992) Polyak, B. T.; and Juditsky, A. B. 1992. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4): 838–855.
Qin et al. (2022) Qin, R.-J.; Chen, F.; Wang, T.; Yuan, L.; Wu, X.; Kang, Y.; Zhang, Z.; Zhang, C.; and Yu, Y. 2022. Multi-Agent Policy Transfer via Task Relationship Modeling. In Deep Reinforcement Learning Workshop NeurIPS 2022.
Rahme et al. (2021) Rahme, J.; Jelassi, S.; Bruna, J.; and Weinberg, S. M. 2021. A Permutation-Equivariant Neural Network Architecture For Auction Design. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, 5664–5672. AAAI Press.
Rashid et al. (2018) Rashid, T.; Samvelyan, M.; Witt, C. S.; Farquhar, G.; Foerster, J.; and Whiteson, S. 2018. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. In International Conference on Machine Learning, 4292–4301.
Ravindranath et al. (2021) Ravindranath, S. S.; Feng, Z.; Li, S.; Ma, J.; Kominers, S. D.; and Parkes, D. C. 2021. Deep learning for two-sided matching. arXiv preprint arXiv:2107.03427.
Ravindranath, Jiang, and Parkes (2023) Ravindranath, S. S.; Jiang, Y.; and Parkes, D. C. 2023. Data Market Design through Deep Learning. In Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; and Levine, S., eds., Advances in Neural Information Processing Systems, volume 36, 6662–6689. Curran Associates, Inc.
Robbins and Monro (1951) Robbins, H.; and Monro, S. 1951. A stochastic approximation method. The annals of mathematical statistics, 400–407.
Ruiz-Hernández, Pinar-Pérez, and Delgado-Gómez (2020) Ruiz-Hernández, D.; Pinar-Pérez, J. M.; and Delgado-Gómez, D. 2020. Multi-machine preventive maintenance scheduling with imperfect interventions: A restless bandit approach. Computers & Operations Research, 119: 104927.
Schmitt-Grohé, Teoh, and Uribe (2020) Schmitt-Grohé, S.; Teoh, K.; and Uribe, M. 2020. COVID-19: testing inequality in New York City. Technical report, National Bureau of Economic Research.
Seow et al. (2024) Seow, R.; Zhao, Y.; Wood, D.; Tambe, M.; and Gonzalez, C. 2024. Improving the prediction of individual engagement in recommendations using cognitive models. arXiv preprint arXiv:2408.16147.
Shapley (1953) Shapley, L. S. 1953. Stochastic games. Proceedings of the national academy of sciences, 39(10): 1095–1100.
Shi et al. (2024) Shi, L.; Mazumdar, E.; Chi, Y.; and Wierman, A. 2024. Sample-Efficient Robust Multi-Agent Reinforcement Learning in the Face of Environmental Uncertainty. arXiv preprint arXiv:2404.18909.
Singh, Jain, and Sukhbaatar (2019) Singh, A.; Jain, T.; and Sukhbaatar, S. 2019. Learning when to Communicate at Scale in Multiagent Cooperative and Competitive Tasks. In Proceedings of the International Conference on Learning Representations (ICLR).
Son et al. (2019) Son, K.; Kim, D.; Kang, W. J.; Hostallero, D. E.; and Yi, Y. 2019. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International conference on machine learning, 5887–5896. PMLR.
Song, Wang, and Zhang (2019) Song, X.; Wang, T.; and Zhang, C. 2019. Convergence of Multi-Agent Learning with a Finite Step Size in General-Sum Games. In Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, 935–943.
Sukhbaatar, Fergus et al. (2016) Sukhbaatar, S.; Fergus, R.; et al. 2016. Learning multiagent communication with backpropagation. Advances in neural information processing systems, 29.
Sunehag et al. (2018) Sunehag, P.; Lever, G.; Gruslys, A.; Czarnecki, W. M.; Zambaldi, V.; Jaderberg, M.; Lanctot, M.; Sonnerat, N.; Leibo, J. Z.; Tuyls, K.; et al. 2018. Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 2085–2087. International Foundation for Autonomous Agents and Multiagent Systems.
Sutton and Barto (2018) Sutton, R. S.; and Barto, A. G. 2018. Reinforcement learning: An introduction. MIT press.
Tacchetti et al. (2019) Tacchetti, A.; Strouse, D.; Garnelo, M.; Graepel, T.; and Bachrach, Y. 2019. A neural architecture for designing truthful and efficient auctions. arXiv preprint arXiv:1907.05181.
Taylor et al. (2019) Taylor, A.; Dusparic, I.; Guériau, M.; and Clarke, S. 2019. Parallel transfer learning in multi-agent systems: What, when and how to transfer? In 2019 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE.
Tripathi and Modiano (2019) Tripathi, V.; and Modiano, E. 2019. A Whittle Index Approach to Minimizing Functions of Age of Information. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 1160–1167. Allerton: Allerton.
Verloop (2016) Verloop, I. M. 2016. Asymptotically optimal priority policies for indexable and nonindexable restless bandits. The Annals of Applied Probability, 26(4): 1947–1995.
Verma et al. (2024) Verma, S.; Zhao, Y.; Shah, S.; Boehmer, N.; Taneja, A.; and Tambe, M. 2024. Group Fairness in Predict-Then-Optimize Settings for Restless Bandits.
Wang et al. (2023) Wang, K.; Verma, S.; Mate, A.; Shah, S.; Taneja, A.; Madhiwalla, N.; Hegde, A.; and Tambe, M. 2023. Scalable decision-focused learning in restless multi-armed bandits with application to maternal and child health. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 12138–12146.
Wang et al. (2020a) Wang, T.; Dong, H.; Lesser, V.; and Zhang, C. 2020a. ROMA: Multi-Agent Reinforcement Learning with Emergent Roles. In Proceedings of the 37th International Conference on Machine Learning.
Wang et al. (2024) Wang, T.; Duetting, P.; Ivanov, D.; Talgam-Cohen, I.; and Parkes, D. C. 2024. Deep Contract Design via Discontinuous Networks. Advances in Neural Information Processing Systems, 36.
Wang et al. (2020b) Wang, T.; Gupta, T.; Mahajan, A.; Peng, B.; Whiteson, S.; and Zhang, C. 2020b. RODE: Learning Roles to Decompose Multi-Agent Tasks. In International Conference on Learning Representations.
Wang, Jiang, and Parkes (2024) Wang, T.; Jiang, Y.; and Parkes, D. C. 2024. GemNet: Menu-Based, strategy-proof multi-bidder auctions through deep learning. In Proceedings of the 25th ACM Conference on Economics and Computation.
Wang et al. (2019a) Wang, T.; Wang, J.; Wu, Y.; and Zhang, C. 2019a. Influence-Based Multi-Agent Exploration. In International Conference on Learning Representations.
Wang et al. (2019b) Wang, T.; Wang, J.; Zheng, C.; and Zhang, C. 2019b. Learning Nearly Decomposable Value Functions Via Communication Minimization. In International Conference on Learning Representations.
Wang et al. (2021) Wang, T.; Zeng, L.; Dong, W.; Yang, Q.; Yu, Y.; and Zhang, C. 2021. Context-aware sparse deep coordination graphs. arXiv preprint arXiv:2106.02886.
Wang et al. (2019c) Wang, Y.; Dong, K.; Chen, X.; and Wang, L. 2019c. Q-learning with UCB Exploration is Sample Efficient for Infinite-Horizon MDP. In International Conference on Learning Representations.
Wang et al. (2020c) Wang, Y.; Han, B.; Wang, T.; Dong, H.; and Zhang, C. 2020c. DOP: Off-policy multi-agent decomposed policy gradients. In International conference on learning representations.
Weber and Weiss (1990) Weber, R. R.; and Weiss, G. 1990. On an index policy for restless bandits. Journal of applied probability, 27(3): 637–648.
Wen et al. (2022) Wen, M.; Kuba, J.; Lin, R.; Zhang, W.; Wen, Y.; Wang, J.; and Yang, Y. 2022. Multi-agent reinforcement learning is a sequence modeling problem. Advances in Neural Information Processing Systems, 35: 16509–16521.
Whittle (1988) Whittle, P. 1988. Restless bandits: Activity allocation in a changing world. Journal of applied probability, 25(A): 287–298.
Wu et al. (2022) Wu, S.; Wang, T.; Wu, X.; Zhang, J.; Hu, Y.; Fan, C.; and Zhang, C. 2022. Model and Method: Training-Time Attack for Cooperative Multi-Agent Reinforcement Learning. In Deep Reinforcement Learning Workshop NeurIPS 2022.
Wu et al. (2023) Wu, Y.; McMahan, J.; Zhu, X.; and Xie, Q. 2023. Reward poisoning attacks on offline multi-agent reinforcement learning. In Proceedings of the aaai conference on artificial intelligence, volume 37, 10426–10434.
Xie et al. (2020) Xie, Q.; Chen, Y.; Wang, Z.; and Yang, Z. 2020. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. In Conference on learning theory, 3674–3682. PMLR.
Xiong and Li (2023) Xiong, G.; and Li, J. 2023. Finite-Time Analysis of Whittle Index based Q-Learning for Restless Multi-Armed Bandits with Neural Network Function Approximation. In Thirty-seventh Conference on Neural Information Processing Systems.
Yaesoubi and Cohen (2011) Yaesoubi, R.; and Cohen, T. 2011. Generalized Markov models of infectious disease spread: A novel framework for developing dynamic health policies. European journal of operational research, 215(3): 679–687.
Yang et al. (2022) Yang, Q.; Dong, W.; Ren, Z.; Wang, J.; Wang, T.; and Zhang, C. 2022. Self-organized polynomial-time coordination graphs. In International Conference on Machine Learning, 24963–24979. PMLR.
Yu et al. (2022) Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; and Wu, Y. 2022. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 35: 24611–24624.
Zayas-Caban, Jasin, and Wang (2019) Zayas-Caban, G.; Jasin, S.; and Wang, G. 2019. An asymptotically optimal heuristic for general nonstationary finite-horizon restless multi-armed, multi-action bandits. Advances in Applied Probability, 51(3): 745–772.
Zhang et al. (2024) Zhang, E.; Zhao, S.; Wang, T.; Hossain, S.; Gasztowtt, H.; Zheng, S.; Parkes, D. C.; Tambe, M.; and Chen, Y. 2024. Position: Social Environment Design Should be Further Developed for AI-based Policy-Making. In Forty-first International Conference on Machine Learning.
Zhang et al. (2020) Zhang, K.; Sun, T.; Tao, Y.; Genc, S.; Mallya, S.; and Basar, T. 2020. Robust multi-agent reinforcement learning with model uncertainty. Advances in neural information processing systems, 33: 10571–10583.
Zhao, Krishnamachari, and Liu (2008) Zhao, Q.; Krishnamachari, B.; and Liu, K. 2008. On myopic sensing for multi-channel opportunistic access: structure, optimality, and performance. IEEE Transactions on Wireless Communications, 7(12): 5431–5440.
Zhao et al. (2023) Zhao, Y.; Behari, N.; Hughes, E.; Zhang, E.; Nagaraj, D.; Tuyls, K.; Taneja, A.; and Tambe, M. 2023. Towards Zero Shot Learning in Restless Multi-armed Bandits. AAMAS.
Zhao et al. (2024a) Zhao, Y.; Behari, N.; Hughes, E.; Zhang, E.; Nagaraj, D.; Tuyls, K.; Taneja, A.; and Tambe, M. 2024a. Towards a Pretrained Model for Restless Bandits via Multi-arm Generalization. IJCAI.
Zhao et al. (2024b) Zhao, Y.; Boehmer, N.; Taneja, A.; and Tambe, M. 2024b. Towards Foundation-model-based Multiagent System to Accelerate AI for Social Impact. arXiv preprint arXiv:2412.07880.
Zhao et al. (2022) Zhao, Y.; Pan, Q.; Choromanski, K.; Jain, D.; and Sindhwani, V. 2022. Implicit Two-Tower Policies. arXiv preprint arXiv:2208.01191.

Appendix A Ethics and Real-world ARMMAN Dataset

We test our method in a purely simulated environment; any potential consideration of real-world deployment would require comprehensive field testing. All work in simulation was conducted in collaboration with ARMMAN’s team of ethics, and any future consideration of deployment would require a real-world evaluation together with the domain partner, which may reveal further challenges to be addressed.

A.1 Secondary Analysis

Our experiment falls into the category of secondary analysis of the data shared by ARMMAN. This paper does not involve the deployment of the proposed algorithm or any other baselines to the service call program. As noted earlier, the experiments are secondary analysis with approval from the ARMMAN ethics board.

A.2 Consent and Data Usage

Consent is obtained from every beneficiary enrolling in the NGO’s mobile health program. The data collected through the program is owned by the NGO and only the NGO is allowed to share data. In our experiments, we use anonymized call listenership logs to calculate empirical transition probabilities. No personally identifiable information (PII) is available to us. The data exchange and usage were regulated by clearly defined exchange protocols including anonymization, read-access only to researchers, restricted use of the data for research purposes only, and approval by ARMMAN’s ethics review committee.

Appendix B Additional Experiment Details and Results

Method	Gaussian	Uniform	Mixture	Gaussian	Uniform	Mixture	Gaussian	Uniform	Mixture
	Synthetic (N=21, B=15)			SIS (N=32, B=16)			ARMMAN (N=48, B=20)
$\mathtt{No\ Noise}$	100%	100%	100%	100%	100%	100%	100%	100%	100%
$\mathtt{Comm}$	75%	97%	79%	63%	73%	56%	77%	100%	80%
$\mathtt{No\ Comm}$	62%	92%	71%	55%	66%	47%	63%	93%	75%
$\mathtt{Fixed/Comm}$	21%	15%	16%	56%	29%	27%	24%	37%	35%
$\mathtt{NN\ Comm}$	44%	89%	62%	32%	48%	20%	46%	89%	72%
$\mathtt{Random\ Comm}$	-24%	21%	1%	22%	24%	-1%	-22%	35%	28%

Table 3: Performance (return averaged across 100 random seeds) with various noise distributions, normalized with respect to

\mathtt{No\ Noise}

and

\mathtt{Comm}

(ours) before communication starts. Gaussian noise is

\mathcal{N}(0,1)

, Uniform noise is

U(-0.5,0.5)

, and Mixture noise has 50% chance of being the Gaussian and 50% chance of being the Uniform. We abbreviate

\mathtt{Nearest\ Neighbor\ Comm}

\mathtt{NN\ Comm}

Our experimental settings include a synthetic setting, an epidemic modeling scenario, and a maternal healthcare intervention task, each designed to test the efficacy of our model under different conditions. In all experimental settings considered, to generate features, we randomly generate projection matrices and then project parameters that describe the ground truth transition dynamics (unknown to the our learning algorithms) into features. The features have the same dimension as the number of parameters that describes the transition dynamics. In Table 3, we present results on noise drawn from different distributions, including Gaussian distribution, Uniform distribution, and Mixture distribution.

In Table 4, we present hyperparameters used. We use categorical DQN to learn per-arm Q-values. The Q-network has tanh activation for all hidden layers and identity activation for output layers.

hyperparameter	value
Number of atoms in Categorical DQN	51
Q-network learning rate	5e-04
Number of steps per epoch	20
DQN initial exploration epsilon	0.3
DQN exploration epsilon decay rate	0.999
Number of hidden layers in Q-network	2
Number of neurons per hidden layer	16
Target-Q updating frequency in epochs	10
DQN buffer size in samples per arm	12000
VDN-network learning rate	5e-3

Table 4: Hyperparameter values.

Our method is robust to both biased and unbiased reward errors. In Synthetic, with biased noise $\mathcal{N}(z,1)$ across varying $z$ , our method consistently outperforms baselines, as shown in Table 5.

Method	$\mathcal{N}(0,1)$	$\mathcal{N}(0.1,1)$	$\mathcal{N}(0.2,1)$
	Synthetic (N=21, B=15)
$\mathtt{Comm}$	75%	64%	56%
$\mathtt{No\ Comm}$	62%	50%	43%
$\mathtt{Fixed/Comm}$	21%	29%	29%
$\mathtt{NN\ Comm}$	44%	33%	24%

Table 5: Performance (return averaged across 100 random seeds) with biased noise

\mathcal{N}(z,1)

across varying

z

, normalized with respect to

\mathtt{No\ Noise}

and

\mathtt{Comm}

(ours) before communication starts. We abbreviate

\mathtt{Nearest\ Neighbor\ Comm}

\mathtt{NN\ Comm}

Remark 7.

Notice that in Tables 1, 2, 3, and 5, we normalize the returns of all algorithms under consideration by

\frac{\text{return}-\text{return($\mathtt{Comm}$ at epoch 200)}}{\text{return($\mathtt{No\ Noise}$ at epoch 200})-\text{return($\mathtt{Comm}$ at epoch 200})}.

Communication learning starts at epoch 200.

Appendix C Proofs

C.1 Proposition 4

See 4

Proof.

Suppose we consider dense communication, where the arm $i$ communicates with $d$ similar arms $\nu_{i,1},\dots,\nu_{i,d}$ . In this case, the arm collects data from each of the policies $\mathcal{D}_{\pi_{\nu_{i,1}}},\dots,\mathcal{D}_{\pi_{\nu_{i,d}}}$ . We call this mixture distribution $\bar{\mathcal{D}}_{i}$ . Now, consider the TD-loss for this dense communication:

	$\displaystyle\mathcal{L}^{\texttt{c,dense}}_{\texttt{TD}}(\theta_{i}\|a_{i}^{\texttt{c}}\textup{{=}}1)=\mathbb{E}_{(s_{i},a_{i},s_{i}^{\prime})\sim\bar{\mathcal{D}}_{i}}\Big{[}\Big{(}R(s_{i})-\lambda c_{a_{i}}$
	$\displaystyle+\beta\max_{a}Q_{i}(s_{i}^{\prime},a;\theta_{i})-Q_{i}(s_{i},a_{i};\tilde{\theta}_{i})\Big{)}^{2}\Big{]}.$		(13)

Now, suppose that we pick a random variable $J$ uniformly at random from $\{1,\dots,d\}$ and collect data only from the arm $\nu_{i,J}$ . Then, we define:

	$\displaystyle\mathcal{L}^{\texttt{c,sparse}}_{\texttt{TD}}(\theta_{i}\|a_{i}^{\texttt{c}}\textup{{=}}1)=\mathbb{E}_{(s_{i},a_{i},s_{i}^{\prime})\sim\mathcal{D}_{\nu_{i,J}}}\Big{[}\Big{(}R(s_{i})-\lambda c_{a_{i}}$
	$\displaystyle+\beta\max_{a}Q_{i}(s_{i}^{\prime},a;\theta_{i})-Q_{i}(s_{i},a_{i};\tilde{\theta}_{i})\Big{)}^{2}\Big{]}.$		(14)

Then, we have:

	$\displaystyle\mathcal{L}^{\texttt{c,dense}}_{\texttt{TD}}(\theta_{i}\|a_{i}^{\texttt{c}}\textup{{=}}1)$	$\displaystyle=\mathbb{E}[\mathcal{L}^{\texttt{c,sparse}}_{\texttt{TD}}(\theta_{i}\|a_{i}^{\texttt{c}}\textup{{=}}1)]$
	$\displaystyle\nabla\mathcal{L}^{\texttt{c,dense}}_{\texttt{TD}}(\theta_{i}\|a_{i}^{\texttt{c}}\textup{{=}}1)$	$\displaystyle=\mathbb{E}[\nabla\mathcal{L}^{\texttt{c,sparse}}_{\texttt{TD}}(\theta_{i}\|a_{i}^{\texttt{c}}\textup{{=}}1)]$		(15)

Now consider the gradient descent (GD) in Equation (6) with $\mathcal{L}^{\texttt{c,dense}}_{\texttt{TD}}(\theta_{i}|a_{i}^{\texttt{c}}\textup{{=}}1)$ . When this loss is replaced with $\mathbb{E}[\mathcal{L}^{\texttt{c,sparse}}_{\texttt{TD}}(\theta_{i}|a_{i}^{\texttt{c}}\textup{{=}}1)]$ , it becomes stochastic gradient descent (SGD) for $\mathcal{L}^{\texttt{c,dense}}_{\texttt{TD}}(\theta_{i}|a_{i}^{\texttt{c}}\textup{{=}}1)$ . SGD is widely deployed to optimize loss functions in deep learning since GD is much more expensive (Bottou 2010). It can be shown that for a large class of loss functions, SGD converges to the same optimal solution as GD (Robbins and Monro 1951; Polyak and Juditsky 1992). Notice that even in our case, we reduce the computational complexity per-step by a factor of $O(d)$ .

We now analyze the total sample complexity. (Chen and Jiang 2019) proves that, if we have $n$ training samples, with probability at least $1-\delta$ , the value function after $k$ iterations satisfies $v^{*}-v^{k}\leq\epsilon_{e}V_{max}$ , and

\displaystyle n=O\left(\frac{C\log\frac{\left|\mathcal{F}\right|}{\delta}}{\epsilon_{e}^{2}(1-\beta)^{4}}\right).

(16)

Here, $\mathcal{F}$ is the class of candidate value function. When comparing the sample complexity under $\mu_{\nu_{i}}$ and $\mu_{\bar{\nu}_{i}}$ , we have the same function class, so the sample complxity depends on the concentratability coefficient. We have

\displaystyle\frac{v(s,a)}{\mu_{\nu_{i}}(s,a)}\geq\frac{v(s,a)}{d\mu_{\bar{\nu}_{i}}(s,a)},\ \ \forall(s,a)

(17)

for all admissible $v$ .

Suppose that

\displaystyle C_{\bar{\nu}_{i}}=\frac{v_{b}(s_{b},a_{b})}{\mu_{\bar{\nu}_{i}}(s_{b},a_{b})},

(18)

we have

$\displaystyle\frac{1}{d}C_{\bar{\nu}_{i}}$	$\displaystyle=\frac{1}{d}\frac{v_{b}(s_{b},a_{b})}{\mu_{\bar{\nu}_{i}}(s_{b},a_{b})}$	(19)
	$\displaystyle\leq\frac{v_{b}(s_{b},a_{b})}{\mu_{\nu_{i}}(s_{b},a_{b})}$	(20)
	$\displaystyle\leq C_{\nu_{i}}.$	(21)

Therefore, the sample complexity of the sparse communication scheme is at most $d$ times greater than that of the dense communication scheme, and these two schemes can both learn $\epsilon_{e}$ -optimal value functions. ∎

C.2 Proposition 2

See 2

Let’s first consider the case where a noise-free arm $i$ learns without communication, i.e., the arm conducts independent Q-learning. The best known sample-complexity result about model-free Q-learning is provided by (Wang et al. 2019c). Using a UCB exploration strategy, it can be proved that for any $1/2<\beta<1$ , $\epsilon>0$ , $\delta>0$ , with a probability $1-\delta$ , the number of timestep $t$ such that the policy is not $\epsilon$ -optimal at state $s_{t}$ is at most

\displaystyle t_{\texttt{ind}}=\tilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|\text{ln}1/\delta}{\epsilon^{2}(1-\beta)^{7}}\right).

(22)

Here, $\tilde{\mathcal{O}}$ suppresses logarithmic factors of $1/\epsilon$ , $1/(1-\beta)$ , and $|\mathcal{S}||\mathcal{A}|$ .

Eq. 22 gives the number of samples needed to get a good estimation of $Q_{i}$ in the worst case scenario. We now compare it to the case with communication.

Without communication, arm $i$ uses a behavior policy $\pi_{\nu_{i}}$ to collect samples for the Q update. This falls in the range of asynchronous Q-learning. We know that (Li et al. 2020) has proven that, for any $0<\delta<1$ and $0<\epsilon<1/(1-\delta)$ , with a probability of at least $1-\delta$ , the number of timesteps where the Q value is not $\epsilon$ -optimal is at most

\displaystyle t_{\texttt{comm}}=\tilde{\mathcal{O}}\left(\text{ln}\frac{1}{\delta}\frac{1}{\mu_{\texttt{min}}}\left(\frac{1}{\epsilon^{2}(1-\beta)^{5}}+\frac{t_{\texttt{mix}}}{1-\beta}\right)\right).

(23)

Again, $\tilde{\mathcal{O}}$ suppresses logarithmic factors of $1/\epsilon$ , $1/(1-\beta)$ , and $|\mathcal{S}||\mathcal{A}|$ .

C.3 Noise-free to noisy communication channels

See 5

Proof.

For any number of $n>0$ and $\beta>0$ , consider the following MDP (see also Fig. 3). The state set is $\mathcal{S}=\{s_{0},s_{1},\cdots,s_{n}\}$ , and the action set is $\{a_{0},a_{1}\}$ . Taking $a_{0}$ and $a_{1}$ at $s_{k}$ , where $k$ is an even number, deterministically lead to state $s_{k+2}$ and $s_{k+1}$ , respectively. We assume that $s_{n+2}=s_{n}$ . For state $s_{k}$ where $k$ is an odd number, both actions leads back to the state itself. In the noise-free arm, the reward is 0 for all states except for state $s_{n}$ , where $R(s_{n})=C$ . In the noisy arm, all states $s_{k}$ with an odd $k$ are affected by noise, which is a uniform random number in the range (0,1).

We consider Q-learning with the $\epsilon$ -greedy exploration strategy. All Q values are initialized to 0. In cases where the Q values of two actions at a state are the same, we break the tie randomly. We observe that, for a sufficiently large reward at $s_{n}$ where $R(s_{n})=C>\left(\frac{1}{\beta}\right)^{n/2-2}$ , the Q-function before visiting $s_{n}$ cannot be optimal. This is because, in the noisy arm, the best return of visiting an odd-number state is $\beta/(1-\beta)$ , while the return of visiting $s_{n}$ is $C\beta^{n/2-1}/(1-\beta)$ .

We explore the environment and update Q functions in $K=O(|\mathcal{S}|)$ epochs. Use $A_{k}$ to denote the event that $s_{n}$ is visited in the $k^{\text{th}}$ trajectory, and use $\bar{A}_{1:k-1}$ to denote the event that $s_{n}$ has not been visited in the first $k-1$ trajectories. We calculate the probability that $s_{n}$ is not explored in $K$ epochs, denoted by $Pr(\bar{A}_{1:K})$ .

In the noisy arm, we have that

\displaystyle Pr(\bar{A}_{1:K})=Pr(\bar{A}_{1:1})Pr(\bar{A}_{1:2}|\bar{A}_{1:1})\cdots Pr(\bar{A}_{1:K}|\bar{A}_{1:K-1}).

(24)

For the first term,

\displaystyle Pr(\bar{A}_{1:1})=1-P(A_{1})=1-\left(\frac{1}{2}\right)^{\frac{n}{2}},

because the only possible case where $A_{1}$ happens is if $a_{0}$ is selected at each state.

For the following terms, when $s_{n}$ is not visited in preceding epochs, there is at least one other state $s_{x}$ , where $x$ is an odd number, that would have been visited. In this case, we have $Q(s_{x-1},a_{1})>Q(s_{x-1},a_{0})$ because $R(s_{x})$ is strictly positive. Therefore, the probability that state $s_{x+1}$ could be visited from $x-1$ decreases to $\epsilon/2$ . It follows that

\displaystyle P(\bar{A}_{1:k}|\bar{A}_{1:k-1})=1-\left(\frac{1}{2}\right)^{\frac{n}{2}-n_{k}}\frac{\epsilon}{2}^{n_{k}}=1-\epsilon^{n_{k}}\left(\frac{1}{2}\right)^{\frac{n}{2}}.

Here, $n_{k}$ is the number of odd-number states that have been visited in the first $k-1$ epochs. Since $1\geq n_{k}\leq\frac{n}{2}$ , we have that

\displaystyle 1-\epsilon\left(\frac{1}{2}\right)^{\frac{n}{2}}<P(\bar{A}_{1:k}|\bar{A}_{1:k-1})<1-\left(\frac{\epsilon}{2}\right)^{\frac{n}{2}}

(25)

Combining Eq. 24 and Eq. 25, we get

		$\displaystyle Pr(\bar{A}_{1:K})\in\Bigg{(}\left(1-\left(\frac{1}{2}\right)^{\frac{n}{2}}\right)\left(1-\epsilon\left(\frac{1}{2}\right)^{\frac{n}{2}}\right)^{K-1},$
		$\displaystyle\left(1-\left(\frac{1}{2}\right)^{\frac{n}{2}}\right)\left(1-\left(\frac{\epsilon}{2}\right)^{\frac{n}{2}}\right)^{K-1}\Bigg{)}$		(26)

for a noisy arm. Let’s look at the upper bound. If we want the probability of finding $s_{n}$ to be sufficiently large, e.g., larger than $1-a$ where $a$ is a constant, then we need $O\left(\left(\frac{1}{\epsilon}\right)^{\frac{n}{2}}\right)$ samples. Here we use the approximation $(1-x)^{r}\approx 1-rx$ for small positive number $x$ and large $r$ . This suggests that using an exponentially large number of samples may still result in an exponentially large error in Q estimation, as the estimation of $Q(s_{n},a_{0})$ is 0 if $s_{n}$ is not visited, while the true value is $\frac{C}{1-\beta}\geq\left(\frac{1}{\beta}\right)^{n/2-2}\frac{1}{1-\beta}=O\left(\left(\frac{1}{\beta}\right)^{|\mathcal{S}|}\right)$ .

In the non-noisy arm, even if an odd number state is visited, the corresponding Q values along the trajectory do not change. Therefore, given that $s_{n}$ is not explored in the first $k-1$ epochs, the probability of $s_{n}$ being explored in epoch $k$ is $\frac{1}{2}^{\frac{n}{2}}$ . It follows that, for the noise-free arm, the probability of $s_{n}$ not being visited in $K$ trajectories is:

\displaystyle Pr(\bar{A}_{1:K})=\left(1-\left(\frac{1}{2}\right)^{\frac{n}{2}}\right)^{K},

(27)

which is smaller than the probability in the noisy arm (Eq. C.3). Therefore, communication of Q value from the non-noisy arm (after visiting $s_{n}$ ) can help exploration in the noisy arm, thus accelerating learning in this RMAB.

∎

See 6

Proof.

We consider a RMAB setting similar to Lemma 5. The only difference is that for all the odd-number states, the reward in the noise-free arm is 1, and the reward at $s_{n}$ is 0.

In the noisy arm, Eq. C.3 still holds, and the probability of $s_{n}$ being visited in $K$ trajectories is larger than

\displaystyle\left(\frac{1}{2}\right)^{\frac{n}{2}}+K\left(\frac{\epsilon}{2}\right)^{\frac{n}{2}}

(28)

Here we use the approximation $(1-x)^{r}\approx 1-rx$ for small positive number $x$ and large $r$ .

If we use the Q function of the non-noisy arm to establish the behavior policy, for every $s_{k}$ , where $k$ is an even number smaller than $n$ , we have $Q(s_{k},a_{1})>Q(s_{k},a_{0})$ . Therefore, the probability that $s_{n}$ is visited in $K$ epochs is

\displaystyle O\left(K\left(\frac{\epsilon}{2}\right)^{\frac{n}{2}}\right).

(29)

Comparing these two probabilities gives us the result. ∎

Appendix D Additional Related Works

To the best of knowledge, our work provides the first method to transfer knowledge among agents in RMAB. This setting is different from multi-agent transfer learning in Dec-POMDPs (Qin et al. 2022; Taylor et al. 2019). From the perspective of learning stability, our approaches provides additional robustness for learning in noisy environments, while robust multi-agent learning recently draws great research interests in Dec-POMDPs (Wu et al. 2023, 2022; Liu and Lai 2024; Zhang et al. 2020; Li et al. 2019; Guo et al. 2022; Bukharin et al. 2024; Shi et al. 2024). This paper also sits within the growing body of work integrating machine learning with game and economic theory and practice, e.g., (Hartford, Wright, and Leyton-Brown 2016; Kuo et al. 2020; Ravindranath, Jiang, and Parkes 2023; Wang et al. 2024; Ravindranath et al. 2021; Tacchetti et al. 2019; Peri et al. 2021; Rahme et al. 2021; Hossain et al. 2024; Wang, Jiang, and Parkes 2024; Dütting et al. 2024; Zhang et al. 2024; Dong et al. 2021; Song, Wang, and Zhang 2019).

RMAB have been widely used to model multiple interacting agents (Jaques et al. 2019; Yu et al. 2022; Shapley 1953; Littman 1994; Jin et al. 2021; Xie et al. 2020; Behari et al. 2024; Seow et al. 2024; Boehmer et al. 2024) For binary action RMABs, Whittle index based Q-learning approaches have been investigated (Nakhleh et al. 2021; Xiong and Li 2023; Fu et al. 2019; Avrachenkov and Borkar 2022). Recent works also studied deep RL approaches for multi-action RMABs (Killian et al. 2021; Zhao et al. 2024a; Killian et al. 2022). However, existing works on multi-action RMABs largely overlook the challenge of data errors, which is common in real-world scenarios. Our novel communication learning approach enables arms to mitigate the influence of systematic data errors.

	$\displaystyle\mathcal{L}^{\texttt{c,dense}}_{\texttt{TD}}(\theta_{i}\|a_{i}^{\texttt{c}}\textup{{=}}1)$	$\displaystyle=\mathbb{E}[\mathcal{L}^{\texttt{c,sparse}}_{\texttt{TD}}(\theta_{i}\|a_{i}^{\texttt{c}}\textup{{=}}1)]$
	$\displaystyle\nabla\mathcal{L}^{\texttt{c,dense}}_{\texttt{TD}}(\theta_{i}\|a_{i}^{\texttt{c}}\textup{{=}}1)$	$\displaystyle=\mathbb{E}[\nabla\mathcal{L}^{\texttt{c,sparse}}_{\texttt{TD}}(\theta_{i}\|a_{i}^{\texttt{c}}\textup{{=}}1)]$		(15)