On the Role of Emergent Communication for Social Learning
in Multi-Agent Reinforcement Learning

Seth Karten Siva Kailas Huao Li Katia Sycara

Abstract

Explicit communication among humans is key to coordinating and learning. Social learning, which uses cues from experts, can greatly benefit from the usage of explicit communication to align heterogeneous policies, reduce sample complexity, and solve partially observable tasks. Emergent communication, a type of explicit communication, studies the creation of an artificial language to encode a high task-utility message directly from data. However, in most cases, emergent communication sends insufficiently compressed messages with little or null information, which also may not be understandable to a third-party listener. This paper proposes an unsupervised method based on the information bottleneck to capture both referential complexity and task-specific utility to adequately explore sparse social communication scenarios in multi-agent reinforcement learning (MARL). We show that our model is able to i) develop a natural-language-inspired lexicon of messages that is independently composed of a set of emergent concepts, which span the observations and intents with minimal bits, ii) develop communication to align the action policies of heterogeneous agents with dissimilar feature models, and iii) learn a communication policy from watching an expert’s action policy, which we term ‘social shadowing’.

Multi-Agent Reinforcement Learning, Emergent Communication, Social Learning, Concept Whitening, Information Theory, Sparse Communication

1 INTRODUCTION

Social learning (jaques2019social; ndousse2021emergent) agents analyze cues from direct observation of other agents (novice or expert) in the same environment to learn an action policy from others. However, observing expert actions may not be sufficient to coordinate with other agents. Rather, by learning to communicate, agents can better model the intent of other agents, leading to better coordination. In humans, explicit communication for coordination assumes a common communication substrate to convey abstract concepts and beliefs directly (mirsky2020penny), which may not be available for new partners. To align complex beliefs, heterogeneous agents must learn a message policy that translates from one theory of mind (li2022theory) to another to synchronize coordination. Especially when there is complex information to process and share, new agent partners need to learn to communicate to work with other agents.

Emergent communication studies the creation of artificial language. Often phrased as a Lewis game, speakers and listeners learn a set of tokens to communicate complex observations (lewis1969convention). However, in multi-agent reinforcement learning (MARL), agents suffer from partial observability and non-stationarity (due to unaligned value functions) (papoudakis2019dealing), which aims to be solved with decentralized learning through communication. In the MARL setup, agents, as speakers and listeners, learn a set of tokens to communicate observations, intentions, coordination, or other experiences which help facilitate solving tasks (karten2022sparse; karten2022inter). Agents learn to communicate effectively through a backpropagation signal from their task performance (foerster2016learning; lowe2017multi; lazaridou2016multi; commnet; ic3net). This has been found useful for applications in human-agent teaming (karten2022inter; marathe2018bidirectional; lake2019human; lazaridou2020emergent), multi-robot navigation (benSparseDiscrete), and coordination in complex games such as StarCraft II (samvelyan2019starcraft). Communication quality has been shown to have a strong relationship with task performance (marlow2018does), leading to a multitude of work attempting to increase the representational capacity by decreasing the convergence rates (EcclesBiases; MA_autoencoder; karten2022sparse; wang2020learning; tucker2022towards). Yet these methods still create degenerate communication protocols (karten2022inter; karten2022sparse; benSparseDiscrete), which are uninterpretable due to joined concepts or null (lack of) information, which causes performance degradation.

In this work, we investigate the challenges of learning a messaging lexicon to prepare emergent communication for social learning (EC4SL) scenarios. We study the following hypotheses: H1) EC4SL will learn faster through structured concepts in messages leading to higher-quality solutions, H2) EC4SL aligns the policies of expert heterogeneous agents, and H3) EC4SL enables social shadowing, where an agent learns a communication policy while only observing an expert agent’s action policy. By learning a communication policy, the agent is encouraged to develop a more structured understanding of intent, leading to better coordination. The setting is very realistic among humans and many computer vision and RL frameworks may develop rich feature spaces for a specific solo task, but have not yet interacted with other agents, which may lead to failure without alignment.

We enable a compositional emergent communication paradigm, which exhibits clustering and informativeness properties. We show theoretically and through empirical results that compositional language enables independence properties among tokens with respect to referential information. Additionally, when combined with contrastive learning, our method outperforms competing methods that only ground communication on referential information. We show that contrastive learning is an optimal critic for communication, reducing sample complexity for the unsupervised emergent communication objective. In addition to the more human-like format, compositional communication is able to create variable-length messages, meaning that we are not limited to sending insufficiently compressed messages with little information, increasing the quality of each communication.

In order to test our hypotheses, we show the utility of our method in multi-agent settings with a focus on teams of agents, high-dimensional pixel data, and expansions to heterogeneous teams of agents of varying skill levels. Social learning requires agents to explore to observe and learn from expert cues. We interpolate between this form of social learning and imitation learning, which learns action policies directly from examples. We introduce a ’social shadowing’ learning approach where we use first-person observations, rather than third-person observations, to encourage the novice to learn latently or conceptually how to communicate and develop an understanding of intent for better coordination. The social shadowing episodes are alternated with traditional MARL during training. Contrastive learning, which works best with positive examples, is apt for social shadowing. Originally derived to enable lower complexity emergent lexicons, we find that the contrastive learning objective is apt for agents to develop internal models and relationships of the task through social shadowing.

The idea is to enable a shared emergent communication substrate (with minimal bandwidth) to enable future coordination with novel partners. Our contributions are deriving an optimal critic for a communication policy and showing that the information bottleneck helps extend communication to social learning scenarios. In real-world tasks such as autonomous driving or robotics, humans do not necessarily learn from scratch. Rather they explore with conceptually guided information from expert mentors. In particular, having structured emergent messages reduces sample complexity, and contrastive learning can help novice agents learn from experts. Emergent communication can also align heterogeneous agents, a social task that has not been previously studied.

2 RELATED WORK

2.1 Multi-Agent Signaling

Implicit communication conveys information to other agents that is not intentionally communicated (grupen2022multi). Implicit signaling conveys information to other agents based on one’s observable physical position (grupen2022multi). Implicit signaling may be a form of implicit communication such as through social cues (jaques2019social; ndousse2021emergent) or explicit communication such as encoded into the MDP through “cheap talk” (sokota2022communicating). Unlike implicit signaling, explicit signaling is a form of positive signaling (li2021learning) that seeks to directly influence the behavior of other agents in the hopes that the new information will lead to active listening. Multi-agent emergent communication is a type of explicit signaling which deliberately shares information. Symbolic communication, a subset of explicit communication, seeks to send a subset of pre-defined messages. However, these symbols must be defined by an expert and do not scale to particularly complex observations and a large number of agents. Emergent communication aims to directly influence other agents with a learned subset of information, which allows for scalability and interpretability by new agents.

2.2 Emergent Communication

Several methodologies currently exist to increase the informativeness of emergent communication. With discrete and clustered continuous communication, the number of observed distinct communication tokens is far below the number permissible (discreteComm). As an attempt to increase the emergent “vocabulary” and decrease the data required to converge to an informative communication “language”, work has added a bias loss to emit distinct tokens in different situations (EcclesBiases). More recent work has found that the sample efficiency can be further improved by grounding communication in observation space with a supervised reconstruction loss (MA_autoencoder). Information-maximizing autoencoders aim to maximize the state reconstruction accuracy for each agent. However, grounding communication in observations has been found to easily satisfy these input-based objectives while still requiring a myriad more samples to explore to find a task-specific communication space (karten2022sparse). Thus, it is necessary to use task-specific information to communicate informatively. This will enable learned compression for task completion rather than pure compression for input recovery. Other work aims to use the information bottleneck (tishby2015deep) to decrease the entropy of messages (wang2020learning). In our work, we use contrastive learning to increase representation similarity with future goals, which we show optimally optimizes the Q-function for messages.

2.3 Natural Language Inspiration

The properties of the tokens in emergent communication directly affect their informative ability. As a baseline, continuous communication tokens can represent maximum information but lack human-interpretable properties. Discrete 1-hot (binary vector) tokens allow for a finite vocabulary, but each token contains the same magnitude of information, with equal orthogonal distance to each other token. Similar to word embeddings in natural language, discrete prototypes are an effort to cluster similar information together from continuous vectors (discreteComm). Building on the continuous word embedding properties, VQ-VIB (tucker2022towards), an information-theoretic observation grounding based on VQ-VAE properties (van2017neural), uses variational properties to provide word embedding properties for continuous emergent tokens. Like discrete prototypes, they exhibit a clustering property based on similar information but are more informative. However, each of these message types determines a single token for communication. Tokens are stringed together to create emergent “sentences”.

3 Preliminaries

We formulate our setup as a decentralized, partially observable Markov Decision Process with communication (Dec-POMDP-Comm). Formally, our problem is defined by the tuple, $\langle\mathcal{S},\mathcal{A},\mathcal{M},\mathcal{T},\mathcal{R},\mathcal{O},\Omega,\gamma\rangle$ . We define $\mathcal{S}$ as the set of states, $\mathcal{A}^{i}\,,\,i\in[1,N]$ as the set of actions, which includes task-specific actions, and $\mathcal{M}^{i}$ as the set of communications for $N$ agents. $\mathcal{T}$ is the transition between states due to the multi-agent joint action space $\mathcal{T}:\mathcal{S}\times\mathcal{A}^{1},...,\mathcal{A}^{N}\to\mathcal{S}$ . $\Omega$ defines the set of observations in our partially observable setting. Partial observability requires communication to complete the tasks successfully. $\mathcal{O}^{i}:\mathcal{M}^{1},...,\mathcal{M}^{N}\times\hat{\mathcal{S}}\to\Omega$ maps the communications and local state, $\hat{\mathcal{S}}$ , to a distribution of observations for each agent. $\mathcal{R}$ defines the reward function and $\gamma$ defines the discount factor.

3.1 Architecture

The policy network is defined by three stages: Observation Encoding, Communication, and Action Decoding. The best observation encoding and action decoding architecture is task-dependent, i.e., using multi-layer perceptrons (MLPs), CNNs (lecun1995convolutional), GRUs (chung2014empirical), or transformer (vaswani2017attention) layers are best suited to different inputs. The encoder transforms observation and any sequence or memory information into an encoding $H$ . The on-policy reinforcement learning training uses REINFORCE (williams1992simple) or a decentralized version of MAPPO (yu2021surprising) as specified by our experiments.

Our work focuses on the communication stage, which can be divided into three substages: message encoding, message passing (often considered sparse communication), and message decoding. We use the message passing from (karten2022sparse). For message decoding, we build on a multi-headed attention framework, which allows an agent to learn which messages are most important (graphMA). Our compositional communication framework defines the message encoding, as described in section 4.

3.2 Objective

Mutual information, denoted as $I(X;Y)$ , looks to measure the relationship between random variables,

\displaystyle I(X;Y)=\mathds{E}_{p(x,y)}\left[\log\frac{p(x|y)}{p(x)}\right]=\mathds{E}_{p(x,y)}\left[\log\frac{p(y|x)}{p(y)}\right]

which is often measured through Kullback-Leibler divergence (kullback1997information), $I(X;Y)=D_{KL}(p(x,y)||p(x)\otimes p(y))$ . The message encoding substage can be defined as an information bottleneck problem, which defines a trade-off between the complexity of information (compression, $I(X,\hat{X})$ ) and the preserved relevant information (utility, $I(\hat{X},Y)$ ). The deep variational information bottleneck defines a trade-off between preserving useful information and compression (alemi2017deep; tishby2015deep). We assume that our observation and memory/sequence encoder provides an optimal representation $H^{i}$ suitable for sharing relevant observation and intent/coordination information. We hope to recover a representation $Y^{i}$ , which contains the sufficient desired outputs.

In our scenario, the information bottleneck is a trade-off between the complexity of information $I(H^{i};M^{i})$ (representing the encoded information exactly) and representing the relevant information $I(M^{j\neq i};Y^{i})$ , which is signaled from our contrastive objective. In our setup, the relevant information flows from other agents through communication, signaling a combination of the information bottleneck and a Lewis game. We additionally promote complexity through our compositional independence objective, $I(M_{1}^{i};\ldots;M_{L}^{i}|H^{i})$ . This is formulated by the following Lagrangian,

	$\displaystyle\mathcal{L}(\ p(m^{i}\|h^{i})\ )=\$	$\displaystyle\beta_{u}\hat{I}(M^{j\neq i};Y^{i})\ -\beta_{c}\hat{I}(H^{i};M^{i})$
		$\displaystyle-\beta_{I}\hat{I}(M_{1}^{i};\ldots;M_{L}^{i}\|H^{i})$

where the bounds on mutual information $\hat{I}$ are defined in equations 1, 2, and 10. Overall, our objective is,

	$\displaystyle J(\theta)=\max\limits_{\pi}\mathds{E}\left[\sum_{t\in T}\sum_{i\in N}\gamma_{t}\mathcal{R}(s_{t},a_{t})+\mathcal{L}(\ p(m_{t}\|h_{t})\ )\right]$
	$\displaystyle\text{s.t.}(a_{t},m_{t},h_{t})\sim\pi^{i},s_{t}\sim\mathcal{T}(s_{t-1})$

4 Complexity through Compositional Communication

We aim to satisfy the complexity objective, $I(H^{i},M^{i})$ , through compositional communication. In order to induce complexity in our communication, we want the messages to be as non-random as possible. That is, informative with respect to the input hidden state $h$ . In addition, we want each token within the message to share as little information as possible with the preceding tokens. Thus, each additional token adds only informative content. Each token has a fixed length in bits $W$ . The total sequence is limited by a fixed limit, $\sum_{l}^{L}W_{l}\leq S$ , of $S$ bits and a total of $L$ tokens.

We use a variational message generation setup, which maps the encoded hidden state $h$ to a message $m$ ; that is, we are modeling the posterior, $\pi_{m}^{i}(m_{l}|h)$ . We limit the vocabulary size to $K$ tokens, $e_{j}\in\mathds{R}^{D},j\in[1,K]\subset\mathds{N}$ , where each token has dimensionality $D$ and $l\in[1,L]\subset\mathds{N}$ . Each token $m_{l}$ is sampled from a categorical posterior distribution,

\pi_{m}^{i}(m_{l}=e_{k}|h)=\begin{cases}1&\text{for }k=\operatorname*{arg\,min}\limits_{j}||m_{l}-e_{j}||_{2}\\ 0&\text{otherwise}\end{cases}

such that the message $m_{l}$ is mapped to the nearest neighbor $e_{j}$ . A set of these tokens makes a message $m$ . To satisfy the complexity objective, we want to use $m^{i}$ to well-represent $h^{i}$ and consist of independently informative $m^{i}_{l}$ .

4.1 Independent Information

We derive an upper bound for the interaction information between all tokens.

Proposition 4.1.

For the interaction information between all tokens, the following upper bound holds: $I(m_{1};\ldots;m_{L}|h)\leq\mathds{E}_{h\sim p(h)}\left[D_{KL}\left(q(\hat{m}|h)||\pi^{i}_{m}(m_{1}|h)\otimes\cdots\otimes\pi^{i}_{m}(m_{L}|h)\right)\right]$ .

The proof is in Appendix A.1.

Refer to caption — Figure 1: By using contrastive learning, our method seeks similar representations between the state-message pair and future states while creating dissimilar representations with random states. Thus satisfying the utility objective of the information bottleneck. The depicted agents are blind and cannot see other cars.

Since we want the mutual information to be minimized in our objective, we minimize,

		$\displaystyle\hat{I}(m_{1};\ldots;m_{L}\|h)=$		(1)
		$\displaystyle\mathds{E}_{h\sim p(h)}\left[D_{KL}\left(q(\hat{m}\|h)\|\|\pi^{i}_{m}(m_{1}\|h)\otimes\cdots\otimes\pi^{i}_{m}(m_{L}\|h)\right)\right]$		(1)

4.2 Input-Oriented Information

In order to induce complexity in the compositional messages, we additionally want to minimize the mutual information $I(H;M)$ between the composed message $\hat{m}$ and the encoded information $h$ . We derive an upper bound on the mutual information that we use as a Lagrangian term to minimize.

Proposition 4.2.

For the mutual information between the composed message and encoded information, the following upper bound holds: $I(H;M)\leq\sum_{l}^{L}\mathds{E}_{h\sim p(h)}\left[D_{KL}\left(q(m_{l}|h)||z(m_{l}))\right)\right]$ .

The proof is in Appendix A.1. Thus, we have our Lagrangian term,

\displaystyle\hat{I}(H^{i},M^{i})=\sum_{l}^{L}\mathds{E}_{h\sim p(h)}\left[D_{KL}\left(q(m_{l}|h)||z(m_{l}))\right)\right]

(2)

Conditioning on the input or observation data is a decentralized training objective.

4.3 Sequence Length

Compositional communication necessitates an adaptive limit on the total length of the sequence.

Corollary 4.3.

Repeat tokens, $w$ , are redundant and can be removed.

Suppose one predicts two arbitrary tokens, $w_{k}$ and $w_{l}$ . Given equation 1, it follows that there is low or near-zero mutual information between $w_{k}$ and $w_{l}$ .

A trivial issue is that the message generator will predict every available token as to follow the unique token objective. Since the tokens are imbued with input-oriented information (equation 2), the predicted tokens will be based on relevant referential details. Thus, it follows that tokens containing irrelevant information will not be chosen.

A nice optimization objective that follows from corollary 4.3 is that one can use self-supervised learning with an end-of-sequence (EOS) token to limit the variable total length of compositional message sequences.

H(m_{\texttt{EOS}},m_{l})=-\pi(m_{\texttt{EOS}})\log(\pi(m_{l}))

(3)

4.4 Message Generation Architecture

Now, we can define the pipeline for message generation. The idea is to create an architecture that can generate features to enable independent message tokens. We expand each compressed token into the space of the hidden state $h$ (1-layer linear expansion) since each token has a natural embedding in $\mathbf{R}^{|h|}$ . Then, we perform attention using a softmin to help minimize similarity with previous tokens and sample the new token from a variational distribution. See algorithm 1 for complete details. During execution, we can generate messages directly due to equation 1, resolving any computation time lost from sequential compositional message generation.

Algorithm 1 Compositional Message Gen.

(h_{t})

T\leftarrow\texttt{num\_tokens}

m=\textbf{0}

{

T\times d_{m}

d_{m}\leftarrow\texttt{token\_size}

}

Q\leftarrow\texttt{Q\_MLP}(h_{t})

V\leftarrow\texttt{V\_MLP}(h_{t})

5: for

i\leftarrow 1\text{ to }T

K\leftarrow\texttt{K\_MLP}(m)

\hat{h}=\texttt{softmin}(\frac{Q^{\intercal}\texttt{mean}(K,1)}{\sqrt{d_{k}}})^{\intercal}V

m_{i}\sim\mathcal{N}(\hat{h};\mu,\sigma)

9: end for

10: return

m

5 Utility through Contrastive Learning

First, note that our Markov Network is as follows: $H^{j}\rightarrow M^{j}\rightarrow Y^{i}\leftarrow H^{i}$ . Continue to denote $i$ as the agent identification and $j$ as the agent ID such that $j\neq i$ . We aim to satisfy the utility objective of the information bottleneck, $I(M^{j};Y^{i})$ , through contrastive learning as shown in figure 1.

Proposition 5.1.

Utility mutual information is lower bounded by the contrastive NCE-binary objective, $I(M,Y)\geq\log\sigma(f(s,m,s_{f}^{+}))+\log\sigma(1-f(s,m,s_{f}^{-}))$ .

The proof is in Appendix A.1.

This result shows a need for gradient information to flow backward across agents along communication edge connections.

6 Experiments and Results

We condition on inputs, especially rich information (such as pixel data), and task-specific information. When evaluating an artificial language in MARL, we are interested in referential tasks, in which communication is required to complete the task. With regard to intent-grounded communication, we study ordinal tasks, which require coordination information between agents to complete successfully. Thus, we consider tasks with a team of agents to foster messaging that communicates coordination information that also includes their observations. To test H1, structuring emergent messages enables lower complexity, we test our methodology and analyze the input-oriented information and utility capabilities. Next, we analyze the ability of heterogeneous agents to understand differing communication policies (H2)). Finally, we consider the effect of social shadowing (H3), in which agents solely learn a communication policy from an expert agent’s action policy. We additionally analyze the role of offline reinforcement learning for emergent communication in combination with online reinforcement learning to further learn emergent communication alongside an action policy. We evaluate each scenario over 10 seeds.

6.1 Environments

Blind Traffic Junction

We consider a benchmark that requires both referential and ordinal capabilities within a team of agents. The blind traffic junction environment (ic3net) requires multiple agents to navigate a junction without any observation of other agents. Rather, they only observe their own state location. Ten agents must coordinate to traverse through the lanes without colliding into agents within their lane or in the junction. Our training uses REINFORCE (williams1992simple).

Pascal VOC Game

We further evaluate the complexity of compositional communication with a Pascal VOC (everingham2010pascal). This is a two-agent referential game similar to the Cifar game (MA_autoencoder) but requires the prediction of multiple classes. During each episode, each agent observes a random image from the Pascal VOC dataset containing exactly two unique labels. Each agent must encode information given only the raw pixels from the original image such that the other agent can recognize the two class labels in the original image. An agent receives a reward of 0.25 per correctly chosen class label and will receive a total reward of 1 if both agents guess all labels correctly. See figure 2. Our training uses heterogeneous agents trained with PPO (modified from MAPPO (yu2021surprising) repository). For simplicity of setup, we consider images with exactly two unique labels from a closed subset of size five labels of the original set of labels from the Pascal VOC data. Furthermore, these images must be of size $375\times 500$ pixels. Thus, the resultant dataset comprised 534 unique images from the Pascal VOC dataset.

6.2 Baselines

To evaluate our methodology, we compare our method to the following baselines: (1) no-comm, where agents do not communicate; (2) rl-comm, which uses a baseline communication method learned solely through policy loss (ic3net); (3) ae-comm, which uses an autoencoder to ground communication in input observations (MA_autoencoder); (4) VQ-VIB, which uses a variational autoencoder to ground discrete communication in input observations and a mutual information objective to ensure low entropy communication (tucker2022towards).

Table 1: Beta ablation: Messages are naturally sparse in bits due to the complexity loss. Redundancy measures the capacity for a bijection between the size of the set of unique tokens and the enumerated observations and intents. Min redundancy is 1.0 (a bijection). Lower is better.

$\beta$	Success	Message Size in Bits	Redundancy
0.1	1.0	64	1.0
0.01	.996	69.52	1.06
0.001	.986	121.66	2.06
0	.976	147.96	2.31
non-compositional	.822	512	587

6.3 Input-Oriented Information Results

We provide an ablation of the loss parameter $\beta$ in table 1 in the blind traffic junction scenario. When $\beta=0$ , we use our compositional message paradigm without our derived loss terms. We find that higher complexity and independence losses increase sample complexity. When $\beta=1$ , the model was unable to converge. However, when there is no regularization loss, the model performs worse (with no guarantees about referential representation). We attribute this to the fact that our independence criteria learns a stronger causal relationship. There are fewer spurious features that may cause an agent to take an incorrect action.

In order to understand the effect of the independent concept representation, we analyze the emergent language’s capacity for redundancy. A message token $m_{l}$ is redundant if there exists another token $m_{k}$ that represents the same information. With our methodology, the emergent ‘language’ converges to the exact number of observations and intents required to solve the task. With a soft discrete threshold, the independent information loss naturally converges to a discrete number of tokens in the vocabulary. Our $\beta$ ablation in table 1 yields a bijection between each token in the vocabulary and the possible emergent concepts, i.e., the enumerated observations and intents. Thus for $\beta=0.1$ , there is no redundancy.

Sparse Communication

In corollary 4.3, we assume that there is no mutual information between tokens. In practice, the loss may only be near-zero. Our empirical results yield independence loss around $1e-4$ . In table 1, the size of the messages is automatically compressed to the smallest size to represent the information. Despite a trivially small amount of mutual information between tokens, our compositional method is able to reduce the message size in bits by 2.3x using our derived regularization, for a total of an 8x reduction in message size over non-compositional methods such as ae-comm. Since the base unit for the token is a 32-bit float, we note that each token in the message may be further compressed. We observe that each token uses three significant digits, which may further compress tokens to 10 bits each for a total message length of 20 bits.

6.4 Communication Utility Results

Due to coordination in MARL, grounding communication in referential features is not enough. Finding the communication utility requires grounding messages in ordinal information. Overall, figure 3 shows that our compositional, contrastive method outperforms all methods focused on solely input-oriented communication grounding. In the blind traffic junction, our method yields a higher average task success rate and is able to achieve it with a lower sample complexity. Training with the contrastive update tends to spike to high success but not converge, often many episodes before convergence, which leaves area for training improvement. That is, the contrastive update begins to find aligned latent spaces early in training, but it cannot adapt the methodology quickly enough to converge. The exploratory randomness of most of the early online data prevents exploitation of the high utility $f^{+}$ examples. This leaves further room for improvement for an adaptive contrastive loss term.

Regularization loss convergence

After convergence to high task performance, the autoencoder loss increases in order to represent the coordination information. This follows directly from the information bottleneck, where there exists a tradeoff between utility and complexity. However, communication, especially referential communication, should have an overlap between utility and complexity. Thus, we should seek to make the complexity loss more convex. Our compositional communication complexity loss does not converge before task performance convergence. While the complexity loss tends to spike in the exploratory phase, the normalized value is very small. Interestingly, the method eventually converges as the complexity loss converges below a normalized 0.3. Additionally, the contrastive loss tends to decrease monotonically and converges after the task performance converges, showing a very smooth decrease. The contrastive $f^{-}$ loss decreases during training, which may account for success spikes prior to convergence. The method is able to converge after only a moderate decrease in the $f^{+}$ loss. This implies empirical evidence that the contrastive loss is an optimal critic for messaging. See figure 3.

6.5 Heterogeneous Alignment Through Communication

In order to test the heterogeneous alignment ability of our methodology to learn higher-order concepts from high-dimensional data, we analyze the performance on the Pascal VOC game. We compare our methodology against ae-comm to show that concepts should consist of independent information directly from task signal rather than compression to reconstruct inputs. That is, we show an empirical result on pixel data to verify the premise of the information bottleneck. Our methodology significantly outperforms the observation-grounded ae-comm baseline, as demonstrated by figure 4. The ae-comm methodology, despite using autoencoders to learn observation-grounded communication, performs only slightly better than no-comm. On the other hand, our methodology is able to outperform both baselines significantly. It is important to note that based on figure 4, our methodology is able to guess more than two of the four labels correctly across the two agents involved, while the baseline methodologies struggle to guess exactly two of thew four labels consistently. This can be attributed to our framework being able to learn compositional concepts that are much more easily discriminated due to mutual independence.

6.6 Social Shadowing

Critics of emergent communication may point to the increased sample complexity due to the dual communication and action policy learning. In the social shadowing scenario, heterogeneous agents can learn to generate a communication policy without learning the action policy of the watched expert agents. To enable social shadowing, the agent will alternate between a batch of traditional MARL (no expert) and (1st-person) shadowing an expert agent performing the task in its trajectory. The agent only uses the contrastive objective to update its communication policy during shadowing. In figure 5, the agent that performs social shadowing is able to learn the action policy with almost half the sample complexity required by the online reinforcement learning agent. Our results show that the structured latent space of the emergent communication learns socially benevolent coordination. This tests our hypothesis that by learning communication to understand the actions of other agents, one can enable lower sample complexity coordination. Thus, it mitigates the issues of solely observing actions.

7 Discussion

By using our framework to better understand the intent of others, agents can learn to communicate to align policies and coordinate. Any referential-based setup can be performed with a supervised loss, as indicated by the instant satisfaction of referential objectives. Even in the Pascal VOC game, which appears to be a purely referential objective, our results show that intelligent compression is not the only objective of referential communication. The emergent communication paradigm must enable an easy-to-discriminate space for the game. In multi-agent settings, the harder challenge is to enable coordination through communication. Using contrastive communication as an optimal critic aims to satisfy this, and has shown solid improvements. Since contrastive learning benefits from good examples, this method is even more powerful when there is access to examples from expert agents. In this setting, the communication may be bootstrapped, since our optimal critic has examples with strong signals from the ’social shadowing’ episodes.

Additionally, we show that the minimization of our independence objective enables tokens that contain minimal overlapping information with other tokens. Preventing trivial communication paradigms enables higher performance. Each of these objectives is complementary, so they are not trivially minimized during training, which is a substantial advantage over comparative baselines. Unlike prior work, this enables the benefits of training with reinforcement learning in multi-agent settings.

In addition to lower sample complexity, the mutual information regularization yields additional benefits, such as small messages, which enables the compression aspect of sparse communication. From a qualitative point of view, the independent information also yields discrete emergent concepts, which can be further made human-interpretable by a post-hoc analysis (yeh2021human). This is a step towards white-box machine learning in multi-agent settings. The interpretability of this learned white-box method could be useful in human-agent teaming as indicated by prior work (karten2022inter). The work here will enable further results in decision-making from high-dimensional data with emergent concepts. The social scenarios described are a step towards enabling a zero-shot communication policy. This work will serve as future inspiration for using emergent communication to enable ad-hoc teaming with both agents and humans.

References

Agarwal et al. (2020) Agarwal, A., Kumar, S., Sycara, K., and Lewis, M. Learning transferable cooperative behavior in multi-agent teams. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, pp. 1741–1743, 2020.
Alemi et al. (2017) Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. ICLR, 2017.
Chung et al. (2014) Chung, J., Gulcehre, C., Cho, K., and Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
Eccles et al. (2019) Eccles, T., Bachrach, Y., Lever, G., Lazaridou, A., and Graepel, T. Biases for emergent communication in multi-agent reinforcement learning. Advances in neural information processing systems, 32, 2019.
Everingham et al. (2010) Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
Eysenbach et al. (2022) Eysenbach, B., Zhang, T., Salakhutdinov, R., and Levine, S. Contrastive learning as goal-conditioned reinforcement learning. arXiv preprint arXiv:2206.07568, 2022.
Foerster et al. (2016) Foerster, J. N., Assael, Y. M., de Freitas, N., and Whiteson, S. Learning to communicate with deep multi-agent reinforcement learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2145–2153, 2016.
Freed et al. (2020) Freed, B., James, R., Sartoretti, G., and Choset, H. Sparse discrete communication learning for multi-agent cooperation through backpropagation. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7993–7998, 2020. doi: 10.1109/IROS45743.2020.9341079.
Grupen et al. (2022) Grupen, N. A., Lee, D. D., and Selman, B. Multi-agent curricula and emergent implicit signaling. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pp. 553–561, 2022.
Hjelm et al. (2018) Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2018.
Jaques et al. (2019) Jaques, N., Lazaridou, A., Hughes, E., Gulcehre, C., Ortega, P., Strouse, D., Leibo, J. Z., and De Freitas, N. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International conference on machine learning, pp. 3040–3049. PMLR, 2019.
Karten et al. (2022) Karten, S., Tucker, M., Kailas, S., and Sycara, K. Towards true lossless sparse communication in multi-agent systems. arXiv preprint arXiv:2212.00115, 2022.
Karten et al. (2023) Karten, S., Tucker, M., Li, H., Kailas, S., Lewis, M., and Sycara, K. Interpretable learned emergent communication for human-agent teams. IEEE Transactions on Cognitive and Developmental Systems, pp. 1–1, 2023. doi: 10.1109/TCDS.2023.3236599.
Kullback (1997) Kullback, S. Information theory and statistics. Courier Corporation, 1997.
Lake et al. (2019) Lake, B. M., Linzen, T., and Baroni, M. Human few-shot learning of compositional instructions. arXiv preprint arXiv:1901.04587, 2019.
Lazaridou & Baroni (2020) Lazaridou, A. and Baroni, M. Emergent multi-agent communication in the deep learning era. arXiv preprint arXiv:2006.02419, 2020.
Lazaridou et al. (2016) Lazaridou, A., Peysakhovich, A., and Baroni, M. Multi-agent cooperation and the emergence of (natural) language. arXiv preprint arXiv:1612.07182, 2016.
LeCun et al. (1995) LeCun, Y., Bengio, Y., et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10):1995, 1995.
Lewis (1969) Lewis, D. Convention. Harvard University Press, Cambridge, MA, 1969.
Li et al. (2022) Li, H., Oguntola, I., Hughes, D., Lewis, M., and Sycara, K. Theory of mind modeling in search and rescue teams. In 2022 31st IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 483–489. IEEE, 2022.
Li et al. (2021) Li, S., Zhou, Y., Allen, R., and Kochenderfer, M. J. Learning emergent discrete message communication for cooperative reinforcement learning. arXiv preprint arXiv:2102.12550, 2021.
Lin et al. (2021) Lin, T., Huh, J., Stauffer, C., Lim, S. N., and Isola, P. Learning to ground multi-agent communication with autoencoders. Advances in Neural Information Processing Systems, 34, 2021.
Lowe et al. (2017) Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and Mordatch, I. Multi-agent actor-critic for mixed cooperative-competitive environments. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6382–6393, 2017.
Ma & Collins (2018) Ma, Z. and Collins, M. Noise contrastive estimation and negative sampling for conditional models: Consistency and statistical efficiency. In EMNLP, 2018.
Marathe et al. (2018) Marathe, A. R., Schaefer, K. E., Evans, A. W., and Metcalfe, J. S. Bidirectional communication for effective human-agent teaming. In International Conference on Virtual, Augmented and Mixed Reality, pp. 338–350. Springer, 2018.
Marlow et al. (2018) Marlow, S. L., Lacerenza, C. N., Paoletti, J., Burke, C. S., and Salas, E. Does team communication represent a one-size-fits-all approach?: A meta-analysis of team communication and performance. Organizational behavior and human decision processes, 144:145–170, 2018.
Mirsky et al. (2020) Mirsky, R., Macke, W., Wang, A., Yedidsion, H., and Stone, P. A penny for your thoughts: The value of communication in ad hoc teamwork. Good Systems-Published Research, 2020.
Ndousse et al. (2021) Ndousse, K. K., Eck, D., Levine, S., and Jaques, N. Emergent social learning via multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 7991–8004. PMLR, 2021.
Papoudakis et al. (2019) Papoudakis, G., Christianos, F., Rahman, A., and Albrecht, S. V. Dealing with non-stationarity in multi-agent deep reinforcement learning. arXiv preprint arXiv:1906.04737, 2019.
Poole et al. (2019) Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., and Tucker, G. On variational bounds of mutual information. In International Conference on Machine Learning, pp. 5171–5180. PMLR, 2019.
Samvelyan et al. (2019) Samvelyan, M., Rashid, T., De Witt, C. S., Farquhar, G., Nardelli, N., Rudner, T. G., Hung, C.-M., Torr, P. H., Foerster, J., and Whiteson, S. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019.
Singh et al. (2018) Singh, A., Jain, T., and Sukhbaatar, S. Learning when to communicate at scale in multiagent cooperative and competitive tasks. In International Conference on Learning Representations, 2018.
Sokota et al. (2022) Sokota, S., De Witt, C. A. S., Igl, M., Zintgraf, L. M., Torr, P., Strohmeier, M., Kolter, Z., Whiteson, S., and Foerster, J. Communicating via markov decision processes. In International Conference on Machine Learning, pp. 20314–20328. PMLR, 2022.
Sukhbaatar et al. (2016) Sukhbaatar, S., Fergus, R., et al. Learning multiagent communication with backpropagation. Advances in neural information processing systems, 29:2244–2252, 2016.
Tishby & Zaslavsky (2015) Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In 2015 ieee information theory workshop (itw), pp. 1–5. IEEE, 2015.
Tucker et al. (2021) Tucker, M., Li, H., Agrawal, S., Hughes, D., Sycara, K., Lewis, M., and Shah, J. A. Emergent discrete communication in semantic spaces. Advances in Neural Information Processing Systems, 34, 2021.
Tucker et al. (2022) Tucker, M., Shah, J., Levy, R., and Zaslavsky, N. Towards human-agent communication via the information bottleneck principle. arXiv preprint arXiv:2207.00088, 2022.
Van Den Oord et al. (2017) Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Wang et al. (2020) Wang, R., He, X., Yu, R., Qiu, W., An, B., and Rabinovich, Z. Learning efficient multi-agent communication: An information bottleneck approach. In International Conference on Machine Learning, pp. 9908–9918. PMLR, 2020.
Williams (1992) Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992.
Yeh et al. (2021) Yeh, C.-K., Kim, B., and Ravikumar, P. Human-centered concept explanations for neural networks. In Neuro-Symbolic Artificial Intelligence: The State of the Art, pp. 337–352. IOS Press, 2021.
Yu et al. (2021) Yu, C., Velu, A., Vinitsky, E., Wang, Y., Bayen, A., and Wu, Y. The surprising effectiveness of ppo in cooperative, multi-agent games. arXiv preprint arXiv:2103.01955, 2021.

Appendix A Appendix

A.1 Proofs

Proposition 4.1 For the interaction information between all tokens, the following upper bound holds: $I(m_{1};\ldots;m_{L}|h)\leq\mathds{E}_{h\sim p(h)}\left[D_{KL}\left(q(\hat{m}|h)||\pi^{i}_{m}(m_{1}|h)\otimes\cdots\otimes\pi^{i}_{m}(m_{L}|h)\right)\right]$ .

Proof.

Starting with the independent information objective, we want to minimize the interaction information,

		$\displaystyle I(m_{1};\ldots;m_{L}\|h)=$
		$\displaystyle\int\ldots\int f_{m}(m_{1},\ldots,m_{L},h)dh\ d{m_{1}}\ldots d{m_{L}}$

which defines the conditional mutual information between each token and,

\displaystyle f_{m}(*)=p(h)p(m_{1};\ldots;m_{L}|h)\log\frac{p(m_{1};\ldots;m_{L}|h)}{\prod_{l}^{L}p(m_{L}|h)}

(4)

Let $\pi_{m}^{i}(m_{l}|h)$ be a variational approximation of $p(m_{l}|h)$ , which is defined by our message encoder network. Given that each token should provide unique information, we assume independence between $m_{l}$ . Thus, it follows that our compositional message is a vector, $m=[m_{1},\ldots,m_{L}]$ , and is jointly Gaussian. Moreover, we can define $q(\hat{m}|h)$ as a variational approximation to $p(m|h)=p(m_{1};\ldots,m_{L}|h)$ . We can model $q$ with a network layer and define its loss as $||\hat{m}-m||_{2}$ . Thus, transforming equation 4 into variational form, we have,

\displaystyle g_{m}(m_{1},\ldots,m_{L},h)=p(h)q(\hat{m}|h)\log\frac{q(\hat{m}|h)}{\prod_{l}^{L}\pi_{m}^{i}(m_{l}|h)}

Since Kullback-Leibler divergence $D_{KL}$ is non-negative,

D_{KL}\left(q(\hat{m}|h)||\pi^{i}_{m}(m_{1}|h)\otimes\cdots\otimes\pi^{i}_{m}(m_{L}|h)\right)\geq 0,

it follows that

\int q(\hat{m}|h)\log q(\hat{m}|h)d\hat{m}\geq\int q(\hat{m}|h)\log\prod_{l}^{L}\pi^{i}_{m}(m_{l}|h)d\hat{m}

Thus, we can bound our interaction information,

		$\displaystyle I(m_{1};\ldots;m_{L}\|h)\leq\int\ldots\int g_{m}(*)dhd{m_{1}}\ldots d{m_{L}}$
		$\displaystyle=\mathds{E}_{h\sim p(h)}\left[D_{KL}\left(q(\hat{m}\|h)\|\|\pi^{i}_{m}(m_{1}\|h)\otimes\cdots\otimes\pi^{i}_{m}(m_{L}\|h)\right)\right]$

∎

Proposition 4.2 For the mutual information between the composed message and encoded information, the following upper bound holds: $I(H;M)\leq\sum_{l}^{L}\mathds{E}_{h\sim p(h)}\left[D_{KL}\left(q(m_{l}|h)||z(m_{l}))\right)\right]$ .

Proof.

By definition of mutual information between the composed messages $M$ and the encoded observations $H$ , we have,

\displaystyle I(H;M)=\int\int p(h)p(\hat{m}|h)\log\frac{p(\hat{m}|h)}{p(\hat{m})}d\hat{m}\ dh

Substituting $q(\hat{m}|h)$ for $p(\hat{m}|h)$ , the same KL Divergence identity, and defining a Gaussian approximation $z(\hat{m})$ of the marginal distribution $p(\hat{m})$ , it follows that,

\displaystyle I(H;M)\leq\int\int p(h)q(\hat{m}|h)\log\frac{q(\hat{m}|h)}{z(\hat{m})}d\hat{m}\ dh

In expectation of equation 1, we have,

q(\hat{m}|h)=q(\hat{m}|h)=\prod_{l}^{L}\pi^{i}_{m}(m_{l}|h).

This implies that, for $\hat{m}=[m_{1},\ldots,m_{L}]$ , there is probabilistic independence between $m_{j},m_{k},j\neq k$ . Thus, expanding, it follows that,

	$\displaystyle I(H;M)$	$\displaystyle\leq\sum_{l}^{L}\int\int p(h)q(m_{l}\|h)\log\frac{q(m_{l}\|h)}{z(m_{l})}dm_{l}\ dh$
		$\displaystyle=\sum_{l}^{L}\mathds{E}_{h\sim p(h)}\left[D_{KL}\left(q(m_{l}\|h)\|\|z(m_{l}))\right)\right]$

where $z(m_{l})$ is a standard Gaussian. ∎

Proposition 5.1. Utility mutual information is lower bounded by the contrastive NCE-binary objective, $I(M,Y)\geq\log\sigma(f(s,m,s_{f}^{+}))+\log\sigma(1-f(s,m,s_{f}^{-}))$ .

Proof.

We suppress the reliance on $h$ since this is directly passed through. By definition of mutual information, we have,

\displaystyle I(M^{j};Y^{i})=\int\int p(m)\pi_{R^{+}}(y|m)\log\frac{\pi_{R^{+}}(y|m)}{\pi_{R^{-}}(y)}dm\,dy

Our network model learns $\pi_{R^{+}}(y|m)$ from rolled-out trajectories, $R^{+}$ , using our policy. The prior of our network state, $\pi_{R^{-}}(y)$ , can be modeled from rolling out a random trajectory, $R-$ . Unfortunately, it is intractable to model $\pi_{R^{+}}(y|m)$ and $\pi_{R^{-}}(y)$ directly during iterative learning, but we can sample $y^{+}\sim\pi_{R^{+}}(y|m)$ and $y^{-}\sim\pi_{R^{-}}(y)$ directly from our network during training.

It has been shown that $\log p(y|m)$ provides a bound on mutual information (poole2019variational),

\displaystyle I(M^{j};Y^{i})\geq\mathds{E}\left[\frac{1}{K}\sum_{k=1}^{K}\log\pi_{R^{+}}(y_{k}|m_{k})+\log\pi_{R^{-}}(y_{k})\right]

(5)

with the expectation over $\prod_{l}p(m_{l},y_{l})$ . However, we need a tractable understanding of the information $Y$ .

Lemma A.1.

$\pi_{R^{-}}(y)=p(s^{\prime}=s_{f}^{-}|y)$ .

In the information bottleneck, $Y$ represents the desired outcome. In our setup, $y$ is coordination information that helps create the desired output, such as any action $a^{-}$ . This implies, $y\implies a^{-}$ . Since the transition is known, it follows that $a^{-}\implies s_{f}^{-}$ , a random future state. Thus, we have, $\pi_{R^{-}}(y)=p(s^{\prime}=s_{f}^{-}|y)$ .

Lemma A.2.

$\pi_{R^{+}}(y|m)=p(s^{\prime}=s_{f}^{+}|y,m)$ .

This is similar to the proof for lemma A.5, but requires assumptions on messages $m$ from the emergent language. We note that when $m$ is random, the case defaults to lemma A.5. Thus, we assume we have at least input-oriented information in $m$ given sufficiently satisfying equation 2. Given a sufficient emergent language, it follows that $y\implies a^{+}$ , where $a^{+}$ is an intention action based on $m$ . Similarly, since the transition is known, $a^{+}\implies s_{f}^{+}$ , a desired goal state along the trajectory. Thus, we have, $\pi_{R^{+}}(y|m)=p(s^{\prime}=s_{f}^{+}|y,m)$ .

Recall the following (as shown in (eysenbach2022contrastive)), which we have adapted to our communication objective,

Proposition A.3 (rewards $\rightarrow$ probabilities).

The Q-function for the goal-conditioned reward function $r_{g}(s_{t},m_{t})=(1-\gamma)p(s^{\prime}=s_{g}|y_{t})$ is equivalent to the probability of state $s_{g}$ under the discounted state occupancy measure:

Q_{s_{g}}^{\pi}(s,m)=p^{\pi}(s_{f}^{+}=s_{g}|y)

(6)

and

Lemma A.4.

The critic function that optimizes equation 8 is a Q-function for the goal-conditioned reward function up to a multiplicative constant $\frac{1}{p(s_{f})}$ : $\exp(f^{*}(s,m,s_{f})=\frac{1}{p(s_{f})}Q_{s_{f}}^{\pi}(s,m)$ .

The critic function $f(s,m,s_{f})=y^{\intercal}\texttt{enc}(s_{f})$ represents the similarity between the encoding $y=\texttt{enc}(s,m)$ and the encoding of the future rollout $s_{f}$ .

Given lemmas A.5 A.6 A.8 and proposition A.7, it follows that equation 8 is the NCE-binary (ma2018noise) (InfoMAX (hjelm2018learning)) objective,

\hat{I}(M^{j},Y^{i})=\log\left(\sigma(f(s,m,s_{f}^{+}))\right)+\log\left(1-\sigma(f(s,m,s_{f}^{-}))\right)

(7)

which lower bounds the mutual information, $I(M^{j},Y^{i})\geq\hat{I}(M^{j},Y^{i})$ . The critic function is unbounded, so we constrain it to $[0,1]$ with the sigmoid function, $\sigma(*)$ . We suppress the reliance on $h$ since this is directly passed through. By definition of mutual information, we have,

\displaystyle I(M^{j};Y^{i})=\int\int p(m)\pi_{R^{+}}(y|m)\log\frac{\pi_{R^{+}}(y|m)}{\pi_{R^{-}}(y)}dm\,dy

It has been shown that $\log p(y|m)$ provides a bound on mutual information (poole2019variational),

\displaystyle I(M^{j};Y^{i})\geq\mathds{E}\left[\frac{1}{K}\sum_{k=1}^{K}\log\pi_{R^{+}}(y_{k}|m_{k})+\log\pi_{R^{-}}(y_{k})\right]

(8)

with the expectation over $\prod_{l}p(m_{l},y_{l})$ . However, we need a tractable understanding of the information $Y$ .

Lemma A.5.

$\pi_{R^{-}}(y)=p(s^{\prime}=s_{f}^{-}|y)$ .

Lemma A.6.

$\pi_{R^{+}}(y|m)=p(s^{\prime}=s_{f}^{+}|y,m)$ .

Recall the following (as shown in (eysenbach2022contrastive)), which we have adapted to our communication objective,

Proposition A.7 (rewards $\rightarrow$ probabilities).

Q_{s_{g}}^{\pi}(s,m)=p^{\pi}(s_{f}^{+}=s_{g}|y)

(9)

and

Lemma A.8.

The critic function $f(s,m,s_{f})=y^{\intercal}\texttt{enc}(s_{f})$ represents the similarity between the encoding $y=\texttt{enc}(s,m)$ and the encoding of the future rollout $s_{f}$ .

Given lemmas A.5 A.6 A.8 and proposition A.7, it follows that equation 8 is the NCE-binary (ma2018noise) (InfoMAX (hjelm2018learning)) objective,

\hat{I}(M^{j},Y^{i})=\log\left(\sigma(f(s,m,s_{f}^{+}))\right)+\log\left(1-\sigma(f(s,m,s_{f}^{-}))\right)

(10)

which lower bounds the mutual information, $I(M^{j},Y^{i})\geq\hat{I}(M^{j},Y^{i})$ . The critic function is unbounded, so we constrain it to $[0,1]$ with the sigmoid function, $\sigma(*)$ . ∎

On the Role of Emergent Communication for Social Learning in Multi-Agent Reinforcement Learning

Abstract

1 INTRODUCTION

2 RELATED WORK

2.1 Multi-Agent Signaling

2.2 Emergent Communication

2.3 Natural Language Inspiration

3 Preliminaries

3.1 Architecture

3.2 Objective

4 Complexity through Compositional Communication

4.1 Independent Information

Proposition 4.1.

4.2 Input-Oriented Information

Proposition 4.2.

4.3 Sequence Length

Corollary 4.3.

4.4 Message Generation Architecture

5 Utility through Contrastive Learning

Proposition 5.1.

6 Experiments and Results

6.1 Environments

Blind Traffic Junction

Pascal VOC Game

6.2 Baselines

6.3 Input-Oriented Information Results

Sparse Communication

6.4 Communication Utility Results

Regularization loss convergence

6.5 Heterogeneous Alignment Through Communication

6.6 Social Shadowing

7 Discussion

References

Appendix A Appendix

A.1 Proofs

Proof.

Proof.

Proof.

Lemma A.1.

Lemma A.2.

Proposition A.3 (rewards →\rightarrow probabilities).

Lemma A.4.

Lemma A.5.

Lemma A.6.

Proposition A.7 (rewards →\rightarrow probabilities).

Lemma A.8.

On the Role of Emergent Communication for Social Learning
in Multi-Agent Reinforcement Learning

Proposition A.3 (rewards $\rightarrow$ probabilities).

Proposition A.7 (rewards $\rightarrow$ probabilities).