NeLLCom-X: A Comprehensive Neural-Agent Framework
to Simulate Language Learning and Group Communication

Yuchen Lian^⋄ ^† Tessa Verhoef^†^∗ Arianna Bisazza^‡

^⋄Faculty of Electronic and Information Engineering, Xi’an Jiaotong University
^†Leiden Institute of Advanced Computer Science, Leiden University
{y.lian, t.verhoef}@liacs.leidenuniv.nl
^‡Center for Language and Cognition, University of Groningen
[email protected] Shared senior authorship.

Abstract

Recent advances in computational linguistics include simulating the emergence of human-like languages with interacting neural network agents, starting from sets of random symbols. The recently introduced NeLLCom framework Lian et al. (2023) allows agents to first learn an artificial language and then use it to communicate, with the aim of studying the emergence of specific linguistics properties. We extend this framework (NeLLCom-X) by introducing more realistic role-alternating agents and group communication in order to investigate the interplay between language learnability, communication pressures, and group size effects. We validate NeLLCom-X by replicating key findings from prior research simulating the emergence of a word-order/case-marking trade-off. Next, we investigate how interaction affects linguistic convergence and emergence of the trade-off. The novel framework facilitates future simulations of diverse linguistic aspects, emphasizing the importance of interaction and group dynamics in language evolution.

Yuchen Lian^⋄ ^† Tessa Verhoef^†^∗ Arianna Bisazza^‡^†^†thanks: Shared senior authorship. ^⋄Faculty of Electronic and Information Engineering, Xi’an Jiaotong University ^†Leiden Institute of Advanced Computer Science, Leiden University {y.lian, t.verhoef}@liacs.leidenuniv.nl ^‡Center for Language and Cognition, University of Groningen [email protected]

1 Introduction

Human language can be viewed as a complex adaptive dynamical system Fitch (2007); Steels (2000); Beckner et al. (2009), in which individual behaviours of language users drive linguistic emergence and change at the population level. Languages are shaped by the brains of individuals who are learning them Christiansen and Chater (2008); Kirby et al. (2014) and novel conventions and meanings are negotiated during interaction and language use Fusaroli and Tylén (2012); Namboodiripad et al. (2016); Garrod et al. (2007). The effect of these mechanisms on linguistic patterns has been studied extensively, and it is recognized that language systems do not spring from the mind of a single individual, but are the result of constant reinterpretation and filtering through populations of human minds. As such, language users are not mere passive learners, but unconsciously and gradually contribute to language change.

Recently, this interactive and dynamic property of human language was recognized as a key factor to improve AI Mikolov et al. (2018), leading to a large interest in simulating the emergence of human-like languages with neural network agents Havrylov and Titov (2017); Kottur et al. (2017); Lazaridou et al. (2017); Lazaridou and Baroni (2020). Typically, a pair of agents is simulated where a speaking agent tries to help a listener recover an intended meaning by generating a message the listener can interpret. Early frameworks have been progressively expanded to display important aspects of human language and communication, like generational transmission Li and Bowling (2019); Chaabouni et al. (2019); Lian et al. (2021); Chaabouni et al. (2022), group interaction Tieleman et al. (2019); Chaabouni et al. (2022); Rita et al. (2022); Michel et al. (2023); Kim and Oh (2021) and other aspects Galke and Raviv (2024). Within this body of work, most studies start from sets of random symbols, with a strong focus on tracking the emergence of human-like language properties such as compositionality Chaabouni et al. (2020, 2022); Li and Bowling (2019); Conklin and Smith (2022) or principles of lexical organization like Zipf’s law of abbreviation Rita et al. (2020).

Refer to caption — Figure 1: Overview of the NeLLCom-X framework.

However, neural agent emergent communication frameworks could also be a valuable tool to simulate the evolution of more specific aspects of language. Studies with human participants have addressed many other aspects such as specific syntactic patterns like word order or morphology Saldana et al. (2021b); Culbertson et al. (2012); Christensen et al. (2016); Motamedi et al. (2022), a tendency to reduce dependency lengths Fedzechkina et al. (2018); Saldana et al. (2021a), colexification patterns and the role of iconicity or metaphor in the emergence of new meanings Karjus et al. (2021); Verhoef et al. (2015, 2016, 2022); Tamariz et al. (2018), and combinatorial organisation of basic building blocks Roberts and Galantucci (2012); Verhoef (2012); Verhoef et al. (2014). What most of these studies have in common is that participants are asked to learn and/or interact with pre-defined artificial languages specifically designed by the experimenters to study the linguistic property of interest. However, the existing neural-agent communication frameworks (often based on EGG Kharitonov et al. (2019)), do not enable training agents on pre-defined languages. A different body of work has studied the learnability by neural networks of various types of artificial languages (Lupyan and Christiansen, 2002; Wang and Eisner, 2016; Bisazza et al., 2021; White and Cotterell, 2021; Hopkins, 2022; Kallini et al., 2024). This paradigm has led to important insights, revealing inductive biases of neural models, but is limited to studying learnability in a passive supervised learning setting, unlike the dynamic and interactive setting in which human language has evolved.

A framework combining agent communication with the ability to learn pre-defined artificial languages was recently introduced by Lian et al. (2023). In NeLLCom (Neural agent Language Learning and Communication), agents are first trained on an initial language through Supervised Learning, followed by a communication phase in which a speaking and listening agent continue learning together through Reinforcement Learning by optimizing a shared communicative reward.

In this paper, we extend NeLLCom with group interaction with the aim of studying the interplay between learnability of specific pre-defined languages, communication pressures, and group size effects under the same framework. To this end, we first extend the vanilla NeLLCom agent to act as both listener and speaker (i.e. role alteration, cf. Figure 1), which was identified as an important gap in the emergent communication literature by Galke et al. (2022). Then, we design a procedure to let such ‘full-fledged’ agents interact in pairs with either similar or different initial language exposure, or in groups of various sizes. With the extended framework, NeLLCom-X, we replicate the key findings of Lian et al. (2023) and additionally show that (i) pairs of agents trained on different initial languages quickly adapt their utterances towards a mutually understandable language, (ii) languages used by agents in larger groups become more optimized and less redundant, and (iii) a word-order/case-marking trade-off emerges not only in individual speakers, but also at the group level.

We release NeLLCom-X to promote simulations of other language aspects where interaction and group dynamics are expected to play a key role.¹¹1https://github.com/Yuchen-Lian/NeLLCom-X

2 Related Work

Role-alternating agents

Initially, most work on emergent communication modeled agents to fulfill separate, complimentary roles (i.e. one agent always speaks, the other always listens). Human language users are, of course, able to take both roles. When listing a set of "design features" of human language, Hockett (1960) refered to interchangeability as the ability of language speakers to reproduce any linguistic message they can understand. In experiments with humans communicating via artificial languages, participants also usually take turns being the speaker and listener Kirby et al. (2015); Namboodiripad et al. (2016); Roberts and Galantucci (2012); Verhoef et al. (2015, 2022). Therefore, Galke et al. (2022) named role-alternation as a missing key ingredient to close the gap between outcomes of simulations and findings from human language evolution data.

Exceptions to this trend include the role-alternating architectures of Kottur et al. (2017), Harding Graesser et al. (2019), and Taillandier et al. (2023). Recently, Michel et al. (2023) propose a method to couple a speaker and listener among a group of speaking and listening agents. By what they call "partitioning", the listener-part is only trained to adapt to its associated speaker, while the listener parameters are frozen during communication with other speakers. Hence, the speaking and listening parts of an agent are tied softly, i.e. no "physical" link via shared modules. While being workable, this partitioning seems less realistic in terms of cognitive plausibility and communication, as human listeners continually refine their understanding during all kinds of interactions (speaking as well as listening). What all these studies have in common is their focus on protocols emerging from scratch, i.e. starting from random symbols, which does not allow for simulations with pre-defined languages. Closer to our goal, Chaabouni et al. (2019) train agents on artificial languages and observe them drift in a simple iterated learning setup that does not model communication success. They use sequence-to-sequence networks that can function both as speaker and listener by representing both utterances and meanings as sequences and merging meaning and word embeddings into a single weight matrix, tied between input and output.

We combine elements of the above techniques to design agents that can learn artificial languages and use them to interact in a realistic manner.

Group communication

Natural languages typically have more than two speakers, and language structure is shaped by properties of the population. According to the Linguistic Niche hypothesis, for example, languages used by larger communities tend to be simpler than those used in smaller, more isolated groups Wray and Grace (2007); Lupyan and Dale (2010). Similarly, experiments with human participants have shown that interactions in larger groups can result in more systematic languages Raviv et al. (2019). Various emergent communication simulations have been designed to investigate group effects, revealing the emergence of natural language phenomena. Tieleman et al. (2019), for example, found that representations emerging in groups are less idiosyncratic and more symbolic. They model a population of community-autoencoders and since the identities of the encoder and decoder are not revealed within a pair, the emerging representations develop in such way that all decoders can use them to successfully reconstruct the input, resulting in a more simple language as also found in humans. Michel et al. (2023) found that larger agent groups develop more compositional languages. Harding Graesser et al. (2019) investigated various language contact scenarios with populations of agents that have first developed distinct languages within their own groups, and could observe the emergence of simpler ’creole’ languages, resembling findings from human language contact. Kim and Oh (2021) vary the connectivities between agents in groups, and find the spontaneous emergence of linguistic dialects in large groups with over a hundred agents having only local interactions. Again, none of these frameworks support training agents on pre-defined languages, limiting the extent to which they can be applied to specific human-like linguistic features.

In this work, we showcase how NeLLCom-X agents can interact in groups using artificial languages that were specifically designed to study the emergence of word-order/case-marking patterns.

3 NeLLCom-X

We summarize the original NeLLCom framework Lian et al. (2023) and then explain how we extend it with role alternation and group communication.

3.1 Original Framework

NeLLCom agents exchange simple meanings using pre-defined artificial languages. To achieve this, the framework combines: (i) a supervised learning (SL) phase, during which agents are taught a language with specific properties, and (ii) a reinforcement learning (RL) phase, during which agent pairs interact via a meaning reconstruction game.

Meanings are triplets $m=\{A,a,p\}$ representing simple scenes with an action, agent, and patient, respectively (e.g. praise, fox, crow). An artificial language defines a mapping between any given meaning $m$ and utterance $u$ which is a variable-length sequence of symbols from a fixed-size vocabulary (e.g. ‘Fox praises crow’). According to the language design, the same meaning may be expressed by different utterances, and vice versa, the same utterance may signal different meanings.

The speaking function $\mathcal{S}:m\mapsto u$ is implemented by a linear-to-RNN network, whereas the listening function $\mathcal{L}:u\mapsto m$ is implemented by a symmetric RNN-to-linear network.²²2To make the two networks fully symmetric, we slightly modify the original listener architecture of Lian et al. (2023) by adding a meaning embedding layer before the final softmax. Preliminary experiments show no visible effect on the results. The sequential components are implemented as a single-layer Gated Recurrent Unit Chung et al. (2014). In both directions, meanings are represented by unordered tuples instead of sequences to avoid any ordering bias, differently from Chaabouni et al. (2019) who also represent meanings as sequences.

The SL phase minimizes the cross-entropy loss of the predicted words given meaning (speaker) or the predicted meaning tuple given utterance (listener) with respect to a gold-standard dataset $D={(m,u)}$ . The RL phase maximizes a shared reward $r(m,\hat{u})$ evaluated by the listener’s prediction $\mathcal{L}(\hat{u})$ given the speaker-generated utterance $\hat{u}=\mathcal{S}(m)$ . More details on the SL and RL procedures, the respective training objectives, and network architectures are given in Appendix A.

Crucially, each agent in the original NeLLCom can either function as listener (utterance-to-meaning) or as speaker (meaning-to-utterance), but not as both, see Figure 1. While this minimal setup was sufficient to simulate the emergence of the word-order/case-marking trade-off Lian et al. (2023), it does not allow for role alternation –a missing key ingredient for realistic simulations of emergent communication Galke et al. (2022) and a necessary condition to simulate group communication.

3.2 Full-fledged Agent

To realize a full-fledged agent ( $\alpha$ ) that can speak and listen while interacting with other agents, we pair two networks $\alpha_{i}=(N_{i}^{\mathcal{S}},N_{i}^{\mathcal{L}})$ using two strategies: parameter sharing and self-play (Fig. 1).

Parameter sharing

A common practice in NLP is tying the weights of the embedding (input) and softmax (output) layers to maximize performance and reduce the number of parameters in large language models Press and Wolf (2017). Chaabouni et al. (2019) applied this technique to their sequence-to-sequence utterance $\leftrightarrow$ meaning architecture. However in our setup, listening and speaking are implemented by two separate, symmetric networks. We then tie the input embedding of the speaking network to the output embedding of the listening network $\mathbf{X}(N_{i}^{\mathcal{S}})=\mathbf{O}(N_{i}^{\mathcal{L}})$ (both representing meanings). Likewise, we tie the input embedding of the listener to the output embedding of the speaker $\mathbf{X}(N_{i}^{\mathcal{L}})=\mathbf{O}(N_{i}^{\mathcal{S}})$ (both representing words). Because of these shared parameters, the speaker training process will also affect the listener, and vice versa. To balance listener and speaker optimization during supervised learning, we alternate between the two after each epoch.³³3As verified in preliminary experiments, results are similar whether the last epoch is a listening or speaking one.

Self-play

Even when word and meaning representations are shared, the rest of the speaking and listening networks remain disjoint, potentially causing the speaking and listening abilities to drift in different directions. As discussed in Section 2, a realistic full-fledged agent should be able to understand itself at any moment. To ensure this, we let the agent’s speaking network send messages to its own listening network while optimizing the shared communicative reward $r$ , a procedure known as self-play in emergent communication literature (Lowe et al., 2020; Lazaridou et al., 2020). In Section 6.1, we show empirically that self-play is indeed necessary to preserve the agents’ self-understanding while their language evolves in interaction.

3.3 Interactive Communication

Given the new full-fledged agent definition, communication becomes possible between two or more role-alternating agents. We introduce the notion of turn to denote a minimal communication session where RL weight updates take place between an agent’s speaker and either its own listener or another agent’s listener:

	$\displaystyle\mathsf{self\_turn}(\alpha_{i})$	$\displaystyle=\mathsf{RL}(N^{\mathcal{S}}_{i},N^{\mathcal{L}}_{i})$		(1)
	$\displaystyle\mathsf{inter\_turn}(\alpha_{i},\alpha_{j})$	$\displaystyle=\mathsf{RL}(N^{\mathcal{S}}_{i},N^{\mathcal{L}}_{j})$		(2)

For example, in our experiments, a turn corresponds to 10 batches of 32 meanings. Note that interaction can involve agents that were trained on the same language, or on different initial languages, as we will show in Section 6.

Input: set of SL-trained agents:

Agents

edges in the connectivity graph:

\mathcal{G}

n\_rounds

\sigma

1 for $r=1:n\_rounds$ do

comm\_turns=

shuffle

(\mathcal{G})

3 for $turn_{i}\in comm\_turns$ do

i_{spk},\;i_{lst}

turn_{i}

\alpha_{spk}

Agents

[

i_{spk}

\alpha_{lst}

Agents

[

i_{lst}

]

6 inter_turn(

\alpha_{spk}

\alpha_{lst}

)

7 for $\alpha=\{\alpha_{spk},\alpha_{lst}\}$ do

\alpha

.activation += 1

9 if $\alpha$ .activation >= $\sigma$ then

10 self_turn(

\alpha

)

\alpha

.activation = 0

Algorithm 1 Group Communication

Turn scheduling

During group communication, a connectivity graph $\mathcal{G}$ is used to define which agents can communicate with another, and which cannot. Within $\mathcal{G}$ , a node $i$ represents an agent and a directed edge ( $i,j$ ) represents a connection whereby $\alpha_{i}$ can speak to $\alpha_{j}$ , but not necessarily vice versa. Turn scheduling then proceeds as shown in Algorithm 1: Before each turn, an edge $(i,j)$ is sampled without replacement from $\mathcal{G}$ . Then $\alpha_{i}$ and $\alpha_{j}$ perform an $\mathsf{inter\_turn}$ of meaning reconstruction game, with $\alpha_{i}$ acting as the speaker and $\alpha_{j}$ as the listener. Interactive turns are interleaved with self-play turns at fixed intervals, i.e. every time an agent has participated in $\sigma\times\mathsf{inter\_turn}$ , it performs one $\mathsf{self\_turn}$ . Once all edges in $\mathcal{G}$ have been sampled, a communication round is complete. In this work, we only consider a setup where all agents can interact with all other agents ( $\mathcal{G}$ is a complete directed graph). We leave an exploration of more complex configurations such as those studied by Harding Graesser et al. (2019); Kim and Oh (2021); Michel et al. (2023) to future work. We set $\sigma=10$ in all interactive experiments, unless differently specified. Interaction between two agents follows the same procedure as group communication.

4 Experimental Setup

As our use case, we adopt the same artificial languages as Lian et al. (2023). These simple verb-final languages vary in their use of word order and/or case marking to denote subject and object, and were originally proposed by Fedzechkina et al. (2017) to study the existence of an effort-informativeness trade-off in human learners.

Artificial languages

The meaning space includes 10 entities and 8 actions, resulting in a total of 10 $\times$ (10 $-$ 1) $\times$ 8 $=$ 720 possible meanings. Utterances can be either SOV or OSV. The order profile of a language is defined by the proportion of SOV, e.g. 100% fixed, 80% dominant, 50% maximally flexible-order. Objects are optionally followed by a special token ‘mk’ while subjects are never marked. To simplify the vocabulary learning problem, each meaning item correspond to exactly one word, leading to a vocabulary size of 10 $+$ 8 $+$ 1 $=$ 19. Two example languages are shown in Table 1.

language	properties	possible utterances
100s+0m	100% sov; 0% marker	Tom Jerry chase
80s+100m	80/20% sov/osv	Tom Jerry mk chase
80s+100m	100% marker	Jerry mk Tom chase

Table 1: Two example languages with varying order and marking proportions, and corresponding utterances for the meaning

m

A

: chase,

a

: tom,

p

: jerry}.

Evaluation

Following Lian et al. (2023), agents are evaluated on a held-out set of meanings unseen during any training phase. The SL phase is evaluated by listening/speaking accuracy computed against gold dataset $D$ , while the RL phase is evaluated by meaning reconstruction accuracy, or communication success. In NeLLCom-X, communication success denotes two different aspects: self-understanding when measured between the same agent’s speaker and listener network, or interactive communication success when measured between a speaking agent and a different listener agent:

	$\displaystyle acc_{self}(m,\alpha_{i})$	$\displaystyle=acc(m,\mathcal{L}_{\alpha_{i}}(\mathcal{S}_{\alpha_{i}}(m))$		(3)
	$\displaystyle acc_{inter}(m,\alpha_{i},\alpha_{j})$	$\displaystyle=acc(m,\mathcal{L}_{\alpha_{j}}(\mathcal{S}_{\alpha_{i}}(m))$		(4)

where $acc(m,\hat{m})$ is 1 iff the entire meaning is matched. Interactive success is not symmetric.

Production preferences

Besides accuracy, our main goal is to observe how the properties of a given language evolve throughout communication. This is done by recording the proportion of markers and different orders in a set of utterances generated by an agent for a held-out meaning set, after filtering out utterances that are not recognized by the initial grammar. When the focus is on the trade-off, rather than on a specific word order, we measure order entropy. Production preferences can be aggregated over an individual agent, a group, or the entire population.

5 Replicating the Trade-off with Full-fledged Agents

Before moving to interactive communication, we validate the new NeLLCom-X framework through a replication of Lian et al. (2023)’s main findings. The simple speaker-listener communication setup of NeLLCom could be seen as a speaker-internal monitoring mechanism predicting the utterance understandability Ferreira (2019). Here, we compare NeLLCom results to those of NeLLCom-X full-fledged agents only engaging in self-play. We use SL to train two sets of agents on the exact same languages as Lian et al. (2023), respectively: 100s+67m for fixed-order and 50s+67m for flexible-order. Then, every agent performs 60 $\mathsf{self\_turn}$ iterations causing its production preferences to drift.

After SL, our agents have successfully learnt both languages but no regularization happens, as expected. By contrast, the results of self-play averaged over each 50-agent set indicate that both languages progressively lose markers. Crucially, the fixed-order language does so faster than the flexible one, where markers are often necessary for agent/patient disambiguation. In sum, self-play in NeLLCom-X results in very similar trends as the simple NeLLCom setup, confirming the emergence of a human-like order/marking trade-off Fedzechkina et al. (2017). Detailed replication results are provided in Appendix B. Here, we report communication success during self-play and production preferences at the end of self-play for the flexible language (Section 5, top row). Self-understanding increases through RL leading to a much more informative language, while production preferences reveal that this spans from an overall decrease in order entropy with marking proportion remaining almost the same on average (solid circle). While some agents approach the optimal points of fixed-order/no-marking (bottom-left corner) or flexible-order/full-marking (top-right), the large variability in production preferences suggests many agents settle on less optimized, redundant languages, as also found by Lian et al. (2023).

[Uncaptioned image] — Figure 2: Two populations of 50 agents engaging in self-play (no interaction) after having learned two flexible-order, optional-marker languages: one with 67% the other with 50% marking. Left column: Average communication success across self-play turns. Right column: Production preferences: solid diamonds mark the initial language; each empty circle denotes a full-fledged agent at the end of self-play; solid circles are the average of all agents, with error bars showing standard deviation.

Initial marking proportion

We reconsider here a language design choice of Lian et al. (2023) who, in turn, inherited it from the human study of Fedzechkina et al. (2017). It was recently found that human learners exposed to a fixed-order language with 75% marking tend to regularize by increasing marker use even though this would make the language less efficient Tal et al. (2022). Similarly, the dominant proportion (67%) of marking utterances in our initial languages may push the agents to prefer marking even when it may be a redundant strategy. Hence, we propose that a more balanced distribution of 50% markers and 50/50% word order may be a better choice to reveal the intrinsic preferences of the learners, if there are any, without biasing them to regularize markers. Results in Section 5 (bottom row) show that this language has overall lower communicative success, as expected given the higher amount of ambiguous sentences. However, success increases substantially during interaction while production preferences reveal a larger variability in solutions including those with more fixed order and less markers. We use this more neutral combination as the default language in all remaining experiments.

6 Interactive Communication Results

This section presents our main results: in Section 6.1 we focus on pairwise interaction and show how NeLLCom-X can be used to simulate communication between speakers of different languages, which was not possible in the original framework; in Section 6.2 we move to group communication and study the effect of group dynamics on communication success and production preferences. Training details for this section are given in Appendix C.

6.1 Speakers of Different Languages

We study a simple setup with two full-fledged agents interacting with each other in both ways $\alpha_{base}$ $\leftrightarrow$ $\alpha_{other}$ . The first ( $\alpha_{b}$ for base) is always trained on the neutral language 50s+50m, while the second ( $\alpha_{o}$ for other) is trained on one of four languages with different properties. If interaction works, we expect (i) agent pairs to negotiate a mutually understandable language and (ii) $\alpha_{b}$ ’s language to drift in different directions according to its interlocutor. For production preferences, we are interested here in the specific word order of the evolving languages so we plot proportion of markers against proportion of SOV instead of order entropy.

The communication success plots in Section 6.1 (left column) show a faster convergence and higher final accuracy when $\alpha_{o}$ has a stronger order preference. As for production preferences (Section 6.1, right column), in the control setting where two neutral agents interact with each other, most agents move towards either side of the plot, representing order regularization. A larger portion of agents regularize towards OSV rather than SOV, which was also observed by Lian et al. (2023) and might be due to OSV being the order where the disambiguating marker appears earlier. Marking decreases only slightly on average. The next two settings involve initial languages with few markers and different order preferences but equally low order entropy (20s+20m and 80s+20m). As shown by the highly symmetric trends, these pairs strongly converge by regularizing towards the dominant order of $\alpha_{o}$ and further reducing markers. The fourth setting involves a language where marking is widespread and informative due to high order entropy (50s+80m). Here, $\alpha_{b}$ shows on average a similar order regularization as in the control setting $\alpha_{b}$ $\leftrightarrow$ $\alpha_{b}$ , but with a marking increase instead of decrease. Finally, when involving a dominant-order language with no clear marking preference (80s+50m), agents strongly regularize the dominant order, with a majority of them reducing marker use.

Taken together, these results demonstrate that (i) pairs of different-language agents succeed in negotiating a mutually understandable language in most cases, and (ii) the evolution of an agent’s language strongly depends on whom they interact with, thereby matching the expectations for a realistic simulation of interactive communication.

Impact of self-play during interactions

As explained in Section 3.3, each agent performs a turn of self-play after completing $\sigma=10$ turns of interactive communication, based on preliminary experiments. We compare this to a setup where no self-play is performed during interaction ( $\sigma=\inf$ ), in the case where two agents start from a state of poor mutual understanding due to limited marking and strongly diverging order preferences (80s+20m vs. 20s+20m). As shown in Section 6.1, disabling self-play leads to extremely low self-understanding even though communication between the two agents is successful. To explain this result, we inspect the production preferences of individual agent pairs and find that many regularize their language in opposite directions (e.g. dominant SOV vs. dominant OSV, both with no markers), indicating a total decoupling of the speaking and listening ability. Thus, we confirm that embedding tying alone does not allow for a realistic interaction simulation, making self-play necessary in our framework.

w/ self-play ( $\sigma$ $=$ 10)	w/o self-play ( $\sigma$ $=$ $\inf$ )

6.2 Effect of Group Size

Here we move back to a setup where all agents are trained on the same neutral and unstable initial language (50s+50m), but this time they interact in groups of different sizes (2, 4, 8, 20) using the standard self-play frequency ( $\sigma=10$ ). To make results comparable, we ensure the total number of interactive turns per agent is the same ( $\approx$ 200) in all setups, by setting $comm\_round$ to 100, 34, 15, and 6 respectively. A total of 200 agents are trained in each group size setting.⁴⁴4100 runs of group of 2, 50 of 4, 25 of 8, and 10 of 20. See all group-specific training details in Appendix C. In this paper, we only consider fully connected communication graphs and fix the total amount of trained agents to enable comparison. We leave an exploration of other group communication factors, such as density and connectivity, to future work.

Section 6.2 (left column) shows similar learning curves for all group sizes, demonstrating that communication is successful even in larger groups. In all cases, interactive and self-communication test accuracy start low (25%), but agents collaborate and end up between 60% and 80% success at $\mathsf{inter\_turn}$ = 100.

For production preferences, we plot proportion of marking by order entropy as we are again interested in order flexibility rather than the specific order chosen by the agents (Section 6.2, right column). Here, each circle denotes the average production preferences of an entire group, as opposed to those of a single agent. When comparing results across different group sizes, we see that the variability observed in self-playing agents (Section 5) including less optimal and redundant strategies, gets smaller as group size increases. The average entropy in groups of 8 and 20 is also lower than in groups of 4 or 2. In the group setting, an agent’s choice to use a marker does not only depend on its own order entropy but on that of the entire group. As a measure of the order/marking trade-off at group level, we therefore calculate Spearman’s correlation ( $\rho$ ) between order entropy and marker use, both computed over all (categorizable) utterances produced by all agents in a group. As shown in Section 6.2, $\rho$ steadily increases with group size from relatively weak (0.32) in pairs to strong (0.73) in groups of 20. This confirms that pairs, like self-playing agents, still often settle on redundant strategies, while larger groups develop more optimized languages in which stronger order consistency at the group level leads to a drop in marker use, confirming the emergence of the trade-off also at the group level.⁵⁵5Even when trained for much longer, the results of pairs remain similar, suggesting they indeed settle on less optimized solutions which is not overcome simply by more interactions (e.g. 200 rounds, $\rho$ = 0.33). See Appendix D.

7 Discussion and conclusion

We introduced NeLLCom-X, a framework for simulating neural agent language learning and communication in groups, starting from pre-defined languages. Agents in this framework display the cognitively plausible property of interchangeability Hockett (1960), by which anything they can understand, they can say and vice versa, while also having the ability to align to other individuals. We replicated an earlier finding by Lian et al. (2023) and showed that a word-order/case-marking trade-off still appears with the adjusted full-fledged agent architecture. Subsequently, we simulated interactions between agents trained on different languages. We found that pairs quickly adapt their utterances towards a mutually understandable language and that the neutral language drifts in different directions depending on the preferences of the other agent. Moreover, agents converge on a shared language faster, and reach higher accuracy in cases where one of the two agents has a stronger word order preference. We then assessed the effect of performing self-play during interactive communication and found it necessary to ensure our full-fledged agents continue to understand themselves, while also realistically adapting to other individuals. Lastly, we studied group dynamics and found that NeLLCom-X agents manage to establish a successful communication system even in larger groups (up to size 20). Moreover, we generally see a larger entropy reduction in the languages developed by larger groups as compared to the languages used by pairs of agents. This finding aligns with previous work on group-level emergent communication, where it was shown that groups developed less idiosyncratic languages than pairs Tieleman et al. (2019) as well as with human experiments which demonstrated more systematic languages to emerge in larger groups Raviv et al. (2019). In our simulations, pairs and smaller groups sometimes settle on less optimized and partly still redundant solutions, while large groups end up with more efficient communication systems.

In the future, NeLLCom-X can be used to study the influence of learning and group dynamics on many other language universals. We plan to keep refining the framework to allow studying different connectivities between the agents, multilingual populations and generational transmission of emerged languages to new agents.

Limitations

Although the use of miniature artificial languages in our work allows for easily interpretable results due to abstractions and simplifications that are hard to achieve with natural human languages, the languages used currently are very small. This may limit the possibility of drawing conclusions beyond proof-of-concept demonstrations. Future work should increase the size and complexity of the languages to see if results hold on a larger scale and compare to patterns found in real human languages, such as those reported by Levshina et al. (2023).

The meanings in our simulations are also strongly abstracted away from reality. While our design is well suited for an investigation of the word-order/case-marking trade-off, future simulations may need a less constrained meaning space, possibly using images to represent meanings.

All experiments conducted so far with NeLLCom-X use the same neural agent architecture (GRU), but we know that different architectures exhibit different inductive biases Kuribayashi et al. (2024) or memory constraints and these factors may influence the findings. Different types of neural learners, however, can be easily plugged into NeLLCom-X.

Interaction between individuals in groups is not the only population factor that shapes language, but linguistic structure is shaped by both interaction and learning Kirby et al. (2015). Especially when languages are learned and transmitted to subsequent generations repeatedly, even small inductive biases may have a large effect on emerging properties Thompson et al. (2016). We therefore plan to augment NeLLCom-X with iterated learning so that new agents learn from the utterances of others and become teachers to agents in the next generation.

Finally, our agents are interacting in groups with multiple individuals, but they currently do not have any awareness of agent identities. A more realistic simulation should take into account that individuals know who they are interacting with, which becomes even more important when different network structures and connectivities will be explored.

Acknowledgements

Arianna Bisazza acknowledges the support of the Dutch Research Council (NWO) within the InDeep project (NWA.1292.19.399) and the Talent Programme (VI.Vidi.221C.009).

References

Beckner et al. (2009) Clay Beckner, Nick C Ellis, Richard Blythe, John Holland, Joan Bybee, Jinyun Ke, Morten H Christiansen, Diane Larsen-Freeman, William Croft, and Tom Schoenemann. 2009. Language is a complex adaptive system: Position paper. Language Learning, 59:1–26.
Bisazza et al. (2021) Arianna Bisazza, Ahmet Üstün, and Stephan Sportel. 2021. On the difficulty of translating free-order case-marking languages. Transactions of the Association for Computational Linguistics, 9:1233–1248.
Chaabouni et al. (2020) Rahma Chaabouni, Eugene Kharitonov, Diane Bouchacourt, Emmanuel Dupoux, and Marco Baroni. 2020. Compositionality and generalization in emergent languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4427–4442.
Chaabouni et al. (2019) Rahma Chaabouni, Eugene Kharitonov, Alessandro Lazaric, Anna Dupoux, Towards more natural artificial languagesen, David Traum, and Lluís Màrquez. 2019. Word-order biases in deep-agent emergent communication. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5166–5175, Florence, Italy. Association for Computational Linguistics.
Chaabouni et al. (2022) Rahma Chaabouni, Florian Strub, Florent Altché, Eugene Tarassov, Corentin Tallec, Elnaz Davoodi, Kory Wallace Mathewson, Olivier Tieleman, Angeliki Lazaridou, and Bilal Piot. 2022. Emergent communication at scale. In International Conference on Learning Representations.
Christensen et al. (2016) Peer Christensen, Riccardo Fusaroli, and Kristian Tylén. 2016. Environmental constraints shaping constituent order in emerging communication systems: Structural iconicity, interactive alignment and conventionalization. Cognition, 146:67–80.
Christiansen and Chater (2008) Morten H Christiansen and Nick Chater. 2008. Language as shaped by the brain. Behavioral and brain sciences, 31(5):489–509.
Chung et al. (2014) Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555.
Conklin and Smith (2022) Henry Conklin and Kenny Smith. 2022. Compositionality with variation reliably emerges in neural networks. In The Eleventh International Conference on Learning Representations.
Culbertson et al. (2012) Jennifer Culbertson, Paul Smolensky, and Géraldine Legendre. 2012. Learning biases predict a word order universal. Cognition, 122(3):306–329.
Fedzechkina et al. (2018) Maryia Fedzechkina, Becky Chu, and T Florian Jaeger. 2018. Human information processing shapes language change. Psychological science, 29(1):72–82.
Fedzechkina et al. (2017) Maryia Fedzechkina, Elissa L. Newport, and T. Florian Jaeger. 2017. Balancing effort and information transmission during language acquisition: Evidence from word order and case marking. Cognitive Science, 41(2):416–446.
Ferreira (2019) Victor S. Ferreira. 2019. A mechanistic framework for explaining audience design in language production. Annual Review of Psychology, 70(1):29–51. PMID: 30231000.
Fitch (2007) W Tecumseh Fitch. 2007. An invisible hand. Nature, 449(7163):665–667.
Fusaroli and Tylén (2012) Riccardo Fusaroli and Kristian Tylén. 2012. Carving language for social coordination: A dynamical approach. Interaction studies, 13(1):103–124.
Galke et al. (2022) Lukas Galke, Yoav Ram, and Limor Raviv. 2022. Emergent communication for understanding human language evolution: What’s missing? In Emergent Communication Workshop at ICLR 2022.
Galke and Raviv (2024) Lukas Galke and Limor Raviv. 2024. Emergent communication and learning pressures in language models: a language evolution perspective. arXiv preprint arXiv:2403.14427.
Garrod et al. (2007) Simon Garrod, Nicolas Fay, John Lee, Jon Oberlander, and Tracy MacLeod. 2007. Foundations of representation: where might graphical symbol systems come from? Cognitive science, 31(6):961–987.
Harding Graesser et al. (2019) Laura Harding Graesser, Kyunghyun Cho, and Douwe Kiela. 2019. Emergent linguistic phenomena in multi-agent communication games. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3700–3710, Hong Kong, China. Association for Computational Linguistics.
Havrylov and Titov (2017) Serhii Havrylov and Ivan Titov. 2017. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 2146–2156. Curran Associates Inc.
Hockett (1960) Charles F Hockett. 1960. The origin of speech. Scientific American, 203(3):88–97.
Hopkins (2022) Mark Hopkins. 2022. Towards more natural artificial languages. In Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), pages 85–94, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Kallini et al. (2024) Julie Kallini, Isabel Papadimitriou, Richard Futrell, Kyle Mahowald, and Christopher Potts. 2024. Mission: Impossible language models. arXiv preprint arXiv:2401.06416.
Karjus et al. (2021) Andres Karjus, Richard A Blythe, Simon Kirby, Tianyu Wang, and Kenny Smith. 2021. Conceptual similarity and communicative need shape colexification: An experimental study. Cognitive Science, 45(9):e13035.
Kharitonov et al. (2019) Eugene Kharitonov, Rahma Chaabouni, Diane Bouchacourt, and Marco Baroni. 2019. EGG: a toolkit for research on emergence of lanGuage in games. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 55–60, Hong Kong, China. Association for Computational Linguistics.
Kim and Oh (2021) Jooyeon Kim and Alice Oh. 2021. Emergent communication under varying sizes and connectivities. Advances in Neural Information Processing Systems, 34:17579–17591.
Kirby et al. (2014) Simon Kirby, Tom Griffiths, and Kenny Smith. 2014. Iterated learning and the evolution of language. Current opinion in neurobiology, 28:108–114.
Kirby et al. (2015) Simon Kirby, Monica Tamariz, Hannah Cornish, and Kenny Smith. 2015. Compression and communication in the cultural evolution of linguistic structure. Cognition, 141:87–102.
Kottur et al. (2017) Satwik Kottur, José Moura, Stefan Lee, and Dhruv Batra. 2017. Natural language does not emerge ‘naturally’in multi-agent dialog. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2962–2967.
Kuribayashi et al. (2024) Tatsuki Kuribayashi, Ryo Ueda, Ryo Yoshida, Yohei Oseki, Ted Briscoe, and Timothy Baldwin. 2024. Emergent word order universals from cognitively-motivated language models. arXiv preprint arXiv:2402.12363.
Lazaridou and Baroni (2020) Angeliki Lazaridou and Marco Baroni. 2020. Emergent multi-agent communication in the deep learning era. arXiv preprint arXiv:2006.02419v2.
Lazaridou et al. (2018) Angeliki Lazaridou, Karl Moritz Hermann, Karl Tuyls, and Stephen Clark. 2018. Emergence of linguistic communication from referential games with symbolic and pixel input. In International Conference on Learning Representations.
Lazaridou et al. (2017) Angeliki Lazaridou, Alexander Peysakhovich, and Marco Baroni. 2017. Multi-agent cooperation and the emergence of (natural) language. In International Conference on Learning Representations.
Lazaridou et al. (2020) Angeliki Lazaridou, Anna Potapenko, and Olivier Tieleman. 2020. Multi-agent communication meets natural language: Synergies between functional and structural language learning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7663–7674, Online. Association for Computational Linguistics.
Levshina et al. (2023) Natalia Levshina, Savithry Namboodiripad, Marc Allassonnière-Tang, Mathew Kramer, Luigi Talamo, Annemarie Verkerk, Sasha Wilmoth, Gabriela Garrido Rodriguez, Timothy Michael Gupton, Evan Kidd, et al. 2023. Why we need a gradient approach to word order. Linguistics, 61(4):825–883.
Li and Bowling (2019) Fushan Li and Michael Bowling. 2019. Ease-of-teaching and language structure from emergent communication. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Lian et al. (2021) Yuchen Lian, Arianna Bisazza, and Tessa Verhoef. 2021. The effect of efficient messaging and input variability on neural-agent iterated language learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10121–10129.
Lian et al. (2023) Yuchen Lian, Arianna Bisazza, and Tessa Verhoef. 2023. Communication drives the emergence of language universals in neural agents: Evidence from the word-order/case-marking trade-off. Transactions of the Association for Computational Linguistics, 11:1033–1047.
Lowe et al. (2020) Ryan Lowe, Abhinav Gupta, Jakob Foerster, Douwe Kiela, and Joelle Pineau. 2020. On the interaction between supervision and self-play in emergent communication. In International Conference on Learning Representations.
Lupyan and Christiansen (2002) Gary Lupyan and Morten H Christiansen. 2002. Case, word order, and language learnability: Insights from connectionist modeling. In Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society, pages 596–601. Routledge.
Lupyan and Dale (2010) Gary Lupyan and Rick Dale. 2010. Language structure is partly determined by social structure. PloS one, 5(1):e8559.
Michel et al. (2023) Paul Michel, Mathieu Rita, Kory Wallace Mathewson, Olivier Tieleman, and Angeliki Lazaridou. 2023. Revisiting populations in multi-agent communication. In The Eleventh International Conference on Learning Representations.
Mikolov et al. (2018) Tomas Mikolov, Armand Joulin, and Marco Baroni. 2018. A roadmap towards machine intelligence. In Computational Linguistics and Intelligent Text Processing: 17th International Conference, pages 29–61. Springer.
Motamedi et al. (2022) Yasamin Motamedi, Lucie Wolters, Danielle Naegeli, Simon Kirby, and Marieke Schouwstra. 2022. From improvisation to learning: How naturalness and systematicity shape language evolution. Cognition, 228:105206.
Namboodiripad et al. (2016) Savithry Namboodiripad, Daniel Lenzen, Ryan Lepic, and Tessa Verhoef. 2016. Measuring conventionalization in the manual modality. Journal of Language Evolution, 1(2):109–118.
Press and Wolf (2017) Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163, Valencia, Spain. Association for Computational Linguistics.
Raviv et al. (2019) Limor Raviv, Antje Meyer, and Shiri Lev-Ari. 2019. Larger communities create more systematic languages. Proceedings of the Royal Society B, 286(1907):20191262.
Rita et al. (2020) Mathieu Rita, Rahma Chaabouni, and Emmanuel Dupoux. 2020. “lazimpa”: Lazy and impatient neural agents learn to communicate efficiently. In Proceedings of the 24th Conference on Computational Natural Language Learning, pages 335–343.
Rita et al. (2022) Mathieu Rita, Corentin Tallec, Paul Michel, Jean-Bastien Grill, Olivier Pietquin, Emmanuel Dupoux, and Florian Strub. 2022. Emergent communication: Generalization and overfitting in lewis games. In Advances in Neural Information Processing Systems.
Roberts and Galantucci (2012) Gareth Roberts and Bruno Galantucci. 2012. The emergence of duality of patterning: Insights from the laboratory. Language and cognition, 4(4):297–318.
Saldana et al. (2021a) Carmen Saldana, Yohei Oseki, and Jennifer Culbertson. 2021a. Cross-linguistic patterns of morpheme order reflect cognitive biases: An experimental study of case and number morphology. Journal of Memory and Language, 118:104204.
Saldana et al. (2021b) Carmen Saldana, Kenny Smith, Simon Kirby, and Jennifer Culbertson. 2021b. Is regularization uniform across linguistic levels? comparing learning and production of unconditioned probabilistic variation in morphology and word order. Language Learning and Development, 17(2):158–188.
Steels (1997) Luc Steels. 1997. The synthetic modeling of language origins. Evolution of communication, 1(1):1–34.
Steels (2000) Luc Steels. 2000. Language as a complex adaptive system. In International Conference on Parallel Problem Solving from Nature, pages 17–26. Springer.
Taillandier et al. (2023) Valentin Taillandier, Dieuwke Hupkes, Benoît Sagot, Emmanuel Dupoux, and Paul Michel. 2023. Neural agents struggle to take turns in bidirectional emergent communication. In The Eleventh International Conference on Learning Representations.
Tal et al. (2022) Shira Tal, Kenny Smith, Jennifer Culbertson, Eitan Grossman, and Inbal Arnon. 2022. The impact of information structure on the emergence of differential object marking: an experimental study. Cognitive Science, 46(3):e13119.
Tamariz et al. (2018) Mónica Tamariz, Seán G Roberts, J Isidro Martínez, and Julio Santiago. 2018. The interactive origin of iconicity. Cognitive Science, 42(1):334–349.
Thompson et al. (2016) Bill Thompson, Simon Kirby, and Kenny Smith. 2016. Culture shapes the evolution of cognition. Proceedings of the National Academy of Sciences, 113(16):4530–4535.
Tieleman et al. (2019) Olivier Tieleman, Angeliki Lazaridou, Shibl Mourad, Charles Blundell, and Doina Precup. 2019. Shaping representations through communication: community size effect in artificial learning systems. arXiv preprint arXiv:1912.06208.
Verhoef (2012) Tessa Verhoef. 2012. The origins of duality of patterning in artificial whistled languages. Language and cognition, 4(4):357–380.
Verhoef et al. (2014) Tessa Verhoef, Simon Kirby, and Bart De Boer. 2014. Emergence of combinatorial structure and economy through iterated learning with continuous acoustic signals. Journal of Phonetics, 43:57–68.
Verhoef et al. (2016) Tessa Verhoef, Simon Kirby, and Bart De Boer. 2016. Iconicity and the emergence of combinatorial structure in language. Cognitive science, 40(8):1969–1994.
Verhoef et al. (2015) Tessa Verhoef, Seán G Roberts, and Mark Dingemanse. 2015. Emergence of systematic iconicity: Transmission, interaction and analogy. In Proceedings of the 37th Annual Meeting of the Cognitive Science Society (CogSci 2015), pages 2481–2486. Cognitive Science Society.
Verhoef et al. (2022) Tessa Verhoef, Esther Walker, and Tyler Marghetis. 2022. Interaction dynamics affect the emergence of compositional structure in cultural transmission of space-time mappings. In Proceedings of the 44th Annual Meeting of the Cognitive Science Society, pages 2133–2139. Cognitive Science Society.
Wang and Eisner (2016) Dingquan Wang and Jason Eisner. 2016. The galactic dependencies treebanks: Getting more data by synthesizing new languages. Transactions of the Association for Computational Linguistics, 4:491–505.
White and Cotterell (2021) Jennifer C. White and Ryan Cotterell. 2021. Examining the inductive bias of neural language models with artificial languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 454–463, Online. Association for Computational Linguistics.
Williams (1992) Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256.
Wray and Grace (2007) Alison Wray and George W Grace. 2007. The consequences of talking to strangers: Evolutionary corollaries of socio-cultural influences on linguistic form. Lingua, 117(3):543–578.

Appendix A More Details about NeLLCom

We list here additional details on the original NeLLCom framework Lian et al. (2023) that also apply to our extended NeLLCom-X framework.

Speaker and Listener Architectures

Both speaking and listening networks have a single 16-dim GRU layer. The shared meaning embeddings have 8-dim and the shared word embeddings have 16-dim. The maximum utterance length for the speaking decoder is set to 10 words.

Supervised Language Learning

During supervised learning, the speaker learns the mapping from the meaning inputs to utterances and vice versa for the listener. Dataset $D$ is composed of meaning-utterance pairs $(m,u)$ where $u$ is the gold-standard generated for $m$ by a predefined grammar. Given training sample $(m,u)$ , speaker’s parameters $\theta_{\mathcal{S}}$ and listener’s parameters $\theta_{\mathcal{L}}$ are optimized by minimizing the cross-entropy loss of the predicted words and the predicted meaning tuples respectively:

Loss^{sup}_{(\mathcal{S})}=-\sum_{i=1}^{I}\log p_{\theta_{\mathcal{S}}}(w^{i}|w^{<i},m)

(5)

Loss^{sup}_{(\mathcal{L})}=-(\log p_{\theta_{\mathcal{L}}}(A|u)\\ +\log p_{\theta_{\mathcal{L}}}(a|u)+\log p_{\theta_{\mathcal{L}}}(p|u))

(6)

where $w_{i}...w_{I}$ are the words composing utterance $u$ , whereas $A,a,p$ are respectively the action, agent and patient of meaning $m$ .

Communicative Reward Optimization

Communication is implemented by a meaning reconstruction game following common practice in the artificial agent communication literature (e.g. Steels, 1997; Lazaridou et al., 2018). The speaker generates an utterance $\hat{u}$ given a meaning $m$ , and the listener needs to reconstruct meaning $m$ given $\hat{u}$ . The policy-based algorithm REINFORCE Williams (1992) is used to maximize a shared reward $r^{\mathcal{L}}(m,\hat{u})$ , defined as the log likelihood of $m$ given $\hat{u}$ according to the listener’s model:

r^{\mathcal{L}}(m,\hat{u})=\sum_{e\ \in\ m=\{A,a,p\}}\log p_{\theta_{\mathcal{L}}}(e|\hat{u})

(7)

Thus, the communication loss becomes:

Loss_{(\mathcal{S,L})}^{comm}=-r^{\mathcal{L}}(m,\hat{u})*\sum_{i=1}^{I}\log p_{\theta_{\mathcal{S}}}(w^{i}|w^{<i},m)

(8)

Appendix B Replicating NeLLCom Results with NeLLCom-X Full-fledged Agents

B.1 Training details for the replication

For this replication (discussed in Section 5), we make the training configuration as consistent as possible with Lian et al. (2023). Specifically, we split the data into 66.7/20% training/testing. The testing proportion is different from the 33.3% used in NeLLCom as we would like to match the test set size we use for interactive communication in this work. All entities and actions are required to appear at least once in the training set. The default Adam optimizer is applied with a learning rate of 0.01. Both SL and $\mathsf{self\_turn}$ iterate 60 times.⁶⁶6As the 66.7% trainset results in 480 samples, which equals 15 batches of 32 samples per turn. This is slightly different than 10 batches per turn during interactive communication. Each replication setup is repeated with 50 random seeds.

B.2 Results

Fixed-order self-communication

Starting from the initial marker proportion (66.7%), fixed-order language learners start to drop the marker (50% at round 60) during self-communication while maintaining high understandability (95%) (Appendix A (a1) and (a4)). This aligns with the results of Lian et al. (2023).

Flexible-order self-communication

The self-communication accuracy in the flexible-order language (Appendix A (c1)) starts from a relatively low success rate as expected, but increases with more communication rounds. In particular, agents exceed the communication success they had achieved at the end of SL on new meanings and finally reach a much higher accuracy on new meanings at the end of self-communication (around 75%) comparing to the communication success they had achieved at the end of SL.

The average ordering and marking proportions also show that flexible-order language self-communication results in a very similar pattern as was found by Lian et al. (2023): (i) The average word order production (Appendix A (c2)) shows a strong preference for OSV, (ii) Although the overall marking system ends with a similar marker proportion as the initial condition (Appendix A (c4)), i.e., the proportion of with-marker utterances is twice the proportion of no-marker utterances, we can see a clear shift to conditional marking (Appendix A (c3)) with an asymmetric use of markers: at round 60, the marker proportion on utterances with OSV order (70%) remains similar to the initial proportion (66.7%), while the proportion of markers use with SOV drops to 35%. This order preference and asymmetric marking system align with the flexible-order language results of Lian et al. (2023).

Figure 7(d) shows the production preferences of individual agents where the distributions of utterance type usage diverge over time, similar to the independent speaker and listener communication results in Lian et al. (2023).

Uncertainty vs. Effort

Lian et al. (2023) found that agents balanced uncertainty and effort in a similar way to human participants in an artificial language learning task Fedzechkina et al. (2017). To evaluate whether a similar uncertainty-effort trade-off is found with our full-fledged agents, we apply the same measurement on both fixed and flexible languages in Figure 7(a). Besides the results from our new framework, we also reproduce the independent listener-speaker communication result from Lian et al. (2023) (Figure 7(b)) and human results from Fedzechkina et al. (2017) (Figure 7(c)) for comparison.

For the fixed-order language, the obvious drop of the averaged effort fits both Lian et al. (2023) and Fedzechkina et al. (2017). Among 50 agents, only one agent significantly increases the use of markers and ends at around 3.8 words per utterance. Others reduce the marker, and two agents even end with 3.0 and 3.05 words per utterance which means almost no markers are produced. For the flexible-order language, uncertainty is reduced slightly less strongly as in the human results, which was also the case in Lian et al. (2023).

group size	# comm_edges	# comm_rounds	# repeated groups
2	$2=2*(2-1)$	$100=\left\lceil 100/(2-1)\right\rceil$	$100=200/2$
4	$12=4*(4-1)$	$34=\left\lceil 100/(4-1)\right\rceil$	$50=200/4$
8	$56=8*(8-1)$	$15=\left\lceil 100/(8-1)\right\rceil$	$25=200/8$
20	$380=20*(20-1)$	$6=\left\lceil 100/(20-1)\right\rceil$	$10=200/20$

Table 2: Number of communication edges, number of rounds, and number of repeated groups for each group-size setting. Theaw settings were selected to ensure a fair comparison (i.e. similar amount of computation) across different group sizes.

50% marking in initial language

As described in Section 5, the initial proportion of marker use of 67%, which was used in Lian et al. (2023) and inherited from Fedzechkina et al. (2017), may create a bias for the agents to regularize towards more marker use, settling on more redundant languages. We therefore switched to the more neutral value of 50% markers in the initial language. In Appendix A, the self-communication results of this new setting can be directly compared to the original set-up. As expected, markers are dropped more rapidly in the fixed-order 50% marker language than in the 67% marker language (Appendix A (a3) versus Appendix A (b3)). In the flexible-order languages, agents trained on the 67% marker language mostly kept using the marker, even though they also developed a clear preference for one word order, resulting in redundant strategies. With 50% markers in the initial language, however, agents drop the marker when they develop a word order preference despite being trained on a flexible word order language (Appendix A (c3) versus Appendix A (d3)).

Appendix C Training Details for Interactive Communication Experiments

We explain here the detailed setup for the main experiments discussed in Section 6.1 and Section 6.2. This setup was determined based on preliminary experiments to yield optimal results in terms of learning accuracy (during SL) and communication success (during RL).

Data splits

We first split the data into 80/20% training/test. The test split is used thoughout the whole training. We resample 66.7% meanings out of the first train set (resulting in 480 meaning-utterance pairs) for the SL training phase. All entities and actions are required to appear at least once in the training set.

Then, for each communication turn, 50% meanings are sampled from the first train set (resulting in 320 meanings) and used as the training samples for this RL turn. Because interactive communication is always preceded by SL, agents have already learnt the mapping between words and entities and actions in the meaning space. Thus we do not enforce the all-seen-entities/actions rule in RL sampling.

Communication turns and rounds

During interactive communication, the RL learning rate is set to 0.005. For each communication turn, 1 epoch is applied corresponding to 10 batches of 32 meanings. We fix the total number of inter_turn per agent to (approximately) 200 (both speaking and listening are considered). The total round is then computed as:

comm\_rounds=\left\lceil\frac{200*group\_size}{2*|commu\_edges|}\right\rceil,

or to simplify the equation in fully connected communication graphs:

comm\_rounds=\left\lceil\frac{100}{group\_size-1}\right\rceil.

For a group of 2, a communication round includes 2 communication edges to be sampled: $\mathcal{G}_{g2}=\{\alpha_{0}\to\alpha_{1},\alpha_{1}\to\alpha_{0}\}$ . For a group of 4, a communication round includes $12=4\times(4-1)$ communication edges $\mathcal{G}_{g4}=\{A_{0}\to A_{1},A_{0}\to A_{2},A_{0}\to A_{3},A_{1}\to A_{0},A_{1}\to A_{2},A_{1}\to A_{3},A_{2}\to A_{0},A_{2}\to A_{1},A_{2}\to A_{3},A_{3}\to A_{0},A_{3}\to A_{1},A_{3}\to A_{2}\}$ . Similarly, $|\mathcal{G}_{g8}|=8\times(8-1)=56$ and $|\mathcal{G}_{g20}|=20\times(20-1)=380$ . As for self-play, each agent performs 200/ $\sigma$ self-play turns in total during interaction, that is 200/10=20 in the standard case where $\sigma=10$ .

Number of random seeds

In Section 6.1 we repeat each language combination experiment with 50 pairs of agents (i.e. 100 random seeds). In Section 6.2, we set the total number of trained agents to 200 in each setup, (i.e. number of groups = $200/group\_size$ ). The details of rounds and repeated groups are listed in Table 2.

Appendix D Additional Group Experimennts

Appendix C shows the effect of longer training on the production preferences of pairs of same-language speakers (50s+50m). Production preferences (right column) do not change much after 100 additional turns (bottom row), and the correlation $\rho$ increases only marginally from 0.32 to 0.33.

	Communicative success per turn	Marker use by order (SOV/all)
$\alpha_{b}\leftrightarrow\alpha_{b}$
$\alpha_{b}\leftrightarrow$ 80s+20m
$\alpha_{b}\leftrightarrow$ 20s+20m
$\alpha_{b}\leftrightarrow$ 50s+80m
$\alpha_{b}\leftrightarrow$ 80s+50m

	(1) Comm. success	(2) Order use	(3) Cond. Marker use	(4) Marker user
(a) 100s+67m
(b) 100s+50m
(c) 50s+67m
(d) 50s+50m

	Communicative success per turn	Marker use by order entropy
100 rounds
200 rounds

	Communicative success per turn	Marker use by order entropy
50s+67m
50s+50m

	Communicative success per turn	Marker use by order entropy
groups of 2
groups of 4
groups of 8
groups of 20

NeLLCom-X: A Comprehensive Neural-Agent Framework to Simulate Language Learning and Group Communication

Abstract

1 Introduction

2 Related Work

Role-alternating agents

Group communication

3 NeLLCom-X

3.1 Original Framework

3.2 Full-fledged Agent

Parameter sharing

Self-play

3.3 Interactive Communication

Turn scheduling

4 Experimental Setup

Artificial languages

Evaluation

Production preferences

5 Replicating the Trade-off with Full-fledged Agents

Initial marking proportion

6 Interactive Communication Results

6.1 Speakers of Different Languages

Impact of self-play during interactions

6.2 Effect of Group Size

7 Discussion and conclusion

Limitations

Acknowledgements

References

Appendix A More Details about NeLLCom

Speaker and Listener Architectures

Supervised Language Learning

Communicative Reward Optimization

Appendix B Replicating NeLLCom Results with NeLLCom-X Full-fledged Agents

B.1 Training details for the replication

B.2 Results

Fixed-order self-communication

Flexible-order self-communication

Uncertainty vs. Effort

50% marking in initial language

Appendix C Training Details for Interactive Communication Experiments

Data splits

Communication turns and rounds

Number of random seeds

Appendix D Additional Group Experimennts

NeLLCom-X: A Comprehensive Neural-Agent Framework
to Simulate Language Learning and Group Communication