This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\setcopyright

ifaamas \acmConference[AAMAS ’25] Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025)May 19 – 23, 2025 Detroit, Michigan, USAA. El Fallah Seghrouchni, Y. Vorobeychik, S. Das, A. Nowe (eds.) \copyrightyear2025 \acmYear2025 \acmDOI \acmPrice \acmISBN \affiliation \institutionPolytechnique Montreal \cityMontreal \countryCanada \affiliation \institutionPolytechnique Montreal \cityMontreal \countryCanada \affiliation \institutionUniversity of Konstanz \cityKonstanz \countryGermany \affiliation \institutionPolytechnique Montreal \cityMontreal \countryCanada

Bridging Swarm Intelligence and Reinforcement Learning

Karthik Soma Yann Bouteiller Heiko Hamann  and  Giovanni Beltrame
Abstract.

Swarm intelligence (SI) explores how large groups of simple individuals (e.g., insects, fish, birds) collaborate to produce complex behaviors, exemplifying that the whole is greater than the sum of its parts. A fundamental task in SI is Collective Decision-Making (CDM), where a group selects the best option among several alternatives, such as choosing an optimal foraging site. In this work, we demonstrate a theoretical and empirical equivalence between CDM and single-agent reinforcement learning (RL) in multi-armed bandit problems, utilizing concepts from opinion dynamics, evolutionary game theory, and RL. This equivalence bridges the gap between SI and RL and leads us to introduce a novel abstract RL update rule called Maynard-Cross Learning. Additionally, it provides a new population-based perspective on common RL practices like learning rate adjustment and batching. Our findings enable cross-disciplinary fertilization between RL and SI, allowing techniques from one field to enhance the understanding and methodologies of the other.

Key words and phrases:
Swarm Intelligence, Reinforcement Learning, Evolutionary Game Theory, Opinion Dynamics

1. Introduction

Swarm Intelligence (SI) takes inspiration from how a collective of natural entities, following simple, local, and decentralized rules, can produce emergent and complex behaviors Bonabeau et al. (1999). Researchers have extracted core principles such as coordination, cooperation, and local communication from these natural systems, and applied them to artificial systems, (e.g., swarm robotics Dorigo et al. (2021); Hamann (2018) and optimization algorithms Dorigo et al. (2006)).

In this paper, we focus the specific SI problem of Collective Decision Making (CDM). In CDM, individuals work together to reach an agreement on the best option from a set of alternatives, a problem commonly called the best-of-n decision problem. To solve this problem, researchers have turned to opinion dynamics Xia et al. (2011), a field that studies how opinions spread in a population. In particular, in the voter rule Castellano et al. (2009); Redner (2019), an individual copies the opinion of a randomly chosen neighbor. Similarly, researchers have taken inspiration from the house-hunting behavior of honey bees to create the weighted voter rule Valentini et al. (2014); Reina et al. (2024). In this rule, after scouting one of nn potential nesting areas, bees come back to perform a “dance” Visscher and Camazine (1999) that describes the coordinates of the option that they have explored. According to the weighted voter model, this dance is performed at a frequency that is proportional to the estimated quality of the explored area. Other bees go scout the area corresponding to the first dance they witness, and this process repeats until the entire colony converges to the same option.

Next, we turn toward reinforcement learning (RL), where an agent111To avoid confusion, we use “agent” in the context of RL and “individual” in the context of a population, wherever possible learns to solve a task by interacting with the environment to maximize a reward signal Sutton and Barto (2018). RL has been successfully applied to solve complex problems in various fields such as robotics Kaufmann et al. (2023), nuclear fusion Seo et al. (2024), and games Mnih et al. (2013). In this paper, we are specifically interested in multi-armed bandits Sutton and Barto (2018), in which a single agent makes choices among different options (or “arms”) to maximize its reward. Among the many learning algorithms designed to solve this task (Upper-confidence-Bound Auer et al. (2002) (UCB), ϵ\epsilon-greedy Sutton and Barto (2018), Gradient Bandit Williams (1992)…), we consider the Cross Learning Cross (1973) update rule, closely related to the Gradient Bandit algorithm.

Although SI and RL are seemingly disjoint, we show that these fields can in fact be bridged via the Replicator Dynamic Sandholm et al. (2008) (RD), a famous equation used in Evolutionary Game Theory (EGT) to model the outcome of evolutionary processes through the idea of survival of fittest. In the rest of the paper, we demonstrate a mathematical equivalence between different concepts from SI and RL:

  • We first show that a large population whose members follow the voter rule can be seen as a single abstract RL agent following the Cross Learning update rule.

  • Next, via a similar argument, we show that the weighted voter rule, yields another abstract RL update that we coin Maynard-Cross Learning.

  • We validate these equivalences with RL and population experiments and offer a new perspective about two common practices in RL, learning rate adjustment and batching.

2. Preliminaries

2.1. Multi-armed bandits and Cross Learning

Multi-armed bandits are one of the simplest types of environments encountered in RL literature. They consist of a discrete set of available actions, called “arms”, amongst which an agent has to find the most rewarding. In an nn-armed bandit, pulling arm a{1,,n}a\in\{1,\dots,n\} returns a real-valued reward ra[0,1]r_{a}\in[0,1] sampled from a hidden distribution r(a)r(a). The objective for an RL agent playing a multi-armed bandit is to learn a policy, denoted by the probability vector π=(π1,.πn)\pi=(\pi_{1},\dots.\pi_{n}), that maximizes the rewards obtained upon pulling the arms. Different exploration strategies exist to find such policies, one of them being Cross Learning Cross (1973):
Cross Learning (CL). Let kk be an action and rkr_{k} a corresponding reward sample (rkr(k)r_{k}\sim r(k)). CL updates the policy π\pi as:

a,πaπa+rk{1πaifa=kπaotherwise\forall a,\pi_{a}\leftarrow\pi_{a}+r_{k}\begin{cases}1-\pi_{a}&\text{if}\ a=k\\ -\pi_{a}&\text{otherwise}\end{cases} (1)

For convenience, we denote the expected policy update on action aa’s probability πa\pi_{a} when sampling reward rkr_{k} from action kk as:

dπa(k)=𝔼rkr(k)[rk]{1πaifa=kπaotherwised\pi_{a}(k)=\mathbb{E}_{r_{k}\sim r(k)}[r_{k}]\begin{cases}1-\pi_{a}&\text{if}\ a=k\\ -\pi_{a}&\text{otherwise}\end{cases} (2)

In CL, every reward rkr_{k} sampled when applying the associated action kk directly affects the probabilities accorded by policy π\pi to all available actions. As noted earlier, CL is closely related to the Gradient Bandit algorithm, which performs a similar update at the parameter level (called “preferences”) of a parametric policy rather than directly updating the probability vector.

2.2. Evolutionary game theory

Evolutionary game theory (EGT) studies population games Sandholm (2010). In a single-population game, a population 𝒫\mathcal{P} is made of a large number of individuals, where any individual ii is associated with a type, denoted by Ti{1,,n}T_{i}\in\{1,\dots,n\}. The population vector π=(π1,,πn)\pi=(\pi_{1},\dots,\pi_{n}) represents the fraction of individuals in each type (iπi=1\sum_{i}\pi_{i}=1). Individuals are repeatedly paired at random to play a game, each receiving a separate payoff defined by the game bi-matrix222A bimatrix game is defined by a pair of identical-shape matrices, one per agent. AA. Individuals adapt their type based on these payoffs according to an update rule.333In population games, an update rule is sometimes called a revision protocol. One notable such rule is imitation of success Sandholm (2010):
Imitation Dynamics: Any individual i𝒫i\in\mathcal{P} of type Ti=aT_{i}=a follows the voter rule444Voter rule is not a terminology used in EGT. Instead, it comes from opinion dynamics. RvoterR_{\text{voter}}:

  1. (1)

    ii samples a random individual j𝒰(𝒫)j\sim\mathcal{U}(\mathcal{P}) to imitate. Let TjT_{j} be bb.

  2. (2)

    Both individuals ii and jj play the game defined by AA to receive payoffs rar_{a} and rbr_{b} respectively (0ra,b10\leq r_{a,b}\leq 1). In general, each payoff may depend on the types of both individuals.

  3. (3)

    ii switches to type bb with probability rbr_{b}555This definition of voter rule differs from opinion dynamics as individuals do not switch deterministically, but rather make a probabilistic switch..

One can easily see why this rule is called “imitation of success”: ii imitates jj based on jj’s payoff. When aggregated to the entire population, imitation of success yields a famous equation in EGT, called the Taylor Replicator Dynamic Taylor and Jonker (1978); Sandholm et al. (2008) (TRD) (see Lemma 3.1):

π˙a=πa(qaπvπ),\dot{\pi}_{a}=\pi_{a}(q^{\pi}_{a}-v^{\pi}), (3)

where π˙a\dot{\pi}_{a} is the derivative of the aa-th component of the population vector πa\pi_{a}, qaπ:=𝔼[ra]q^{\pi}_{a}:=\mathbb{E}[r_{a}] is the expected payoff of the type aa against the current population, and vπ:=bπb𝔼[rb]v^{\pi}:=\sum_{b}\pi_{b}\mathbb{E}[r_{b}] is the current average payoff of the entire population.666In Evolutionary Game Theory, expected payoffs are often referred to as “fitness”, as they model the reproductive fitness of the different types. Further, we describe another useful variant of the TRD for later convenience in the paper, the Maynard Smith Replicator Dynamic Smith (1982) (MRD):

π˙a=πavπ(qaπvπ)\dot{\pi}_{a}=\frac{\pi_{a}}{v^{\pi}}(q^{\pi}_{a}-v^{\pi}) (4)

2.3. Collective-decision making in swarms

Consider a population 𝒫\mathcal{P} of NN individuals trying to reach a consensus on which amongst nn available options is the optimal. Similar to population games, each individual ii has an opinion, denoted by Op{1,,n}O_{p}\in\{1,\dots,n\}, about which option they prefer. We again call the population vector π=(π1,,πn)\pi=(\pi_{1},\dots,\pi_{n}), which in this context represents the fraction of individuals sharing each opinion. The weighted voter rule models the dance of honey bees during nest-hunting Reina et al. (2024):

Weighted voter rule: Any individual i𝒫i\in\mathcal{P} of opinion Oi=aO_{i}=a follows the weighted voter rule RwvoterR_{\text{wvoter}}:

  1. (1)

    ii estimates the quality of its current opinion rar(a)r_{a}\sim r(a), where 0ra10\leq r_{a}\leq 1.

  2. (2)

    After obtaining rar_{a}, ii locally broadcasts its opinion at a frequency proportional to rar_{a}.

  3. (3)

    ii switches its opinion to the first opinion bb that it perceives from its neighborhood. Assuming all individuals are well mixed in the population Nowak (2006), the corresponding expected probability of ii switching to opinion bb is the proportion of votes cast for bb: P(ba)=Nb𝔼[rb]lNl𝔼[rl]P(b\leftarrow a)=\frac{N_{b}\mathbb{E}[r_{b}]}{\sum_{l}N_{l}\mathbb{E}[r_{l}]} (where NkN_{k} is the number of individuals of opinion kk). This probability can further be written πb𝔼[rb]lπl𝔼[rl]\frac{\pi_{b}\mathbb{E}[r_{b}]}{\sum_{l}\pi_{l}\mathbb{E}[r_{l}]} by dividing both the numerator and the denominator by NN.

Note that in this model, bees do not directly observe the quality estimate of other individuals, but only their opinion. This makes the weighted voter rule well-adapted to swarms of communication-limited organisms.

3. Theory

Remark 0.

Population-policy equivalence. As noted by Bloembergen et al. (2015), a population vector π=(π1,,πn)\pi=(\pi_{1},\dots,\pi_{n}) can be abstracted as a multi-armed bandit RL policy (and vice-versa). In this view, uniformly sampling an individual of type aa from the population is equivalent to sampling action aa from the policy.

3.1. Voters and Cross Learning

Proposition 1.

An infinite population of individuals following RvoterR_{\text{voter}} can be seen as an RL agent following Exact Cross Learning777We call ”exact” the algorithm that applies the expected update., i.e.,

dvoterπa=𝔼kπ,rkr(k)[dCLπa(k,rk)],d^{\text{voter}}\pi_{a}=\mathbb{E}_{k\sim\pi,r_{k}\sim r(k)}[d^{\text{CL}}\pi_{a}(k,r_{k})]\;, (5)

where π\pi can be seen as both the population vector and the vector of action-probabilities under the population-policy equivalence, dvoterπd^{\text{voter}}\pi is the single-step change of population π\pi under the voter rule (i.e., the change in type proportions after all individuals simultaneously perform RvoterR_{\text{voter}} once), and dCLπ(k,rk)d^{\text{CL}}\pi(k,r_{k}) is the update performed by CL on the policy π\pi for a given action-reward sample (k,rk)(k,r_{k}).

To prove Proposition 1, we use two intermediate results (Lemmas 3.1 and 3.1). These results are already known from the literature (although to the best of our knowledge we are the first to integrate them and apply them in this context). We provide proofs using our formalism for both Lemmas, as we will later follow a similar reasoning to prove the CDM/RL equivalence. The first result describes a policy-population equivalence between CL and the TRD:

Lemma 0.

In expectation, an RL agent learning via the CL update rule follows Bloembergen et al. (2015):

𝔼kπ[dπa(k)]=πa(qaπvπ),\mathbb{E}_{k\sim\pi}[d\pi_{a}(k)]=\pi_{a}(q^{\pi}_{a}-v^{\pi}), (6)

where qaπq^{\pi}_{a} is the value of action aa, and vπv^{\pi} is the value of policy π\pi.

Proof.

Let us compute the expectation over actions sampled from π\pi in Eq. 2. For convenience, we write
𝔼[dπa]:=𝔼kπ[dπa(k)]\mathbb{E}[d\pi_{a}]:=\mathbb{E}_{k\sim\pi}[d\pi_{a}(k)], and 𝔼[rk]:=𝔼rkr(k)[rk]\mathbb{E}[r_{k}]:=\mathbb{E}_{r_{k}\sim r(k)}[r_{k}]:

𝔼[dπa]\displaystyle\mathbb{E}[d\pi_{a}] =k=1nπk.dπa(k)\displaystyle=\sum_{k=1}^{n}\pi_{k}.d\pi_{a}(k) (7)
=πa.dπa(a)+kaπk.dπa(k)\displaystyle=\pi_{a}.d\pi_{a}(a)+\sum_{k\neq a}\pi_{k}.d\pi_{a}(k)
=πa𝔼[ra](1πa)+kaπk𝔼[rk](πa)\displaystyle=\pi_{a}\mathbb{E}[r_{a}](1-\pi_{a})+\sum_{k\neq a}\pi_{k}\mathbb{E}[r_{k}](-\pi_{a})
=πa[𝔼[ra]πa𝔼[ra]kaπk𝔼[rk]]\displaystyle=\pi_{a}\Bigr{[}\mathbb{E}[r_{a}]-\pi_{a}\mathbb{E}[r_{a}]-\sum_{k\neq a}\pi_{k}\mathbb{E}[r_{k}]\Bigr{]}
=πa[𝔼[ra]kπk𝔼[rk]]\displaystyle=\pi_{a}\Bigr{[}\mathbb{E}[r_{a}]-\sum_{k}\pi_{k}\mathbb{E}[r_{k}]\Bigr{]}
=πa(qaπvπ)\displaystyle=\pi_{a}(q^{\pi}_{a}-v^{\pi}) (8)

The term qaπvπq^{\pi}_{a}-v^{\pi} is commonly known as the “advantage” of action aa in RL. From that perspective, it describes how good action aa is in comparison to the current policy π\pi. But Remark 3 enables looking at Lemma 3.1 under a different light: the right-hand sides of Eqs. 3 and 6 are equivalent. In other words, under the population-policy equivalence, the CL update rule tangentially follows the TRD (in expectation). Furthermore, it is known that a population adopting RvoterR_{\text{voter}} also yields the TRD:

Lemma 0.

An infinite population of individuals adopting RvoterR_{\text{voter}} follows the TRD Sandholm (2010); Sandholm et al. (2008) i.e.,

dπa=πa(qaπvπ),d\pi_{a}=\pi_{a}(q^{\pi}_{a}-v^{\pi}), (9)
Proof.

Let P(ab)P(a\leftarrow b) denote the inflow of individuals of type bb into type aa, i.e, the proportion of the population leaving type bb and adopting type aa. The population has a proportion of πb\pi_{b} individuals of type bb, each having a probability πa\pi_{a} of meeting an individual of type aa, and a conditional probability 𝔼[ra]\mathbb{E}[r_{a}] of switching to its type. Thus, we get P(ab)=πbπa𝔼[ra]P(a\leftarrow b)=\pi_{b}\pi_{a}\mathbb{E}[r_{a}]:

dπa\displaystyle d\pi_{a} =baP(ab)inflowP(ba)outflow\displaystyle=\sum_{b\neq a}\underbrace{P(a\leftarrow b)}_{\text{inflow}}-\underbrace{P(b\leftarrow a)}_{\text{outflow}} (10)
=baπbπa𝔼[ra]πaπb𝔼[rb]\displaystyle=\sum_{b\neq a}\pi_{b}\pi_{a}\mathbb{E}[r_{a}]-\pi_{a}\pi_{b}\mathbb{E}[r_{b}]
=πa[baπb𝔼[ra]baπb𝔼[rb]]baπb+πa=1\displaystyle=\pi_{a}\Bigr{[}\sum_{b\neq a}\pi_{b}\mathbb{E}[r_{a}]-\sum_{b\neq a}\pi_{b}\mathbb{E}[r_{b}]\Bigr{]}\hskip 20.00003pt\sum_{b\neq a}\pi_{b}+\pi_{a}=1
=πa[(1πa)𝔼[ra]baπb𝔼[rb]]\displaystyle=\pi_{a}\Bigr{[}(1-\pi_{a})\mathbb{E}[r_{a}]-\sum_{b\neq a}\pi_{b}\mathbb{E}[r_{b}]\Bigr{]}
=πa[𝔼[ra]bπb𝔼[rb]]\displaystyle=\pi_{a}\Bigr{[}\mathbb{E}[r_{a}]-\sum_{b}\pi_{b}\mathbb{E}[r_{b}]\Bigr{]}
=πa(qaπvπ)\displaystyle=\pi_{a}(q^{\pi}_{a}-v^{\pi}) (11)

Combining these results yields Proposition 1, as Lemmas 3.1 and 3.1 yield:

dvoterπa=𝔼kπ,rkr(k)[dCLπa(k,rk)]d^{\text{voter}}\pi_{a}=\mathbb{E}_{k\sim\pi,r_{k}\sim r(k)}[d^{\text{CL}}\pi_{a}(k,r_{k})]

Proposition 1 shows how TRD connects multi-agent imitation dynamics and single-agent Exact Cross Learning. In practice, RL updates do not follow their exact expectation due to finite sampling. They rely on action samples from the policy and reward samples from the environment. To circumvent high variance and improve convergence properties, RL practitioners typically perform these policy updates in a batched fashion, which, according to Proposition 1, is equivalent to making these updates closer to infinite-population dynamics. In fact, there is an apt population-based interpretation of this practice, shown in the next section.

3.2. Learning rate and batch-size

Instead of studying the mean-field effect of aggregated individual voters, we can look at the individual effect of each voter on the entire population. This effect yields interesting insights regarding two common practices in RL: adjusting the learning rate, i.e., scaling down RL updates by a small factor, and batching, i.e., averaging RL updates over several samples.

Instead of an infinite population, let us consider a near-infinite888The assumption N1N\gg 1 enables approximating flows by their expectation. population 𝒫\mathcal{P} of N1N\gg 1 individuals. Again, we describe 𝒫\mathcal{P} by the population vector π\pi. A single individual ii of type kk sampling payoff rkr_{k} and following RvoterR_{\text{voter}} has the following influence (outflow from aa to kk) on the population vector for types aka\neq k:

ak,P(i)(ka)=πatype a1Npicking irkand switching to type k\forall a\neq k,P^{(i)}(k\leftarrow a)=\underbrace{\pi_{a}}_{\text{type a}}\underbrace{\frac{1}{N}}_{\text{picking i}}\underbrace{r_{k}}_{\text{and switching to type k}} (12)

while its influence on type kk is the sum of inflows (to kk):

akP(i)(ka)=(1πk)1Nrk.\sum_{a\neq k}P^{(i)}(k\leftarrow a)=(1-\pi_{k})\frac{1}{N}r_{k}\;. (13)

Denoting α:=1N\alpha:=\frac{1}{N} yields the following learning rule attributable to a single individual on the entire population vector:

a,πaπa+αrk{1πaifa=kπaotherwise.\forall a,\pi_{a}\leftarrow\pi_{a}+\alpha r_{k}\begin{cases}1-\pi_{a}&\text{if}\ a=k\\ -\pi_{a}&\text{otherwise}\end{cases}\;. (14)

Note how the RL update described in Eq. 14 differs from the one described in Eq. 1 only by a scaling factor α=1N\alpha=\frac{1}{N}.

The population-policy equivalence gives an interesting interpretation to the learning rate α\alpha commonly used in RL. Under the population perspective, α\alpha describes the number of individuals in the population. In Sec. 5, we empirically show that the CL update rule described in Eq. 1 does not typically converge to the optimal action. However, using a small enough learning rate (i.e., a large enough population size) alleviates this issue999As implied by Eq. 14, assuming that, when performed sequentially on randomly sampled individuals instead of parallelly in one step across the entire population, the voter rule still yields the Replicator Dynamic, which we conjecture. We leave a proof of this conjecture for future work..

To describe the aggregated effect of RvoterR_{\text{voter}} on the entire population, we can now sum the effect described in Eq. 14 across all individuals. Let us denote r(i)r^{(i)} the payoff sampled by individual ii, NkN_{k} the number of individuals of type kk, and qkπq_{k}^{\pi} the average payoff across individuals of type kk. The aggregated update is:

dπa\displaystyle d\pi_{a} =i=0Nr(i)N{1πaifa=kπaotherwise\displaystyle=\sum_{i=0}^{N}\frac{r^{(i)}}{N}\begin{cases}1-\pi_{a}&\text{if}\ a=k\\ -\pi_{a}&\text{otherwise}\end{cases} (15)
=1NkNkqkπ{1πaifa=kπaotherwise\displaystyle=\frac{1}{N}\sum_{k}N_{k}q_{k}^{\pi}\begin{cases}1-\pi_{a}&\text{if}\ a=k\\ -\pi_{a}&\text{otherwise}\end{cases}
=1N(Naqaπ(1πa)πakaNkqkπ)\displaystyle=\frac{1}{N}\left(N_{a}q_{a}^{\pi}(1-\pi_{a})-\pi_{a}\sum_{k\neq a}N_{k}q_{k}^{\pi}\right)
=1N(NaqaππakNkqkπ)\displaystyle=\frac{1}{N}\left(N_{a}q_{a}^{\pi}-\pi_{a}\sum_{k}N_{k}q_{k}^{\pi}\right)
=πaqaππakπkqkπ\displaystyle=\pi_{a}q_{a}^{\pi}-\pi_{a}\sum_{k}\pi_{k}q_{k}^{\pi} (πk=NkN)\displaystyle(\pi_{k}=\frac{N_{k}}{N})
=πa(qaπkπkqkπ)\displaystyle=\pi_{a}(q_{a}^{\pi}-\sum_{k}\pi_{k}q_{k}^{\pi})
=πa(qaπvπ),\displaystyle=\pi_{a}(q_{a}^{\pi}-v^{\pi})\;, (16)

which is the TRD.

Note how, as a corollary of Proposition 1, summing the update from Eq. 14 over the population exactly yields the expectation of the CL update rule described in Eq. 1. By summing Eq. 14, we have retrieved the same update as what averaging Eq. 1 over a large batch would have estimated: its expectation, which is the TRD.101010As expected from the population perspective, since this derivation is essentially another proof of Lemma 3.1. From the RL perspective this result means that batching updates removes the need for using a small learning rate (see Sec. 5), at least in gradient-free multi-armed bandits where our analysis provides mathematical grounding to this commonly accepted rule of thumb.

3.3. Swarms and Maynard-Cross Learning

Arguably, the meaning of Eq. 14 is non-intuitive from the population perspective: it describes the effect of a single individual ii on the entire population 𝒫\mathcal{P}, whereas there is no such explicit effect in RvoterR_{\text{voter}}. However, the weighted voter rule RwvoterR_{\text{wvoter}} does contain an explicit effect, making the analysis much more intuitive.

In this Section, we will show that, similar to how imitation of success yields the CL update rule, when the entire population is considered as an abstract RL agent, swarms of bees performing CDM for house-hunting follow an abstract RL algorithm that we coin Maynard-Cross Learning.

Let us now consider a near-infinite population of NN honey bees, reaching a consensus on which nesting site to select via RwvoterR_{\text{wvoter}}. Under RwvoterR_{\text{wvoter}}, individuals have a tangible influence on the rest of the population: remember that in this model, individuals deterministically switch to the first action they witness. Hence, the influence of each individual is equal to the ratio of its broadcasting frequency rkr_{k}, by the total broadcasting frequency of the entire population jr(j)\sum_{j}r^{(j)}. In other words, an individual ii of type Ti=kT_{i}=k and payoff sample rkr(k)r_{k}\sim r(k) has the same influence on all other members of the population:

P(i)(k)=rkjr(j).P^{(i)}(k\leftarrow\cdot)=\frac{r_{k}}{\sum_{j}r^{(j)}}\;. (17)

Thus, the inflow from type bb to type kk attributable to ii is

P(i)(kb)\displaystyle P^{(i)}(k\leftarrow b) =πbrkjr(j)\displaystyle=\pi_{b}\frac{r_{k}}{\sum_{j}r^{(j)}} (18)
=1Nπbrk1Njr(j)\displaystyle=\frac{1}{N}\pi_{b}\frac{r_{k}}{\frac{1}{N}\sum_{j}r^{(j)}}
=αrkvππb.\displaystyle=\alpha\frac{r_{k}}{v^{\pi}}\pi_{b}\;. (19)

And the total inflow into type kk attributable to ii is

bkP(i)(kb)\displaystyle\sum_{b\neq k}P^{(i)}(k\leftarrow b) =bkαrkvππb\displaystyle=\sum_{b\neq k}\alpha\frac{r_{k}}{v^{\pi}}\pi_{b} (20)
=αrkvπ(1πk).\displaystyle=\alpha\frac{r_{k}}{v^{\pi}}(1-\pi_{k})\;. (21)

Eqs. 19 and 21 yield an RL update rule describing the effect of a single individual and corresponding sampled payoff (i.e., reward sample) over the entire population (i.e., policy), called:
Maynard-Cross Learning (MCL). Let kk be an action and rkr(k)r_{k}\sim r(k) a corresponding reward sample. MCL updates the policy π\pi as:

a,πaπa+αrkvπ{1πaifa=kπaotherwise\forall a,\pi_{a}\leftarrow\pi_{a}+\alpha\frac{r_{k}}{v^{\pi}}\begin{cases}1-\pi_{a}&\text{if}\ a=k\\ -\pi_{a}&\text{otherwise}\end{cases} (22)

where vπv^{\pi} is the current value of policy π\pi.

Finding the aggregated population effect amounts to summing Eq. 22 across all individuals:

dπa\displaystyle d\pi_{a} =i=0Nr(i)Nvπ{1πaifa=kπaotherwise\displaystyle=\sum_{i=0}^{N}\frac{r^{(i)}}{Nv^{\pi}}\begin{cases}1-\pi_{a}&\text{if}\ a=k\\ -\pi_{a}&\text{otherwise}\end{cases} (23)
=kNkqkπNvπ{1πaifa=kπaotherwise\displaystyle=\sum_{k}\frac{N_{k}q^{\pi}_{k}}{Nv^{\pi}}\begin{cases}1-\pi_{a}&\text{if}\ a=k\\ -\pi_{a}&\text{otherwise}\end{cases}
=1Nvπ(Naqaπ(1πa)kaNkqkππa)\displaystyle=\frac{1}{Nv^{\pi}}(N_{a}q^{\pi}_{a}(1-\pi_{a})-\sum_{k\neq a}N_{k}q^{\pi}_{k}\pi_{a})
=1Nvπ(NaqaπkNkqkππa)\displaystyle=\frac{1}{Nv^{\pi}}(N_{a}q^{\pi}_{a}-\sum_{k}N_{k}q^{\pi}_{k}\pi_{a})
=1vπ(πaqaππakπkqkπ)\displaystyle=\frac{1}{v^{\pi}}(\pi_{a}q^{\pi}_{a}-\pi_{a}\sum_{k}\pi_{k}q^{\pi}_{k})
=πavπ(qπavπ)\displaystyle=\frac{\pi_{a}}{v^{\pi}}(q^{\pi_{a}}-v^{\pi}) (24)

which is the MRD.

We have shown that a population whose individuals follow RwvoterR_{\text{wvoter}} aggregates to the MRD. An argument similar to Sec. 3.2 yields the “batched” version of Eq. 22, that we call Exact Maynard-Cross Learning (EMCL):111111MCL has a valid implementation only when the learning rate α\alpha is small enough, while EMCL has a valid implementation when the batch-size NN used to estimate the expectation is large enough. In both cases, vπv^{\pi} also needs to be estimated.

a,πaπa+𝔼k,rkrkvπ{1πaifa=kπaotherwise\forall a,\pi_{a}\leftarrow\pi_{a}+\mathbb{E}_{k,r_{k}}\frac{r_{k}}{v^{\pi}}\begin{cases}1-\pi_{a}&\text{if}\ a=k\\ -\pi_{a}&\text{otherwise}\end{cases} (25)

EMCL is the RL algorithm followed by swarms of bees that make a collective decision via RwvoterR_{\text{wvoter}}:

Proposition 2.

An infinite population of individuals following RwvoterR_{\text{wvoter}} can equivalently be seen as an RL agent following EMCL

dwvoterπa=dEMCLπa,d^{\text{wvoter}}\pi_{a}=d^{\text{EMCL}}\pi_{a}, (26)

where π\pi is both the population vector and the policy under the population-policy equivalence, dwvoterπd^{\text{wvoter}}\pi is the single-step change of population π\pi under the weighted voter rule (i.e., the change in type proportions after all individuals simultaneously perform RwvoterR_{\text{wvoter}} once), and dEMCLπd^{\text{EMCL}}\pi is the update performed by EMCL on the policy π\pi.

Proof.

The proof of Proposition 2 is trivial at this point. We have already shown that dwvoterπd^{\text{wvoter}}\pi is the MRD, and dividing everything by vπv^{\pi} in the proof of Lemma 3.1 (starting from Eq. 3.1) yields that dEMCLπd^{\text{EMCL}}\pi is also the MRD. ∎

4. Methods

We present two types of experiments to validate the findings from the previous section. First, we implement the two RL update rules, CL and MCL in two variants: batched and non-batched. Second, we conduct population-based experiments using RvoterR_{\text{voter}} and RwvoterR_{\text{wvoter}} (VR, WVR) for different population sizes. Moreover, we also numerically simulate the TRD and MRD to show how the above experiments compare with the analytical solutions.

4.1. Environment

We consider the standard multi-armed stateless bandit setting described in preliminaries (see Sec. 2.1). As it is clear from Remark 3, we can use the same environment for RL and population experiments. The environment in consideration returns rewards sampled from the hidden distribution r(a)r(a) when aa is pulled. A normal distribution 𝒩(Qaπ,σ2)\mathcal{N}(Q^{\pi}_{a},\sigma^{2}) is used to generate these reward samples, where QaπQ^{\pi}_{a} \in (,+)(-\infty,+\infty) is the mean of 𝒩\mathcal{N}, and σ2\sigma^{2} the variance. These rewards need to be bounded between [0,1][0,1], for which sigmoid function s(r)=11+ers(r)=\frac{1}{1+e^{-r}} is used on r𝒩(Qaπ,σ2)r\sim\mathcal{N}(Q^{\pi}_{a},\sigma^{2}), making this the hidden distribution r(a)r(a). Moreover, this transformation squeezes 𝒩\mathcal{N} and shifts the mean away from s(Qaπ)s(Q^{\pi}_{a}) to a new mean denoted by qaπ[0,1]q^{\pi}_{a}\in[0,1]. This qaπq^{\pi}_{a} can be estimated as 𝔼rr(a)[r]\mathbb{E}_{r\sim r(a)}[r], by sampling a large number of samples (10710^{7} samples) from r(a)r(a) and averaging them. Further, three different kinds of environments are used, where a:qaπ\forall a:q^{\pi}_{a}’s are near 0, spread between 0 and 1, or near 1.

4.2. RL experiments

Non-batched: In these experiments, an RL agent starts with an initial random policy π\pi. The agent then samples only one action kk from π\pi in an iterative fashion. Further, pulling action kk in the environment, the agent receives a noisy reward signal rkr(k)r_{k}\sim r(k). For CL, the agent utilizes Eq. 14 to make an update. Whereas for MCL, Eq.  22 cannot be used directly, since we do not have access to vπv^{\pi}. We therefore, approximate vπv^{\pi} by employing a moving average over rewards, where γ\gamma is a weighting factor for recent rewards:

r¯γr+(1γ)r¯.\bar{r}\leftarrow\gamma r+(1-\gamma)\bar{r}\;. (27)

Moreover, since this update rule can make π\pi invalid, i.e., components could become negative or above one (see footnote 11), we clamp π\pi between 0 and 1:

a:πaclamp(πa+rkr¯{1πaifa=kπaotherwise)\forall a:\pi_{a}\leftarrow\text{clamp}\left(\pi_{a}+\frac{r_{k}}{\bar{r}}\begin{cases}1-\pi_{a}&\text{if}\ a=k\\ -\pi_{a}&\text{otherwise}\end{cases}\;\;\right) (28)

We call this training step a run, and perform RR runs per seed.

Batched: In these experiments, we implement the batched variants of update rules CL (Eq. 2) and MCL (Eq. 25), henceforth named B-CL and B-MCL. B-CL is a straightforward batching of the CL update rule, averaging over a batch of BB samples instead of one sample to update π\pi. Whereas, with B-MCL, vπv^{\pi} is no longer a moving average of the rewards but rather the mean of batch rewards. We also need to explicitly clamp the policy between 0 and 1 to ensure it remains valid. B-MCL also uses BB samples simultaneously similar to B-CL. Similar to non-batched experiments, we perform RR runs per seed.

4.3. Population Experiments

In this section, we focus on the population update rules, VR, and WVR (see preliminaries Secs. 2.3 and 2.2). Since we cannot simulate 𝒫\mathcal{P} for infinite sizes, we choose two finite population sizes of 10 and 1000. For both VR and WVR, we start with an equal proportion of individuals associated with any type/opinion. Further, each individual receives a stochastic payoff/quality estimate for its type/opinion. Thereafter, with VR, everyone is paired with another random individual. All individuals then generate a random number between 0 and 1, and if the random number is greater than the payoff of the paired individual, they switch to their partner’s type (rule 3 of RvoterR_{\text{voter}}). Whereas with WVR, each individual switches to an opinion sampled from the distribution defined by vv, where vv is the ratio of votes for type ii by the total number of votes in the population:121212We implement the third rule of RwvoterR_{\text{wvoter}} in a centralized fashion for these simple numerical simulations, but in reality, it is a completely decentralized rule.

vi=p𝒫:Op=irpq𝒫rq.v_{i}=\frac{\sum\limits_{\forall p\in\mathcal{P}:\ O_{p}=i}r_{p}}{\sum\limits_{\forall q\in\mathcal{P}}r_{q}}\;. (29)

Similar to RL experiments, we perform RR runs per seed.

4.4. TRD and MRD

To empirically validate Propositions  1 and  2, we numerically simulate both the differential equations 3 (TRD) and 4 (MRD). As these equations are continuous, we discretize them by a step δ\delta (discretizing step). Further, we start from an initial random population/policy (π\pi) and simulate its evolution according to TRD and MRD between time intervals [0,tf][0,t_{f}], using the privileged information qaπq^{\pi}_{a} not available to RL and population experiments.

πaπa+δπa[qaπlπlqlπ]\displaystyle\pi_{a}\leftarrow\pi_{a}+\delta\pi_{a}[q^{\pi}_{a}-\sum_{l}\pi_{l}q^{\pi}_{l}] (30)
πaπa+δπavπ[qaπlπlqlπ]\displaystyle\pi_{a}\leftarrow\pi_{a}+\delta\frac{\pi_{a}}{v^{\pi}}[q^{\pi}_{a}-\sum_{l}\pi_{l}q^{\pi}_{l}] (31)
Refer to caption
Refer to caption
Refer to caption
Figure 1. Results for non-batched RL experiments with small α\alpha. The dotted lines represent the mean reward according to the analytical solutions. The darker shades represent the mean reward (or percentage of optimal actions) and the lighter shade represents their variance for the CL and MCL update rules. As the reward scales are different across environments (rows), it is important to look at the ratio of optimal actions (last column)
\Description

Results for RL non-batched experiments for small α\alpha

Refer to caption
Figure 2. Results for non-batched RL experiments for large α\alpha. It is clear that with a larger alpha, the RL solutions diverge from the analytical solutions.
\Description

Results for RL non-batched experiments for large α\alpha

Refer to caption
Refer to caption
Figure 3. Results for population experiments. The dotted lines represent the average payoff/quality according to the analytical solutions. The darker shades represent the mean payoff/quality and % of populations with the optimal type/opinion, while the lighter shade represents the variance in payoff with VR and opinion with WVR.
\Description

bandits and how it connects to things

Refer to caption
Refer to caption
Figure 4. Results for batched RL experiments. The dotted lines represent the average reward according to the analytical solutions. The darker shades represent the mean reward and % optimal action, while the lighter shade represents the variance in rewards with B-CL and B-MCL.
\Description

bandits and how it connects to things

5. Results

For all experiments, we use the hyperparameters described in supplementary Sec. A.3.

Non-batched RL update rules CL and MCL follow TRD and MRD respectively when the learning rate (α\alpha) is small. These results are presented in Figures 1 and  2. For all environments, CL and MCL follow TRD and MRD respectively, which can be explicitly seen with the dotted line of the analytical solutions (TRD, MRD) exactly at the center of the average reward curves of the CL and MCL update rules. This empirically validates that, with a small α\alpha, Eqs. 14 and 22 follow the TRD and MRD respectively, even when iteratively applied with single samples. However, as soon as α\alpha is increased, CL and MCL start deviating from their respective analytical solutions (see Figure  2). This is a well-known effect in optimization literature but from a population perspective (see Sec. 3.2) we see that a larger α\alpha corresponds to a smaller population, hence leading to a poor approximation of the expected update. Further, we also observe that MCL performs poorly compared to TRD with larger α\alpha.
Batched RL update rules B-CL and B-MCL follow TRD and MRD respectively when the batch size (BB) is large enough. As seen in Figure 4, it is clear that B-CL and B-MCL follow TRD and MRD respectively for large batch sizes (this can be seen from how the analytical solution is exactly at the center of the average reward curves of B-CL and B-MCL). However, as soon we make the batch sizes smaller, the batched updates deviate from their analytical solutions (see sub-section 3.2). See supplementary Sec. A.1 for other environments. We also observe that B-MCL performs poorly compared to B-CL with smaller batch sizes, similar to observations made with non-batched RL experiments.
Convergence rate of MRD is \geq TRD. As noted by Sandholm (2010), TRD and MRD can be rearranged in the form:

πa˙\displaystyle\dot{\pi_{a}} =vπ(πaqaπvππa)\displaystyle=v^{\pi}(\frac{\pi_{a}q^{\pi}_{a}}{v^{\pi}}-\pi_{a}) (TRD) (32)
πa˙\displaystyle\dot{\pi_{a}} =1(πaqaπvππa)\displaystyle=1(\frac{\pi_{a}q^{\pi}_{a}}{v^{\pi}}-\pi_{a}) (MRD) (33)

π˙\dot{\pi} being the update ”speed” and vπv^{\pi} being bounded between 0 and 1. The MRD speed is thus greater than the TRD speed. Empirically, we observe that MRD converges faster than TRD, especially when the qaπq^{\pi}_{a}’s are close to 0, as seen in the first column of Figure  1. Whereas, when the qaπq^{\pi}_{a}’s are near 1 there is no visible difference (as vπ1v^{\pi}\approx 1). By extension, this also implies that MCL (for small α\alpha), B-MCL (for large batch size), and WVR (for large population) have convergence rates \geq CL, B-CL, and VR respectively.
Population experiments with VR and WVR follow TRD and MRD respectively for large populations. It can be seen in Figure 3 that both VR and WVR follow TRD and MRD respectively when the population size is large. See supplementary Sec. A.2 for other environments. As soon as the population shrinks, VR and WVR start deviating from the analytical solution. Similar to the RL experiments, WVR performs poorly in comparison to VR for small population sizes.
Finally, from the discussions related to batched RL updates and population experiments, we empirically validate Proposition 1 and Proposition 2.

6. Conclusions

With Propositions 1 and 2, we have demonstrated how RD is the underlying connection between Reinforcement Learning and Collective Decision-Making. Further, we have empirically validated this correspondence. This correspondence opens a bridge between these two fields, enabling the flow of ideas, new perspectives, and analogies for both. For example, it can be seen that Cross Learning, Maynard-Cross Learning, and, more generally, Reinforcement Learning take a single-agent perspective, where information from consecutive action samples/batches is accumulated into one centralized agent’s policy. On the other hand, RvotersR_{\text{voters}} and RwvotersR_{\text{wvoters}} take a multi-agent perspective, where every individual implements simple local and decentralized rules, making independent decisions, which leads to an emergent collective policy.

Significance for RL. Similar to how we discovered a new RL update rule (i.e., Maynard-Cross Learning) from Swarm Intelligence, other ideas such as majority rule Valentini et al. (2016), and cross-inhibition Reina et al. (2017), can be used to create new update rules for Reinforcement Learning. Moreover, Swarm Intelligence algorithms are often inspired by nature, and thus require individuals to follow physics. This mandates practical constraints such as congestion Soma et al. (2023), communication, and finite size effects, which are generally ignored in Reinforcement Learning and population games. Comparing the performance of Reinforcement Learning agents with their equivalent swarm counterparts on such constraints is a direction for future work.

Significance for SI. The population-policy equivalence highlights how certain Swarm Intelligence methods are equivalent to single-agent Reinforcement Learning, demonstrating agency of the entire swarm as a single learning entity. Therefore, one could imagine that Multi-Agent Reinforcement Learning would similarly yield equivalent multi-swarm systems, where two or more coexisting swarms would compete/collaborate for resources (i.e., prisoners dilemma, hawk dove, etc). Further, ideas that empower Reinforcement Learning, could be ported to swarm intelligence and swarm robotics.

References

  • (1)
  • Auer et al. (2002) Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer. 2002. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning 47 (05 2002), 235–256. https://doi.org/10.1023/A:1013689704352
  • Bloembergen et al. (2015) Daan Bloembergen, Karl Tuyls, Daniel Hennes, and Michael Kaisers. 2015. Evolutionary Dynamics of Multi-Agent Learning: A Survey. Journal of Artificial Intelligence Research 53 (08 2015), 659–697. https://doi.org/10.1613/jair.4818
  • Bonabeau et al. (1999) Eric Bonabeau, Marco Dorigo, and Guy Theraulaz. 1999. Swarm Intelligence: From Natural to Artificial Systems. Oxford University Press. https://doi.org/10.1093/oso/9780195131581.001.0001
  • Castellano et al. (2009) Claudio Castellano, Santo Fortunato, and Vittorio Loreto. 2009. Statistical physics of social dynamics. Reviews of Modern Physics 81, 2 (May 2009), 591–646. https://doi.org/10.1103/revmodphys.81.591
  • Cross (1973) John G. Cross. 1973. A Stochastic Learning Model of Economic Behavior. The Quarterly Journal of Economics 87, 2 (1973), 239–266. https://EconPapers.repec.org/RePEc:oup:qjecon:v:87:y:1973:i:2:p:239-266.
  • Dorigo et al. (2006) Marco Dorigo, Mauro Birattari, and Thomas Stutzle. 2006. Ant colony optimization. IEEE Computational Intelligence Magazine 1, 4 (2006), 28–39. https://doi.org/10.1109/MCI.2006.329691
  • Dorigo et al. (2021) Marco Dorigo, Guy Theraulaz, and Vito Trianni. 2021. Swarm Robotics: Past, Present, and Future [Point of View]. Proc. IEEE 109, 7 (2021), 1152–1165. https://doi.org/10.1109/JPROC.2021.3072740
  • Hamann (2018) Heiko Hamann. 2018. Swarm Robotics: A Formal Approach. https://doi.org/10.1007/978-3-319-74528-2
  • Kaufmann et al. (2023) Elia Kaufmann, Leonard Bauersfeld, Antonio Loquercio, Matthias Mueller, Vladlen Koltun, and Davide Scaramuzza. 2023. Champion-level drone racing using deep reinforcement learning. Nature 620 (08 2023), 982–987. https://doi.org/10.1038/s41586-023-06419-4
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. CoRR abs/1312.5602 (2013). arXiv:1312.5602 http://arxiv.org/abs/1312.5602
  • Nowak (2006) Martin A. Nowak. 2006. Five Rules for the Evolution of Cooperation. Science 314, 5805 (2006), 1560–1563. https://doi.org/10.1126/science.1133755 arXiv:https://www.science.org/doi/pdf/10.1126/science.1133755
  • Redner (2019) Sidney Redner. 2019. Reality-inspired voter models: A mini-review. Comptes Rendus. Physique 20, 4 (May 2019), 275–292. https://doi.org/10.1016/j.crhy.2019.05.004
  • Reina et al. (2017) Andreagiovanni Reina, James A. R. Marshall, Vito Trianni, and Thomas Bose. 2017. Model of the best-of-N nest-site selection process in honeybees. Physical Review E 95, 5 (May 2017). https://doi.org/10.1103/physreve.95.052411
  • Reina et al. (2024) Andreagiovanni Reina, Thierry Njougouo, Elio Tuci, and Timoteo Carletti. 2024. Speed-accuracy trade-offs in best-of-nn collective decision making through heterogeneous mean-field modeling. Phys. Rev. E 109 (May 2024), 054307. Issue 5. https://doi.org/10.1103/PhysRevE.109.054307
  • Sandholm (2010) William H Sandholm. 2010. Population games and evolutionary dynamics. MIT press.
  • Sandholm et al. (2008) William H. Sandholm, Emin Dokumacı, and Ratul Lahkar. 2008. The projection dynamic and the replicator dynamic. Games and Economic Behavior 64, 2 (2008), 666–683. https://doi.org/10.1016/j.geb.2008.02.003 Special Issue in Honor of Michael B. Maschler.
  • Seo et al. (2024) Jaemin Seo, SangKyeun Kim, Azarakhsh Jalalvand, Rory Conlin, Andrew Rothstein, Joseph Abbate, Keith Erickson, Josiah Wai, Ricardo Shousha, and Egemen Kolemen. 2024. Avoiding fusion plasma tearing instability with deep reinforcement learning. Nature 626 (02 2024), 746–751. https://doi.org/10.1038/s41586-024-07024-9
  • Smith (1982) John Maynard Smith. 1982. Evolution and the Theory of Games. Cambridge University Press, Cambridge, UK.
  • Soma et al. (2023) Karthik Soma, Vivek Shankar Vardharajan, Heiko Hamann, and Giovanni Beltrame. 2023. Congestion and Scalability in Robot Swarms: A Study on Collective Decision Making. In 2023 International Symposium on Multi-Robot and Multi-Agent Systems (MRS). 199–206. https://doi.org/10.1109/MRS60187.2023.10416793
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
  • Taylor and Jonker (1978) Peter D. Taylor and Leo B. Jonker. 1978. Evolutionary stable strategies and game dynamics. Mathematical Biosciences 40, 1 (1978), 145–156. https://doi.org/10.1016/0025-5564(78)90077-9
  • Valentini et al. (2016) Gabriele Valentini, Eliseo Ferrante, Heiko Hamann, and Marco Dorigo. 2016. Collective decision with 100 Kilobots: speed versus accuracy in binary discrimination problems. Autonomous Agents and Multi-Agent Systems 30, 3 (2016), 553–580. https://doi.org/10.1007/s10458-015-9323-3
  • Valentini et al. (2014) Gabriele Valentini, Heiko Hamann, and Marco Dorigo. 2014. Self-organized collective decision making: the weighted voter model. In Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems (Paris, France) (AAMAS ’14). International Foundation for Autonomous Agents and Multiagent Systems, Richland, SC, 45–52.
  • Visscher and Camazine (1999) P Kirk Visscher and Scott Camazine. 1999. Collective decisions and cognition in bees. Nature 397, 6718 (February 1999), 400. https://doi.org/10.1038/17047
  • Williams (1992) Ronald J Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. , 229-256 pages.
  • Xia et al. (2011) Haoxiang Xia, Huili Wang, and Zhaoguo Xuan. 2011. Opinion dynamics: A multidisciplinary review and perspective on future research. International Journal of Knowledge and Systems Science (IJKSS) 2, 4 (2011), 72–91.

Appendix A Supplementary material

A.1. Batched RL experiments

Refer to caption
Refer to caption
Figure 5. Batched RL figures for large batch size and the two other environments where qapiq^{pi}_{a}’s are near zero and near one.

A.2. Population experiments

Refer to caption
Refer to caption
Figure 6. Population experiments for large population size and the two other environments where qapiq^{pi}_{a}’s are near zero and near one.

A.3. Hyperparametrs

In this section, we list out various hyperparameters used by RL and population experiments.

Hyperparameter value
Arms (nn) 10
Learning rate (α\alpha) {0.001,0.1}\{0.001,0.1\}
Epochs (EE) 100
Variance (σ2)\sigma^{2}) 1
Runs (RR) {1000000,1000}\{1000000,1000\}
Weight factor (γ\gamma) 0.01
Discretizing factor (δ\delta) α\alpha
final time (tft_{f}) α×R\alpha\times R = 100
Table 1. Hyperparameters used for RL non-batched experiments
Hyperparameter value
Arms (nn) 10
Epochs (EE) 100
Variance (σ2)\sigma^{2}) 1
Runs (RR) 100
Batch size (BB) {10,1000}\{10,1000\}
Discretizing factor (δ\delta) 1
final time (tft_{f}) RR = 100
Table 2. Hyperparameters used for RL batched experiments
Hyperparameter value
Types/opinions (nn) 10
Epochs (EE) 100
Variance (σ2)\sigma^{2}) 1
Runs (RR) 100
population size (BB) {10,1000}\{10,1000\}
Discretizing factor (δ\delta) α\alpha
final time (tft_{f}) RR = 100
Table 3. Hyperparameters used for population experiments