Learning as Reinforcement: Applying Principles of Neuroscience for More General Reinforcement Learning Agents
Abstract
A significant challenge in developing AI that can generalize well is designing agents that learn about their world without being told what to learn, and apply that learning to challenges with sparse rewards. Moreover, most traditional reinforcement learning approaches explicitly separate learning and decision making in a way that does not correspond to biological learning. We implement an architecture founded in principles of experimental neuroscience, by combining computationally efficient abstractions of biological algorithms. Our approach is inspired by research on spike-timing dependent plasticity, the transition between short and long term memory, and the role of various neurotransmitters in rewarding curiosity. The Neurons-in-a-Box architecture can learn in a wholly generalizable manner, and demonstrates an efficient way to build and apply representations without explicitly optimizing over a set of criteria or actions. We find it performs well in many environments including OpenAI Gym’s Mountain Car, which has no reward besides touching a hard-to-reach flag on a hill, Inverted Pendulum, where it learns simple strategies to improve the time it holds a pendulum up, a video stream, where it spontaneously learns to distinguish an open and closed hand, as well as other environments like Google Chrome’s Dinosaur Game.
1 Introduction
Previous works introduce the concept of spiking neural networks and illustrate reinforcement learning as a potential use case (Donahoe, 1997; Stanley & Miikkulainen, 2002; Florian, 2005). We build upon these works by abstracting away the necessity for spike trains and introducing new ideas from neuroscience, including memory formation, intrinsic motivation, and computationally-efficient algorithms for spike-timing dependent plasticity.

2 Background
2.1 Spike-Timing Dependent Plasticity
Spike-timing dependent plasticity (STDP) is a biological phenomenon that constantly updates the strength of connections between neurons in the brain based on firing patterns. Connections strengths are adjusted based on the timings between a neuron firing and its input neurons firing, such that neurons that fire consecutively have their connections strengthened in the “direction” of firing. This formulation has been shown to explain characteristics of long-term depression and other neuroscientific phenomena, as shown in (Debanne et al., 1994).
One natural but generally not-revealing approach is the use of STDP to train a neural network to identify patterns, then using a separate neural network trained on the STDP network’s firings to make predictions. This approach was used to accurately predict road traffic dynamics by Kasabov (2014).
Another important adaptation of STDP to the formulation of reinforcement learning is called reward-modulated STDP, as explained in (Florian, 2007). The idea is that including a global reward signal into the standard STDP formulation can be analytically shown to lead to a reinforcement learning algorithm, thus “modulating” the STDP with the reward. In essence, this is simply STDP where the learning update step is multiplied by the reward. We see that in a later section, we continue to draw upon neuroscience research by coming up with intrinsic rewards based on motivation rather than explicit rewards specified by the environment or programmer.
2.2 Long-Term and Short-Term Memory
Research indicates that there are distinct chemical and physiological properties of neural connections which correspond to long-term and short-term memory (Hiratani & Fukai, 2014). Moreover, there seem to be explicit chemical causes for this transition (Malik et al., 2013). In the context of spike-timing dependent plasticity, a transition from short-term to long-term memory of a connection that has been consistently, actively predictive allows for more sophisticated new representations to be built ones.
2.3 Intrinsic Reward
Intrinsic reward refers to the idea that humans often learn and discover useful behaviors within their environments through experimentation, exploration, and often just spontaneity. In work modeling intrinsic reward signals in artificial agents, researchers frequently draw an analogy with young infants, who are often observed to learn about the world by engaging with it through ”playful behavior” (Haber et al., 2018). The goal in similar research is discovering a reward signal which allows an agent to learn effectively about its environment in a generalizable manner, regardless of the structure of the environment.
2.4 Neural Noise
Some amount of neural noise has consistently been shown to be useful in training spiking neural models and appears to be important in actual neuronal processing, though aside from possibly preventing overfitting, the function this noise plays in the actual brain is a topic of debate (Longtin, 2013; Engel et al., 2015). Regardless, some baseline level of firing encourages a model that does something rather than nothing.
3 Approach
3.1 Architecture
As our model, we constructed a Neuron-in-a-Box architecture. This architecture was composed of an input layer which took in pixel input (either from a continuous video and/or audio stream) which then produced feedforward outputs into a number of feedforward neurons. These neurons were randomly spatially distributed, and connections were formed such that neurons closer spatially would be more likely to form a connection. Then, a random subset of neurons in a slice of the architecture are selected as output neurons, which were moved to a constant distance from the input.
The long-term memory was implemented by keeping an accumulator which tracks the frequency with which a neuron fires and, if synaptic strength is high enough, changes the plasticity of the connection to false.
The intrinsic reward which we attempted to use was simply the amount of learning that occurs at each timestep. In other words, we expected that training based on amount of novelty would eventually lead to good performance in a similar manner to a curious agent exploring phenomena it finds surprising. Unfortunately, in the dinosaur game, the game-over screen results in more novelty than simply continuing to explore the world, so eventually the agent converges on immediately ending the game. While in theory the game-over screen should eventually become uninteresting to the agent, this did not seem to happen in practice.
3.2 Reinforcement
One guiding principle used was the idea that, at least at this level, a small number of neurons do not explicitly learn state values and choose firings that maximize over these states. Instead, the presence or lack of reward should be used to directly strengthen or weaken neural connections.
3.2.1 Direct Reward
Suppose we take a very blunt approach: if the neural network did something good, we want it to do that more often. If it did something bad, we want it to do less of that. This ”global” update seems pretty consistent with what we see in the brain: many reward pathways, when triggered, affect many neurons at the same time and propagate to large parts of the brain (Wise & Rompre, 1989).
There are several questions that need to be answered with this approach: first, what should be strengthened with a global reinforcement? For consistency, I will refer to the underlying sensory inputs as sensory neurons, the neurons that are not directly measured when determining an action as intermediate neurons, and the neurons that are directly measured as action neurons.
-
1.
All neural connections (including inhibitory connections): This is far too coarse, and actually weakens the strongest connections relative to everything else due to the maximal connection strength. It introduces noise and encourages unstable solutions. However, combined with STDP, this can encourages the learning of meaningful relationships because neurons that may have nearly fired before, that capture information relevant to the reward, will now be more likely to fire and be related to one another by an action neuron.
-
2.
Outputs from neurons that fired: This is more stable, but has another significant problem: sometimes you want to reward passivity. For example, in the pendulum game, the action neurons are rewarded for not firing when the pendulum is at the top. Furthermore, as we know from backpropagation, which we avoided using, not the entire network is equally responsible for the result.
-
3.
Connections which were used: That is, where the input neuron fired and the output neuron’s firing corresponds to whether the connection is excitatory (positive) or inhibitory (negative). This is one of the better options, but it is almost always more consequential to fire than to not fire so this ends up killing the network.
-
4.
Input connections to action neurons: This is stable and can actually indirectly facilitate the development of ”reward neurons,” but can actually result in a disproportionate reinforcement of inputs which are counterproductive (for example, if an action neuron fired with to one strong excitatory input and several weak inhibitory inputs, it would actually become less likely to fire). If instead you consider only the connections that were used, as in the previous item, one could end up silencing all connections there’s a delay between reward and firing.
-
5.
Input connections to action neurons where both fired: This is stable and results in behavior learning, but unfortunately still suffers from the same problem as Approach 2. However, STDP can accomplish much of the work which backpropogation is usually used for, allowing neurons that are strengthened to strengthen their inputs which are responsible.
3.2.2 STDP Reinforcement
There are two ways that we considered using STDP to directly shape behavior. First, we can simply multiply the STDP reward update by the reward. For negative rewards, this will actually lead to unlearning or ”forgetting” whatever was just experienced, which isn’t necessarily ideal. Alternatively, we can measure the size of the learning update and using a direct reward to reinforce actions that lead to more learning. This is necessary to approach questions with sparse rewards that require exploration. In essence, it encourages the network to take a more scenic route when it can.
3.3 Memory
The idea of memory in our model is to keep fixed those connections that lead to the agent’s favorable behavior so that intelligent strategies are propagated forward in time. Our approach consists of a plasticity mask which modulates (in the short term) the degree with which any connection can be altered by future weight updates in the long term. To implement this in practice, we considered two approaches.
-
1.
All neural connections decay uniformly (”aging”): This method involves decaying the plasticity of all connections uniformly over time. This approach seems far too coarse as it fails to adequately account for differences in favorability of connections.
-
2.
Plasticity decay accumulation: This approach involves accumulating modifications to individual connections. In other words, the more a particular connection is changed, the more ’change’ is accumulated. Once this value reaches a certain threshold for a connection and its strength is significantly above mean connection strength, the connection is fixed and cannot be altered. This relies on the assumption that those connections which have been altered many times are closer to converging on an optimal fixed strength, whereas those which have not changed much still have room to change.
4 Theory
There are a number of important theoretical questions which needed to be tackled in order to implement this agent.
4.1 Discrete STDP Update Rule
STDP is often implemented based on a weighted sum over some rolling windows combined with some neuronal spiking frequency (Lee et al., 2018). This approach is absolutely biologically realistic, and researchers have even argued that norepinephrine plays the role of determining the size of this window (Salgado et al., 2012). However, it’s computationally inefficient to implement across many neurons, and is only parallelizable with a high STDP update frequency, which comes with its own computational cost.
We derive a discrete, matrix-operation parallel to the standard STDP rule. Given with neurons, a prior time-window of firings and a current time-window firings , an an input connection matrix C we want to output an updated connection matrix C’ .
We want a function with two features.
-
•
It should strengthen connections which are predictive: if neuron fires and then neuron fires, the connection should strengthen and should weaken.
-
•
It should only strengthen connections where both of the neurons fired at least once
4.1.1 Predictivity
We can represent the first feature by the following rule: For first firings and second firing as defined above,
(1) |
For ,
(2) |
For ,
(3) |
For ,
(4) |
Since the last example is clearly not the desired result (One doesn’t want to weaken all connections that fire unprompted), we need to incorporate the second requirement.
4.1.2 Co-occurrence
To do this, we calculate the pairwise matrix directly:
(5) |
For example, for ,
(6) |
That is, the connection from the last neuron to the first neuron is the only one that’s strengthened.
4.1.3 Update
Supposing we have a connection matrix, C, of
an update weight of , and a plasticity matrix of all 1’s. The new connection matrix is
(7) |
4.1.4 Overrall
Where for the firings at to timesteps, , a learning rate , a connections matrix C, and a plasticity matrix P, the discrete STDP update rule is:
4.2 Reward Neurons
Reward neurons are a surprisingly elegant way to learn rewards and modulate actions. Essentially, if one sets the reward neuron to fire with strength corresponding the reward, then an STDP update will reinforce other input connections that predict the firing of that reward, and then inputs that predict the firings of those inputs will be further strengthened.
Ideally, we want B to also be able to trigger the reward neuron, so that reward can be predicted when not yet present, so the reward neuron should sum its inputs, rather than simply be set directly to R. Since is initially weak, then this is initially .
If r and B have a firing covariance , eventually will converge on , if there are no other outputs, since will be strengthened if the frequency of firing of B is less than the firing of , and weakened otherwise. If B is an output neuron, then this implies a statistical equivalent to Rescorla-Wagner learning (Rescorla & Wagner, 1972)111Where V is the current value of a state, is the learning rate:
(8) |
The more similar the frequencies of firing, the larger the update from an error in the opposite direction (i.e. the closer it gets to the actual covariance) and the smaller an error in the correct direction.
Note that this same relationship between also applies between A and B, so tends to . As a result, as the connection between and is strengthened in line with Rescorla-Wagner, the same thing happens to the connection from to . Propagating R-W value backwards in time by considering the covariance of states in this way is known as Temporal Difference learning:
(9) |
4.3 Novelty
Novelty as a reward has been shown for decades in primates to be a valid reward stimulus(Martinez-Rubio et al., 2018). Hence, we aimed to model novelty as reward in environments in which a clear, continuous reward signal is not present.
As defined earlier, is the STDP update matrix of firings and . (In practice, these correspond to the last firings and the second-to-last firings.) is our last connection matrix, is our plasticity matrix, is our novelty, and is the size (i.e. number of elements) of . Thus, we have
(10) |
where
(11) |
In other words, we are taking the mean over the magnitude of change between firing updates.
An alternative measure of novelty would be performing this same operation using video frames rather than firing patterns (i.e. , where and are the last and second-to-last video frames). This measure of novelty was used in testing on the Dino game environment, described below.
5 Experimental Results
5.1 Pattern Recognition and Differentiation
As an initial experiment, STDP was applied to neurons at each timestep which had, as input, a video stream. The neural network had no modulation of learning, and were essentially driven to build correlations between the data presented, both spatially and temporally. In this way, the network learned to differentiate between patterns observed in the video stream. To evaluate the model’s effectiveness, a hand was shown to the neurons in different stages (e.g. in a fist or open palm, as shown in Figure 2) in different locations on the screen.



The stages alternated at varying times and the neurons were recorded to keep track the maximal difference between any neuron’s tendency to fire in the first and second stages. For a control, this process was run again where nothing was changed between stages. The neuron with the most variance between stages is chosen as the output neuron. Then, the ratio of the differences in activation of this neuron in the control and trained cases is shown in Figure 3.
This network was also shown YouTube for several hours, and then we visualized visual stimuli for the neurons that were most sensitive. Specifically, the neurons were activated by a large number222Typically, of samples of random noise, and the sample that activated them most was selected, and then averaged again with many samples of random noise, with this process repeated several times. This whole process was repeated several times and the images which were stimulating or inhibitory were averaged together. We present these results in Figure 4.






5.2 XOR
The XOR gate problem has historically been known to be notoriously difficult to learn due to its nonlinear structure.333In what was known as the XOR affair, Marvin Minsky famously mathematically proved that a single-layer perceptron was incapable of solving the XOR gate. This was among many reasons for the AI winter of the 1970s. The problem involves returning true only when one of two bits have fired, and false otherwise.
We demonstrated that the architecture is able to achieve an accuracy on the task of approximately 444, , indicating that it was reliably solving the task with only a reward signal of for a correct answer and otherwise. The connection strength matrix is shown in Figure 5.

5.3 Chrome Dino Game
For the behavioral learning portion, we extended our visual input stream to a screen-captured screen of the dinosaur game that Chrome displays when it is offline555Which can be accessed by going to chrome://dino in the Google Chrome browser, and modulated the amount of learning that occurred as a function of whether or not the screen was changing. While the ultimate goal is to find an intrinsic reward function that allows the modulation to learn to play arbitrary games well while seeking novelty, this lets us test whether this approach to behavior optimization is capable of solving reinforcement learning tasks. We show what the neuron architecture sees in Figure 6.

Because the discretized STDP update rule for this neural net is simpler (both computationally and mathematically) than the traditional window-based approach (Kempter et al., 1998), it wasn’t immediately clear that this would have any success at all. On the dinosaur task, the best achieved performance was a score of 168. In comparison, the average score of an agent performing at chance (i.e. continuously holding down the jump key) was 57, with a maximum score of 100666. Here is a recording of a typical run: https://giant.gfycat.com/EmotionalSecondhandHagfish.webm.

The confusion matrix that results from ten thousand learning steps is shown in Figure 7. The neural configuration corresponds to a 10x10 pixel grayscale input, 50 randomly firing neurons, 600 ”hidden” neurons, and 50 output neurons. Notably, the number of connections made by each input neuron suggest that the localization of the initial connections is failing.
5.4 Mountain Car
Here, we test our model on a classic reinforcement learning problem available in OpenAI Gym. This particular environment has a discrete action space: go left, stay still, go right. We use the observations returned by the environment as input into our neuron architecture, and we use a measure of novelty to be the reward signal that updates our model. The proportion of neurons that are firing determines the action and are discretized to fit the discrete action space.
The results of the experiment indicate that the agent is able to learn to quickly reach the goal state using a measure of novelty as an instantaneous reward. We present the results of the experiment in Figure 8.

We can see that the agent is unable to complete the objective at the beginning and is able to quickly learn the correct back-and-forth movement to reach the goal. The performance is stochastic, but the overall trend of improvement is clear. After around 75 episodes, the agent is able to reach the goal in 200 steps relatively consistently. This is not considered a successful solve of the environment, but our agent is not using the reward of the environment to learn, so this may be the agent’s best performance with the novelty metric.
5.5 Mountain Car Continuous
Here, we adapt our model to a similar reinforcement learning problem available in OpenAI Gym, the continuous action space version of Mountain Car. This particular environment accepts any float value as the action with negative corresponding to left and positive corresponding to right (and magnitude corresponding to the impulse level). Again, we use the observations returned by the environment as input into our neuron architecture, and we use a measure of novelty to be the reward signal that updates our model. The proportion of neurons that are firing determines the action and are linearly scaled to fit the continuous action space. We present the results of the experiment in Figure 9.
We can see that the agent is unable to complete the objective at the beginning and is able to gradually learn the correct back-and-forth movement to reach the goal. At around 300 episodes, the agent is able to reach the goal in 100 steps or less, which is considered a successful solve of this environment.

5.6 Pendulum
The OpenAI Gym Inverted Pendulum environment is a classic reinforcement learning problem, though a notoriously difficult one to solve. The input was the observation of angle and angular velocity of the pendulum, and the reward here was the continuous reward returned by the environment.
The performance is stochastic, but the general trend of improvement is clear. Over time, the agent learns to improve its average reward per step. The best reward at a step is 0, and the worst reward is around -16. Each episode runs for 1000 time steps. We present the results in Figure 10.

6 Conclusion
We experimented with neuroscientific algorithms to approach a variety of problems including pattern differentiation, Chrome Dino, XOR, and several classic reinforcement learning problems, including Mountain Car (continuous and discrete) and Pendulum. Our approach implemented spike-timing dependent plasticity, the formation of long and short-term memory, and an intrinsic curiosity-driven reward signal. Our project successfully created a generalizable agent that is capable of learning to solve problems with or without being given an explicit reward signal. Overall, we demonstrate a unique, neuro-inspired approach to reinforcement learning.
Future work would include developing a mode of learning with lower variance and less stochasticity so that performance is more consistent. Ideally, performance improvement would be relatively monotonic with slight fluctuations due to exploration. We also aim to extend our model to a wider range of reinforcement learning problems, especially an environment like MuJoCo where our novelty metric could be more useful to the agent. We also aim to implement more biologically-inspired mechanisms in our algorithm to more closely approximate the brain.
References
- Debanne et al. (1994) Debanne, D., Gähwiler, B., and Thompson, S. M. Asynchronous pre-and postsynaptic activity induces associative long-term depression in area ca1 of the rat hippocampus in vitro. Proceedings of the National Academy of Sciences, 91(3):1148–1152, 1994.
- Donahoe (1997) Donahoe, J. W. Chapter 18 - selection networks: Simulation of plasticity through reinforcement learning. In Donahoe, J. W. and Dorsel, V. P. (eds.), Neural-Network Models of Cognition, volume 121 of Advances in Psychology, pp. 336 – 357. North-Holland, 1997. doi: https://doi.org/10.1016/S0166-4115(97)80104-5. URL http://www.sciencedirect.com/science/article/pii/S0166411597801045.
- Engel et al. (2015) Engel, T. A., Chaisangmongkon, W., Freedman, D. J., and Wang, X.-J. Choice-correlated activity fluctuations underlie learning of neuronal category representation. Nature Communications, 6:6454 EP –, Mar 2015. URL https://doi.org/10.1038/ncomms7454. Article.
- Florian (2005) Florian, R. V. A reinforcement learning algorithm for spiking neural networks. In Seventh International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC’05), pp. 8 pp.–, Sep. 2005. doi: 10.1109/SYNASC.2005.13.
- Florian (2007) Florian, R. V. Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity. Neural Computation, 19(6):1468–1502, 2007.
- Haber et al. (2018) Haber, N., Mrowca, D., Fei-Fei, L., and Yamins, D. L. K. Learning to Play with Intrinsically-Motivated Self-Aware Agents. arXiv e-prints, February 2018.
- Hiratani & Fukai (2014) Hiratani, N. and Fukai, T. Interplay between short- and long-term plasticity in cell-assembly formation. PLoS ONE, 9(7):e101535, 2014.
- Kasabov (2014) Kasabov, N. K. Neucube: A spiking neural network architecture for mapping, learning and understanding of spatio-temporal brain data. Neural Networks, 52:62–76, 2014.
- Kempter et al. (1998) Kempter, R., Gerstner, W., and van Hemmen, J. L. Spike-based compared to rate-based hebbian learning. In NIPS, 1998.
- Lee et al. (2018) Lee, C., Panda, P., Srinivasan, G., and Roy, K. Training deep spiking convolutional neural networks with stdp-based unsupervised pre-training followed by supervised fine-tuning. Frontiers in Neuroscience, 12:435, 2018. ISSN 1662-453X. doi: 10.3389/fnins.2018.00435. URL https://www.frontiersin.org/article/10.3389/fnins.2018.00435.
- Longtin (2013) Longtin, A. Neuronal noise. Scholarpedia, 8(9):1618, 2013. doi: 10.4249/scholarpedia.1618. revision #137114.
- Malik et al. (2013) Malik, B., Gillespie, J., and Hodge, J. Cask and camkii function in the mushroom body α′/β′ neurons during drosophila memory formation. Frontiers in Neural Circuits, 7:52, 2013. URL https://www.frontiersin.org/article/10.3389/fncir.2013.00052.
- Martinez-Rubio et al. (2018) Martinez-Rubio, C., Paulk, A. C., McDonald, E. J., Widge, A. S., and Eskandar, E. N. Multimodal encoding of novelty, reward, and learning in the primate nucleus basalis of meynert. Journal of Neuroscience, 38(8):1942–1958, 2018. ISSN 0270-6474. doi: 10.1523/JNEUROSCI.2021-17.2017. URL http://www.jneurosci.org/content/38/8/1942.
- Rescorla & Wagner (1972) Rescorla, R. A. and Wagner, A. R. A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. Classical conditioning II: Current research and theory, pp. 64–99, 1972.
- Salgado et al. (2012) Salgado, H., Köhr, G., and Treviño, M. Noradrenergic ‘tone’ determines dichotomous control of cortical spike-timing-dependent plasticity. Scientific Reports, 2(1), may 2012. doi: 10.1038/srep00417. URL https://doi.org/10.1038/srep00417.
- Stanley & Miikkulainen (2002) Stanley, K. O. and Miikkulainen, R. Efficient reinforcement learning through evolving neural network topologies. In Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation, GECCO’02, pp. 569–577, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. ISBN 1-55860-878-8. URL http://dl.acm.org/citation.cfm?id=2955491.2955578.
- Wise & Rompre (1989) Wise, R. A. and Rompre, P.-P. Brain dopamine and reward. Annual review of psychology, 40(1):191–225, 1989.