\jyear

2022

Mitigating Communication Costs in Neural Networks: The Role of Dendritic Nonlinearity

Xundong Wu^∗ Pengfei Zhao^∗ Zilin Yu^∗ Lei Ma^∗ Ka-Wa Yip Huajin Tang Gang Pan Tiejun Huang [ [ [ [ [ [

Abstract

Our comprehension of biological neuronal networks has profoundly influenced the evolution of artificial neural networks (ANNs). However, the neurons employed in ANNs exhibit remarkable deviations from their biological analogs, mainly due to the absence of complex dendritic trees encompassing local nonlinearity. Despite such disparities, previous investigations have demonstrated that point neurons can functionally substitute dendritic neurons in executing computational tasks. In this study, we scrutinized the importance of nonlinear dendrites within neural networks. By employing machine-learning methodologies, we assessed the impact of dendritic structure nonlinearity on neural network performance. Our findings reveal that integrating dendritic structures can substantially enhance model capacity and performance while keeping signal communication costs effectively restrained. This investigation offers pivotal insights that hold considerable implications for the development of future neural network accelerators.

1]Zhejiang Lab, China 2]Beijing Academy of Artificial Intelligence, China 3]Bytedance, China 4]National Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University, China 5]National Biomedical Imaging Center, Peking University, China 6]Zhejiang University, China

^*^*footnotetext: First authorship: These authors share first authorship Study originated at BAAI. Direct correspondence to X.W. at [email protected]

1 Introduction

In the past decade, we have observed a remarkable increase in artificial neural network (ANN) utilization across different domains. To some extent, this gives us an impression that AI is closing in on human-level intelligence silver_mastering_2017 ; openai2023gpt4 . Looking back to the beginning of neural networks, we will find out that those ANNs were structured to mimic the neuronal networks in our brains. However, it is crucial to acknowledge that neurons in contemporary ANNs exhibit considerable differences from their biological counterparts. The following equation can represent a typical ANN neuron:

h=\sigma(\sum_{i=1}^{n}{w_{i}x_{i}+b})\,.

(1)

Here, $\sigma$ denotes the nonlinear output function, $w_{i}$ and $x_{i}$ correspond to the weights and inputs, and $b$ is the bias term. These neurons, commonly called point neurons, are characterized by their simple weighted summation properties, which contrast with the intricate dendritic structures observed in biological neurons, as illustrated in Fig. 1.

Refer to caption — Figure 1: (A, B, C) Illustration of three representative neurons showcasing distinct dendritic structures from left to right: A chicken bipolar neuron wang2012vivo , a human hippocampal pyramidal neuron benavides2020differential , and a ferret neocortical pyramidal neuron adusei2021morphological . All neuron models were derived from the Neuromorpho.org database ascoli2007neuromorpho . (D) Portrays a layer of a neural network made up of point neurons, as characterized by Equation 1. (E) Illustrates a comparable network layer, but composed of dendritic neurons as detailed by Equation 3. An exemplary dendritic neuron is highlighted within the red dotted line for clearer understanding.

The dendritic structure is indispensable in neuronal network computation because it offers a better surface-area-to-volume ratio stuart2016dendrites ; chklovskii2000optimal . Unlike cells with a more compact shape, this attribute enables neurons to gather synaptic inputs through their branched and elongated form effectively. In contemporary ANN models processed on general-purpose computing hardware, such as CPUs and GPUs, a physical dendrite structure is no longer required to enhance information collection efficiency.

Suppose a dendritic structure is not needed for collecting synaptic inputs, and point neuron-based ANN models have been quite successful in the last decade. Is the dendritic structure just a fancy decoration no longer necessary for our modern ANNs? Evidence suggests that localized nonlinear signal processing happens inside the dendritic tree. Therefore we cannot dismiss whether dendrites should play a significant role for ANNs yet.

In recent decades, experimental and computational neuroscience research has curated strong evidence that dendrites actively funnel synaptic inputs toward cell bodies instead of passively. Dendrites are active because their membrane is embedded with many voltage-gated ion channels, for example, voltage-gated sodium, calcium, and NMDA channels magee2000dendritic ; schiller2000nmda ; polsky2004computational ; major2013active . Those channels lead to the nonlinearity of the dendritic input-output function. Earlier studies have assigned many different roles to the active dendrites, including counter-balance spatial attenuation at the distal end of dendrites, improving model expressivity, enabling efficient learning, and enabling dendrites to detect temporal sequences poirazi_impact_2001 ; jadi2012location ; wu2018improved ; jones2021might ; richards2019dendritic ; stuart2016dendrites ; kastellakis2019synaptic .

We can assign many more roles to the active dendrites with diverse signal-processing functions. However, it is well established that any nonlinear function performed by active dendrites can be replicated by a series of point neurons, as per the Universal Approximating Theory park1991universal . Given this, the question remains: Why are dendrites necessary in the nervous system, Are dendrites relevant for ANNs?

We endeavor to address this crucial question via a machine-learning perspective. Through our analysis, we identify the central role of active dendrites to efficiently enhance model capacity without incurring excessive communication overhead. Notably, recent research has underscored communication as the predominant factor in energy consumption for both ANNs and biological neural networks during computation, as highlighted in the works of Dally et al. dally_model_2022 and Levy et al. levy2021communication . Our study illuminates the pivotal role of dendrites within neuronal networks, offering insights with significant implications for practical applications in a real-world context.

2 Results

In this study, we aim to identify the primary role played by the active dendritic structure of neurons. To this end, we reduce dendritic structure to a dual-layered neural network poirazi_pyramidal_2003 ; polsky2004computational , as exemplified in Fig. 1E. The mathematical representation of the simplified dendritic neuron employed in our study is provided by the equations (Eqs. 2, 3) shown below. The first line of the expression defines the dendritic computation, where $W_{i}$ represents the weight vector for a specific dendrite and $X$ denotes the input activation vector. The second line of the expression illustrates that the outputs of every $K$ dendrites are aggregated to yield the output of a neuron.

	$\displaystyle d_{i}$	$\displaystyle=\sigma({W_{i}X+b_{i}})\,,$		(2)
	$\displaystyle h$	$\displaystyle=\theta(\sum_{i=1}^{K}{d_{i}})\,.$		(3)

In the proposed architectural framework, incoming data is initially integrated at each dendrite before undergoing a transformation via a nonlinear function, denoted as $\sigma$ . Subsequently, the outputs of these nonlinear functions are aggregated and, if necessary, further processed by an optional nonlinear function represented by $\theta$ (not used in this study). This refined output is then transmitted to the downstream recipients.

It is imperative to note that while each dendritic unit possesses a similar information-processing capacity to a point neuron, a distinct divergence exists in how their respective outputs are conveyed to downstream neurons. Contrary to a point neuron’s output, which is independently channeled to the downstream neurons, dendritic outputs necessitate sharing a common channel with fellow dendrites of the same neuron for information dispatch.

2.1 Dimension expansion with active dendrites

Before exploring the experimental aspects in-depth, we want to first develop an intuitive comprehension of dendrites and their significance in biological brains, particularly from a machine-learning standpoint.

Dimension expansion is an essential technique within the machine learning domain, which facilitates the mapping of original input data into an alternate basis within a higher-dimensional space, thereby enhancing pattern separation capabilities cayco2017sparse ; marr_theory_1969 . It is postulated that this methodology bears a striking resemblance to strategies employed within biological brains, such as in the vertebrate cerebellum marr_theory_1969 . In this context, a relatively limited number of mossy fibers project onto a substantially larger number of granule cells, as demonstrated in studies by luo_architectures_2021 ; sanger_expansion_2020 . Analogously, a similar expansion of inputs can be observed in various sensory pathways, such as in the case of cats, where the input signals from the lateral geniculate nucleus to the V1 cortex undergo a 25-fold expansion babadi2014sparseness ; olshausen2004sparse .

On the other hand, through empirical exploration, researchers and practitioners of deep neural networks have discovered that scaling up networks is the core receipt to achieve good model performance nakkiran_deep_2021 . The mechanism behind this scaling behavior remains elusive, with dimension expansion potentially playing a role. Both expanding feature dimensionality and model capacity are associated with the high costs in biological and artificial neural networks. In the case of the mossy fiber projection to granule cells, granule cells account for 99% of all neurons in the cerebellum consalez_origins_2021 ; sanger_expansion_2020 .

The scaling behavior observed in artificial neural networks has led to the adoption of large models such as GPT-3 brown2020language , ChatGPT openai_chatgpt_2022 and MT-NLG smith2022using . With the advent of these greatly expanded models, both biological and artificial neural networks face increased costs. In biological brains, more synapses and neurons are required to carry out tasks. In artificial neural networks, larger memory space is required to store weights and intermediate activation values. More computing hardware is needed to handle the expanded computing needs, resulting in a high energy cost.

2.2 Reducing communication cost

The energy required for computing in neural networks is considerable, yet it is not the primary cost involved. The dominant cost in the computing process of neural networks stems from communication rather than computing itself. In biological brains, only a small fraction of energy is spent on the computing part. As highlighted by Levy et al. levy2021communication , the communication process consumes 35 times more energy than the computing part of the brain.

To put things in perspective, communication costs in artificial neural networks can be orders of magnitude higher than computing costs. For instance, the energy cost of adding two 32-bit numbers may only be 20 femtojoules (fJ), but fetching those two numbers from memory can consume 1.3 nanojoules (nJ). This means that the communication process in this example consumes 64,000 times more energy than the computing process dally_model_2022 .

Artificial neural networks must grapple with controlling communication costs like their biological counterparts. By examining biological neuronal networks, we can glean strategies to minimize these costs in artificial systems. Our study demonstrates that incorporating active dendrites can play a significant role in addressing this issue.

2.3 Evaluating efficiency of dendritic structure

To bolster a neural network’s capacity for encoding information and enhancing expressivity, a prevalent technique involves widening the network, specifically by adding more neurons to hidden layers. This approach has been consistently demonstrated to increase a model’s capacity and improve its generalization performance. As previously noted, communication energy consumption constitutes a significant cost in neural network computing. Each hidden layer neuron possesses its own nonlinear output function in a commonly utilized artificial neural network. It creates a one-to-one relationship between each nonlinear function and the hidden layer activation output to the next layer. Conversely, a dendritic neuron generates its activation output by aggregating outputs from multiple nonlinear functions, enabling it to form more efficacious synapses than a point neuron. This prompts the question: Does incorporating nonlinear dendrites into a neuron serve as an effective means of augmenting model capacity? To address this inquiry, we undertook a series of experiments.

In the first part of our experiment, our objective is to scrutinize the impact of amalgamating active dendritic outputs on the behavior of neural networks. Although incorporating dendrites into a neuron can theoretically bolster its information storage capacity, given that more synapses are available for storing information poirazi_impact_2001 , whether this method is efficient persists. To address this, we draw comparisons between models composed of point neurons and those integrating dendritic structures of varying configurations.

Dense model on ImageNet dataset

We begin by employing the Resnet-18 network he2016deep as a baseline point neuron-based model, a widely utilized computer vision model. In the case of models featuring active dendrites, we substitute point neurons in the original comparative models with dendritic neurons, as illustrated in Fig. 1E. We ensure that each nonlinear summation unit—whether it be a point neuron or an active dendrite unit—receives no more than one copy of input from the preceding network layer.

We first compare the baseline models with models where point neurons are directly substituted with dendritic neurons while maintaining the overall architecture. For this part of the experiment, each neuron’s dendrite receives the same set of inputs from the last layer. The models were trained with widely used ImageNet dataset deng2009imagenet . Further details about model training, evaluation and architecture can be found in the Methods section.

The results of this part of the experiment are displayed in the left half of Fig. 2. (A) shows the model training loss values, (B) illustrates the model accuracy on the training set, and (C) shows the test accuracy of the models. Each curve in the three panels compares models with the same inter-layer communication cost. The data points on the left end of each curve are from models comprised of point neurons, while the remaining data points are from models with dendritic neurons of different numbers of dendrites per neuron, indicated on the $x$ -axis. We also scaled the baseline model by proportionally increasing or decreasing the number of channels per network layer, shown as different curves. In this way, the communication bandwidths between network layers were proportionally scaled.

The results shown in Fig. 2A and B indicate that adding dendrites can lead to an improved model expressivity, leading to better fittings to the training data. When comparing models of the same inter-layer communication budget, the models with more dendrites consistently exhibit lower loss values and higher training accuracy than those with fewer or no dendrites.

Is it possible to translate the enhanced fitting capabilities conferred by dendritic neurons into tangible benefits, as measured by model test accuracy? As illustrated in Fig. 2C, incorporating additional dendrites into each neuron consistently results in improvements in test accuracy across all four distinct levels of inter-layer communication cost. It is evident that, under the same hidden layer communication channel width level, the integration of active dendrites can substantially enhance model capacity and performance, and the effect remains stable across different inter-layer communication width scales.

Enhancing model capacity while keeping a reasonable level of communication cost is a desirable goal. The increased number of parameters associated with additional dendrites can pose a significant challenge for networks that require a transfer of weights between on-chip and off-chip locations. Those dendrites’ additional computing and space requirements can also be a big issue. To address this challenge, we investigated the effectiveness of replacing point neurons with dendritic neurons in computing cost, as measured by the total number of model parameters and FLOPs required for inference.

The outcomes of this experiment segment can be observed in the right half of Fig. 2. We compare three distinct levels of computational complexity to provide a clearer understanding of the results. Assume a dendritic neuron contains $K$ branches, for the dense model we study here, each dendrite receives the same number of inputs/weights as a point neuron; that is, a dendritic neuron receives $K$ times more inputs than a point neuron with the same input dimensionality. Assume that two sequential fully connected neural network layers both have $D$ channels. For the second network layer, the computational complexity and the number of parameters will both be $D^{2}$ . Then for two dendritic neuron layers with $\hat{D}$ channels, assume each neuron is equipped with $K$ dendrites, the computational and parametric complexity will be $K\hat{D}^{2}$ . Therefore, for a model of dendritic neuron with $K$ dendrites to have the same level of computational complexity as a point neuron-based model, we need to reduce the number of inter-layer communication channels in each network layer to $1/\sqrt{K}$ of the original numbers, that is $\hat{D}=D/\sqrt{K}$ .

In Fig. 2D, E, and F, the blue dashed curves represent experimental results obtained from various models with standard complexity. The leftmost data point corresponds to the standard ResNet-18 model, which serves as a baseline for this group. Subsequent data points to the right denote dendritic models with $K$ values of 4, 16, and 64, respectively. Concomitantly, these models’ channels have been adjusted to 1/2, 1/4, and 1/8 of the original model’s values, respectively, barring the input and output network layers. By maintaining this configuration, data points on the same curve exhibit equivalent parametric and computational complexities.

The orange curves demonstrate data from models in which the number of inter-layer communication channels has been uniformly scaled up by a factor of two, while the green dashed curves represent models where the number of channels has been scaled by a factor of four. To facilitate a comprehensive understanding of the data, channel scale factors for each model, as compared to the standard model, have been explicitly labeled on the curves in Fig. 2F.

Our analysis yields a particularly intriguing result concerning the performance of dendritic neuron models compared to point neuron-based models under the constraints of equivalent computing and parametric complexity. As depicted in Fig. 2D, E, and F, we observe that a dendritic neuron model is capable of achieving comparable or superior fitting power relative to an equivalent point-neuron-based model when the channel width is set to be greater than or equal to one-fourth of the width of the baseline model. Remarkably, when the channel width is increased to half of the width of the baseline model, the dendritic neuron-based model consistently outperforms its point neuron-based counterpart regarding test accuracy.

Further Findings

To further substantiate our research and enhance the robustness of our findings, we conducted supplementary experiments using a diverse array of model architectures and datasets. This analysis included an assortment of models encompassing those lacking residual connections, others that employed sparse network connections, as well as those leveraged transformer-based architectures. For the sake of clarity, we have included these additional results in the Appendix.

3 Communication cost analysis

The experimental analysis presented above has demonstrated that dendritic models can provide superior model capacity and performance while maintaining constrained inter-layer communication cost. To gain a better understanding on the benefits dendritic models can offer, we performed theoretical analysis on the full communication cost of point neuron and dendritic neuron-based models. Results reported in this part are based on the following considerations:

•

We study and quantify the data movement process for computation between two sequential neural network layers, which can be easily generalized to many-layer settings.
•

We evaluate the suitability of adopting dendritic networks for real-world applications, where wiring that follows a city-walk route is more relevant, by measuring the data movement path length using Manhattan distance metric.
•

For the sake of clarity, a standard feed-forward network structure is employed for the analysis. However, the results obtained can also be applied to other network architectures.
•

The computation of each network layer is executed within a discrete square area with dimensions of one-by-one.
•

We only consider movement of neuron output values. This setting is most relevant for in-memory or near-memory computing. This is also relevant for computing with large batch size. No tiling based acceleration is considered.

We model the signal communication cost $C_{T}$ as a sum of three parts: $C_{A}$ is the cost for all processing elements (PEs) to propagate their outputs toward a convergence point (top-right corner is used) at the edge of the chip. $C_{I}$ characterizes the process of transmitting data between two computational stages (network layers). $C_{E}$ describes the communication cost associated with distributing signals within a chip for layer inference. This is represented by the following equation:

C_{T}=C_{A}+C_{I}+C_{E}\,.

(4)

Cost with point neuron model

Our analysis initiates with a model predicated on point neurons. As previously mentioned, our investigation focuses on two network layers. We assume that the first layer sends an output of $D$ dimensions to the second layer. For convenience and without loss of generality, we assume that each of the $D$ dimensions originates from one PE on the chip.

In order to arrange $D$ PEs on die area of size $1\times 1$ , each PE must have a height and width of $l={1}/{\sqrt{D}}$ , resulting in an area size of ${1}/{D}$ . Similarly, the second layer is also composed of $D$ PEs of the same size. Consequently, we obtain a grid of $N$ by $N$ PEs with $N=\sqrt{D}$ , with a distance of $l$ between the center of each pair of neighboring PEs. See Fig. A4-A for a visual illustration.

For this arrangement we have

C_{A}=D(\sqrt{D}-1)l=D-\sqrt{D}\,,

(5)

as measured with Manhattan distance. Furthermore, an illustrative example of signal propagation within this context is provided in Fig. A4-B. Derivation of Eq. 5 can be found in Appendix. A4.

$C_{I}$ is very architecture dependent, thus we will abstain from attempting to estimate this component. We note $C_{I}$ is linearly proportional to the size of $D$ . In scenarios where two network layers reside on different physical devices or the cost of moving data from PEs and memory is high, this portion of the cost may become the dominant communication expenses.

We assess $C_{E}$ with minimal rectilinear spanning tree (MRST) algorithm murty2008graph . Given a grid of $N\times N$ PEs, the objective is to deliver every dimension of the data to each PE. The MRST algorithm enables us to determine the minimal path length required to connect all PEs, which is $(N^{2}-1)\cdot l$ . An example path is illustrated in Fig. A4C. Consequently, we obtain the cost of delivering data as

C_{E}=(N^{2}-1)\cdot l\cdot D=(D-1)\sqrt{D}\,.

(6)

Cost with dendritic neuron model

This section estimates the communication cost associated with a model based on dendritic neurons. To ensure a fair comparison, we have maintained the number of parameters and floating-point operations (FLOPs) consistent with those in the point neuron model scenario. Similar to the case of point neuron-based models, we also give the total cost $\hat{C}_{T}$ in the following equation:

\hat{C}_{T}=\hat{C}_{A}+\hat{C}_{I}+\hat{C}_{E}\,.

(7)

Given that each neuron has $K$ dendrites, one layer of the model under examination will have a total of $M=D\sqrt{K}$ dendrites. As illustrated in Fig. A4-D, every group of $K$ dendrites aggregates to form a single output dimension. Consequently, the first layer will produce an output with a dimensionality of $\hat{D}={D}/{\sqrt{K}}$ , which serves to maintain an equivalent computational complexity as the point neuron-based model previously described. We reiterate our assumption that those $\hat{D}$ neurons are arranged in a grid format, specifically of size $\hat{N}\times\hat{N}$ , with $\hat{N}=\sqrt{\hat{D}}$ .

We postulate that the computation of each dendrite is processed by one PE. In this scenario, the die area is divided into $M$ units, with each unit occupying a specific area. The height and width of this area, denoted by $\hat{l}$ , can be calculated as $\hat{l}=1/\sqrt{M}$ . Through this, we arrive at the size of a PE for processing each dendrite being $\frac{1}{D\sqrt{K}}$ , which is $1/{\sqrt{K}}$ of the point neuron-based model PE die size. This corresponds to the assumption that a dendrite in this analysis receives a proportion of $1/{\sqrt{K}}$ of the inputs that a point neuron receives.

In light of the aforementioned derivation, we note that the signal transfer cost, denoted as $\hat{C}_{A}$ , consists of two components. The first component, $\hat{C}_{AG}$ , refers to the cost of aggregating dendritic outputs for each neuron. The second component, $\hat{C}_{AA}$ , represents the cost of transmitting the aggregated data of all neurons off the die. Their expressions are as follows.

	$\displaystyle\hat{C}_{AG}=(K-1)\cdot\hat{D}\cdot\hat{l}$
	$\displaystyle\;\;\;\;\;\;\;=\sqrt{D}(K^{1/4}-K^{-3/4})<\sqrt{D}K^{1/4}\,,$		(8)
	$\displaystyle\hat{C}_{AA}=\hat{N}\hat{N}(\hat{N}-1)\hat{l}(\sqrt{K})<\frac{D}{\sqrt{K}}\,,$		(9)
	$\displaystyle\hat{C}_{A}=\hat{C}_{AG}+\hat{C}_{AA}<\sqrt{D}K^{1/4}+\frac{D}{\sqrt{K}}\,.$		(10)

Akin to the point neuron models, we will not attempt to derive $\hat{C}_{I}$ , although we have the relationship of $C_{I}=\sqrt{K}\cdot\hat{C}_{I}$ under the assumptions of the equivalent parameter/FLOPs count setting.

As for the $\hat{C}_{E}$ component, note that the second layer receives $\frac{D}{\sqrt{K}}$ inputs and consists of $M$ units. Utilizing the MRST method, the cost associated with one-dimensional input connecting to $M$ units can be computed as $(M-1)\cdot\hat{l}$ . We arrive at

\hat{C}_{E}=\frac{D}{\sqrt{K}}(D\sqrt{K}-1)\cdot\hat{l}\approx D^{\frac{3}{2}}/K^{\frac{1}{4}}\,.

(11)

Comparative analysis

From the above derivations, we are equipped to compare the communication costs associated with point neuron-based and dendritic neuron-based models under different configurations. As previously established, $C_{I}$ and $\hat{C}_{I}$ are highly dependent on the architecture, and $\hat{C}_{I}$ is $\sqrt{K}$ times smaller than $C_{I}$ . For the scope of this analysis, we will focus on analyzing $C_{A}$ , $\hat{C}_{A}$ , $C_{E}$ , and $\hat{C}_{E}$ .

Fig. 3A depicts the ratio of communication costs between the dendritic neuron-based model and the point neuron-based model. The results for different inter-layer dimensions $D$ of the point neuron models and various channel reduction ratios $\sqrt{K}$ are displayed within the figure. Notably, as $\sqrt{K}$ increases, dendritic neuron-based models consistently demonstrate lower communication costs compared to their point neuron-based counterparts at the same level of computational complexity.

Upon examining $\hat{C}_{A}$ and $\hat{C}_{E}$ , it becomes apparent that $\hat{C}_{E}$ is typically much larger than $\hat{C}_{A}$ when $D$ assumes large values, a common scenario in this context. Moreover, in models characterized by sparse connectivity, $\hat{C}_{A}$ remains unchanged regardless of connection sparsity levels, whereas $\hat{C}_{E}$ can vary. As a result, it is necessary to delve deeper into the $\hat{C}_{E}$ term and scrutinize its behavior within the context of sparse models.

In congruence with the approach adopted for dense models, we also employ the MRST algorithm to estimate the communication cost when dealing with sparse models. Considering the variability in the communication cost due to different sparse connection patterns, we sample a set of 100 random connection patterns for each setting to provide a robust estimate of the average cost.

Fig. 3B presents $\hat{C}_{E}$ under varying model sparsity levels and diverse numbers of dendrites per neuron. Our observations reveal a negative power relationship between $\hat{C}_{E}$ and $K$ , with a power of $0.51$ . This relationship is accurately mirrored in the $K^{1/4}$ factor presented in simplified version of $\hat{C}_{E}$ shown in Eq. 11.

4 Discussion

In this study, we were inspired by the observation that biological neurons aggregate the outputs of multiple dendrites to form a single output. Within each dendrite, the integration of synaptic inputs is nonlinear rather than a linear summation due to the presence of various voltage-gated ion channels. Accordingly, we have constructed neural network units that mimic these characteristics by integrating their synaptic inputs nonlinearly.

In the process of pooling outputs from multiple units, there is an inherent loss of information due to the many-to-one nature of the pooling function. This suggests a potential decline in model performance. Such behavior is indeed noticeable when dealing with models that possess limited inter-layer bandwidth, that is those models that are equipped with a smaller number of channels. This observation holds when the models under comparison maintain the same parametric complexity. It is likely that this phenomenon led Goodfellow et al. goodfellow_maxout_2013 to observe suboptimal model performance when they utilized architectures that pool ReLU units.

However, our findings suggest that once the bandwidth is increased beyond a certain threshold, it becomes unnecessary to augment the model size by adding more channels. Instead, the addition of extra dendrites to neurons appears to offer superior efficiency in enhancing model performance.

The implications of this discovery are substantial for both theoretical perspectives and practical applications. Theoretically, it highlights that when we widen the architecture to expand models, we are actually augmenting the number of features within the hidden layers rather than enhancing the features propagated toward subsequent layers. This insight refines our understanding of the internal dynamics of neural network development and behavior.

Practically, the adoption of an active dendritic structure enables models to achieve superior performance compared to point neuron-based models, given a fixed inter-layer communication budget. This can lead to a linear reduction in memory access during neural network inference and a smaller memory footprint, particularly when large batch sizes are employed during model inference.

Our comprehensive analysis of communication cost further reveals that adopting a dendritic structure can also yield a reduction in on-chip communication costs, following a square-root relationship to the inter-layer communication reduction ratio. Considering that communication costs dominate energy consumption in contemporary computing chips, our findings could significantly influence the design of future neural network accelerators.

Our findings shed light on critical insights; however, our comprehension of why channel sharing among a group of features can yield performance on par with, or even superior to, traditional models still needs to be more refined. In conventional models, each feature, or nonlinear neuron, establishes multiple connections directly to the succeeding network layer. One potential explanation posits that the pooling process in our dendritic layer can be viewed as a low-rank approximation of a significantly larger weight matrix $W\in\mathbb{R}^{D\sqrt{K}\times D\sqrt{K}}$ using a smaller weight matrix $W_{d}\in\mathbb{R}^{\frac{\hat{D}}{\sqrt{K}}\times D\sqrt{K}}$ . However, this interpretation provides only a limited perspective in comprehending this process. These gaps in our understanding necessitate further research to appreciate the dynamics and implications of such channel-sharing configurations fully.

Notably, the dendritic models used in our study are equipped with a single layer of nonlinearity, as opposed to two layers as suggested in Eq. 3. However, we observed improved performance in dendritic neuron models when adding an extra nonlinear function, particularly when neurons were equipped with many dendrites (results not shown). It would be intriguing to explore how different types of nonlinearity and more advanced nonlinear architectures would impact models.

Finally, our observation aligns with patterns observed in the evolution and development of the brain. In simpler, early-stage brains, neurons exhibit less structural complexity, consistent with the preference for point neuron-based models in smaller neural networks. However, as brains evolve to more advanced stages, neurons exhibit greater complexity and richer connectivity patterns, analogous to the preference for dendritic neuron-based models in larger neural network architectures stuart2016dendrites . This parallel suggests that incorporating dendritic neurons in artificial neural networks may reflect fundamental principles underlying the organization and functionality of biological neural systems. Our study contributes valuable insights into the comparative utility of dendritic and point neuron models in neural network design and offers guidance for their applications in various computational contexts.

5 Methods

5.1 Datasets

The present study leverages three commonly used datasets: ImageNet, CIFAR-100, and LibriSpeech, for model training and evaluation. These datasets are commonly served as benchmarks in deep learning research.

ImageNet Dataset: For this study, we use the ILSVRC 2012 subset of the ImageNet dataset, which consists of 1.2 million training images and 50,000 validation images from 1,000 categories deng2009imagenet . The images vary in size and are resized to a fixed resolution of 224x224 pixels for uniformity, per the standard ResNet procedure he2016deep . The typical data augmentation techniques, such as random cropping, random horizontal flipping, and color jittering, were applied during training to enhance the model’s generalization ability.

CIFAR-100 Dataset: The dataset consists of 60,000 32x32 color images in 100 classes, with 600 images in each class. There are 50,000 training images and 10,000 test images krizhevsky2009learning . Like the ImageNet data processing, we followed the typical data augmentation procedure he2016deep .

LibriSpeech dataset: The dataset is a publicly available English speech corpus for Automatic Speech Recognition (ASR) training and evaluation from the LibriVox project’s audiobooks. It consists of 1000 hours of transcribed speech, divided into training, development, and testing subsets panayotov2015librispeech . The experiment utilizing this dataset can be found in Appendix F.

5.2 Model architectures

In this study, we primarily used the ResNet-18 architecture as the baseline model. ResNet-18 is an 18-layer deep residual neural network, a seminal model proposed by He et al. he2016deep . The baseline configuration of ResNet-18 encapsulates an initial convolutional layer, followed by four residual blocks, each of which consists of two convolutional layers. This pattern constitutes the primary structure of our working model; in contrast to the original ResNet-18 model, our adapted architecture positions the shortcut connection after the ReLU (Rectified Linear Unit) activation function. This modification is imperative to ensure the compatibility of the dendritic structure with the model architecture.

For experiments on scaling up networks, we scaled up each network layer by the same designated factor except for the input and output of the model. For models with dendritic neurons, we replaced neurons in the standard model with dendritic neurons with $K$ dendrites as specified by the experiment setting, except for the input and output layers of the model. To maintain the uniform model complexity scaling throughout the model, we equip the input layer and the penultimate layer of the model with neurons of $\sqrt{K}$ instead of $K$ dendrites. The same setting is also employed in experiments designed to compare models that share identical inter-layer communication costs.

For models trained on CIFAR-100, we observed training instability. Therefore we clipped the gradient norm to $1.0$ during model training. We also added an extra batch norm to each dendrite to improve model stability. This additional batch norm can be fused with the previous layer and thus will not add extra computation burden at the inference stage.

In addition to models based on the ResNet-18 architecture, we have corroborated our findings using a model devoid of shortcut connections. This strategy ensures that the benefits observed are not strictly confined to a particular architecture. The configuration of this model is delineated in Appendix F, where the corresponding experimental outcomes can also be found.

Moreover, our experimentation extended to the transformer-based model. Within this model, the standard feedforward layers are substituted with network layers based on dendritic neurons. Comprehensive details pertaining to this modification can be found in Appendix F.

5.3 Model training

We trained all models with a cosine learning rate decay schedule and the SGD optimizer with a momentum of 0.9.

For ImageNet with dense ResNet models, the learning rate was initialized at $0.4$ (instead of $0.1$ to compensate for the batch size used for training), and models were trained for 120 epochs, including two warm-up epochs with a learning rate of $0.04$ . Weight decay was set to $1\times 10^{-4}$ . A batch size of 1024 was employed, and the training was distributed across 8 GPUs.

For ImageNet with sparse ResNet models, the models were trained for 200 epochs with an initial learning rate of 0.1 and 2 warm-up epochs at a learning rate of 0.01. The weight decay parameter was set to $1\times 10^{-4}$ . To achieve a sparse ratio of 85%, we applied L1-unstructured global pruning in 5 rounds, conducted between epochs 40 and 140. Subsequently, the models were trained for an additional 60 epochs.

Finally, for CIFAR-100 models, we trained them for 200 epochs with a learning rate of 0.05, including two warm-up epochs at a learning rate of 0.005. A batch size of 64 was utilized, and the weight decay parameter was set to $5\times 10^{-4}$ .

Our investigation emphasizes the comparative analysis of the performance of various models under identical training conditions, facilitating an equitable assessment of the distinct capabilities of each model. Consequently, all models within the comparison group undergo training with the same hyper-parameters, barring the requisite architecture adjustments. Further details concerning the experiments can be found in the accompanying source code.

5.4 Code availability

The entirety of the code used to produce the findings presented herein will be openly accessible to the public upon the publication of this paper.

References

\bibcommenthead
(1) Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., van den Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of Go without human knowledge. Nature 550, 354–359 (2017)
(2) OpenAI: GPT-4 Technical Report (2023)
(3) Wang, Y., Rubel, E.W.: In vivo reversible regulation of dendritic patterning by afferent input in bipolar auditory neurons. Journal of Neuroscience 32(33), 11495–11504 (2012)
(4) Benavides-Piccione, R., Regalado-Reyes, M., Fernaud-Espinosa, I., Kastanauskaite, A., Tapia-González, S., León-Espinosa, G., Rojo, C., Insausti, R., Segev, I., DeFelipe, J.: Differential structure of hippocampal ca1 pyramidal neurons in the human and mouse. Cerebral Cortex 30(2), 730–752 (2020)
(5) Adusei, M., Hasse, J.M., Briggs, F.: Morphological evidence for multiple distinct channels of corticogeniculate feedback originating in mid-level extrastriate visual areas of the ferret. Brain Structure and Function 226, 2777–2791 (2021)
(6) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007)
(7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016)
(8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000)
(9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000)
(10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000)
(11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004)
(12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013)
(13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001)
(14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012)
(15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018)
(16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021)
(17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019)
(18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019)
(19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991)
(20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022)
(21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021)
(22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003)
(23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017)
(24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969)
(25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021)
(26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020)
(27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014)
(28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004)
(29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021)
(30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021)
(31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
(32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25
(33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022)
(34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
(35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
(36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008)
(37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013)
(38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
(39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015)
(40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021)
(41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)

Appendices

E Supplementary Material for the Derivation of Communication Costs

Without loss of generality, we set a junction point for inter-chip communication to be collected at the top-right corner ( $0$ -th row and $0$ -th column, starting with ID 0 counting from right-to-left, top-to-bottom). Therefore, the total cost of propagating outputs from every PE to the junction point is:

$\displaystyle C_{A}$	$\displaystyle=\left(\sum_{x,y=0}^{N-1}(x+y)\right)l$
	$\displaystyle=\left(\sum_{x,y=0}^{N-1}x+\sum_{x,y=0}^{N-1}y\right)l$
	$\displaystyle=\left(\sum_{x=0}^{N-1}x\sum_{y=0}^{N-1}1+\sum_{x=0}^{N-1}1\sum_{y=0}^{N-1}y\right)l$
	$\displaystyle=\left(\frac{(0+(N-1))(N)}{2}N+N\frac{(0+(N-1))(N)}{2}\right)l$
	$\displaystyle=N^{2}(N-1)l$
	$\displaystyle=D(\sqrt{D}-1)\frac{1}{\sqrt{D}}$
	$\displaystyle=D-\sqrt{D}\,.$	(12)

F Additional experimental analysis

CIFAR-100 dataset with ResNet-18-style models

In this part of the experiment, we utilized the CIFAR-100 dataset krizhevsky2009learning , which comprises 100 distinct object categories. This dataset is commonly employed in various studies. Experimental results are demonstrated in Fig. A5. As the foundation of this part of the investigation, we also adopt the ResNet-18 model as the base architecture.

Our finding reveals a pattern analogous to that observed in the ImageNet experiment, where incorporating dendrites into a model with a fixed inter-layer communication budget consistently yields improved performance. Furthermore, models based on dendritic neurons surpass those on point neurons within the same computational budget, provided the inter-layer communication budget is above a certain threshold.

Sparse model

So far, we have discovered that incorporating the local feature aggregation operation characteristic of dendritic structures can effectively reduce communication costs. A vital question we aim to address is the applicability of our findings to real-world scenarios, given that our experimental setting significantly differs from the biological context in which dendrites typically form sparse connections with their inputs. The sparsification of our model entails a reduction in the inter-layer communication pathways, which could potentially impact its behavior.

It is also worth noting that, at present, sparse neural networks have yet to gain widespread adoption in real-world applications due to the absence of effective hardware accelerators. However, given their potential for reducing computational costs, future advancements will likely promote their widespread use. As such, it is essential to investigate the influence of sparsity on model behavior.

Our results, presented in Fig. A6, demonstrate that sparsity can significantly affect models’ performance with low computational budgets. However, larger models remain mostly unaffected by the investigated sparsity level ( $85\%$ ). Moreover, the model behavior retains a consistent pattern akin to what we observed in the dense model experiments.

Non-Residual Convolutional Neural Network Performance on the ImageNet Dataset

This segment of our experiment was conducted utilizing a convolutional neural network (CNN) model devoid of residual connections. The base model for this experiment was a modified version of the original ResNet-18 network, from which we eliminated the residual connections. The original ResNet-18 model consists of four stages, each featuring two residual blocks. We removed one residual block from both the second and third stages to reduce computing costs. Fig. A7 illustrates our findings derived from this modified, non-residual network.

Transformer model

This section investigates the impact of replacing the feedforward block within transformer-based neural networks. The specific feedforward block in question comprises a classic bottleneck architecture, as illustrated in Fig. A8.

A bottleneck structure enhances the expressive capacity of a network module by expanding the number of channels in the middle layer. Conventionally, if the module input consists of $L$ channels, the middle layer is expanded to comprise $sL$ channels. Small integer values are commonly employed for $s$ in typical transformer-based models, with common choices including 2, 3, or 4. Subsequently, the module’s output is reduced back to the original $L$ channels.

This bottleneck module confers greater expressivity power to the model than a standard two-layer network of $L$ channels while maintaining a modest input/output channel number for the module. This is similar to what dendritic structures try to achieve.

However, the bottleneck structure has an expanded middle layer, necessitating high communication bandwidth. Thus, the question arises: can a dendritic structure supplant the bottleneck structure while conferring additional benefits?

The naive substitution of a bottleneck structure with two dendritic layers is ineffective because the second layer comprises linear neurons. The pooling of linear neuron outputs does not confer inherent advantages to a nonlinear dendritic structure. Consequently, our design only employs a dendritic structure exclusively for the first layer of the block while retaining a linear layer for the second.

More precisely, for a bottleneck structure accepting an input dimension of $L$ and an expansion ratio of $s$ , the corresponding first layer is assigned the dendritic branches equal to $2s-1$ . This configuration maintains the input channel number for both layers at $L$ , preserving the computational and parametric complexity at levels comparable to the original model.

An empirical examination involving a compact transformer model, as proposed by Hassani et al. hassani2021escaping , demonstrates that this modification incurs only a marginal performance decline. Specifically, test accuracy on the ImageNet dataset decreased from 80.9% to 80.6%, a negligible reduction considering the substantial decrease in peak activation output I/O within the block threefold less than before.

Considering the highly tuned nature of the transformer architecture, we posit that additional refinements to the model—particularly adjustments favoring the dendritic structure may unlock further potential for performance enhancement.

Speech recognition task

In addition, we substantiate our theory with a speech recognition task. We employ models trained on the LibriSpeech dataset, which consists of approximately 1,000 hours of spoken English panayotov2015librispeech . Owing to computing resource constraints, we utilize the train-clean-100 and train-clean-360 subsets for model training and the dev-clean subset for model evaluation. The models used in this portion of the experiment are derived from the Jasper model li2019jasper , a 1D convolutional neural network. To lessen the computational burden during model training, we modified the original model by eliminating the dense residual connections and significantly reducing the number of blocks in the model to arrive at a baseline point neuron based model. Further details regarding the modifications to the models can be found in the accompanying code.

For this part, we carry out two distinct sets of experiments. The first set focuses on models of equivalent computational complexity, and the second emphasizes models sharing the same inter-layer communication cost.

In the first set of experiments, we evaluated models of two distinct computational complexity levels, varying the neuron configurations. Specifically, the configurations encompassed point neurons and dendritic neurons with varying numbers of dendrites. The results for this segment of experiments are displayed in Table 1. Analogous to previous experiments, we observed that models utilizing dendritic neurons were able to achieve comparable performance relative to the point neuron-based models with equivalent computational complexity if they are equipped with efficient inter-layer communication bandwidth.

The second set of experiments is conducted employing models that retain the same inter-layer communication cost. Our experimental procedure begins with a point neuron-based model, which possesses one-fourth of the inter-layer communication complexity compared to the baseline model. This point neuron model is subsequently replaced with dendritic neuron models that contain 4 and 16 dendrites respectively. The corresponding results are systematically presented in Table 2. Upon analyzing these results, it becomes apparent that the performance of the model progressively enhances as we incorporate neurons with an increased number of dendrites.

# of Dendrites	Channel scaling factor	Test error
1 st complexity level(baseline)
1 (baseline)	1	7.72
4	1/2	7.92
16	1/4	8.28
2nd Complexity level
1	2	6.69
4	1	6.69
16	1/2	6.89

Table 1: Comparison of the performance of dendritic models with varying numbers of dendrites per neuron on the LibriSpeech dataset. The table presents models with two levels of computational complexity. To maintain equivalent computational complexity when increasing the number of dendrites in a neuron, the number of inter-layer channels is proportionally reduced, as indicated in the table.

# of Dendrites	Channel scaling factor	Test error
1	1/4	15.39
4	1/4	10.49
16	1/4	8.23

Table 2: Performance Evaluation of Dendritic Models with varying dendritic counts per neuron evaluated on the LibriSpeech Dataset. The models in this comparison have the same inter-layer communication cost.