This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

ES-Parkour: Advanced Robot Parkour with Bio-inspired Event Camera and Spiking Neural Network

Qiang Zhang1,2∗, Jiahang Cao1∗, Jingkai Sun1,2∗, Yecheng Shao3,4, Gang Han2, Wen Zhao2, Yijie Guo2, Renjing Xu1† Corresponding author; Equal contribution.1The authors are with the Microelectronics Thrust, The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China. [email protected], [email protected]2The authors are with Beijing Innovation Center of Humanoid Robotics Co., Ltd. [email protected]3The author is with Center for X-Mechanics, Zhejiang University, China.4The author is with Institute of Applied Mechanics, Zhejiang University, China. [email protected]
Abstract

In recent years, quadruped robotics has advanced significantly, particularly in perception and motion control via reinforcement learning, enabling complex motions in challenging environments. Visual sensors like depth cameras enhance stability and robustness but face limitations, such as low operating frequencies relative to joint control and sensitivity to lighting, which hinder outdoor deployment. Additionally, deep neural networks in sensor and control systems increase computational demands. To address these issues, we introduce spiking neural networks (SNNs) and event cameras to perform a challenging quadruped parkour task. Event cameras capture dynamic visual data, while SNNs efficiently process spike sequences, mimicking biological perception. Experimental results demonstrate that this approach significantly outperforms traditional models, achieving excellent parkour performance with just 11.7% of the energy consumption of an artificial neural network (ANN)-based model, yielding an 88.3% energy reduction. By integrating event cameras with SNNs, our work advances robotic reinforcement learning and opens new possibilities for applications in demanding environments.

Index Terms:
Bio-inspired Robot Learning, Legged Robots, Visual Learning, Spiking Neural Network.

Quadruped robotics has advanced significantly in both proprioceptive motion control  [1, 2, 3, 4, 5] and vision-based planning. These robots now perform a wide range of tasks in complex environments [6, 7], demonstrating potential applications in extreme conditions. However, challenges such as diverse lighting conditions, complex terrains, and energy efficiency remain underexplored.

Our study addresses these challenges by leveraging Spiking Neural Networks (SNNs) and event cameras to improve perception and motion control. Event cameras, which detect pixel-level changes at kilohertz frequencies and operate independently of lighting conditions, provide dynamic and high-frequency visual information. SNNs, inspired by biological neurons, process spike signals efficiently, reducing computational demands while maintaining performance. The integration of these technologies bridges the gap between fast joint control cycles and high-frequency perception, crucial for quadruped robots. Additionally, their low power consumption alleviates design trade-offs, reducing the need for bulky cooling systems and batteries.

Using the Parkour task as a benchmark, we develop a simulation environment to rigorously evaluate the capabilities of quadruped robots under complex conditions. Unlike previous approaches that rely on 3D sensors such as LiDAR or depth cameras, our work demonstrates the potential of event cameras for effective perception in challenging environments. By combining SNNs and event cameras, we enhance efficiency, reduce computational costs, and advance the feasibility of real-world robotic applications. Figure 2 illustrates our bio-inspired system pipeline.

Refer to caption
Figure 1: Demonstration of a quadruped robot performing parkour using spiking neural network under extreme lighting conditions. The robot processes event images in real-time, where the red part and blue part denote the positive event and negative event, respectively.

Our work makes three primary contributions:

  1. 1.

    We pioneer the implementation of a system-level design for quadruped robot parkour using SNNs and event cameras (in Figure 1), providing new insights into perception and motion control.

  2. 2.

    We successfully transition the end-to-end training of the quadruped robot RL network from ANNs to SNNs through a distillation method, significantly reducing computational burden and training complexity.

  3. 3.

    To our knowledge, this is the first demonstration of achieving complex quadruped robot control tasks across a variety of environments using a brain-inspired sensor. Our work ensures that robots possess robust perception and control capabilities in various environments, significantly broadening the application scope of brain-inspired devices in robots.

Although breakthrough research has already been conducted in drones [8] and autonomous driving [9], the application of event cameras in legged robots remains limited. This scarcity is primarily due to the complexities involved in designing and implementing the systems. However, the combination of event cameras and spiking neural networks (SNN) holds significant potential for the robotics community. We hope our extensive simulation validation work will pave the way for further advancements in this field.

Refer to caption
Figure 2: Pipeline of our bio-inspired reinforcement learning system. Different from the previous standard vision-based robot system, our bio-inspired system is equipped with an event camera to capture event data from diverse scenes. The event is then processed by the spiking neural network which in turn dictates the robot’s actions to the environment. The adoption of this brain-inspired approach yields three significant advantages: (1) enhanced stability in motion-intensive scenarios is achieved through the superior temporal resolution of the event data. (2) the system’s resilience in fluctuating lighting conditions is ensured by the event camera’s high dynamic range. (3) the inherently low energy consumption of the SNN contributes to the system’s overall efficiency.

I Related Work

Legged Robot Agile Locomotion. Legged robots have advanced in agile locomotion through optimization-based [10, 11] and learning-based methods [1, 2]. Trajectory optimization (TO) and Model Predictive Control (MPC) are commonly used for stable, dynamic motion but rely on detailed robot models and terrain estimation. Learning-based methods, in contrast, use neural networks to directly process visual data, avoiding complex modeling. Notable approaches include Rapid Motor Adaptation (RMA)[12] and Adversarial Motion Priors (AMP)[13], which generalize policies across terrains. Recent advances have integrated proprioception with vision, using depth sensors [14] and 3D feature encoding [15] for improved terrain perception.

Robotic Parkour. Robotic parkour focuses on algorithms for navigating complex, high-risk terrains. Approaches like [6] pre-train models under soft constraints and refine them with stricter ones. Hierarchical frameworks [16] combine motion and navigation policies for effective traversal. However, these methods mainly rely on traditional sensors like depth cameras and LiDAR, which struggle in challenging lighting and fast-moving scenes.

Event Camera and SNN on robots. Event cameras capture high-speed, dynamic scenes asynchronously and are increasingly applied in robotics for tasks like Visual-Inertial Odometry (VIO)[17]. However, most methods rely on conventional ANN processing. Spiking Neural Networks (SNNs), which excel in temporal precision, are better suited for handling event-based data. SNNs have been used in control tasks[18] but remain underexplored in real-world robotic applications. While multimodal integration of SNNs with traditional sensors has been explored [19], a unified perception and control framework using SNNs is still lacking.

Refer to caption
Figure 3: Pipeline of our ES-Parkour ANN-to-SNN distilling process. Through the distillation process, the extreme parkour capabilities of the ANN are transferred to an SNN, which receives input from an event camera. In the warm-up phase, minimizing the Mean Squared Error (MSE) loss between the outputs of the teacher (ANN) and the student (SNN) networks ensures the student network can closely replicate the teacher network’s outputs. Following the warm-up phase, the student network demonstrates basic movement capabilities but encounters challenges with complex terrains. Further interaction and optimization of the student network enhance its performance on complex terrains, closely aligning it with the teacher’s performance.

II Method

II-A Build Event Camera in Simulation

Event Camera are bio-inspired sensors, which capture the relative intensity changes asynchronously. In contrast to standard cameras that output 2D images, event cameras output sparse event streams. When brightness change exceeds a threshold CC, an event eke_{k} is generated containing position u=(x,y)\textbf{u}=(x,y), time tkt_{k}, and polarity pkp_{k}:

ΔL(u,tk)=L(u,tk)L(u,tkΔtk)=pkC.\Delta L(\textbf{u},t_{k})=L(\textbf{u},t_{k})-L(\textbf{u},t_{k}-\Delta t_{k})=p_{k}C. (1)

The polarity of an event reflects the direction of the changes. In this paper, we utilize IsaacGym as the simulation and training environment. IsaacGym is a high-performance robotic simulation platform that provides a rich physical simulation environment, enabling us to efficiently train and test quadruped robots in complex scenarios. However, the IsaacGym platform does not natively support the simulation of event cameras. Therefore, we develop an algorithm to simulate the working principle of event cameras within the IsaacGym environment:

Suppose that in a small time interval, the brightness consistency assumption [20] is conformed, under which the intensity change in a vicinity region remains the same. By using Taylor’s expansion, we can approximate intensity change by:

ΔL(u,t)\displaystyle\Delta L(\textbf{u},t) =L(u,t)L(u,tΔt),\displaystyle=L(\textbf{u},t)-L(\textbf{u},t-\Delta t), (2)
=δLδt(u,t)Δt+O(Δt2)δLδt(u,t)Δt,\displaystyle=\frac{\delta L}{\delta t}(\textbf{u},t)\Delta t+O(\Delta t^{2})\approx\frac{\delta L}{\delta t}(\textbf{u},t)\Delta t, (3)

where u=(x,y)\textbf{u}=(x,y) denotes the position. Substituting the brightness constancy assumption (δLδt(u(t),t)+L(u(t),t)v(u))=0.\frac{\delta L}{\delta t}(\textbf{u}(t),t)+\nabla L(\textbf{u}(t),t)\cdot\textbf{v}(\textbf{u}))=0.) into the above equation, we can obtain:

ΔL(u)L(u)v(u)Δt,\Delta L(\textbf{u})\approx-\nabla L(\textbf{u})\cdot\textbf{v}(\textbf{u})\Delta t, (4)

which indicates that the brightness changes are caused by intensity gradients ΔL=(δLδx,δLδy)\Delta L=(\frac{\delta L}{\delta x},\frac{\delta L}{\delta y}) moving with velocity v(u)\textbf{v}(\textbf{u}) over a displacement Δu=vΔt\Delta\textbf{u}=\textbf{v}\Delta t. With v(u)\textbf{v}(\textbf{u}) and L(u)\nabla L(\textbf{u}), we can get ΔL(u)\Delta L(\textbf{u}) to generate event data with Eq. 1. In this paper, we adopt the same simulated methods to obtain v(u)\textbf{v}(\textbf{u}) and L(u)\nabla L(\textbf{u}) from [21], where only a single depth image is required to simulate the corresponding event frames.

Our simulation algorithm calculates pixel changes in the environment in real-time and converts these changes into events that would be output by an event camera, making the simulated event data as close as possible to the output of a real event camera.

II-B Build SNNs in Simulation

Spiking neural network is a bio-inspired algorithm that mimics the actual signaling process occurring in brains. Compared to the ANNs, it transmits sparse spikes instead of continuous representations, offering benefits such as low energy consumption and robustness. In this paper, we adopt the widely used Leaky Integrate-and-Fire (LIF [22]) model, which effectively characterizes the dynamic process of spike generation and can be defined as:

V[n]=βV[n1]+γI[n],\displaystyle V[n]=\beta V[n-1]+\gamma I[n], (5)
S[n]=Θ(V[n]ϑth),\displaystyle S[n]=\Theta(V[n]-\vartheta_{\textrm{th}}), (6)

where nn is the time step and β\beta is the leaky factor that controls the information reserved from the previous time step; V[n]V[n] is the membrane potential; S[n]S[n] denotes the output spike which equals 1 when there is a spike and 0 otherwise; Θ(x)\Theta(x) is the Heaviside function. When the membrane potential exceeds the threshold ϑth\vartheta_{\textrm{th}}, the neuron will trigger a spike and resets its membrane potential to Vreset<ϑthV_{\textrm{reset}}<\vartheta_{\mathrm{th}}.

Camera Type RGB Camera Depth Camera Event Camera
Dynamic Range Low (\sim60dB) Low High ( \geq 120dB )
Latency High High Low
Advantages \bullet Versatile for many conditions
\bullet High color fidelity
\bullet Captures spatial data for 3D modeling
\bullet Useful in AR/VR
\bullet Great for capturing movement in high dynamic range scenes
Disadvantages \bullet Limited in extreme lighting without HDR \bullet Limited functionality in diverse light conditions \bullet Less effective for static scenes
TABLE I: Comparison of RGB, Depth, and Event Cameras. Event cameras, with their high dynamic range (HDR) and low latency, are ideally suited for robotic applications in outdoor and extreme-exposure environments.

II-C Learning Process

Reinforcement Learning on ANN. Our policy training framework is structured as a Markov Decision Process (MDP), defined by the tuple (𝒮,𝒜,,p,γ)(\mathcal{S},\mathcal{A},\mathcal{R},p,\gamma), where 𝒮\mathcal{S} denotes the state space, 𝒜\mathcal{A} represents the action space, \mathcal{R} is the reward function, pp characterizes the transition probabilities between states for each action-state pair, γ\gamma is the discount factor applied to rewards. At each time step tt, the agent receives a state st𝒮s_{t}\in\mathcal{S}. Based on this observation, the agent selects an action at𝒜a_{t}\in\mathcal{A} which is sampled from policy π(at|st)\pi(a_{t}|s_{t}). This action leads to a transition sts_{t} to a new state to st+1s_{t+1} determined probabilistically by st+1p(st+1|st,at)s_{t+1}\sim p(s_{t+1}|s_{t},a_{t}). And the agent obtains a reward value at each time step rt=(st,at)r_{t}=\mathcal{R}(s_{t},a_{t}). The primary goal is to optimize the policy parameters θ\theta to maximize the reward:

argmaxθ𝔼(st,at)pθ(st,at)[t=0T1γtrt],\textnormal{arg}\max_{\theta}\mathbb{E}_{(s_{t},a_{t})\sim p_{\theta}(s_{t},a_{t})}\left[\sum_{t=0}^{T-1}\gamma^{t}r_{t}\right], (7)

where T denotes the time horizon of MDP.

Our ANN teacher policy training process follows [7] to aim for the policy to not directly learn the skills of traversing difficult terrain, but rather to enable the robot to parkour by learning from rewards and following instructions. Thus, unlike approach [12], our method uses privileged information like scandots of terrain, which can be acquired in real-world scenarios, instead of relying on environmental factors like friction. In this phase, the policy externally receives the scandots, and target yaw direction as privileged observations. We utilize various obstacles including gaps, steps, hurdles, and parkour terrain to train the policy.

Distilling to SNN. In the initial phase of our training, we employ ANN to create a model capable of generating action and directional commands for executing parkour tasks with quadruped robots. This foundational step establishes the groundwork for our subsequent transition to a more energy-efficient model. We then embark on a distillation process, where the goal is to train an SNN to emulate the decision-making behavior of the ANN. This process begins with the ANN, serving as the teacher network, and interacting with the simulation environment. We proceed by training the SNN, referred to as the student network, aiming to minimize the Mean Squared Error (MSE) loss between the outputs of both the student and teacher networks. To enhance the performance, we further engage the student network in interactions with the environment. During this phase, we continue to measure and minimize the loss under identical environmental conditions. The training process is shown in Figure 3, where the distillation loss are defined as:

action\displaystyle\mathcal{L}_{action} =1n1mi=1nj=1m(actionijANNactionijSNN)2,\displaystyle=\frac{1}{n}\frac{1}{m}\sum_{i=1}^{n}\sum_{j=1}^{m}(action^{\mathrm{ANN}}_{ij}-action^{\mathrm{SNN}}_{ij})^{2}, (8)
yaw\displaystyle\mathcal{L}_{yaw} =1mj=1m(yawjANNyawjSNN)2,\displaystyle=\frac{1}{m}\sum_{j=1}^{m}(yaw^{\mathrm{ANN}}_{j}-yaw^{\mathrm{SNN}}_{j})^{2}, (9)

where nn represents the total number of joints in the quadruped robot and mm represents the number of training robots. This iterative process of fine-tuning and adjustment enables the SNN to closely match the ANN’s output patterns across a variety of scenarios. As a result, the original model’s performance is preserved, while the computational load and energy consumption are significantly reduced. This efficiency improvement renders the model more suitable for real-time applications on devices with limited power resources, marking the successful completion of our training process.

Due to the wide dynamic range of event cameras, we can distill our SNN model using models trained under normal lighting conditions with depth cameras, achieving the same effect. This means that although our SNN model is trained with external environmental light different from traditional depth cameras, it can still maintain efficient and accurate inference capabilities under extreme lighting conditions (e.g., direct sunlight or low-light scenes). Our approach opens up new possibilities for deploying quadruped robots in more challenging environments, enabling them to perform precise perception and rapid response under almost any lighting condition.

Refer to caption
Figure 4: Overview of the event simulation process. Each depth image can be converted into its corresponding event with the optical flow and image gradient.

II-D Theoretical Energy Consumption Calculation

To calculate the theoretical energy consumption of SNN, we begin by determining the synaptic operations (SOPs). The SOPs for each block in the spiking model can be calculated using the following equation [23]: SOPs(l)=fr×T×FLOPs(l)\operatorname{SOPs}(l)=fr\times T\times\operatorname{FLOPs}(l), where ll denotes the block number in the spiking model, frfr is the firing rate of the input spike train of the block and TT is the time step of the spike neuron. FLOPs(l)\operatorname{FLOPs}(l) refers to floating point operations of ll block.

To estimate the theoretical energy consumption of our model, we assume that the MAC and AC operations are 32-bit floating-point implementations in 45nm45nm hardware [24], with energy costs of EMAC=4.6pJE_{MAC}=4.6pJ and EAC=0.9pJE_{AC}=0.9pJ, respectively. According to [25, 26], the calculation for the theoretical energy consumption of ES-Parkour is given by:

EES-Parkour\displaystyle E_{\text{ES-Parkour}} =EMAC×FLOPSNNConv1\displaystyle=E_{MAC}\times\mathrm{FLOP}_{\mathrm{SNN}_{\mathrm{Conv}}}^{1} (10)
+EAC×(n=2NSOPSNNConvn+m=1MSOPSNNFCm)\displaystyle+E_{AC}\times\left(\sum_{n=2}^{N}\mathrm{SOP}_{\mathrm{SNN}_{\mathrm{Conv}}}^{n}+\sum_{m=1}^{M}\mathrm{SOP}_{\mathrm{SNN}_{\mathrm{FC}}}^{m}\right)

where NN and MM represent the total number of layers of Conv and FC, EMACE_{MAC} and EACE_{AC} represent the energy cost of MAC and AC operation, FLOPSNNConv\mathrm{FLOP}_{\mathrm{SNN}_{\mathrm{Conv}}} denotes the FLOPs of the first Conv layer, SOPSNNConvn\mathrm{SOP}^{n}_{\mathrm{SNN}_{\mathrm{Conv}}} and SOPSNNFCm\mathrm{SOP}^{m}_{\mathrm{SNN}_{\mathrm{FC}}} are the SOPs of nthn^{th} Conv and mthm^{th} FC layer, respectively.

III Experiments

III-A Training Setting

During the transition from ANN to SNN in our distillation process, we embark on an extensive training regimen for the student SNN model. This training is conducted within IsaacGym, utilizing a total of 32 parallel robot simulation environments. These environments are specifically chosen to provide a diverse range of challenges and scenarios, thereby ensuring a comprehensive learning experience for the SNN model. To simulate real-world conditions as closely as possible, we sample event images at a frequency of 10Hz, which allows us to capture dynamic changes within the network inferencing effectively. The training process is powered by an NVIDIA 3090 GPU and spans over a duration of 30 hours. In configuring the SNN, we opt for the Integrate-and-Fire (IF) neuron model, renowned for its simplicity and efficiency. The spiking timestep is set as 4, optimizing the balance between responsiveness and computational demand.

During our training process, we adopt a series of meticulously designed parameters to optimize the performance of our SNN model. Firstly, we set the learning rate to 0.001. For our encoder network structure, we choose the spiking ResNet-18 [27] as our vision backbone. We also use the GRU module to fuse the latent features encoded from proprioceptive information and event features. Additionally, we incorporate a 3-layer spiking MLP layer, with sizes [512, 256, 128], to serve as the actor network.

Refer to caption
Figure 5: We evaluate our SNN strategy across four different scenarios. The figure shows the shapes related to each. The top row indicates the type of terrain, while the bottom row displays the success rate for each situation.

III-B Simulation Results

During the training phase of our quadruped robot’s Spiking Neural Network, we closely monitor the terrain level curve to assess the robot’s ability to adapt to complex terrains. We evaluate our SNN strategy and the results are shown in Figure 5. This method gradually guides the robot to face tasks of increasing difficulty, significantly enhancing its adaptability and performance under various environmental conditions. It is also important to note that our training achievements are made under varying lighting conditions, which means our curriculum learning can robustly handle changes in lighting.

Encoder Type SNN ANN Efficiency \downarrow
FLOPs SOPs FLOPs OPs(SNN):OPs(ANN)
ResNet 8.00×\timese6 8.76×\timese7 2.04×\timese8 0.46 : 1
MLP 7.17×\timese6 2.61×\timese6 3.31×\timese7 0.29 : 1
TABLE II: Comparisons of the number of operations (FLOPs/SOPs) between the vision encoder of Parkour and ES-Parkour. SNN yields lower operating times than its ANN counterpart.
Module Encoder Actor
ResNet (11.19M) MLP (8.01M) MLP (0.26M)
ANN Power (mJ) 0.94 0.15 1.08e-3
SNN Power (mJ) 0.11 0.04 3.30e-4
Energy Saving 88.29% 73.33% 69.44%
TABLE III: Comparisons of Energy Consumption between origin Parkour (ANN model) and ES-Parkour (SNN model). Our ES-Parkour achieves extreme energy saving (up to 88.29%) in each module.

III-C Analysis of Computing Efficiency

III-C1 Comparisons of the number of operations.

Given that the majority of computational demands in neural networks stem from matrix operations, this section explores the analysis and comparison of the operational counts within the visual encoders of both ANN and SNN. This comparison aims to validate the efficiency of our ES-Parkour system. According to Table II, the SNN consistently demonstrates a lower total operational count (including both FLOPs and SOPs) compared to the ANN, regardless of whether ResNet or MLP serves as the visual backbone. This difference arises because, the non-spiking portion of the feature (i.e., zero value) in SNNs does not consume computational resources during matrix operations. As a result, the overall number of operations for SNNs significantly falls below that of ANNs. We define an operational efficiency metric:

Efficiency=OPs(SNN)OPs(ANN)=FLOPSNN+SOPSNNFLOPANN,\text{Efficiency}=\frac{\mathrm{OPs}(\mathrm{SNN})}{\mathrm{OPs}(\mathrm{ANN})}=\frac{\mathrm{FLOP}_{\mathrm{SNN}}+\mathrm{SOP}_{\mathrm{SNN}}}{\mathrm{FLOP}_{\mathrm{ANN}}}, (11)

where this metric measures the relative energy efficiency ratio, a lower efficiency value (i.e., less than 1) indicates a higher energy efficiency of SNNs compared to ANNs. The reduced efficiency values presented in the table underline the computational efficiency of our ES-Parkour system.

Gap Step Hurdle Parkour
ANN 808 16 876.23 853.32 1008.6
SNN (ours) 813.45 869.27 862.01 997.54
TABLE IV: Comparisons of the average robots’ joints motor energy (mJ) between ANN and SNN.

III-C2 Evaluation of the energy consumption.

To further emphasize the low-energy nature of our ES-Parkour, we conduct a detailed comparative analysis of the energy consumption between the proposed ES-Parkour and its corresponding ANN model. As shown in Table III, using the ResNet scenario as an example, our ES-Parkour presents significantly lower energy consumption, amounting to merely 11.7% of that exhibited by the ANN model. This corresponds to an extreme energy-saving (88.3%) by utilizing SNN. Moreover, the actor module of the ES-Parkour further exemplifies energy conservation compared with ANN Parkour, with 3.30×\timese-4 vs. 1.08×\timese-3, demonstrating the superior low-energy benefits of our systems. We can maintain the same power consumption at the joint level as with ANN, but with better environmental adaptability and lower computational burden as shown in Table IV. Our extensive testing has validated the overall feasibility and performance advantages of the system.

Scenarios normal-light overexposed underexposed high-speed
Anymal parkour [16] \checkmark \checkmark \checkmark ×\times
Extreme parkour [7] \checkmark ×\times ×\times ×\times
Robot parkour [6] \checkmark ×\times ×\times ×\times
ES-Parkour (ours) \checkmark \checkmark \checkmark \checkmark
TABLE V: Comparison of the abilities of different methods in extreme scenarios.

IV Conclusion

In this paper, by integrating Spiking Neural Networks (SNN) and event cameras, we not only address the challenges of power consumption and computational load inherent in traditional deep learning models for quadruped robot parkour but also forge a new pathway for enhancing robot perception and control. This approach enables more efficient and adaptive responses in complex environments. We compare the work of robot parkour in Table V, and our work is the only one that can be tested under all environmental conditions.

Due to the difficulty in obtaining SNN chips, our work has not been tested on actual robots. However, as with previous robotic validation efforts, we have extensively tested our system in simulations to ensure its feasibility on actual robots. This phased research ensures the sustainability of our study. In the future, we will continue to refine our system and advance the integration of the SNN chips with actual robots.

me

References

  • [1] Alexander Reske, Jan Carius, Yuntao Ma, Farbod Farshidian, and Marco Hutter, “Imitation learning from mpc for quadrupedal multi-gait control,” in ICRA. IEEE, 2021, pp. 5014–5020.
  • [2] Jemin Hwangbo, Joonho Lee, Alexey Dosovitskiy, Dario Bellicoso, Vassilios Tsounis, Vladlen Koltun, and Marco Hutter, “Learning agile and dynamic motor skills for legged robots,” Science Robotics, vol. 4, no. 26, pp. eaau5872, 2019.
  • [3] Jinze Wu, Guiyang Xin, Chenkun Qi, and Yufei Xue, “Learning robust and agile legged locomotion using adversarial motion priors,” IEEE RAL, 2023.
  • [4] Xue Bin Peng, Erwin Coumans, Tingnan Zhang, Tsang-Wei Lee, Jie Tan, and Sergey Levine, “Learning agile robotic locomotion skills by imitating animals,” arXiv preprint arXiv:2004.00784, 2020.
  • [5] Atil Iscen, Ken Caluwaerts, Jie Tan, Tingnan Zhang, Erwin Coumans, Vikas Sindhwani, and Vincent Vanhoucke, “Policies modulating trajectory generators,” in CoRL. PMLR, 2018, pp. 916–926.
  • [6] Ziwen Zhuang, Zipeng Fu, Jianren Wang, Christopher Atkeson, Soeren Schwertfeger, Chelsea Finn, and Hang Zhao, “Robot parkour learning,” arXiv preprint arXiv:2309.05665, 2023.
  • [7] Xuxin Cheng, Kexin Shi, Ananye Agarwal, and Deepak Pathak, “Extreme parkour with legged robots,” arXiv preprint arXiv:2309.14341, 2023.
  • [8] Falanga Davide, Kleber Kevin, and Scaramuzza Davide, “Dynamic obstacle avoidance for quadrotors with event cameras,” Science Robotics, vol. 5, no. 40, pp. 13–27, 2020.
  • [9] Daniel Gehrig and Davide Scaramuzza, “Low-latency automotive vision with event cameras,” Nature, vol. 629, no. 8014, pp. 1034–1040, 2024.
  • [10] Gerardo Bledt and Sangbae Kim, “Extracting legged locomotion heuristics with regularized predictive control,” in ICRA. IEEE, 2020, pp. 406–412.
  • [11] Jared Di Carlo, Patrick M Wensing, Benjamin Katz, Gerardo Bledt, and Sangbae Kim, “Dynamic locomotion in the mit cheetah 3 through convex model-predictive control,” in IROS. IEEE, 2018, pp. 1–9.
  • [12] Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik, “Rma: Rapid motor adaptation for legged robots,” in RSS, 2021.
  • [13] Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa, “Amp: Adversarial motion priors for stylized physics-based character control,” ToG, vol. 40, no. 4, pp. 1–20, 2021.
  • [14] Chieko Sarah Imai, Minghao Zhang, Yuchen Zhang, Marcin Kierebiński, Ruihan Yang, Yuzhe Qin, and Xiaolong Wang, “Vision-guided quadrupedal locomotion in the wild with multi-modal delay randomization,” in IROS. IEEE, 2022, pp. 5556–5563.
  • [15] Ruihan Yang, Ge Yang, and Xiaolong Wang, “Neural volumetric memory for visual locomotion control,” in CVPR, 2023, pp. 1430–1440.
  • [16] David Hoeller, Nikita Rudin, Dhionis Sako, and Marco Hutter, “Anymal parkour: Learning agile navigation for quadrupedal robots,” arXiv preprint arXiv:2306.14874, 2023.
  • [17] Alex Zihao Zhu, Nikolay Atanasov, and Kostas Daniilidis, “Event-based visual inertial odometry,” in CVPR, 2017, pp. 5391–5399.
  • [18] Guangzhi Tang, Neelesh Kumar, Raymond Yoo, and Konstantinos Michmizos, “Deep reinforcement learning with population-coded spiking neural network for continuous control,” in CoRL. PMLR, 2021, pp. 2016–2029.
  • [19] Fangwen Yu, Yujie Wu, Songchen Ma, Mingkun Xu, Hongyi Li, Huanyu Qu, Chenhang Song, Taoyi Wang, Rong Zhao, and Luping Shi, “Brain-inspired multimodal hybrid neural network for robot place recognition,” Science Robotics, vol. 8, no. 78, pp. eabm6996, 2023.
  • [20] Berthold KP Horn and Brian G Schunck, “Determining optical flow,” AI, vol. 17, no. 1-3, pp. 185–203, 1981.
  • [21] Jiahang Cao, Xu Zheng, Yuanhuiyi Lyu, Jiaxu Wang, Renjing Xu, and Lin Wang, “Chasing day and night: Towards robust and efficient all-day object detection guided by an event camera,” in ICRA. IEEE, 2024, pp. 9026–9032.
  • [22] Eric Hunsberger and Chris Eliasmith, “Spiking deep networks with lif neurons,” arXiv preprint arXiv:1510.08829, 2015.
  • [23] Zhaokun Zhou, Yuesheng Zhu, Chao He, Yaowei Wang, Shuicheng Yan, Yonghong Tian, and Li Yuan, “Spikformer: When spiking neural network meets transformer,” arXiv preprint arXiv:2209.15425, 2022.
  • [24] Mark Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in ISSCC. IEEE, 2014, pp. 10–14.
  • [25] Priyadarshini Panda, Sai Aparna Aketi, and Kaushik Roy, “Toward scalable, efficient, and accurate deep spiking neural networks with backward residual connections, stochastic softmax, and hybridization,” Frontiers in Neuroscience, vol. 14, pp. 653, 2020.
  • [26] Man Yao, Guangshe Zhao, Hengyu Zhang, Yifan Hu, Lei Deng, Yonghong Tian, Bo Xu, and Guoqi Li, “Attention spiking neural networks,” IEEE TPAMI, 2023.
  • [27] Wei Fang, Zhaofei Yu, Yanqi Chen, Tiejun Huang, Timothée Masquelier, and Yonghong Tian, “Deep residual learning in spiking neural networks,” NeurIPS, vol. 34, pp. 21056–21069, 2021.