(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version

¹¹institutetext: National Key Laboratory for Multimedia Information Processing,
School of Computer Science, Peking University

SpikeNVS: Enhancing Novel View Synthesis from Blurry Images via Spike Camera

Gaole Dai Zhenyu Wang Qinwen Xu Ming Lu Wen Chen Boxin Shi Shanghang Zhang ^* Tiejun Huang

Abstract

One of the most critical factors in achieving sharp Novel View Synthesis (NVS) using neural field methods like Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) is the quality of the training images. However, Conventional RGB cameras are susceptible to motion blur. In contrast, neuromorphic cameras like event and spike cameras inherently capture more comprehensive temporal information, which can provide a sharp representation of the scene as additional training data. Recent methods have explored the integration of event cameras to improve the quality of NVS. The event-RGB approaches have some limitations, such as high training costs and the inability to work effectively in the background. Instead, our study introduces a new method that uses the spike camera to overcome these limitations. By considering texture reconstruction from spike streams as ground truth, we design the Texture from Spike (TfS) loss. Since the spike camera relies on temporal integration instead of temporal differentiation used by event cameras, our proposed TfS loss maintains manageable training costs. It handles foreground objects with backgrounds simultaneously. We also provide a real-world dataset captured with our spike-RGB camera system to facilitate future research endeavors. We conduct extensive experiments using synthetic and real-world datasets to demonstrate that our design can enhance novel view synthesis across NeRF and 3DGS. The code and dataset will be made available for public access.

Keywords:

Neuromorphic sensors Novel view synthesis Deblur

¹¹footnotetext: ^* Corresponding author.

1 Introduction

Novel view synthesis methods, such as Neural Radiance Fields (NeRF) [18] and 3D Gaussian Splatting (3DGS) [12], have garnered significant attention due to their remarkable capability learned from multi-view images. The NeRF model predicts color and density from 3D scene coordinates and ray information, subsequently volume-rendered into an image. On the other hand, 3DGS uses the learnable 3D Gaussian to represent the scene and employs the splatting method for rendering. In both methods, achieving sharp NVS heavily relies on the quality of training images. However, RGB cameras are prone to motion blur, compromising the captured images and learning process. Previous approaches, such as NeRF-W [16] and Deblur-NeRF [15], have attempted to model image degradation factors like blur, occlusion, and illumination changes. Nevertheless, these methods fail to address the inherent limitations of RGB cameras. In contrast, neuromorphic cameras like event and spike cameras inherently capture richer temporal information than RGB cameras. Consequently, a few recent studies have proposed enhancing NVS with neuromorphic cameras.

The primary focus of event-RGB methods is to utilize event cameras for capturing dynamic objects and guiding the learning process [9, 23, 24]. However, event cameras rely on temporal differentiation, restricting their ability to capture static scenes like the background. Additionally, these approaches use an event-based loss function that requires multiple independent renderings for differentiation calculation. These methods lead to significantly higher training costs. The limitations shown in Fig. 1 (b) arise due to the trade-off between precision and computational cost in E2NeRF [23], an event-RGB NeRF method.

Refer to caption — Figure 1: Limitations of event-RGB methods. (a): Neuromorphic cameras can capture much richer temporal information than conventional RGB cameras. However, event streams are sparser than spike streams and cannot capture static background information. (b): Integrating event streams into NVS methods requires high training costs. Taking E2NeRF as an example, achieving higher accuracy requires increasing the number of independent renderings, resulting in significantly higher costs (top row). The reason for this is that simulating events requires the use of asynchronous differentiation (bottom row).

In contrast, the spike camera measures accumulated brightness to activate a spike pulse, perfectly avoiding the intrinsic limitations of event-based methods. This temporal integration property allows for the simultaneous capture of static and dynamic scenes, avoiding the need for independent rendering for asynchronous differentiation. With this insight, we propose a novel Texture from Spike (TfS) loss to introduce spike data to the NVS methods. Specifically, Texture from Interval (TFI) and Texture from Playback (TFP) [4] are two commonly used spike image texture reconstruction techniques. As demonstrated in Fig. 2 (a), TFI evaluates the relative trigger frequency within an interval and reconstructs the image based on the eligibility traces. Conversely, TFP utilizes a sliding time window (e.g., 32, 64 timestamps) to aggregate the spike pulses within this time window. Our TfS loss leverages the benefits of both TFP and TFI in a learnable manner, as shown in Fig. 2 (b).

We also build a new dataset with our synchronized spike-RGB camera system to evaluate our method on real-world scenes. In summary, our main contributions can be concluded as follows:

1.

We present a pioneer study that assesses the advantages of spike over event streams regarding both quality and cost for NVS. We have developed the first spike-RGB NVS technique based on this.
2.

We propose a novel Texture from Spike (TfS) loss that effectively integrates the advantages of both Texture from Interval (TFI) and Texture from Playback (TFP) results, which are commonly employed for spike reconstruction, into the learning process.
3.

We have developed a synchronized spike-RGB camera system with aligned field-of-view and trigger time. We contributed the first real-world dataset for spike-RGB NVS based on it.

2 Related Work

2.1 Novel View Synthesis

Neural Radiance Fields (NeRF) have been a popular method for synthesizing novel views since they were first introduced [18]. The following advancements in NeRF technology have been directed towards improving its effectiveness and performance. For instance, PlenOctree [31], FastNeRF [7], and EfficientNeRF [8] have refined data structures to expedite processing speeds. Other innovations, such as AutoInt [18] and Instant-NGP [20], have worked on extracting distinct features to improve scene representation. Research efforts like pixelNeRF [32], and MVSNeRF [1] have been made to adapt NeRF to sparse view conditions, leveraging pre-trained networks for feature extraction. Methods incorporating depth or geometry cues [30, 3] have been proposed to speed up training. Recently, the emerging technique of 3D Gaussian Splatting (3DGS) [12] employs the learnable 3D Gaussian to represent the scene and utilizes the splatting method for rendering from arbitrary camera views. This approach achieves real-time rendering without using neural networks to learn the implicit function, resulting in accelerated training. However, all these advancements rely on consistent, high-quality multi-view images, which poses a common challenge, especially when using standard RGB cameras.

2.2 Neuromorphic Cameras

Spike cameras [5] and event cameras [13, 2] are bio-inspired sensors that can overcome the limitations of traditional RGB cameras in challenging scenarios. One of the key advantages of Neuromorphic cameras is their high temporal resolution and pixel bandwidth. Researchers have leveraged these advantages to recover high-quality scene information from low-quality RGB frames. EDI [21] and E-CIR [29] fuse event data and blurry frames to reconstruct a high-quality video. [34, 36, 37] use spike cameras to capture high-speed movements without motion blur. Besides, spike and event cameras have a much higher dynamic range than RGB cameras. This characteristic has been effectively utilized in spike cameras to address challenges such as overexposure [35], low illumination [6], and noise [38] associated with conventional cameras. Similarly, event cameras have been used in different ways, such as multi-bracket HDR pipelines [17], E2SRI [11], and enhancing robustness in dynamic scenes [10]. However, event cameras struggle to capture the texture details of visual scenes due to their recording of only relative changes in light intensity, leading to significantly degraded visibility. In contrast, spike cameras completely record the absolute light intensity at a fairly high frame rate, which provides a more explicit input format for detailed reconstruction.

3 Method

We first commences with a concise introduction to NeRF, 3DGS, and spike reconstruction. Subsequently, we demonstrate of our synchronized spike-RGB camera system. Ultimately, all components are integrated to form our final pipeline.

3.1 Preliminary

NeRF utilizes a Multi-layer Perceptron (MLP) $F_{\theta}$ to model the mapping from 5D input coordinates—combining 3D spatial positions and 2D viewing directions—to the color and density of a scene, as detailed in Eq. 1:

(\textbf{c},\sigma)=F_{\theta}(\gamma(\textbf{o}),\gamma(\textbf{d}))

(1)

The function $\gamma(\cdot)$ encodes the inputs into a higher dimensional space to facilitate the learning of complex spatial relationships. The rendering process relies on classical volume rendering techniques, which integrate the color and density along camera rays, represented by Eq. 2. Here, $\textbf{r}=\textbf{o}+l\textbf{d}$ defines the ray’s path, o is the camera origin, d the viewing direction, and $l$ the distance along the ray.

C(\textbf{r})=\sum_{i=1}^{N}T_{i}(1-\exp(-\sigma_{i}\delta_{i}))\textbf{c}_{i},\quad T_{i}=\exp(-\sum_{j=1}^{i-1}\sigma_{j}\delta_{j})

(2)

NeRF leverages hierarchical volume sampling, optimizing both a coarse and a fine network to render accurate images. The loss function, described in Eq. 3, minimizes the difference between the rendered and the ground truth colors across a batch of rays R, enhancing the fidelity of both the coarse and fine models.

\mathcal{L}=\sum_{\textbf{r}\in\textit{R}}(\|C_{c}(\textbf{r})-C(\textbf{r})\|_{2}^{2}+\|C_{f}(\textbf{r})-C(\textbf{r})\|_{2}^{2})

(3)

3D Gaussian Splatting involves projecting points in 3D space onto the visual plane and smoothly distributing their influence using Gaussian functions. Each point $x$ in the cloud is then represented as a 3D Gaussian. This involves defining parameters such as the position center $\mu$ and covariance $\Sigma$ for each Gaussian.

G\left(x\right)=e^{-\frac{1}{2}\left(x\right)^{T}\Sigma^{-1}\left(x\right)}

During rendering, given a viewing transformation $W$ the covariance matrix $\Sigma$ in camera coordinates is given as

\Sigma^{\prime}=JW\Sigma W^{T}J^{T}

where $J$ is the Jacobian of the affine approximation of the transformation.

3.2 Spike Reconstruction

The spike camera captures light intensity in a distinct manner, which differs from the exposure method used by RGB cameras and the differential methods employed by event cameras. It integrates light intensity until it reaches a threshold $\Omega$ , triggering a spike while keeping the surplus $I$ . Given time $t_{i}$ and accumulator for each pixel as $A$ , we have:

A_{t_{i}}=(A_{t_{i-1}}+I_{t_{i}})\mod\Omega

(4)

The pixel’s spike value at (x,y) is determined by the accumulator’s value and the input, indicating brightness during sampling:

p_{x,y,t_{i}}=\begin{cases}1,&\text{if }A_{t_{i-1}}+I_{t_{i}}\geq\Omega\\ 0,&\text{otherwise}\end{cases}

(5)

Spike frames are the spikes between interval $t_{i-1}$ to $t_{i}$ , denoted as $F_{i}$ :

F_{i}=\{\textbf{s}_{i}(x_{i},y_{i},p_{i})\}_{t_{i}}

(6)

The rich data from spike streams are processed to assist NeRF’s learning of spatiotemporal textures. Texture reconstruction uses TFI (Texture from Interval) and TFP (Texture from Playback), described by:

\textbf{TFI: }P_{t_{i}}=\frac{\Omega}{d_{t_{i}}},\textbf{TFP: }P_{t_{i}}=\frac{N_{w}}{w}*C

(7)

Here, $P_{t_{i}}$ is the reconstructed texture; for TFI, $d_{t_{i}}$ represents the temporal interval (latency) between time $t_{i}$ and the last spike emission. TFI approach is proficient in capturing the contours of textures. For TFP, $w$ denotes the size of the time window and $N_{w}$ represents the accumulated spike values within this window. By dynamically adjusting the size of the time window according to different contrast levels, the TFP method achieves texture reconstruction with diverse dynamic ranges.

Unlike event data, which only captures motion, TFP and TFI are adept at reconstructing a relatively blur-free texture of both static (e.g. backgrounds) and dynamic (e.g. moving objects) scenes straight from spike streams.

3.3 Synchronized Spike-RGB Camera System

Table 1: The detailed configuration of the spike-RGB camera. The Vidar spike camera boasts a significantly higher frame rate and dynamic range, enabling it to capture motion and aperture with exceptional clarity and minimal blur. We downsample images from RGB cameras to match the resolution of Vidar.

Camera Type	Specifications
Camera Type	GoPro 9 (RGB)	Vidar (Spike)
Resolution	1920 $\times$ 1080	400 $\times$ 250
Frame Rate (fps)	120	40000
Dynamic Range (dB)	60	100

To introduce spike streams into the learning process of NeRF or 3DGS, the first step is to develop a platform for data collection. In this case, we designed and constructed a synchronized spike-RGB camera system. This system combines a spike camera (Vidar [5]) and a conventional RGB camera (GoPro 9) using a beam splitter (Thorlabs CCM1-BS013). The hardware prototype and specifications of our system are depicted in Fig. 3. The purpose of the beam splitter is to achieve the spatial synchronization of the same scene capture. The beam splitter splits the incoming light and directs it to separate sensors with the same field of view. We also ensure time synchronization by employing a clock with an accuracy of 0.0001s to determine the timestamps of the scenes captured by both cameras. Moreover, the mobility of our Spike-RGB camera system enables us to capture images in both indoor and outdoor environments. This versatility allows us to validate the effectiveness of our proposed method across various scenarios.

3.4 Integrating Colored Blur-free Representation from Spike Stream

Color Rendering Loss

We introduce the concept of Color Rendering Loss as Eq. 8. We adhere to the joint optimization design of NeRF’s coarse and fine models, a strategy that remains advantageous in our framework.

\mathcal{L}_{color}=\sum_{\textbf{r}\in\mathcal{R}}[\parallel\hat{C}_{blur}^{c}-C(\textbf{r})\parallel_{2}^{2}+\parallel\hat{C}_{blur}^{f}-C(\textbf{r})\parallel_{2}^{2}]

(8)

Here $\hat{C}_{blur}^{c}$ and $\hat{C}_{blur}^{f}$ mean the predicted colors of the coarse and fine models, respectively, while $C(\textbf{r})$ denotes the true color values for any sampled ray r.

Texture from Spike Loss

Our framework enhances deblurring by incorporating spike stream into a learning-based 3D reconstruction process, utilizing our specifically designed Texture from Spike (TfS) Loss. This approach exploits the high temporal resolution of spike cameras, which capture a series of spike streams for each blurry RGB image frame. The TfS Loss aids in reconstructing a clear texture from the spike streams, which corresponds to the non-blurred content within the RGB image’s exposure time. We use multiple losses combined to construct the final TfS loss, including the loss from both TFI and TFP.

The trainable converter layer, which converts the color output into grayscale, is another distinctive design. In contrast to the previous event-based method [23] that employs standard weighting (R:0.2989, G:0.5870, and B:0.1140) for RGB to grayscale conversion, we question the limited flexibility of such a conversion approach designed for regular images. Instead, we utilize a learnable layer for conversion that aligns the reconstructed grayscale texture with the blur-free texture obtained from spike streams. This design not only demonstrates greater potential but also enables direct loss computation. The formulation of TfS Loss is as follows:

\mathcal{L}_{TfS}=\mathcal{L}_{color}+w\sum_{x\in\mathcal{X}}\|\text{TfS}_{\text{G}}(x)-\text{TfS}_{\text{S}}(x)\|^{2}_{2}

(9)

In this equation, for any 4D output $x$ from the color rendering head, $\text{TfS}_{\text{G}}(x)$ represents the grayscale texture predicted by the learned conversion, and $\text{TfS}_{\text{S}}(x)$ is the ground truth texture from the spike reconstruction. The overall training loss for SpikeNeRF combines the TfS Loss with a color rendering loss, ensuring that both textural and color are preserved in the deblurred output.

4 Experiments

The First section of the Appendix will provide a more comprehensive account. However, due to spatial constraints, we will present a concise overview on the generation and collection of spike data, experimental setup, and result analysis in this section to showcase our contributions.

4.1 Experimental Setups

Synthetic Dataset

Our synthetic dataset comprises six classical scenes for NeRF reconstruction: hotdog, ficus, lego, chair, materials, and mic. We utilized Blender as our virtual environment to collect 200 sets of RGB images for each scene, all captured from different camera views. Each set consists of 18 sharp images obtained using the Camera Shakify plugin in Blender, which simulates motion blur by shaking the camera. The 200 sets of views are divided evenly into 100 training views and 100 testing views. For the training views, 18 RGB images from each view are processed through a spike-generating tool to produce corresponding binary spike streams. It is crucial to note that during the training phase, we use only a single blurry RGB image and 18 binary spike data for supervision, excluding any sharp RGB images from the process.

Real-world Dataset

Based on the Spike-RGB camera system we introduced in Sec. 3.3, we constructed an RGB & Spike 3D (RS-3D) dataset. This dataset captures six real-world scenes using the Spike-RGB camera system, with camera poses derived via COLMAP[25]. The creation involved capturing RGB and spike videos from 40 viewpoints, with intentional camera movement for motion blur. Temporal synchronization utilized a precise clock for accurate timestamps, aligning the spike data with RGB frames. Spatial alignment was achieved using a beam splitter and camera calibration with checkerboards, allowing for accurate cropping of RGB frames to match the spike camera’s field of view. The RS-3D dataset thus includes six scenes with 40 synchronized and aligned pairs of RGB images and spike streams. Dataset details are provided in Fig. 4.

Baselines

The aforementioned in Sec. 1 highlights the existence of diverse deblur techniques, encompassing algorithm-based deblurring without reliance on hardware or sharp training data (Deblur-NeRF[15] and MPR-NeRF[33]) as well as hardware-dependent methods that employ neuromorphic cameras for capturing clear data (D2Net-NeRF[28], EDI-NeRF[22], and E2NeRF[23]). To be specific, Deblur-NeRF[15] incorporates a deformable kernel to acquire knowledge of the blurring process, which is then passed on to NeRF MLP. MPR-NeRF[33] is a single-image deblurring technique that performs image deblurring before NeRF training. D2Net-NeRF[28] and EDI-NeRF[22] employ event-based deblurring, with NeRF being trained using the resulting deblurred images. Lastly, E2NeRF[23] integrates an event-based loss for deblurring during the NeRF training process and achieves state-of-the-art performance.

4.2 Result Analysis

Synthetic data

Table 2: Quantitative analysis on blur and novel views. The results are the averages of the six synthetic scenes. The best average score under a similar setting is marked in bold.

Method	PSNR $\uparrow$		SSIM $\uparrow$		LPIPS $\downarrow$
Method	Blur View	Novel View	Blur View	Novel View	Blur View	Novel View
NeRF	22.91	22.27	.9072	.9018	.1441	.1483
Deblur-NeRF	21.71	19.93	.8795	.8584	.2364	.2573
D2Net-NeRF	27.46	26.65	.9450	.9427	.1029	.1087
EDI-NeRF	27.94	27.71	.9497	.9522	.0860	.0896
MPR-NeRF	27.93	27.91	.9525	.9571	.0882	.0861
E2NeRF^5×	29.16	29.09	.9592	.9571	.0828	.0826
E2NeRF^2×	23.41	23.17	.9048	.9034	.1579	.1588
SpikeNeRF^5×	29.36	29.15	.9669	.9654	.0625	.0624
SpikeNeRF	28.46	28.27	.9603	.9595	.0787	.0790

As detailed in Sec. 4.1, we evaluated various NeRF methods using three metrics: PSNR, SSIM, and LPIPS. Our approach, integrating spike streams, outperformed baseline models across the dataset, showcasing state-of-the-art deblurring performance, and outstripping traditional NeRF deblurring techniques and E2NeRF under similar test conditions (Fig. 5 and Tab. 2).

We noted that E2NeRF necessitates independent rendering to calculate the event-based loss, incurring substantial extra computational costs. The results from the optimal settings reported in the original E2NeRF [23], which involved sampling 5 individual timestamps (4 bins) for each pixel (referred to as E2NeRF^5×), were included for comparison. We also incorporated a comparable setting in SpikeNeRF^5×. Additionally, we present results from E2NeRF’s minimum cost setting (1 bin 2 individual timestamps) as E2NeRF^2×. Lastly, the default setting of our SpikeNeRF does not entail any significant additional costs and is simply referred to as SpikeNeRF.

In addition to quantitative analysis, we also conducted qualitative comparisons on synthetic data (Fig. 5). The experiments included the majority of other models from the quantitative analysis. From the final results, it is evident that SpikeNeRF still exhibits the best performance, reconstructing images with greater clarity and richer texture details.

Real-world Data

We conducted a range of qualitative and quantitative experiments on real-world datasets, employing both NeRF and 3DGS as fundamental methodologies. Subsequently, we applied diverse deblurring techniques to process images from our RS-3D dataset, with the comparative results depicted in Fig. 6. Our approach demonstrates promising outcomes when applied to real-world data. The spike stream exhibits an exceptionally high frame rate, effectively mitigating motion blur during recorded motion processes and compensating for the loss of information inherent in the input blurry RGB frame.

4.3 Runtime Test and Ablation Study

The rationale behind the superior efficiency of spike streams over event streams in learnable 3D reconstruction tasks lies primarily in their ability to avoid independent rendering along the time axis. Experimental results from Fig. 7 corroborate our hypothesis. In addition to conducting runtime experiments, we also investigated the impact of different spike texture reconstruction methods. Specifically, we compared experiments involving direct utilization of raw spike streams, TFP texture, TFI texture, and a combination of both. From the results obtained, it is evident that employing both TFP and TFI yields benefits across all metrics, particularly perceptual loss (LPIPS). Subsequently, we conducted experiments with varying scales for the number of spike texture reconstructions used in training each view ( $n$ ) and weight ( $w$ ) of the TfS loss. Upon reaching a training dataset size of 1 for spike data, there was a noticeable plateau in metric growth indicating that spike-enhanced deblur exhibits higher data efficiency compared to the event method. The choice of TfS loss weight ( $w$ ) had a subtle impact on experimental outcomes; within our settings, its effect peaked at an optimal value (0.0001).

5 Potential Advantages

In Fig. 8 we demonstrate additional advantages that spike streams bring to our test, including facilitating camera pose estimation and mitigating overexposure. We utilize COLMAP [26, 27] for camera pose estimation. Direct estimation of camera poses from blurry RGB inputs is notoriously difficult. However, our experiments reveal that spike reconstruction significantly enhances COLMAP’s ability to accurately estimate camera poses. The robustness of COLMAP’s model estimation algorithms, such as RANSAC and bundle adjustment, largely depends on the availability of high-quality and consistent feature points across images. Spike reconstruction contributes to this process by providing denser and clearer feature points, which are essential for precise feature matching in COLMAP.

6 Conclusion and Future Works

In summary, the findings of our study showcase the superior performance of spike streams compared to event streams in the context of 3D reconstruction with blurry images task from various perspectives. Our novel ’Texture from Spike’ (TfS) loss function, coupled with an efficient end-to-end training process, demonstrates state-of-the-art performance without incurring substantial additional computational costs. Besides, the creation of the synthetic and real-world RS-NeRF datasets marks a pioneering step towards furthering research in the mutual enhancement of spike data and Implicit Neural Representation (INR) networks. Our approach opens new avenues for leveraging the high temporal resolution of spike cameras, surpassing traditional RGB cameras in capturing fast-moving scenes without motion blur and in challenging lighting conditions.

Future work could explore the extension of the SpikeNeRF framework to other domains where high-speed capture is crucial. Despite the limited availability of spike hardware in the market, we anticipate that future research will unveil further applications for spikes in diverse visual tasks, similar to event cameras but with their distinctive attributes.

Appendix - SpikeNeRF: Enhancing Learning-based 3D Reconstruction from Blurry Images using Spike Camera

1 Overview

This appendix of SpikeNeRF includes:

•

Experimental Setups: The model aligns with standard NeRF configurations in terms of depth and width parameters, trained on a single NVIDIA A100 GPU. Baseline description and detailed hyperparameters are provided, ensuring a balance between computational efficiency and performance.
•

Computational Complexity Analysis: A detailed comparison between the computational costs of E2NeRF and SpikeNeRF is provided, highlighting the efficiency of SpikeNeRF in resource management.
•

Comparison between Cameras and Future works The cost of the equipment poses a practical concern. We conducted a comparison of the cost and configuration between spike, event, and high-speed RGB cameras. Additionally, we explored the potential applications of next-generation spike cameras that incorporate an RGB channel.

2 Experimental Setups

Our study undertakes a detailed comparison with the E2NeRF model, building upon its framework and incorporating elements from the standard NeRF architecture as described in [23]. The training was conducted on a single NVIDIA A100 GPU, mirroring the typical resource allocation in standard NeRF implementations. Notably, our approach aligns with common NeRF configurations, specifically in terms of the ‘depth‘ and ‘width‘ parameters of the model. These parameters are crucial, as the ‘depth‘ corresponds to the number of ‘Dense‘ layers, and the ‘width‘ to the number of units in each layer, as is standard in NeRF models. We also includes the baseline description in Tab. 4

Table 3: Detailed Hyperparameters

Parameter	Value
Weight ( $w$ )	0.0001
Threshold ( $\Omega$ )	2
Batch Size	1024
Coordinate Normalization Range	[-1, 1]
TFP Window Size	6
Iteration Count	200,000

Table 4: The deblur-NeRF[15] incorporates a deformable kernel to acquire knowledge of the blurring process, which is then passed on to NeRF MLP. MPR-NeRF[33] is a single-image deblurring technique that performs image deblurring before NeRF training. D2Net-NeRF[28] and EDI-NeRF[22] employ event-based deblurring, with NeRF being trained using the resulting deblurred images. Lastly, E2NeRF[23] integrates an event-based loss for deblurring during the NeRF training process and achieves state-of-the-art performance.

Method	Description
Deblur-NeRF[15]	Deblur with extra network, NeRF train on blurry images
MPR[33]-NeRF	Single-image deblurring, NeRF train on deblurred images
D2Net[28]-NeRF	Event-based deblurring, NeRF train on deblurred images
EDI[22]-NeRF	Event-based deblurring, NeRF train on deblurred images
E2NeRF[23]	Event-based loss deblur during training NeRF

3 Analysis of Complexity: Computational Cost Comparison Between E2NeRF and SpikeNeRF

As we discussed in Sec. 1 of our main text, there exists a fundamental trade-off between event loss accuracy and computational efficiency in event-based neural rendering frameworks. To illustrate, consider two sampled timestamps $t_{0}$ and $t_{1}$ , which exhibit similar brightness values. These values do not trigger an event as their difference falls within the non-activation range ( $-\Theta<b_{t_{1}}-b_{t_{0}}<\Theta$ ), where $\Theta$ denotes the event threshold. However, introducing an intermediary timestamp $t_{\frac{1}{2}}$ reveals that an event does occur within the interval $t_{0}$ to $t_{1}$ , particularly if $b_{t_{\frac{1}{2}}}-b_{t_{0}}>\Theta$ .

A straightforward solution to this detection dilemma is to decrease the time sampling interval, thereby capturing more granular changes in brightness. Yet, this approach introduces a significant computational overhead. In E2NeRF [23], for instance, each additional sampled timestamp proportionally increases the computational load for event-based loss computation, as illustrated in Fig. 1 in our main text.

To quantify this complexity, we analyze the inference operations in the original NeRF network architecture, as shown in Fig. 9 (a). The computational load for each fully connected (dense) layer is determined by the number of neurons in both the current and preceding layers. Consequently, the total inference complexity, denoted as $C$ , is calculated by summing the products of neuron counts across successive layers:

C=n_{0}\cdot n_{1}+n_{1}\cdot n_{2}+\cdots+n_{L-2}\cdot n_{L-1}+n_{L-1}\cdot n_{L}

(10)

In the case of E2NeRF, which employs an event-based loss, each rendered event requires independent calculations, resulting in a computational cost of at least $N\times C$ if we have $N$ sampling timestamps. Conversely, SpikeNeRF primarily incurs additional computational expenses in the RGB converter layer, represented by $O(n_{L-1}\cdot n_{L})$ . This distinction underscores the efficiency of SpikeNeRF in managing computational resources.

4 Comparison between Cameras and Future works

In Fig. 10 Column 3 presents a comparison between spike cameras, event cameras, and regular high-speed cameras in terms of specifications and prices. Spike cameras prove to be more cost-effective than high-speed cameras; with similar hardware costs, they yield superior reconstruction results while requiring significantly less training expenditure compared to event cameras. In column 4, next-generation spike cameras offer higher resolution and RGB content capabilities, enabling direct utilization for NeRF or 3DGS training purposes—a potential breakthrough when considering scenes involving relative movement between the camera and targets.

References

[1] Chen, A., Xu, Z., Zhao, F., Zhang, X., Xiang, F., Yu, J., Su, H.: Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14124–14133 (2021)
[2] Daniel, F.Y., Fessler, J.A.: Mean and variance of single photon counting with deadtime. Physics in Medicine & Biology 45(7), 2043 (2000)
[3] Deng, K., Liu, A., Zhu, J.Y., Ramanan, D.: Depth-supervised nerf: Fewer views and faster training for free. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12882–12891 (2022)
[4] Dong, S., Huang, T., Tian, Y.: Spike camera and its coding methods. 2017 Data Compression Conference (DCC) pp. 437–437 (2017), https://api.semanticscholar.org/CorpusID:3839066
[5] Dong, S., Huang, T., Tian, Y.: Spike camera and its coding methods. arXiv preprint arXiv:2104.04669 (2021)
[6] Dong, Y., Zhao, J., Xiong, R., Huang, T.: High-speed scene reconstruction from low-light spike streams. 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP) pp. 1–5 (2022). https://doi.org/10.1109/VCIP56404.2022.10008850
[7] Garbin, S.J., Kowalski, M., Johnson, M., Shotton, J., Valentin, J.: Fastnerf: High-fidelity neural rendering at 200fps. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14346–14355 (2021)
[8] Hu, T., Liu, S., Chen, Y., Shen, T., Jia, J.: Efficientnerf efficient neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12902–12911 (2022)
[9] Hwang, I., Kim, J., Kim, Y.M.: Ev-nerf: Event based neural radiance field (2023)
[10] Isfahani, S.M.M., Choi, J., Yoon, K.J.: Learning to super resolve intensity images from events. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2765–2773 (2019). https://doi.org/10.1109/cvpr42600.2020.00284
[11] Isfahani, S.M.M., Nam, Y., Choi, J., Yoon, K.J.: E2sri: Learning to super-resolve intensity images from events. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 6890–6909 (2022). https://doi.org/10.1109/TPAMI.2021.3096985
[12] Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering (2023)
[13] Lichtsteiner, P., Delbruck, T., Posch, C.: A 100db dynamic range high-speed dual-line optical transient sensor with asynchronous readout. In: 2006 IEEE International Symposium on Circuits and Systems. pp. 4–pp. IEEE (2006)
[14] Liu, X., van de Weijer, J., Bagdanov, A.D.: Rankiqa: Learning from rankings for no-reference image quality assessment. 2017 IEEE International Conference on Computer Vision (ICCV) pp. 1040–1049 (2017), https://api.semanticscholar.org/CorpusID:6736352
[15] Ma, L., Li, X., Liao, J., Zhang, Q., Wang, X., Wang, J., Sander, P.V.: Deblur-nerf: Neural radiance fields from blurry images. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 12851–12860 (2021), https://api.semanticscholar.org/CorpusID:244714238
[16] Martin-Brualla, R., Radwan, N., Sajjadi, M.S.M., Barron, J.T., Dosovitskiy, A., Duckworth, D.: Nerf in the wild: Neural radiance fields for unconstrained photo collections (2021)
[17] Messikommer, N., Georgoulis, S., Gehrig, D., Tulyakov, S., Erbach, J., Bochicchio, A., Li, Y., Scaramuzza, D.: Multi-bracket high dynamic range imaging with event cameras. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) pp. 546–556 (2022). https://doi.org/10.1109/CVPRW56347.2022.00070
[18] Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf. Communications of the ACM 65, 99 – 106 (2020), https://api.semanticscholar.org/CorpusID:213175590
[19] Mittal, A., Moorthy, A.K., Bovik, A.C.: Blind/referenceless image spatial quality evaluator. 2011 Conference Record of the Forty Fifth Asilomar Conference on Signals, Systems and Computers (ASILOMAR) pp. 723–727 (2011), https://api.semanticscholar.org/CorpusID:16388844
[20] Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (ToG) 41(4), 1–15 (2022)
[21] Pan, L., Scheerlinck, C., Yu, X., Hartley, R., Liu, M., Dai, Y.: Bringing a blurry frame alive at high frame-rate with an event camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6820–6829 (2019)
[22] Pan, L., Scheerlinck, C., Yu, X., Hartley, R.I., Liu, M., Dai, Y.: Bringing a blurry frame alive at high frame-rate with an event camera. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 6813–6822 (2018), https://api.semanticscholar.org/CorpusID:53749928
[23] Qi, Y., Zhu, L., Zhang, Y., Li, J.: E2nerf: Event enhanced neural radiance fields from blurry images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13254–13264 (October 2023)
[24] Rudnev, V., Elgharib, M., Theobalt, C., Golyanik, V.: Eventnerf: Neural radiance fields from a single colour event camera (2023)
[25] Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)
[26] Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
[27] Schönberger, J.L., Zheng, E., Pollefeys, M., Frahm, J.M.: Pixelwise view selection for unstructured multi-view stereo. In: European Conference on Computer Vision (ECCV) (2016)
[28] Shang, W., Ren, D., Zou, D., Ren, J.S.J., Luo, P., Zuo, W.: Bringing events into video deblurring with non-consecutively blurry frames. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 4511–4520 (2021), https://api.semanticscholar.org/CorpusID:244102332
[29] Song, C., Huang, Q.X., Bajaj, C.: E-cir: Event-enhanced continuous intensity recovery. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 7793–7802 (2022). https://doi.org/10.1109/CVPR52688.2022.00765
[30] Xu, Q., Xu, Z., Philip, J., Bi, S., Shu, Z., Sunkavalli, K., Neumann, U.: Point-nerf: Point-based neural radiance fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5438–5448 (2022)
[31] Yu, A., Li, R., Tancik, M., Li, H., Ng, R., Kanazawa, A.: Plenoctrees for real-time rendering of neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5752–5761 (2021)
[32] Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelnerf: Neural radiance fields from one or few images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4578–4587 (2021)
[33] Zamir, S.W., Arora, A., Khan, S.H., Hayat, M., Khan, F.S., Yang, M.H., Shao, L.: Multi-stage progressive image restoration. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 14816–14826 (2021), https://api.semanticscholar.org/CorpusID:231802205
[34] Zhao, J., Xiong, R., Zhao, R., Wang, J., Ma, S., Huang, T.: Motion estimation for spike camera data sequence via spike interval analysis. 2020 IEEE International Conference on Visual Communications and Image Processing (VCIP) pp. 371–374 (2020). https://doi.org/10.1109/VCIP49819.2020.9301840
[35] Zhao, J., Xie, J., Xiong, R., Zhang, J., Yu, Z., Huang, T.: Super resolve dynamic scene from continuous spike streams. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 2513–2522 (2021). https://doi.org/10.1109/ICCV48922.2021.00253
[36] Zhao, J., Xiong, R., Huang, T.: High-speed motion scene reconstruction for spike camera via motion aligned filtering. 2020 IEEE International Symposium on Circuits and Systems (ISCAS) pp. 1–5 (2020). https://doi.org/10.1109/ISCAS45731.2020.9181055
[37] Zhao, J., Xiong, R., Xie, J., Shi, B., Yu, Z., Gao, W., Huang, T.: Reconstructing clear image for high-speed motion scene with a retina-inspired spike camera. IEEE Transactions on Computational Imaging 8, 12–27 (2022). https://doi.org/10.1109/tci.2021.3136446
[38] Zhu, L., Dong, S., Li, J., Huang, T., Tian, Y.: Ultra-high temporal resolution visual reconstruction from a fovea-like spike camera via spiking neuron model. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1233–1249 (2022). https://doi.org/10.1109/TPAMI.2022.3146140