This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Spike-NeRF: Neural Radiance Field Based On Spike Camera

Yijia Guo1, Yuanxi Bai2, Liwen Hu1, Mianzhi Liu2, Ziyi Guo2, Lei Ma1,2*, Tiejun Huang1
1National Engineering Research Center of Visual Technology (NERCVT), Peking University
2College of Future Technology, Peking University
* Corresponding author. This paper is accepted by ICME2024
Abstract

As a neuromorphic sensor with high temporal resolution, spike cameras offer notable advantages over traditional cameras in high-speed vision applications such as high-speed optical estimation, depth estimation, and object tracking. Inspired by the success of the spike camera, we proposed Spike-NeRF, the first Neural Radiance Field derived from spike data, to achieve 3D reconstruction and novel viewpoint synthesis of high-speed scenes. Instead of the multi-view images at the same time of NeRF, the inputs of Spike-NeRF are continuous spike streams captured by a moving spike camera in a very short time. To reconstruct a correct and stable 3D scene from high-frequency but unstable spike data, we devised spike masks along with a distinctive loss function. We evaluate our method qualitatively and numerically on several challenging synthetic scenes generated by blender with the spike camera simulator. Our results demonstrate that Spike-NeRF produces more visually appealing results than the existing methods and the baseline we proposed in high-speed scenes. Our code and data will be released soon.

Index Terms:
Neuromorphic Vision, Spike Camera, Neural Radiance Field.

I Introduction

Novel-view synthesis (NVS) is a long-standing problem aiming to render photo-realistic images from novel views of a scene from a sparse set of input images. This topic has recently seen impressive progress due to the use of neural networks to learn representations that are well suited for view synthesis tasks, known as Neural Radiance Field (NeRF)[1, 2, 3, 4, 5, 6, 7]. Despite its success, NeRF performs awfully in high-speed scenes since the motion blur caused by high-speed scenes violates the assumption by NeRF that the input images are sharp. Deblur methods such as Deblur-NeRF [8] and BAD-NeRF[9] can only handle mild motion blur. The introduction of high-speed neuromorphic cameras, such as event cameras [10] and spike cameras, are expected to fundamentally solve this problem.

Spike camera [11, 12] is a neuromorphic sensor, where each pixel captures photons independently, keeps recording the luminance intensity asynchronously, and outputs binary spike streams to record the dynamic scenes at extremely high temporal resolution (40000Hz). Recently, many existing approaches use spike data to reconstruct image [13, 14, 15] for high-speed scenes, or directly perform downstream tasks such as optical flow estimation [16] and depth estimation [17].

Motivated by the notable success achieved by spike cameras and NeRF’s

Refer to caption
Figure 1: Existing works on NeRF (orange background) are reconstructed from image sequences generated by traditional cameras which record the luminance intensity during the exposure time at a fixed frame rate producing strong blur in high-speed scenes. Our approach (blue background) produces significantly better and sharper results by using dense spike streams instead of image sequences.

limitation, we proposed Spike-NeRF, the first Neural Radiance Field built by spike data. Different from NeRF, we use a set of continuously spike streams as inputs instead of images from different perspectives at the same time (see Figure 1). To reconstruct a volumetric 3D representation of a scene from spike streams and generate a new spike stream for novel views based on this scene, we first proposed a spiking volume renderer based on the coding method of spike cameras. It generates spike streams asynchronously with radiance obtained by ray casting. Additionally, we both use spike loss to reduce local blur and spike masks to limit NeRF learning information in a specific area, thereby mitigating artifacts resulting from reconstruction errors and noise.

Our experimental results show that Spike-NeRF is suitable for high-speed scenes that would not be conceivable with traditional cameras. Moreover, our method is largely superior to directly using reconstructed images of spike streams for supervision which is considered as the baseline. Our main contributions can be summarized as follows:

  • \bullet

    Spike-NeRF, the first approach for inferring NeRF from a spike stream that enables a novel view synthesis in both gray and RGB space for high-speed scene.

  • \bullet

    A bespoke rendering strategy for spike streams leading to data-efficient training and spike stream generating.

  • \bullet

    A dataset containing RGB spike data and high-frequency (40,000fps) camera poses

II Related Work

II-A NeRF on traditional cameras

Neural Radiance Field (NeRF) [1] has arisen as a significant development in the field of Computer Vision and Computer Graphics, used for synthesizing novel views of a scene from a sparse set of images by combining machine learning with geometric reasoning. Various research and approaches based on NeRF have been proposed recently. For example, [18, 19, 20, 21] extend NeRF to dynamic and non-grid scenes, [2, 3] significantly improve the rendering quality of NeRF and [9, 8, 22, 23] robustly severe blurred images that could affect the rendering quality of NeRF.

Refer to caption
Figure 2: Overview of our Spike-NeRF. Same as NeRF, we use color(C) and density(σ\sigma) generated by MLPs as the input of volume renderer (equation (7)) and spiking volume renderer (equation (17)). We proposed reconstruct loss between the volume renderer result and masked images which are reconstructed from g.t. spike streams. Spike loss between the spike rendering result generated by our spiking volume renderer and g.t. spike streams is computed too.

II-B NeRF on Neuromorphic Cameras

Neuromorphic sensors have shown their advantages in most computer version problems including novel views synthesizing. EventNeRF [24] and Ev-nerf [25] synthesize the novel view in scenarios such as high-speed movements that would not be conceivable with a traditional camera by event supervision. Nonetheless, these works assume that event streams are temporally dense and low-noise which is inaccessible in practice. Robust e-NeRF [26] incorporates a more realistic event generation model to directly and robustly reconstruct NeRFs under various real-world conditions. DE-NeRF [27] and E2NeRF [28] extend event nerf to dynamic scenes and severely blurred images as NeRF researchers did.

II-C Spike Camera Application

As a neuromorphic sensor with high temporal resolution, spike cameras [12] offer significant advantages on many high-speed version tasks. [29] and [13] propose spike stream reconstruction methods for high-speed scenes. Subsequently, deep learning-based reconstruction frameworks [14, 15] are introduced to reconstruct spike streams robustly. Spike cameras also show its superiors on downstream tasks such as optical flow estimation [16, 30], monocular and stereo depth estimation [17, 31], Super-resolution [32] and high-speed real-time object tracking [33].

III Preliminary

III-A Spike Camera And Its Coding Method

Unlike traditional cameras that record the luminance intensity of each pixel during the exposure time at a fixed frame rate, tensors on spike cameras of each pixel capture photons independently and keep recording the luminance intensity asynchronously without a dead zone.

Each pixel on the spike camera converts the light signal into a current signal. When the accumulated intensity reaches the dispatch threshold, a spike is fired and the accumulated intensity is reset. For pixel 𝒙=(x,y)\boldsymbol{x}=(x,y),this process can be expressed as

A(𝒙,t)\displaystyle{A}(\boldsymbol{x},t) =A(𝒙,t1)+I(𝒙,t)\displaystyle={A}(\boldsymbol{x},t-1)+{I}(\boldsymbol{x},t) (1)
s(𝒙,t)={1if A(𝒙,t1)+I(𝒙,t)>ϕ0otherwise\displaystyle{s}(\boldsymbol{x},t)=\begin{cases}1&\text{if }{A}(\boldsymbol{x},t-1)+{I}(\boldsymbol{x},t)>\phi\\ 0&\text{otherwise}\end{cases} (2)

where:

I(𝒙,t)=t1tIin(𝒙,τ)𝑑τmodϕ\displaystyle{I}(\boldsymbol{x},t)=\int_{t-1}^{t}{I_{in}}(\boldsymbol{x},\tau)d\tau\ {\rm{mod}}\;\phi (3)

Here A(𝒙,t){A}(\boldsymbol{x},t) is the accumulated intensity at time tt, s(𝒙,t){s}(\boldsymbol{x},t) is the spike output at time tt and Iin(𝒙,τ){I_{in}}(\boldsymbol{x},\tau) is the input current at time τ\tau (proportional to light intensity). We will directly use I(𝒙,t)I(\boldsymbol{x},t) to represent the luminance intensity to simplify our presentation. Further, due to the limitations of circuits, each spike is read out at discrete time nT,nnT,n\in\mathbb{N} (TT is a micro-second level). Thus, the output of the spike camera is a spatial-temporal binary stream SS with H×W×NH\times W\times N size. Here, HH and WW are the height and width of the sensor, respectively, and NN is the temporal window size of the spike stream.

III-B Neural Radiance Field (NeRF) Theory

Neural Radiance Field (NeRF) uses a 5D vector-valued function to represent a continuous scene. The input to this function consists of a 3D location x = (x, y, z) and 2D viewing direction d =(θ\theta,ϕ\phi), while output is an emitted color c = (r, g, b) and volume density σ\sigma. Both σ\sigma and c are represented implicitly as multi-layer perceptrons (MLPs), written as:

FΘ:(x,d)(c,σ)F_{\Theta}:(\textbf{x},\textbf{d})\to(\textbf{c},\sigma) (4)

Given the volume density σ\sigma and color functions c, the rendering result II of any given ray r=o+td\textbf{r}=\textbf{o}+t\textbf{d} passes through the scene can be computed using principles from volume rendering.

I(r)=tntfT(t)σ(r(t))c(r(t),d)𝑑tI(\textbf{r})=\begin{matrix}\int_{t_{n}}^{t_{f}}T(t)\sigma(\textbf{r}(t))\textbf{c}(\textbf{r}(t),\textbf{d})\,dt\end{matrix} (5)

where

T(t)=etntσ(r(s))𝑑sT(t)=e^{-\begin{matrix}\int_{t_{n}}^{t}\sigma(\textbf{r}(s))\,ds\end{matrix}} (6)

The function T(t) denotes the accumulated transmittance along the ray from tnt_{n} to t. For computing reasons, rays were divided into N equally spaced bins, and a sample was uniformly drawn from each bin. Then, equation 5 can be approximated as

I(r)=i=1NTi(t)(1exp(σiδi))ciI(\textbf{r})=\begin{matrix}\sum_{i=1}^{N}T_{i}(t)(1-\exp(-\sigma_{i}\delta_{i}))\textbf{c}_{i}\end{matrix} (7)

where:

Ti(t)=j=1i1exp(σiδi)ciT_{i}(t)=\begin{matrix}\sum_{j=1}^{i-1}-\exp(-\sigma_{i}\delta_{i})\textbf{c}_{i}\end{matrix} (8)

and:

δi=ti+1ti\delta_{i}=t_{i+1}-t_{i} (9)

After calculating the color I(r)I(\textbf{r}) for each pixel, a square error photometric loss is used to optimize the MLP parameters.

L=rRI(r)Igt(r)L=\sum_{r\in R}\|I(\textbf{r})-I_{gt}(\textbf{r})\| (10)

IV Method

IV-A Overview

Taking inspiration from NeRF, Spike-NeRF implicitly represents the static scenes as an MLP network FΘF_{\Theta} with 5D inputs:

FΘ:(x(ti),d(ti))(c,σ)F_{\Theta}:(\textbf{x($t_{i}$)},\textbf{d($t_{i}$)})\to(\textbf{c},\sigma) (11)

Here, each tit_{i} corresponds to a frame of spike si={0,1}s_{i}=\{0,1\} in the continuous spike stream 𝕊={siW×H|i=0,1,2,}\mathbb{S}=\{{s}_{i}\in\mathbb{R}^{W\times H}|i=0,1,2,\dots\} captured by a spike camera in a very short time. Considering the difficulty of directly using spike streams for supervision, we firstly reconstruct the spike stream 𝕊\mathbb{S} into image sequence 𝕀={imiW×H|i=0,1,2,}\mathbb{I}=\{{im}_{i}\in\mathbb{R}^{W\times H}|i=0,1,2,\dots\} where imi{im}_{i} is the reconstructed image at tit_{i}. We use the results with inputs=𝕀\mathbb{I} as our baseline. Since all methods reconstruct images from multi-frame spikes, using the reconstructed images as a supervision signal would lead to artifacts and blurring. We introduce spike masks Ms{M_{s}} to make NeRF focus on the triggered area. We also propose a spiking volume renderer based on the coding method of the spike camera to generate spike streams for novel views. We then use the g.t. spike streams constraint network directly.

The total loss used to train Spike-NeRF is given by:

Ltotal=Lrecon+λLspikeL_{total}=L_{recon}+{\lambda}L_{spike} (12)

LreconL_{recon} is the loss between the image rendering result and masked images which are reconstructed from g.t. spike streams. LspikeL_{spike} is the loss between the spike rendering result generated by our spiking volume rendering method and g.t. spike streams.

IV-B Spiking Volume Renderer

If we introduce time tt into the volume rendering equation 5, the rendering results I(r,t)I(\textbf{r},t) of any given ray r(t)=o(t)+kd(t)\textbf{r(t)}=\textbf{o(t)}+k\textbf{d(t)} at time tt is:

I(r,t)=knkfT(k,t)σ(r(k,t))c(r(k,t),d(t))𝑑kI(\textbf{r},t)=\begin{matrix}\int_{k_{n}}^{k_{f}}T(k,t)\sigma(\textbf{r}(k,t))\textbf{c}(\textbf{r}(k,t),\textbf{d}(t))\,dk\end{matrix} (13)

Where

T(k,t)=ekknσ(r(s,t))𝑑sT(k,t)=e^{-\begin{matrix}\int_{k}^{k_{n}}\sigma(\textbf{r}(s,t))\,ds\end{matrix}} (14)

Then, if we assume that for any xx A(x,t0)=0A(x,t_{0})=0, equation LABEL:ax can be written as:

A(x,t)=t0tI(x,t)𝑑tNϕA(x,t)=\begin{matrix}\int_{t_{0}}^{t}I(x,t)dt\end{matrix}-N\phi (15)

Here,ϕ\phi is the threshold of the spike camera and N is the number of ”1” for spike streams 𝕊(x)={sti|ti(t0,t)}\mathbb{S}(x)=\{{s}_{t_{i}}|t_{i}\in(t_{0},t)\} and x=(x,y) is the coordinates for each pixel. For computing reasons, rays were divided into N0N_{0} equally spaced bins, (t0,t)(t_{0},t) were divided into N1N_{1} equally spaced bins, and a sample was uniformly drawn from each bin. Then, equation 2 can be written as:

s(𝒙,t)={1if i=1N1I(x,ti)Nϕ>ϕ0otherwise\displaystyle{s}(\boldsymbol{x},t)=\begin{cases}1&\text{if }\sum_{i=1}^{N_{1}}I(x,t_{i})-N\phi>\phi\\ 0&\text{otherwise}\end{cases} (16)

where:

I(r,t)=i=1N0Ti(k,t)(1exp(σiδi))ci(t)I(\textbf{r},t)=\begin{matrix}\sum_{i=1}^{N_{0}}T_{i}(k,t)(1-\exp(-\sigma_{i}\delta_{i}))\textbf{$c_{i}$}(t)\end{matrix} (17)

where:

Ti(k,t)=j=1i1exp(σiδi)ci(t)T_{i}(k,t)=\begin{matrix}\sum_{j=1}^{i-1}-\exp(-\sigma_{i}\delta_{i})\textbf{$c_{i}$}(t)\end{matrix} (18)

and:

δi=ti+1ti\delta_{i}=t_{i+1}-t_{i} (19)

However, A(x,t0)A(x,t_{0}) is not equal to 0 in real situations. To address this, we introduce a random startup matrix and utilize the stable results after several frames. The above processes do not participate in backpropagation as they are not differentiable. After generating spike streams 𝕊\mathbb{S}, we can compute:

Lspike=rR𝕊(x)𝕊gt(x)L_{spike}=\sum_{r\in R}\|\mathbb{S}(\textbf{x})-\mathbb{S}_{gt}(\textbf{x})\| (20)
Refer to caption
Figure 3: Comparisons on novel view synthesis. We compare our results with three baselines:NeRF, BAD-NeRF, and NeRF+Spk2ImgNet.More details are shown in the green box. NeRF and BAD-NeRF’s results have significant blur while NeRF+Spk2ImgNet’s results show artifacts. Our results are sharp. The supplement shows more details.

IV-C Spike Masks

Due to the serious lack of information in a single spike, all reconstruction methods use multi-frame spikes as input. These methods can reconstruct images with detailed textures from spike streams, but can also introduce erroneous information due to the use of preceding and following frame spikes (see Figure 4 original), which results in foggy edges in the scene learned by NeRF. We introduce spike masks Ms{M_{s}} to solve this problem.

Refer to caption
Figure 4: Effective areas (white when r+g+b>>0 and black when r+g+b=0) for different solutions. Compared with GT, there are obvious error areas when not using masks, and cavities when using single-frame masks. Our solution solves the above problems.
Refer to caption
Figure 5: Comparison between spike and image. Compared with images, spikes lack texture details and have a lot of noise because a single frame of spike has less information than an image.
TABLE I: Comparing our method against NeRF, BAD-NeRF and Spk2ImgNet+NeRF on muti scenes. Our method consistently produces better results. Best results are marked in red and the second best results are marked in yellow.
Dataset chair_rgb drums_rgb ficus_rgb hotdog_rgb lego_rgb materials_rgb average_rgb
Method || Metrics SSIM \uparrow PSNR \uparrow SSIM \uparrow PSNR \uparrow SSIM \uparrow PSNR \uparrow SSIM\uparrow PSNR \uparrow SSIM \uparrow PSNR \uparrow SSIM \uparrow PSNR \uparrow SSIM \uparrow PSNR \uparrow
NeRF 0.754 21.14 0.593 21.40 0.707 22.31 0.752 19.31 0.455 17.24 0.533 18.36 0.632 19.96
BAD-Nerf[cvpr23] 0.604 19.44 0.563 20.70 0.597 20.97 0.658 19.42 0.385 16.02 0.397 17.56 0.534 19.02
NeRF+Spk2ImgNet[cvpr21] 0.961 32.21 0.899 29.90 0.908 27.80 0.920 28.92 0.837 26.00 0.877 28.27 0.901 28.85
Ours 0.973 32.90 0.922 30.16 0.936 29.10 0.923 29.69 0.861 26.32 0.912 28.67 0.921 29.48
Dataset chair_gray drums_gray ficus_gray hotdog_gray lego_gray materials_gray average_gray
Methods || Metrics SSIM \uparrow PSNR \uparrow SSIM \uparrow PSNR \uparrow SSIM \uparrow PSNR \uparrow SSIM\uparrow PSNR \uparrow SSIM \uparrow PSNR \uparrow SSIM \uparrow PSNR \uparrow SSIM \uparrow PSNR \uparrow
NeRF 0.662 17.09 0.437 15.85 0.628 16.59 0.243 18.06 0.292 14.06 0.365 14.34 0.438 16.00
BAD-Nerf[cvpr23] 0.646 15.12 0.510 14.38 0.624 16.18 0.431 15.87 0.372 11.91 0.341 12.60 0.487 14.34
NeRF+Spk2ImgNet[cvpr21] 0.803 27.36 0.671 23.58 0.827 24.79 0.528 25.47 0.636 22.97 0.615 23.12 0.680 24.55
Ours 0.881 31.70 0.764 25.81 0.874 26.18 0.581 26.79 0.710 25.07 0.769 25.43 0.763 26.83

Because of the spatial sparsity of the spike streams, using a single spike mask will lead to a large number of cavities. To address this, we use a relatively small number of multi-frame spikes to fill the cavities. Considering spike sis_{i} at time tit_{i} and reconstruction result imiim_{i}, we first choose 𝕊ti={sjW×H|j=in,in+1,,i,,i+n1,i+n}\mathbb{S}_{ti}=\{{s}_{j}\in\mathbb{R}^{W\times H}|j=i-n,i-n+1,\dots\ ,i,\dots\ ,i+n-1,i+n\} as original mask sequence. Finally, we have:

Ms=sin|sin+1||si+n1|si+n{M_{s}}=s_{i-n}|s_{i-n+1}|\dots|s_{i+n-1}|s_{i+n} (21)

where || means or. After masking image sequence 𝕀\mathbb{I}, we can compute:

Lrecon=rRMs(𝕀(x))𝕀gt(x)L_{recon}=\sum_{r\in R}\|M_{s}(\mathbb{I}(\textbf{x}))-\mathbb{I}_{gt}(\textbf{x})\| (22)

V Experiment

We adopt Novel View Synthesis (NVS) as the standard task to verify our method. We first compare our method with NeRF approaches on traditional cameras and the proposed baseline. We then conduct comprehensive quantitative ablation studies to illustrate the usage of the designed modules.

V-A Implementation Details

Our code is based on NeRF [1] and we train the models for 21052*10^{5} iterations on one NVIDIA A100 GPU with the same optimizer and hyper-parameters as NeRF. Since the spiking volume renderer requires continuous multi-spikes, we select the camera pose and sampling points for spiking volume rendering determinedly rather than randomly as NeRF did. We examined our method on synthetic sequences from NeRF [1]. We examined six scenes(lego, ficus, chair, materials, hotdog and drums) which cover different conditions. We rendered all of them with a 0.025-second-long 360-degree rotation of the camera around the object resulting in 1000 views to simulate the 40000 fps spike camera and other blurred images in 1000 views to simulate the 400 fps high-speed traditional camera by Blender. Like NeRF, we directly use the corresponding camera intrinsics and extrinnsics generated by the blender.

Refer to caption
Figure 6: Importance of spike loss. Cavities and blur appear in both RGB and Gray space when disabling spike loss
Refer to caption
Figure 7: Importance of spike masks. Artifacts appear in both RGB and Gray space when disabling spike masks

V-B Comparisons against other Methods

We compare the spike NeRF results with three baselines: Spk2ImgNet+NeRF[14], the 400 fps traditional camera results on NeRF and BAD-NeRF[9](see Figure 3).To better demonstrate the adaptability of our method to spike cameras, we show both gray and RGB results. From Figure 3 we know that our method has absolute advantages over NeRF and BAD-NeRF in high-speed scenes. Compared with directly using spike reconstruction results for training, our method also has obvious supervisors. Corresponding numerical results are reported in Table I from what we can conclude our method improves more in gray space. We also compared the data sizes of two modalities: spikes and images. From Figure 5 we can conclude that spikes have less information resulting in noise and loss of detail. However, our method leverages temporal consistency (see section IV) to derive stable 3D representations from information-lacking and unstable spike streams.

TABLE II: Ablation on spike mask and spike loss. Best results are marked in red and the second best results are marked in yellow.
  Methods||Metrics SSIM \uparrow PSNR \uparrow LPIPS \downarrow
  W/O spike masks 0.899 28.97 0.064
W/O spike loss 0.918 29.18 0.067
Full 0.921 29.48 0.061
 

V-C Ablation

Compared with baselines, our Spike-NeRF introduces two main components: spike masks and the spike volume renderer with spike loss. Next, we discuss their impact on the results.

Spike Loss: In section IV, we proposed spike loss to solve the cavities caused by the partial information loss due to spike masks and the blur caused by incorrect reconstruction. Figure 6 shows the results before and after disabling spike loss. From Figure 6 we know that after disabling spike loss, some scenes have obvious degradation in details and a large number of wrong holes. TabII shows the improvement of the spike loss.

Spike masks: Incorrect reconstruction will also lead to a large number of artifacts in NeRF results. We use spike masks to eliminate artifacts to the maximum extent (see IV). Figure 7 shows the results before and after disabling spike masks. From Figure 7 we know that after disabling spike masks, all scenes have obvious artifacts. TabII shows the improvement of the spike masks.

VI Conclusion

We introduced the first approach to reconstruct a 3D scene from spike streams which enables photorealistic novel view synthesis in both gray and RGB space. Thanks to the high temporal resolution and unique coding method of the spike camera, Spike-Nerf shows credible advantages in high-speed scenes. Further, we proposed the spiking volume renderer and spike mask so that Spike NeRF outperforms baselines in terms of scene stability and texture details. Our method can also directly generate spike streams. To the best of our knowledge, our paper is also the first time that spike cameras have been used in the field of 3D representation.

Limitation: Due to the difficulty of collecting real spike data with camera poses, Spike-NeRF is only tested on synthetic datasets. In addition, Spike-NeRF assumes that the only moving object in the scene is the spike camera. We believe that NeRF based on spike cameras has greater potential in handling high-speed rigid and non-rigid motions for other objects. Future works can investigate it.

References

  • [1] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
  • [2] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5855–5864.
  • [3] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mip-nerf 360: Unbounded anti-aliased neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5470–5479.
  • [4] R. Martin-Brualla, N. Radwan, M. S. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth, “Nerf in the wild: Neural radiance fields for unconstrained photo collections,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7210–7219.
  • [5] A. Chen, Z. Xu, A. Geiger, J. Yu, and H. Su, “Tensorf: Tensorial radiance fields,” in European Conference on Computer Vision.   Springer, 2022, pp. 333–350.
  • [6] Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu, “Nerf–: Neural radiance fields without known camera parameters,” arXiv preprint arXiv:2102.07064, 2021.
  • [7] K. Zhang, G. Riegler, N. Snavely, and V. Koltun, “Nerf++: Analyzing and improving neural radiance fields,” arXiv preprint arXiv:2010.07492, 2020.
  • [8] L. Ma, X. Li, J. Liao, Q. Zhang, X. Wang, J. Wang, and P. V. Sander, “Deblur-nerf: Neural radiance fields from blurry images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 861–12 870.
  • [9] P. Wang, L. Zhao, R. Ma, and P. Liu, “Bad-nerf: Bundle adjusted deblur neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4170–4179.
  • [10] J.-W. Liu, Y.-P. Cao, W. Mao, W. Zhang, D. J. Zhang, J. Keppo, Y. Shan, X. Qie, and M. Z. Shou, “Devrf: Fast deformable voxel radiance fields for dynamic scenes,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 762–36 775, 2022.
  • [11] P. Joshi and S. Prakash, “Retina inspired no-reference image quality assessment for blur and noise,” Multimedia Tools and Applications, vol. 76, pp. 18 871–18 890, 2017.
  • [12] S. Dong, T. Huang, and Y. Tian, “Spike camera and its coding methods,” arXiv preprint arXiv:2104.04669, 2021.
  • [13] L. Zhu, J. Li, X. Wang, T. Huang, and Y. Tian, “Neuspike-net: High speed video reconstruction via bio-inspired neuromorphic cameras,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2400–2409.
  • [14] J. Zhao, R. Xiong, H. Liu, J. Zhang, and T. Huang, “Spk2imgnet: Learning to reconstruct dynamic scene from continuous spike stream,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 996–12 005.
  • [15] J. Zhang, S. Jia, Z. Yu, and T. Huang, “Learning temporal-ordered representation for spike streams based on discrete wavelet transforms,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 137–147.
  • [16] L. Hu, R. Zhao, Z. Ding, L. Ma, B. Shi, R. Xiong, and T. Huang, “Optical flow estimation for spiking camera,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 844–17 853.
  • [17] Y. Wang, J. Li, L. Zhu, X. Xiang, T. Huang, and Y. Tian, “Learning stereo depth estimation with bio-inspired spike cameras,” in 2022 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2022, pp. 1–6.
  • [18] K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla, “Nerfies: Deformable neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 5865–5874.
  • [19] A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer, “D-nerf: Neural radiance fields for dynamic scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10 318–10 327.
  • [20] Z. Li, S. Niklaus, N. Snavely, and O. Wang, “Neural scene flow fields for space-time view synthesis of dynamic scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6498–6508.
  • [21] Z. Yan, C. Li, and G. H. Lee, “Nerf-ds: Neural radiance fields for dynamic specular objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8285–8295.
  • [22] D. Lee, M. Lee, C. Shin, and S. Lee, “Dp-nerf: Deblurred neural radiance field with physical scene priors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 386–12 396.
  • [23] D. Lee, J. Oh, J. Rim, S. Cho, and K. M. Lee, “Exblurf: Efficient radiance fields for extreme motion blurred images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 17 639–17 648.
  • [24] V. Rudnev, M. Elgharib, C. Theobalt, and V. Golyanik, “Eventnerf: Neural radiance fields from a single colour event camera,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4992–5002.
  • [25] I. Hwang, J. Kim, and Y. M. Kim, “Ev-nerf: Event based neural radiance field,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 837–847.
  • [26] W. F. Low and G. H. Lee, “Robust e-nerf: Nerf from sparse & noisy events under non-uniform motion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 18 335–18 346.
  • [27] Q. Ma, D. P. Paudel, A. Chhatkuli, and L. Van Gool, “Deformable neural radiance fields using rgb and event cameras,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3590–3600.
  • [28] Y. Qi, L. Zhu, Y. Zhang, and J. Li, “E2nerf: Event enhanced neural radiance fields from blurry images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 254–13 264.
  • [29] Y. Zheng, L. Zheng, Z. Yu, B. Shi, Y. Tian, and T. Huang, “High-speed image reconstruction through short-term plasticity for spiking cameras,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6358–6367.
  • [30] S. Chen, Z. Yu, and T. Huang, “Self-supervised joint dynamic scene reconstruction and optical flow estimation for spiking camera,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 350–358.
  • [31] J. Zhang, L. Tang, Z. Yu, J. Lu, and T. Huang, “Spike transformer: Monocular depth estimation for spiking camera,” in European Conference on Computer Vision.   Springer, 2022, pp. 34–52.
  • [32] J. Zhao, R. Xiong, J. Zhang, R. Zhao, H. Liu, and T. Huang, “Learning to super-resolve dynamic scenes for neuromorphic spike camera,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 3, 2023, pp. 3579–3587.
  • [33] Y. Zheng, Z. Yu, S. Wang, and T. Huang, “Spike-based motion estimation for object tracking through bio-inspired unsupervised learning,” IEEE Transactions on Image Processing, vol. 32, pp. 335–349, 2022.