This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[1]\fnmHongzhi \surYou

[1]\orgdivMOE Key Laboratory for NeuroInformation, \orgnameSchool of Life Science and Technology, University of Electronic Science and Technology of China, \orgaddress\cityChengdu, \stateSichuan, \countryChina

2]\orgnameSchool of Electronic Engineering and Automation, Guilin University of Electronic Technology, \orgaddress\cityGuilin, \stateGuangxi, \countryChina

3]\orgnameSynSense Tech. Co. Ltd., \orgaddress\cityNingbo, \stateZhejiang, \countryChina

Vector-Symbolic Architecture for Event-Based Optical Flow

[email protected]    \fnmYijun \surCao [email protected]    \fnmWei \surYuan [email protected]    \fnmFanjun \surWang [email protected]    \fnmNing \surQiao [email protected]    \fnmYongjie \surLi [email protected] * [ [
Abstract

From a perspective of feature matching, optical flow estimation for event cameras involves identifying event correspondences by comparing feature similarity across accompanying event frames. In this work, we introduces an effective and robust high-dimensional (HD) feature descriptor for event frames, utilizing Vector Symbolic Architectures (VSA). The topological similarity among neighboring variables within VSA contributes to the enhanced representation similarity of feature descriptors for flow-matching points, while its structured symbolic representation capacity facilitates feature fusion from both event polarities and multiple spatial scales. Based on this HD feature descriptor, we propose a novel feature matching framework for event-based optical flow, encompassing both model-based (VSA-Flow) and self-supervised learning (VSA-SM) methods. In VSA-Flow, accurate optical flow estimation validates the effectiveness of HD feature descriptors. In VSA-SM, a novel similarity maximization method based on the HD feature descriptor is proposed to learn optical flow in a self-supervised way from events alone, eliminating the need for auxiliary grayscale images. Evaluation results demonstrate that our VSA-based method achieves superior accuracy in comparison to both model-based and self-supervised learning methods on the DSEC benchmark, while remains competitive among both methods on the MVSEC benchmark. This contribution marks a significant advancement in event-based optical flow within the feature matching methodology.

keywords:
Vector-symbolic architecture, Optical flow, Event camera, feature matching

1 Introduction

Event-based cameras are bio-inspired vision sensors that asynchronously provide per-pixel brightness changes as an event stream [18]. Leveraging their high temporal resolution, dynamic range, and low latency, these cameras have the potential to enhance accurate motion estimation, particularly in optical flow [6, 3]. However, event-based optical flow estimation poses challenges due to its asynchronously and sparsely event visual information and the difficulty in obtaining ground-truth for optical flow, as compared to traditional cameras [18, 52]. Therefore, it is crucial to develop unsupervised optical flow methods that capitalize on the unique characteristics of event data, eliminating the dependency on expensive-to-collect and error-prone ground truth [52].

Optical flow estimation involves finding pixel correspondences between images captured at different moments. The feature matching method, a fundamental approach for event-based optical flow, relies on maximizing feature similarity between accompanying frames [18]. In this method, the feature for each event is typically represented by the image pattern around the corresponding pixel in the event frame [40, 41]. However, the inherent randomness in events [18] result in inconsistent image patterns of the same object across various frames, posing challenges in acquiring accurate and robust feature descriptors. Due to the absence of an effective event-only local feature descriptor, the feature matching method for event-based optical flow is generally limited to estimating sparse optical flows for key points, showing suboptimal performance [40, 41]. During self-supervised learning, accurate dense optical flow estimation becomes challenging without restoring luminance or additional sensor information, such as grayscale images [64, 22, 11, 13].

In this study, we introduce a high-dimensional (HD) feature descriptor for event frames, leveraging the Vector Symbolic Architecture (VSA). VSAs, regarded for their effectiveness in utilizing high-dimensional distributed vectors [33, 34], have traditionally been employed in symbolic representations of artificial shapes [29, 50, 51, 24] or few-shot learning classification tasks [23, 30]. In this work, VSAs form the basis of our novel descriptor in natural scenes captured by event cameras. This descriptor utilizes the local similarity characteristics of neighboring variables within VSA [14, 50] to reduce the impact of randomness in events on representation accuracy. Employing structured symbolic representations [35], it achieves multi-spatial-scale and two-polarity feature fusion for feature descriptor. Our evaluation of descriptor similarity for flow-matching points on datasets DSEC and MVSEC demonstrates the effectiveness of our proposed approach.

Further, we focus on a unifying framework for event-based optical flow within the feature matching strategy, centered around the proposed HD feature descriptors. The model-based VSA-Flow method, derived from the framework, utilizes the similarity of HD feature descriptors to achieve more accurate dense optical flow. Similarity integration in the cost volume from three event frame pairs with progressively doubling time intervals at gradually downsampled scales enables VSA-Flow to achieve large optical flow estimation within a limited neighboring region. Meanwhile, the proposed VSA-SM method relies on a similarity maximization (SM) proxy loss for predicted flow-matching points. This novel self-supervised learning approach effectively estimates optical flow from event-only HD feature descriptors, eliminating the need for additional sensor information. Evaluation results reveal that we obtain the best accuracy in both model-based and self-supervised learning methods on the DSEC-Flow benchmark, and competitive performance on the MVSEC benchmark.

2 Related Works

2.1 Event-based Optical Flow Estimation

From a methodological perspective, event-based optical flow estimation encompasses three primary approaches [18]. The first approach involves the gradient-based method, which leverages the spatial and temporal derivative information provided by event data directly or after appropriate processing to compute optical flow [5, 6]. Previous studies have explored event-based adaptations of Horn-Schunck and Lucas-kanade [26, 42, 5, 2], distance surface [3, 8] and spatial-temporal plane-fitting [6, 1].

The second approach is the feature matching method, which calculates optical flow by evaluating the similarity or correlation of feature representations for individual pixels between consecutive event frames in the temporal domain. For instance, the model-based EDFLOW estimates optical flow by applying adaptive block matching [40, 41]. Meanwhile, this approach is frequently employed in the design of learning-based optical flow neural networks that incorporate cost volume modules capable of computing feature similarity or correlation, such as E-RAFT [21] and TMA [61]. In addition, treating auxiliary grayscale images as low-dimensional features, EV-FlowNet engages in self-supervised learning by minimizing the intensity difference between warped images based on the estimated optical flow [64].

The third approach, exclusive to event cameras, is the contrast maximization method. This method maximizes an objective function, often related to contrast, to quantify the alignment of events generated by the same scene edge [53, 16, 17]. The underlying idea is to estimate motion by reconstructing a clear motion-compensated image of the edge patterns that triggered the events. This approach can be applied not only to model-based optical flow estimation [52] but also frequently serves as a loss function for unsupervised and self-supervised optical flow learning [52, 60, 46, 22, 47].

In contrast to prior work, our proposed VSA-based framework for event-based optical flow adopts a classical feature matching approach to offer deeper insights into the problem. This framework is adaptable to both model-based and self-supervised learning methods, akin to the contrast maximization method [16, 52]. Particularly, the self-supervised learning method in the framework can achieve accurate optical flow solely from event-only VSA-based HD feature descriptors, eliminating the need for auxiliary grayscale images.

2.2 High-dimensional Representations of Images Using Vector Symbolic Architecture

Vector Symbolic Architectures (VSAs) are regarded as a powerful algorithmic framework that leverages high-dimensional distributed vectors and employs specific algebraic operations and structured symbolic representations [33, 34]. VSAs have demonstrated remarkable capabilities in various domains, including spatial cognition and visual scene understanding. The hypervector encoding of the color images and event frames, including artificial shapes, is achieved through a superposition of spatial index vectors, weighted by their corresponding image pixel values [51, 50]. These HD representations finds application in neuromorphic visual scene understanding [51] and visual odometry [50]. Leveraging the structured symbolic representation capacity of VSAs, a biologically inspired spatial representation is employed to generate hierarchical cognitive maps, each containing objects at various locations [35]. Moreover, several VSA-based approaches have been introduced as frameworks for systematic aggregation of image descriptors suitable to visual place recognition [45, 31]. Overall, VSA endows HD representations of images with intrinsic attributes of hierarchical structure and semantics.

Accurate representations of feature descriptors that encompass individual pixels and their contextual features are crucial for optical flow estimation based on the feature matching method. In contrast to prior work, we adopt a specific type of VSA, Vector Function Architecture (VFA), which embodies continuous similarity characteristics to reduce the impact of randomness in events. This specific VSA is employed as a HD kernel to extract localized feature information from event frames. Meanwhile, optical flow estimation models commonly incorporate a multi-scale pyramid design to enhance their performance. Utilizing the binding capacity of structured features in VSA, we amalgamate HD feature representations from multiple scales and two event polarities into a unified feature descriptor.

3 Methodology

3.1 Preliminary

VSAs constitute a family of computational models with vector representations that have two distinct properties [33, 14]. Firstly, symbols are represented by mutually orthogonal randomized dd-dimensional vectors (d\in\mathbb{R}^{d}), which facilitates a clear distinction between different symbols. Secondly, all computations within VSAs can be composed by a limited set of elementary vector algebraic operations, where the primary operations are the binding (\circ) and superposition (++) operations. The binding operation commonly signifies associations between symbols, such as a roll-filler pair [27], while the superposition operation is frequently used to represent sets of symbols. Both operations do not change the hypervector dimensionality. Through the combination of these operations and symbols, VSAs can effectively achieve structured symbolic representations. For instance, consider a scenario in which the character 𝟏\boldsymbol{1} is located at position 𝑷𝑨\boldsymbol{P_{A}} and 𝟐\boldsymbol{2} at position 𝑷𝑩\boldsymbol{P_{B}} in a given image. The hypervector symbolic representation of this image can be denoted as 𝑰=𝑷𝑨𝑶𝒏𝒆+𝑷𝑩𝑻𝒘𝒐\boldsymbol{I}=\boldsymbol{P_{A}}\circ\boldsymbol{One}+\boldsymbol{P_{B}}\circ\boldsymbol{Two}, where 𝑷𝑨\boldsymbol{P_{A}}, 𝑷𝑩\boldsymbol{P_{B}}, 𝑶𝒏𝒆\boldsymbol{One} and 𝑻𝒘𝒐\boldsymbol{Two} d\in\mathbb{R}^{d} represent mutually orthogonal randomized hypervectors of corresponding concepts.

VSAs have various models that use different types of random vectors [33]. In this study, an improved Holographic Reduced Representation (HRR) is employed as the VSA model to ensure high concept retrieval efficacy [19]. For HRR, the binding operation is the circular convolution of two hypervectors, and the superposition operation the component-wise sum. Additionally, the similarity between two HRRs can be measured through the cosine similarity.

In this work, the feature extraction from event frames requires the VSA-based 2-D spatial representation. Here, we first introduce the fractional power encoding (FPE) method [48, 49] for representing integers along each coordinate axis in an image plane, and then the VSA-based spatial representation.

3.1.1 The Fractional Power Encoding Method

In the fractional power encoding method [49], let xx\in\mathbb{Z} be an integer, XdX\in\mathbb{R}^{d} be a random hypervector, the hypervector representation 𝐳(x)d\mathbf{z}\left(x\right)\in\mathbb{R}^{d} for any integer xx can be obtained by repeatedly binding the base vector XX with itself xx times as follows:

𝐳(x):=Xx=(X)(x)=1{{X}x}\mathbf{z}\left(x\right):=X^{x}=\left(X\right)^{\left(\circ x\right)}=\mathcal{F}^{-1}\left\{\mathcal{F}\left\{X\right\}^{x}\right\} (1)

where the rightmost equation denotes the fractional binding operation by expressing it in the complex domain [35, 14]. {}\mathcal{F}\left\{\cdot\right\} is the Fourier transform, and {}x\mathcal{F}\left\{\cdot\right\}^{x} is an component-wise exponentiation of the corresponding complex vector.

3.1.2 The VSA-based Spatial Representation

Recent studies have demonstrated that the hypervector spatial representation D(x,y)D\left(x,y\right) d\in\mathbb{R}^{d} of a point (x,yx,y) in 2-D space can be obtained using VSA with FPE [35, 15], as expressed in the following:

D(x,y)=XxYyD\left(x,y\right)=X^{x}\circ Y^{y} (2)

Here, random vectors XX and YY d\in\mathbb{R}^{d} represent the base vectors for horizontal and vertical axes, respectively. XxX^{x} and XyX^{y} represent pseudo-orthogonal representation vectors for distinct integer positions xx and yy along each axes.

3.2 The VSA-based Feature Matching Framework

This work aims to establish a novel framework for event-based optical flow utilizing VSA, adaptable to both model-based and self-supervised learning methods within the feature matching approach. Optical flow estimation involves finding pixel correspondences between images captured at distinct time intervals. Effective event representation and precise feature descriptors are essential in the framework.

3.2.1 Accumulative Time Surface

Event cameras are innovative bio-inspired sensors that respond to changes in brightness through continuous streams of events ={e1,e2,}\mathcal{E}=\left\{e_{1},e_{2},\cdots\right\} in a sparse and asynchronous manner. Each event ek=(xk,yk,tk,pk)e_{k}=\left(x_{k},y_{k},t_{k},p_{k}\right) comprises the space-time coordinates with polarity pk{+,}p_{k}\in\left\{+,-\right\}. In this work, we use an event representation called accumulative Time Surface (TS) [36, 63]. An accumulative TS at pixel (x,y)(x,y) and time tt is defined as follows:

𝒯(x,y,t)tjtexp(ttj(x,y)τTS)\mathcal{T}\left(x,y,t\right)\doteq\sum_{t_{j}\leq t}{\exp\left(-\frac{t-t_{j}\left(x,y\right)}{\tau_{TS}}\right)} (3)

Here, τTS\tau_{TS} represents the exponential-decay rate, and tjt_{j} denotes the timestamp of any event that occurred at pixel (x,y)(x,y) prior to time tt. Thus, the accumulative TS emulates the synaptic activity that takes place after receiving the stream of events.

3.2.2 VSA-based HD Kernel for Feature Extraction

Refer to caption
Figure 1: Topological similarity in 2-D space for basic VSA and VFA HD kernels. Similarity between hypervectors in the basic VSA (a) and VFA (b) HD kernels, respectively, originating from the center and surrounding points. The comparative analysis of hypervectors between the origin (D(0,0)D(0,0)) and points in its surrounding N×NN\times N neighborhood indicates that the VFA HD kernel, rather than the basic VSA HD kernel, is capable of capturing spatial topological similarity. N=21N=21 (n=N/2=10n=\lfloor N/2\rfloor=10).

Utilizing the spatial representation described in Equation 2, the HD feature representation F(x,y)dF\left(x,y\right)\in\mathbb{R}^{d} of the N×NN\times N neighborhood centered around the pixel (x,yx,y) in the image 𝒯H×W\mathcal{T}\in\mathbb{R}^{H\times W} can be encoded as a hypervector using the following formula [51]:

F(x,y)=Δx,Δy𝒯(x+Δx,y+Δy)D(Δx,Δy)F\left(x,y\right)=\sum_{\varDelta x,\varDelta y}{\mathcal{T}\left(x+\varDelta x,y+\varDelta y\right)D\left(\varDelta x,\varDelta y\right)} (4)

where (Δx,Δy)(\varDelta x,\varDelta y) denotes the offset from the pixel (x,y)(x,y) to any pixel within its N×NN\times N neighborhood, in a scope of [n,n]\left[-n,n\right], n=N/2n=\lfloor N/2\rfloor. From the perspective of 2-D image convolution, we can utilize DD d×N×N\in\mathbb{R}^{d\times N\times N} in Equation 4 as the HD kernel to achieve local feature extraction within an N×NN\times N neighborhood for each pixel in the image. Consequently, the HD feature descriptor FF d×H×W\in\mathbb{R}^{d\times H\times W} of the image 𝒯\mathcal{T} can be efficiently obtained by convolving 𝒯\mathcal{T} with the HD kernel DD [62] as follows:

F=𝒯DF=\mathcal{T}\ast D (5)

In principle, feature descriptors are required to capture differences between various image patterns of event frames, as well as exhibit similarities among comparable image patterns, displaying a certain degree of continuous similarity as image patterns vary. However, the basic VSA spatial representation defined in Equation 2 and 4 ignores important topological similarity relationships in 2-D space due to their pseudo-orthogonal property (Figure 1a) [15]. Given the inherent randomness in the event representations of the same object at different times, the spatial representation DD (Equation 2) is unsuitable as a HD kernel for feature extraction from event frames in tasks involving feature matching.

Recent studies have revealed that the Vector Function Architecture (VFA) [14] and the hyperdimensional transform [12] exhibit continuous translation-invariant similarity kernels. Inspired from these findings and for the sake of simplicity, here we employ a Gaussian-smoothed HD kernel KK d×N×N\in\mathbb{R}^{d\times N\times N} with topological similarity to achieve the HD feature descriptors of the accumulative TS as follows:

K=DGK=D*G (6)

Here, GG represents a two-dimensional Gaussian kernel with a standard deviation of σK\sigma_{K}, facilitating the HD kernel KK to possess a translation-invariant similarity and characteristic similar to VFA (Figure 3, Equation 12 and Theorem 1 in [14]). Hence, we consider KK as specific instances of VFA. The corresponding hypervector spatial representation exhibits topological similarity relationships within a 2-D space (Figure 1b). Compared to the basic VSA (Equation 2), the local similarity characteristics of spatial representation in VFA (Equation 6) can effectively assist the feature descriptor in reducing the impact of randomness in events on representation accuracy. Unless explicitly noted, VSA used in the following sections is VFA.

Refer to caption
Figure 2: Schematic of proposed VSA-Flow method for event-based optical flow. (a) Illustration of acquiring HD feature descriptors from accumulative TSs in a multi-scale strategy. (b) The VSA-Flow method consists of HD feature extractors, a cost volume module, and an optical flow estimator. HD feature extractors capture HD feature descriptors from TSs. The cost volume module computes local visual similarity by forming a volume representing similarity between 33 TS pairs with different time intervals at different scales. The optical flow estimator generates flow using local visual similarity. (c) The mechanism allows for the direct fusion of three cost volumes at different scales through summation to form the final local visual similarity within the cost volume module.

3.2.3 VSA-based HD Feature Descriptor

Inspired by classical estimation methods, feature descriptors at time tt are obtained by a multi-scale strategy [7, 43]. Here, the VSA-based HD feature descriptor involves three steps (Figure 2a): transforming event streams into multiple scales of polarity-dependent accumulative TSs; generating HD feature descriptors for each scale by merging TSs from both polarities; and amalgamating HD feature descriptors from various scales into the final HD descriptor at the original scale of TSs. Here, we leverage the role-filler binding [33] to achieve the fusion of HD features, thereby realizing the structured representation of multi-scale and two-polarity HD feature descriptors.

First, the accumulative TSs 𝒯p(t)H×W\mathcal{T}_{p}\left(t\right)\in\mathbb{R}^{H\times W} for each polarity p{+,}p\in\{+,-\} at time tt are obtained from event streams according to Equation 3. 𝒯p(t)\mathcal{T}_{p}\left(t\right) undergo continuous down-interpolation S1S-1 times at a ratio of 1/21/2, resulting in a set of TSs 𝒯ps,tH2s×W2s\mathcal{T}_{p}^{s,t}\in\mathbb{R}^{\frac{H}{2^{s}}\times\frac{W}{2^{s}}} (s=0,1,S1s=0,1,\cdots S-1).

Second, utilizing the polarity-dependent HD kernel KpK_{p} (Equation 6), the HD feature descriptor Fps,tF_{p}^{s,t} d×H2s×W2s\in\mathbb{R}^{d\times\frac{H}{2^{s}}\times\frac{W}{2^{s}}} for the corresponding TS of each polarity pp at Scale ss can be efficiently computed as follows:

Fps,t=𝒯ps,tKpF_{p}^{s,t}=\mathcal{T}_{p}^{s,t}*K_{p} (7)

By the role-filler binding, the HD feature descriptor Fs,td×H2s×W2sF^{s,t}\in\mathbb{R}^{d\times\frac{H}{2^{s}}\times\frac{W}{2^{s}}} for each scale is obtained from corresponding polarity-specific HD feature descriptors as follows:

Fs,t=F+s,tR++Fs,tRF^{s,t}=F_{+}^{s,t}\circ R_{+}+F_{-}^{s,t}\circ R_{-} (8)

where R+R_{+} and RR_{-} denote the random role (key) vectors for two polarities, respectively.

Finally, the HD feature descriptor FtF^{t} d×H×W\in\mathbb{R}^{d\times H\times W} at time tt, incorporating multiple spatial scales, can be represented as follows:

Ft=s=0S1Fs,tRsF^{t}=\sum_{s=0}^{S-1}{F^{s,t}\circ R^{s}} (9)

Here, RsR^{s} denote the random role (key) vectors for the corresponding spatial scale. In Equation 9, Fs,tF^{s,t} d×H×W\in\mathbb{R}^{d\times H\times W} is obtained through up-interpolation from Fs,tF^{s,t} in Equation 8.

3.2.4 Description of the Framework

Optical flow estimation involves identifying pixel correspondences between images captured at two different moments in time. The foundation of the feature matching method lies in the assumption that accurately estimated optical flow information corresponds to a higher similarity between corresponding pixels in accompanying event frames, compared to other pixels. The VSA-based feature matching framework here consists of two primary steps: 1) utilizing the VSA-based HD kernel to derive HD feature descriptors of consecutive event frames, and 2) employing algorithms such as search and optimization (for model-based methods) or neural networks with a proxy loss (for self-supervised learning methods). Both approaches aims to estimate optical flow by maximizing the similarity in feature descriptors of flow-matching points. In the following, we apply this framework to a model-based method (VSA-Flow) and a self-supervised learning method (VSA-SM) for event-based optical flow.

3.3 VSA-Flow: A Model-based Method Using VSA

The details of VSA-Flow is illustrated in Figure 2b, comprising three main components: HD feature extractors, the cost volume module, and the flow generator. The HD feature extractors are responsible for obtaining corresponding VSA-based HD feature descriptors from the accumulative TSs essential for optical flow estimation. The cost volume module calculates local visual similarity by constructing a volume representing the similarity between all pairs of TSs. Finally, the optical flow estimator generates the optical flow based on the local visual similarity.

3.3.1 HD feature extractors

The accuracy of event-based optical flow estimation is hindered by the stochastic nature in events, especially when relying solely on two accumulative TS with a time difference of Δt\varDelta t. To address this limitation and incorporate more comprehensive intermediate motion information into our method, we include accumulative TSs captured at time 0, Δt/4\varDelta t/4, Δt/2\varDelta t/2, and Δt\varDelta t, each with two polarities, successively represented as 𝒯ps,t\mathcal{T}_{p}^{s,t} (s=0s=0, p{+,}p\in\left\{+,-\right\} and t=0,1,2,4t=0,1,2,4) in Figure 2b. By utilizing this extended set of event frames, we can achieve more precise optical flow estimation from time 0 to Δt\varDelta t. Notably, the latter three time instances follow a progressive doubling pattern (×2\times 2), which will be further explained in the subsequent subsection. Following that, HD feature descriptors FtF^{t} (t=0,1,2,4t=0,1,2,4) corresponding to above event frames are acquired using the HD feature extractors depicted in Equation 9 and Figure 2a.

3.3.2 The cost volume module

Inspired by the basic cost volume in [56, 57], we adopt a strategy that integrates multiple pairs of HD feature descriptors with different time intervals: specifically, F0F^{0} and F1F^{1} at Scale 0, F0F^{0} and F2F^{2} at Scale 11, and F0F^{0} and F4F^{4} at Scale 22 (Figure 2b). The time interval Δts\varDelta t^{s} between F0F^{0} and F2sF^{2^{s}} at Scale ss is Δt/42s\varDelta t/4\cdot 2^{s} (Figure 2c). The HD feature descriptors at the latter two scales are obtained through average pooling from those at Scale 0 with kernel sizes 22 and 44, respectively, and equivalent stride. In this module, we first compute local visual similarity for each pair of HD descriptors F0F^{0} and F2sd×H/2s×W/2sF^{2^{s}}\in\mathbb{R}^{d\times H/2^{s}\times W/2^{s}} at Scale s=0,1,2s=0,1,2. Specifically, the HD descriptor of any event in F0F^{0} is compared for similarity only with the descriptors of pixels within a surrounding M×MM\times M neighborhood in F2sF^{2^{s}} (Figure 2c). Thus, the cost volume, C02sH/2s×W/2s×M×MC^{02^{s}}\in\mathbb{R}^{H/2^{s}\times W/2^{s}\times M\times M}, can be efficiently computed using the cosine similarity as follows:

Cijkl02s=cos(Fij0,Fkl2s)C_{ijkl}^{02^{s}}=\cos\left(F_{ij}^{0},F_{kl}^{2^{s}}\right) (10)

The displacement ds\mathrm{d}_{s}, which maps each event in F0F^{0} to its corresponding coordinates in F2sF^{2^{s}}, is obtained through the maximal similarity (Figure 2c). The estimated optical flow v\mathrm{v} at the original scale of the event camera (s=0s=0) can be calculated as follows:

v=d0Δts=ds2sΔt/42s=dsΔt/4ds=Δt4v\mathrm{v}=\frac{\mathrm{d}_{0}}{\varDelta t^{s}}=\frac{\mathrm{d}_{s}\cdot 2^{s}}{\varDelta t/4\cdot 2^{s}}=\frac{\mathrm{d}_{s}}{\varDelta t/4}\Longrightarrow\mathrm{d}_{s}=\frac{\varDelta t}{4}\mathrm{v} (11)

Here, d0\mathrm{d}_{0} denotes the corresponding displacement of ds\mathrm{d}_{s} transformed from Scale ss to Scale 0. Assuming the optical flow v\mathrm{v} is constant during the interval, Equation 11 reveals that ds\mathrm{d}_{s} is independent of the scale and remains the same at different scales. This indicates that the cost volume, C02sC^{02^{s}} (s=0,1,2s=0,1,2), should theoretically be the same for different scales. Hence, by up-interpolating all cost volumes at different scales to the same size at Scale 0, we obtain the final cost volume CH×W×M×MC\in\mathbb{R}^{H\times W\times M\times M} as the sum of all cost volumes (Figure 2b). Here, similarity integration in the cost volume from three event frame pairs with progressively doubling time intervals at gradually downsampled scales enables VSA-Flow to achieve large optical flow estimation within a limited M×MM\times M neighboring region.

3.3.3 The optical flow estimator

In this module, we adopt the scheme of optical flow probability volumes combined with a priori information on the position of the optical flow to estimate the optical flow (Figure 2b) [9]. The optical flow probability volumes predict the probability of optical flow PH×W×M×MP\in\mathbb{R}^{H\times W\times M\times M} within a M×MM\times M local area for each pixel, based on the final cost volume CC. The priori information on the position of the optical flow is provided by a predefined 2D grid template Tflow2×M×MT_{flow}\in\mathbb{R}^{2\times M\times M} of optical flow containing all possible optical directions that align with the optical flow probability volumes.

The optical flow probability volumes PP are calculated as follows:

{Pij=C¯ijk=1Ml=1MC¯ijklC¯ij=Max(CijαCijMax(1α)CijMean,0)\begin{cases}P_{ij}=\frac{\bar{C}_{ij}}{\sum_{k=1}^{M}{\sum_{l=1}^{M}{\bar{C}_{ijkl}}}}\\ \bar{C}_{ij}=Max\left(C_{ij}-\alpha C_{ij}^{Max}-\left(1-\alpha\right)C_{ij}^{Mean},0\right)\\ \end{cases} (12)

Here, CijMaxC_{ij}^{Max} and CijMeanC_{ij}^{Mean} represent the maximal and mean values of CijM×MC_{ij}\in\mathbb{R}^{M\times M}. The coefficient α\alpha determines the probability area contributing to the estimated optical flow. The cost volume CC in Equation 12 is obtained by average pooling the original CC from the cost volume module with a kernel size of scs_{c} and a stride of 11 to remove fluctuations due to the stochastic nature in events.

The template of optical flow TflowT_{flow} is formed by concatenating two 2D grids along the xx and yy axes with a range of [m,m]4/Δt\left[-m,m\right]\cdot 4/\varDelta t, m=M/2m=\lfloor M/2\rfloor, respectively [9]:

Tx=4Δt[mmmm],Ty=4Δt[mmmm]T_{x}=\frac{4}{\varDelta t}\left[\begin{matrix}-m&\cdots&\,\,m\\ \vdots&\ddots&\vdots\\ -m&\cdots&\,\,m\\ \end{matrix}\right],T_{y}=\frac{4}{\varDelta t}\left[\begin{matrix}-m&\cdots&-m\\ \vdots&\ddots&\vdots\\ \,\,m&\cdots&\,\,m\\ \end{matrix}\right] (13)

Next, the optical flow UU H×W×2\in\mathbb{R}^{H\times W\times 2} is estimated from a weighted average of its probability volumes PP over the predefined template TflowT_{flow} (Figure 2b) [9], formulated as:

{Ux(i,j)=k=1Ml=1MPijklTx(k,l)Uy(i,j)=k=1Ml=1MPijklTy(k,l)\begin{cases}U_{x}\left(i,j\right)=\sum_{k=1}^{M}{\sum_{l=1}^{M}{P_{ijkl}T_{x}\left(k,l\right)}}\\ U_{y}\left(i,j\right)=\sum_{k=1}^{M}{\sum_{l=1}^{M}{P_{ijkl}T_{y}\left(k,l\right)}}\\ \end{cases} (14)

where UxU_{x} and UyU_{y} represent the components of the predicted optical flow along the xx and yy axes, respectively.

Refer to caption
Figure 3: Self-supervised optical flow learning via similarity maximization based on HD Feature descriptors. (a) Multi-frame approach for flow refinement. Within the time interval Δt\Delta t, we utilize K=5K=5 pairs of HD feature descriptors (F0FkF^{0}\rightarrow F^{k},k=1,,Kk=1,...,K) with progressively incremented intervals to compute the similarity between events and their corresponding matching points, ultimately enhancing the accuracy of optical flow estimation. (b) Illustration of similarity calculation for HD descriptors between events and their predicted flow-matching points.

3.4 VSA-SM: A Self-supervised Learning Method Through Similarity Maximization

Here, we adopt a self-supervised approach to learn optical flow estimation from accumulative TSs by maximizing the similarity of HD feature descriptors (Figure 3). We use a classical multi-frame approach for flow refinement, as illustrated in Figure 3a. During a time interval of Δt\Delta t, we extract HD feature descriptors from corresponding accumulative TSs at intervals of Δt/K\Delta t/K (K=5K=5), yielding a set of K+1K+1 descriptors denoted as FkF^{k} (k=0,,Kk=0,...,K). Assuming the optical flow within the interval Δt\Delta t is represented by UU, the inferred optical flow between descriptor F0F^{0} and descriptor FkF^{k} equates to kU/KkU/K. As a result, we utilize KK pairs of descriptors (F0FkF^{0}\rightarrow F^{k}, where k=1,,Kk=1,...,K) to facilitate flow refinement within the context of self-supervised learning.

Knowing the per-pixel optical flow 𝒖(𝒙)U\boldsymbol{u}\left(\boldsymbol{x}\right)\in U, the matching point at time kKΔt\frac{k}{K}\varDelta t can be obtained through:

𝒙i=𝒙i+kK𝒖(𝒙i)\boldsymbol{x}_{i}^{\prime}=\boldsymbol{x}_{i}+\frac{k}{K}\boldsymbol{u}\left(\boldsymbol{x}_{i}\right) (15)

However, the matching point 𝒙i\boldsymbol{x}_{i}^{\prime} may not correspond to an actual pixel. Thus, the similarity between HD feature descriptors of 𝒙i\boldsymbol{x}_{i} in F0F^{0} and the matching point 𝒙i\boldsymbol{x}_{i}^{\prime} in FkF^{k} is calculated by evaluating its similarity to the descriptors of 44 neighboring pixels 𝒙ij{\boldsymbol{x}_{i}^{j}}^{\prime} (j=0,,3j=0,...,3) around the matching point 𝒙i\boldsymbol{x}_{i}^{\prime} in FkF^{k}, with normalized weights wij{w_{i}^{j}}^{\prime} (j=0,,3j=0,...,3) via bilinear interpolation (Figure 3b):

{simk(𝒙i,𝒙i)=jcos(F0(𝒙i),Fk(𝒙ij))wijwij=κ(xixij)κ(yiyij)jκ(xixij)κ(yiyij)+ϵκ(a)=max(0,1|a|),ϵ0\begin{cases}\mathrm{sim}_{k}\left(\boldsymbol{x}_{i},\boldsymbol{x}_{i}^{\prime}\right)=\sum_{j}{\cos\left(F^{0}\left(\boldsymbol{x}_{i}\right),F^{k}\left({\boldsymbol{x}_{i}^{j}}^{\prime}\right)\right)\cdot{w_{i}^{j}}^{\prime}}\\ {w_{i}^{j}}^{\prime}=\frac{\kappa\left(x_{i}^{\prime}-{x_{i}^{j}}^{\prime}\right)\kappa\left(y_{i}^{\prime}-{y_{i}^{j}}^{\prime}\right)}{\sum_{j}{\kappa\left(x_{i}^{\prime}-{x_{i}^{j}}^{\prime}\right)\kappa\left(y_{i}^{\prime}-{y_{i}^{j}}^{\prime}\right)+\epsilon}}\\ \kappa\left(a\right)=\max\left(0,1-|a|\right),\epsilon\approx 0\\ \end{cases} (16)

In this study, we use the similarity maximization proxy loss for feature matching to learn to estimate the event-based optical flow, as outlined in Equation 16. Building upon the principles of previous unsupervised learning methods that emphasize contrast maximization [52, 47], we build on the loss function \mathcal{L} as follows:

=similarity+λsmooth\mathcal{L}=\mathcal{L}_{similarity}+\lambda\mathcal{L}_{smooth} (17)

which is a weighted combination of two terms: the similarity loss similarity\mathcal{L}_{similarity} and the smoothness smooth\mathcal{L}_{smooth}. The computation of the similarity loss involves NN pixels, encompassing the most recent events occurring before time 0, as well as pixels sampled at every 5th interval both horizontally and vertically across the image plane (Figure 3a and 3b). The formulation of the similarity loss is as follows:

{similarity=1simαsim=1KNk=1Ki=1Nsimk(𝒙i,𝒙i)\begin{cases}\mathcal{L}_{similarity}=1-\left<\mathrm{sim}\right>^{\alpha}\\ \left<\mathrm{sim}\right>=\frac{1}{KN}\sum_{k=1}^{K}{\sum_{i=1}^{N}{\mathrm{sim}_{k}\left(\boldsymbol{x}_{i},\boldsymbol{x}_{i}^{\prime}\right)}}\\ \end{cases} (18)

Here, sim\left<\mathrm{sim}\right> represents the average similarity encompassing all relevant pixels within KK pairs of descriptors, while α\alpha serves as a coefficient. A higher value of sim\left<\mathrm{sim}\right> corresponds to more accurate optical flow estimation and a diminished similarity loss function. Additionally, the smoothness smooth\mathcal{L}_{\text{smooth}} adopts the Charbonnier smoothness prior [22, 65] or the first order edge-aware smoothness [55].

In this study, we train E-RAFT [21] in a self-supervised manner, utilizing the loss function described in Equation 17, to demonstrate the effectiveness of our self-supervised learning method based on similarity maximization of HD feature descriptors. In principle, this methodology holds applicability across various event-based optical flow networks. Meanwhile, we adopt the full-image warping technique [55] to improve flow quality near image boundaries.

4 Experiments

4.1 Datasets, Metrics and Implementation Details

Following previous works [21, 52], both VSA-Flow and VSA-SM are evaluated using well-established event-based datasets DSEC-Flow (640×480640\times 480 pixel resolution) [21] and MVSEC (346×260346\times 260 pixel resolution) [64].

For the model-based method (VSA-Flow), experiments are conducted on the official testing set of the public DSEC-Flow benchmark and on outdoor_day1 and three indoor_flying sequences with time intervals of dt=1,4dt=1,4 gray images on the MVSEC benchmark. For the self-supervised learning method (VSA-SM), E-RAFT is trained on the official training set of DSEC and on outdoor_day2 sequence on MVSEC, respectively. To increase the variation in the optical flow magnitude during training, the training sequences on MVSEC are extended with time intervals of dt=0.5,1,2,4,8dt=0.5,1,2,4,8 gray images. Following separate training, evaluations are performed on the same testing sets as VSA-Flow, respectively on DSEC and MVSEC. Both methods are implemented using Pytorch library. For the VSA-SM training, we set batch size to 11, the optimizer is set to Adam [32] and learning rate to 1e21e{-2}.

We evaluate the accuracy of our predictions using following metrics: (i) EPE, the endpoint error; (ii) %1PE\%_{1\mathrm{PE}} and %3PE\%_{3\mathrm{PE}}, the percentage of points with EPE greater than 11 and 33 pixels; (iii) AE, angular error. For both DSEC-Flow [20, 21] and MVSEC [64] datasets, metrics are measured over pixels with valid ground-truth and at least one event in the evaluation intervals.

Refer to caption
Figure 4: The probability density of similarity between matching points based on ground-truth (GT) optical flow on two datasets. For each dataset, we can compute the similarity of HD feature descriptors for NmatchN_{match} pairs of flow-matching points according to GT. The probability density of similarity refers to the likelihood that the feature similarity of flow-matching points equals a certain value. Compared to the basic VSA, VFA demonstrates enhanced capability in encoding the similarity of matching points in event frames.

4.2 Descriptor Similarity of Flow-Matching Points

In this study, HD feature descriptors are derived from feature extractors utilizing VSA-based HD kernels. We explore the impact of different VSA types (basic VSA and VFA) on the descriptor similarity among flow-matching points within the DSEC and MVSEC datasets (Figure 4).

In the basic VSA HD kernel, all hypervectors are pseudo-orthogonal, implying that each pixel within the neighborhoods contributes independently to the feature descriptor. Feature descriptors obtained from the basic VSA HD kernel reflect the most fundamental image patterns. Hence, Figure 4 (blue curves) reveals that the similarity of flow-matching points in the MVSEC dataset is inferior to that in the DSEC dataset. This observation suggests that, in comparison to the DSEC dataset, the MVSEC dataset experiences greater randomness in event frames, leading to lower event frame quality.

Figure 4 (red curves) illustrates that VFA yields higher descriptor similarity for flow-matching points compared to basic VSA. In contrast to basic VSA, VFA exhibits an improved ability to encode the similarity of flow-matching points in event frames.

Table 1: Results on DSEC-Flow dataset [21]. Model-based (MB) methods need no training data; supervised learning (SL) methods need ground-truth; self-supervised learning (SSL) methods shown in this table only require events. Best accuracy presented in bold, and best accuracy in each type of method underlined. EV-FlowNet [65] is retrained by the corresponding literature. d=1024d=1024, N=21N=21 (the kernel size of HD feature descriptors), S=2S=2; K=5K=5.
Methods EPE 1PE 3PE AE
MB MultiCM [52] 3.47 76.57 30.86 13.98
RTEF [8] 4.88 82.81 41.96 10.82
VSA-Flow (VFA) 3.46±\pm0.06 68.94±\pm0.57 28.97±\pm0.40 9.45±\pm0.17
VSA-Flow (Basic VSA) 4.19±\pm0.15 77.50±\pm1.08 32.34±\pm0.72 13.41±\pm0.56
SL EV-FlowNet [21] 2.32 55.4 18.6 -
E-RAFT [21] 0.79 12.74 2.68 2.85
IDNet [59] 0.72 10.07 2.04 2.72
TMA [39] 0.74 10.86 2.30 2.68
E-Flowformer [38] 0.76 11.23 2.45 2.68
SSL EV-FlowNet [47] 3.86 - 31.45 -
TamingCM [47] 2.33 68.29 17.77 10.56
VSA-SM (VFA) 2.22 55.46 16.83 8.86
Table 2: Results on MVSEC dataset [64]. SSLF: semi-supervised learning methods use grayscale images for supervision; Best accuracy presented in bold, and best accuracy in MB and SSL methods underlined.. Only used Scale 0 (F0F4F^{0}\rightarrow F^{4}) in the cost volume module and K=1K=1 due to the small Δt\Delta t, d=1024d=1024, N=25N=25 and S=2S=2; d=1024d=1024, N=25N=25 and S=2S=2.
Methods indoor_flying1 indoor_flying2 indoor_flying3 outdoor_day1
EPE 3PE EPE 3PE EPE 3PE EPE 3PE
dt=1dt=1
MB Nagata et al. [44] 0.62 0.93 0.84 0.77
Akolkar et al. [1] 1.52 1.59 1.89 2.75
Brebion et al. [8] 0.52 0.10 0.98 5.50 0.71 2.10 0.53 0.20
MultiCM [52] 0.42 0.10 0.60 0.59 0.50 0.28 0.30 0.10
VSA-Flow (VFA) 0.46 0.05 0.65 1.08 0.53 0.29 0.65 3.60
SL EV-FlowNet+ [54] 0.56 1.00 0.66 1.00 0.59 1.00 0.68 0.99
E-RAFT [21] 1.10 5.72 1.94 30.79 1.66 25.20 0.24 0.00
TMA [39] 1.06 3.63 1.81 27.29 1.58 23.26 0.25 0.07
SSLF EV-FlowNet [64] 1.03 2.20 1.72 15.10 1.53 11.90 0.49 0.20
Spike-FlowNet [37] 0.84 —– 1.28 —– 1.11 —– 0.49 —–
STE-FlowNet [13] 0.57 0.10 0.79 1.60 0.72 1.30 0.42 0.00
SSL EV-FlowNet [54] 0.58 0.00 1.02 4.00 0.87 3.00 0.32 0.00
Hagenaars et al.[22] 0.60 0.51 1.17 8.06 0.93 5.64 0.47 0.25
VSA-SM (VFA) 0.57 0.07 0.91 3.91 0.69 1.63 0.46 3.42
dt=4dt=4
MB MultiCM [52] 1.69 12.95 2.49 26.35 2.06 19.03 1.25 9.21
VSA-Flow (VFA) 1.44 6.71 2.49 18.01 1.79 11.90 1.66 13.96
SL E-RAFT [21] 2.81 40.25 5.09 64.19 4.46 57.11 0.72 1.12
TMA [39] 2.43 29.91 4.32 52.74 3.60 42.02 0.70 1.08
SSLF EV-FlowNet [64] 2.25 24.70 4.05 45.30 3.45 39.70 1.23 7.30
Spike-FlowNet [37] 2.24 —– 3.83 —– 3.18 —– 1.09 —–
STE-FlowNet [13] 1.77 14.70 2.52 26.10 2.23 22.10 0.99 3.90
SSL EV-FlowNet [54] 2.18 24.20 3.85 46.80 3.18 47.80 1.30 9.70
Hagenaars et al.[22] 2.16 21.51 3.90 40.72 3.00 29.60 1.69 12.50
VSA-SM (VFA) 1.63 10.05 2.92 22.57 1.98 13.12 1.24 8.31

4.3 Results on DSEC

Table 1 presents the evaluation results on the DSEC-Flow benchmark [21]. The methods listed in different rows are classified into three types: model-based methods (MB), supervised learning methods (SL), and self-supervised learning methods (SSL). The notations ‘VFA’ and ‘Basic VSA’ within the parentheses for our methods represent the utilization of the VFA (Eq. 6) and the basic VSA (Eq. 2) HD kernels for feature descriptors. It’s important to note that the stochastic nature of generating spatial base vectors for the HD kernel impacts the evaluation of the VSA-Flow method, all evaluation metrics for the VSA-Flow method represent the statistical outcomes obtained from randomly producing 10 sets of HD kernels. This includes the mean and standard deviation for each metric. Regarding the VSA-SM method, due to its prolonged training duration, Table 1 showcases the evaluation results based on a single set of randomly generated HD kernel used during training.

The VSA-Flow (VFA) method provides the superior performance among all model-based methods in the DSEC-Flow dataset. In particular, the EPE and 3PE metrics slightly outperform other methods, whereas the 1PE and AE metrics display substantial improvements. Moreover, it is evident that employing VFA as the HD kernel in VSA-Flow leads to a significant performance improvement compared to utilizing the basic VSA, which is consistent with observations in Figure 4. In self-supervised training group, the proposed VSA-SM (VFA) method demonstrates the best results among all self-supervised learning methods. The extent of its improvement across the metrics aligns with the evaluation outcomes of the VSA-Flow (VFA).

4.4 Results on MVSEC

Table 2 reports the evaluation results on the MVSEC benchmark [64]. Due to the small deviation of all metrics when d=1024d=1024 (Table 3), for the sake of simplicity, the evaluation results on MVSEC for our methods come from a single set of randomly generated HD kernel. Consistent with [64] and [52], Table 2 compares some primary methods using the same training and testing sequences. Many learning-based methods trained on alternate outdoor sequences or datasets are not used for testing.

The VSA-Flow method achieves the best results among all methods in indoor_flying sequences when dt=4dt=4 and the competitive results when dt=1dt=1. These results indicate that the model-based VSA-Flow method, based on HD feature descriptors, are well-suited for large optical flow estimation (dt=4dt=4) and maintain competitiveness for low optical flow (dt=1dt=1). In addition, in comparison to the indoor_flying sequences, the performance of VSA-Flow is less competitive in the outdoor_day sequence. This discrepancy may primarily stem from the fact that, compared to the indoor_flying scene, the smaller motion in the outdoor_day scene leads to sparser events [65], thereby impacting the representation of HD feature descriptors.

As mentioned earlier, the training sequences for VSA-SM on MVSEC are extended with time intervals of dt=0.5,1,2,4,8dt=0.5,1,2,4,8 grayscale images. Because VSA-Flow exhibits relatively weaker performance for low optical flow (dt=1dt=1, Table 3) compared to large optical flow (dt=4dt=4, Table 3), in the training strategy for VSA-SM, the optical flow predictions at time intervals dt=0.5,1,2dt=0.5,1,2 are scaled by factors of 88, 44, and 22, respectively. Subsequently, self-supervised learning when dt=0.5,1,2dt=0.5,1,2 is conducted using high-dimensional feature descriptors of events frames for dt=4dt=4. Evaluation results indicate that the VSA-SM method achieves competitive performance compared to other self-supervised learning methods. Furthermore, it outperforms some semi-supervised learning methods that employ grayscale images for supervision, particularly on certain sequences.

It is noteworthy that many learning methods, including VSA-SM, exhibit lower performance in the indoor scenes compared to model-based methods. This discrepancy arises because training for MVSEC is exclusively conducted on the outdoor_day2 sequence, but indoor and outdoor sequences contain distinct scene information.

Refer to caption
Figure 5: Qualitative comparision of our methods with the state-of-the-art E-RAFT architecture on several test sequence partitions of the DSEC dataset [21].

4.5 Qualitative Results on DSEC

Qualitative results of both VSA-Flow and VSA-SM methods on multiple sequences from the test partition of DSEC-Flow dataset are shown in Figure 5. Given the unavailability of ground truth for the official testing set, a comparison with the state-of-the-art E-RAFT architecture [21] is performed. Our model-based and self-supervised learning methods can achieve high-quality event-based optical flow estimation from events without the need for additional sensory information. Several conclusions can be drawn from these results: (1) Both VSA-Flow and VSA-SM accurately estimate optical flow, particularly in regions containing events. Event-masked sparse optical flow estimation appears more accurate than dense flow estimation. (2) The optical flow estimation from VSA-SM appears smoother compared to VSA-Flow; (3) VSA-Flow exhibits inaccuracies in optical flow estimation near image boundaries, whereas the adoption of a full-image warping technique [55] for VSA-SM during self-supervised learning enhances its accuracy near image boundaries; (4) Due to both methods relying solely on event frames for flow estimation, accuracy diminishes in large areas devoid of events, sometimes resulting in zero flow estimation - a trend consistent with other self-supervised learning methods [22, 47]; (5) As model-based and self-supervised learning approaches relying on event-only local features, our methods predict optical flow less smoothly compared to supervised learning methods; Meanwhile, our methods exhibit less sharp optical flow estimation at the edges of objects, displaying a smoother transition.

Table 3: Impact of dd and SS for VSA-Flow (VFA) Method on DSEC-Flow dataset [21]. dd: the hypervector dimension; and SS: the multi-scale number in the VSA-based HD feature descriptor.
dd SS EPE 1PE 3PE AE
1024 1 3.40±\pm0.05 70.26±\pm1.17 28.36±\pm0.49 9.93±\pm0.31
2 3.46±\pm0.06 68.94±\pm0.57 28.97±\pm0.40 9.45±\pm0.17
3 3.85±\pm0.06 69.81±\pm0.46 31.35±\pm0.28 9.64±\pm0.10
512 3.56±\pm0.07 69.44±\pm1.20 29.55±\pm0.56 9.73±\pm0.32
256 2 3.63±\pm0.12 70.10±\pm1.25 29.97±\pm0.84 10.00±\pm0.43
128 3.87±\pm0.28 72.71±\pm2.21 31.75±\pm1.64 11.08±\pm1.02

4.6 Effects of Hypervector Dimension and Multi-scale

Table 3 reports evaluation results for experiments on the VSA-Flow method with varying hypervector dimensions (dd) and different multi-scale numbers (SS) for the HD feature descriptor. When d=1024d=1024, the VSA-Flow exhibits better EPE and 3PE metrics with S=1S=1, while for S=2S=2, it demonstrates better 1PE and AE metrics. Moreover, as SS remains constant and dd is altered, all metrics are improved with dd, indicating that a larger hypervector dimension leads to enhanced performance. This result is consistent with the understanding that, within VSA, increased hypervector dimensions contribute to heightened information encoding capabilities [33, 34].

Refer to caption
Figure 6: Effects of exponential-decay rate of time surface for VSA-Flow on DESC and MVSEC (the indoor_flying1 sequence). Because the deviation of all metrics is small when d=1024d=1024 (Table 3), for the sake of simplicity, every evaluation result for each τTS\tau_{TS} come from a single set of randomly generated HD kernel.

4.7 Effects of Exponential-decay Rate of Time Surface

The temporal information of the HD feature descriptor is primarily impacted by the exponential-decay rate τTS\tau_{TS} of the accumulative Time Surface (TS). Figure 6 illustrates the metrics EPE and 3PE for a single trial of the VSA-Flow method on DSEC and MVSEC datasets with varying τTS\tau_{TS}. Both metrics exhibit a trend of initially decreasing and then increasing with τTS\tau_{TS}. These results indicate that the performance of VSA-Flow’s optical flow estimation diminishes when τTS\tau_{TS} is either too small or too large. Optimal performance is observed within a suitable range of τTS\tau_{TS}. It is because short τTS\tau_{TS} for TS emphasizes recent events, resulting in its sparsity and inadequacy. Conversely, an excessively long τTS\tau_{TS} causes TS to encompass events over an extended period, leading to a blurred representation. Hence, an appropriate τTS\tau_{TS} is essential. It is worthy noting that the optimal range for τTS\tau_{TS} in the VSA-Flow method differs between DSEC and MVSEC due to variations in the characteristics of the event cameras used. In contrast to DSEC, events in MVSEC are more sparse, requiring a larger τTS\tau_{TS}. This indicates the necessity to accumulate events over a longer period for MVSEC to achieve more accurate information encoding in the TS.

5 Conclusions and Discussions

In summary, our work introduces a novel VSA-based feature matching framework for event-based optical flow, applicable to both model-based (VSA-Flow) and self-supervised learning (VSA-SM) methods. The key of our work lies in the effective utilization of a VSA-based HD feature descriptor for event frames. The proposed methods can achieve accurate estimation of event-based optical flow in the feature matching methodology without restoring luminance or additional sensor information [64, 22, 11, 13, 58]. This work signifies an important advancement in event-based optical flow within the feature matching methodology, underscored by our compelling and robust results. The proposed framework can have broad applicability, extending to more event-based tasks such as depth estimation and tracking.

Currently, most primary methods for event-based optical flow estimation applicable to both model-based and self-supervised learning are contrast maximization methods [52, 60, 46, 22, 47]. The contrast maximization (CM) methods excel in utilizing temporal information from events but are less adept for local spatial features from events. Hence, these methods perform well in estimating optical flow within short time intervals or for small flow magnitudes. They require more complex strategies for achieving satisfactory performance in larger time intervals, such as producing sharp image of warped events (IWE) at multiple reference times through iterative warping [22, 47]. In contrast, our methods, based on feature similarity maximization, excel in utilizing the local spatial features of events, but is comparatively weaker in exploiting temporal information. Consequently, our methods demonstrate better performance in optical estimation within larger time intervals (Table 2). Our methods achieve competitive performance without complex strategies, and avoid circumvent issues such as occlusions and overfitting observed when warping events in CM methods [52]. Future research will focus on enhancing the temporal encoding capability in HD feature descriptors.

Traditionally, the feature matching is primarily determined by the differences between two local images within the neighborhoods of two feature points, which are often quantified using metrics such as the sum of absolute differences and Euclidean distance [36, 40, 63]. This approach is frequently applied in event camera hardware platforms [41]. However, due to the inherent randomness in events, it may not be the most effective approach for gauging feature similarity directly from local event frames. Inspired from [14, 50], we utilize the VSA-based HD kernel to extract the local features and structured symbolic representation to achieve feature fusion from both event polarities and multiple spatial scales. These approaches enhance the similarity of flow-matching feature descriptors as shown in our evaluation results. VSA, also known as Hyperdimensional Computing, is considered a new emerging neuromorphic computing model for ultra-efficient edge AI [28, 4, 66]. Presently, our method focuses on dense optical flow estimation. With appropriate adjustments and configurations, our method is promising to efficiently and rapidly achieve sparse optical flow estimation on hardware, facilitating the design of event-driven hardware optical flow sensors [10, 25, 41].

\bmhead

Acknowledgements

Data Availability Statement

The DSEC-Flow dataset is available for download from the website at https://dsec.ifi.uzh.ch/dsec-datasets/download/.

The MVSEC dataset is available for download from the website at https://daniilidis-group.github.io/mvsec/download/.


Appendix A Parameter Configurations

Table 4: Parameter Configurations
Parameters VSA-Flow VSA-SM
DSEC MVSEC DSEC MVSEC
Accumulative time surface
τTS\tau_{TS} The exponential-decay rate 35msms 35msms 35msms 35msms
VSA-based kernel and HD feature descriptor
dd The hypervector dimension 1024 1024 1024 1024
NN The size of HD kernel 21 25 21 25
σK\sigma_{K} Stander deviation of Gaussian kernel GG 1.5 1.5 1.5 1.5
SS The multi-scale number in the descriptor 2 2 2 2
The cost volume module and the optical flow estimator in VSA-Flow
MM The neighborhood size of the cost volume 31 31
α\alpha A coefficient in the optical flow probability volume 0.85 0.60
scs_{c} The kernel size of average pooling 71 71
Loss functions in VSA-SM
λ\lambda The weight of the smoothness in loss function 1.0 1.0
α\alpha A coefficient in similarity\mathcal{L}_{similarity} 5.0 5.0
  • Unless explicitly noted.

Table 4 shows parameter values used in proposed VSA-Flow and VSA-SM methods.


References

  • \bibcommenthead
  • Akolkar et al [2020] Akolkar H, Ieng SH, Benosman R (2020) Real-time high speed motion prediction using fast aperture-robust event-driven visual flow. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(1):361–372
  • Almatrafi and Hirakawa [2019] Almatrafi M, Hirakawa K (2019) Davis camera optical flow. IEEE Transactions on Computational Imaging 6:396–407
  • Almatrafi et al [2020] Almatrafi M, Baldwin R, Aizawa K, et al (2020) Distance surface for event-based optical flow. IEEE transactions on pattern analysis and machine intelligence 42(7):1547–1556
  • Amrouch et al [2022] Amrouch H, Imani M, Jiao X, et al (2022) Brain-inspired hyperdimensional computing for ultra-efficient edge ai. In: 2022 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS), IEEE, pp 25–34
  • Benosman et al [2012] Benosman R, Ieng SH, Clercq C, et al (2012) Asynchronous frameless event-based optical flow. Neural Networks 27:32–37
  • Benosman et al [2013] Benosman R, Clercq C, Lagorce X, et al (2013) Event-based visual flow. IEEE transactions on neural networks and learning systems 25(2):407–417
  • Black and Anandan [1996] Black MJ, Anandan P (1996) The robust estimation of multiple motions: Parametric and piecewise-smooth flow fields. Computer vision and image understanding 63(1):75–104
  • Brebion et al [2022] Brebion V, Moreau J, Davoine F (2022) Real-time optical flow for vehicular perception with low-and high-resolution event cameras. IEEE Transactions on Intelligent Transportation Systems 23(9)
  • Cao et al [2023] Cao YJ, Zhang XS, Luo FY, et al (2023) Learning generalized visual odometry using position-aware optical flow and geometric bundle adjustment. Pattern Recognition 136:109262
  • Chao et al [2013] Chao H, Gu Y, Gross J, et al (2013) A comparative study of optical flow and traditional sensors in uav navigation. In: 2013 American Control Conference, IEEE, pp 3858–3863
  • Deng et al [2021] Deng Y, Chen H, Chen H, et al (2021) Learning from images: A distillation learning framework for event cameras. IEEE Transactions on Image Processing 30:4919–4931
  • Dewulf et al [2023] Dewulf P, De Baets B, Stock M (2023) The hyperdimensional transform for distributional modelling, regression and classification. arXiv preprint arXiv:231108150
  • Ding et al [2022] Ding Z, Zhao R, Zhang J, et al (2022) Spatio-temporal recurrent networks for event-based optical flow estimation. In: Proceedings of the AAAI conference on artificial intelligence, pp 525–533
  • Frady et al [2021] Frady EP, Kleyko D, Kymn CJ, et al (2021) Computing on functions using randomized vector representations. arXiv preprint arXiv:210903429
  • Frady et al [2022] Frady EP, Kleyko D, Kymn CJ, et al (2022) Computing on functions using randomized vector representations (in brief). In: Proceedings of the 2022 Annual Neuro-Inspired Computational Elements Conference, pp 115–122
  • Gallego et al [2018] Gallego G, Rebecq H, Scaramuzza D (2018) A unifying contrast maximization framework for event cameras, with applications to motion, depth, and optical flow estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3867–3876
  • Gallego et al [2019] Gallego G, Gehrig M, Scaramuzza D (2019) Focus is all you need: Loss functions for event-based vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 12280–12289
  • Gallego et al [2020] Gallego G, Delbrück T, Orchard G, et al (2020) Event-based vision: A survey. IEEE transactions on pattern analysis and machine intelligence 44(1):154–180
  • Ganesan et al [2021] Ganesan A, Gao H, Gandhi S, et al (2021) Learning with holographic reduced representations. Advances in Neural Information Processing Systems 34:25606–25620
  • Gehrig et al [2021a] Gehrig M, Aarents W, Gehrig D, et al (2021a) Dsec: A stereo event camera dataset for driving scenarios. IEEE Robotics and Automation Letters 6(3):4947–4954
  • Gehrig et al [2021b] Gehrig M, Millhäusler M, Gehrig D, et al (2021b) E-raft: Dense optical flow from event cameras. In: 2021 International Conference on 3D Vision (3DV), IEEE, pp 197–206
  • Hagenaars et al [2021] Hagenaars J, Paredes-Vallés F, De Croon G (2021) Self-supervised learning of event-based optical flow with spiking neural networks. Advances in Neural Information Processing Systems 34:7167–7179
  • Hersche et al [2022] Hersche M, Karunaratne G, Cherubini G, et al (2022) Constrained few-shot class-incremental learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9057–9067
  • Hersche et al [2023] Hersche M, Zeqiri M, Benini L, et al (2023) A neuro-vector-symbolic architecture for solving raven’s progressive matrices. Nature Machine Intelligence 5(4):363–375
  • Honegger et al [2013] Honegger D, Meier L, Tanskanen P, et al (2013) An open source and open hardware embedded metric optical flow cmos camera for indoor and outdoor applications. In: 2013 IEEE International Conference on Robotics and Automation, IEEE, pp 1736–1741
  • Horn and Schunck [1981] Horn BK, Schunck BG (1981) Determining optical flow. Artificial intelligence 17(1-3):185–203
  • Kanerva [2009] Kanerva P (2009) Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors. Cognitive computation 1:139–159
  • Karunaratne et al [2020] Karunaratne G, Le Gallo M, Cherubini G, et al (2020) In-memory hyperdimensional computing. Nature Electronics 3(6):327–337
  • Karunaratne et al [2021] Karunaratne G, Schmuck M, Le Gallo M, et al (2021) Robust high-dimensional memory-augmented neural networks. Nature communications 12(1):2468
  • Karunaratne et al [2022] Karunaratne G, Hersche M, Langeneager J, et al (2022) In-memory realization of in-situ few-shot continual learning with a dynamically evolving explicit memory. In: ESSCIRC 2022-IEEE 48th European Solid State Circuits Conference (ESSCIRC), IEEE, pp 105–108
  • Kempitiya et al [2022] Kempitiya T, De Silva D, Kahawala S, et al (2022) Parameterization of vector symbolic approach for sequence encoding based visual place recognition. In: 2022 International Joint Conference on Neural Networks (IJCNN), IEEE, pp 1–7
  • Kingma and Ba [2014] Kingma DP, Ba J (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980
  • Kleyko et al [2021] Kleyko D, Rachkovskij DA, Osipov E, et al (2021) A survey on hyperdimensional computing aka vector symbolic architectures, part i: Models and data transformations. ACM Computing Surveys (CSUR)
  • Kleyko et al [2023] Kleyko D, Rachkovskij D, Osipov E, et al (2023) A survey on hyperdimensional computing aka vector symbolic architectures, part ii: Applications, cognitive models, and challenges. ACM Computing Surveys 55(9):1–52
  • Komer [2020] Komer B (2020) Biologically inspired spatial representation
  • Lagorce et al [2016] Lagorce X, Orchard G, Galluppi F, et al (2016) Hots: a hierarchy of event-based time-surfaces for pattern recognition. IEEE transactions on pattern analysis and machine intelligence 39(7):1346–1359
  • Lee et al [2020] Lee C, Kosta AK, Zhu AZ, et al (2020) Spike-flownet: event-based optical flow estimation with energy-efficient hybrid neural networks. In: European Conference on Computer Vision, Springer, pp 366–382
  • Li et al [2023] Li Y, Huang Z, Chen S, et al (2023) Blinkflow: A dataset to push the limits of event-based optical flow estimation. arXiv preprint arXiv:230307716
  • Liu et al [2023] Liu H, Chen G, Qu S, et al (2023) Tma: Temporal motion aggregation for event-based optical flow. arXiv preprint arXiv:230311629
  • Liu and Delbruck [2018] Liu M, Delbruck T (2018) Adaptive time-slice block-matching optical flow algorithm for dynamic vision sensors. BMVC
  • Liu and Delbruck [2022] Liu M, Delbruck T (2022) Edflow: Event driven optical flow camera with keypoint detection and adaptive block matching. IEEE Transactions on Circuits and Systems for Video Technology 32(9):5776–5789
  • Lucas and Kanade [1981] Lucas BD, Kanade T (1981) An iterative image registration technique with an application to stereo vision. In: IJCAI’81: 7th international joint conference on Artificial intelligence, pp 674–679
  • Mémin and Pérez [2002] Mémin E, Pérez P (2002) Hierarchical estimation and segmentation of dense motion fields. International Journal of Computer Vision 46:129–155
  • Nagata et al [2021] Nagata J, Sekikawa Y, Aoki Y (2021) Optical flow estimation by matching time surface with event-based cameras. Sensors 21(4):1150
  • Neubert and Schubert [2021] Neubert P, Schubert S (2021) Hyperdimensional computing as a framework for systematic aggregation of image descriptors. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16938–16947
  • Paredes-Vallés and de Croon [2021] Paredes-Vallés F, de Croon GC (2021) Back to event basics: Self-supervised learning of image reconstruction for event cameras via photometric constancy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3446–3455
  • Paredes-Vallés et al [2023] Paredes-Vallés F, Scheper KY, De Wagter C, et al (2023) Taming contrast maximization for learning sequential, low-latency, event-based optical flow. arXiv preprint arXiv:230305214
  • Plate [1992] Plate TA (1992) Holographic recurrent networks. Advances in neural information processing systems 5
  • Plate [1994] Plate TA (1994) Distributed representations and nested compositional structure. Citeseer
  • Renner et al [2022a] Renner A, Supic L, Danielescu A, et al (2022a) Neuromorphic visual odometry with resonator networks. arXiv preprint arXiv:220902000
  • Renner et al [2022b] Renner A, Supic L, Danielescu A, et al (2022b) Neuromorphic visual scene understanding with resonator networks. arXiv preprint arXiv:220812880
  • Shiba et al [2022] Shiba S, Aoki Y, Gallego G (2022) Secrets of event-based optical flow. In: European Conference on Computer Vision, Springer, pp 628–645
  • Stoffregen and Kleeman [2018] Stoffregen T, Kleeman L (2018) Simultaneous optical flow and segmentation (sofas) using dynamic vision sensor. arXiv preprint arXiv:180512326
  • Stoffregen et al [2020] Stoffregen T, Scheerlinck C, Scaramuzza D, et al (2020) Reducing the sim-to-real gap for event cameras. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16, Springer, pp 534–549
  • Stone et al [2021] Stone A, Maurer D, Ayvaci A, et al (2021) Smurf: Self-teaching multi-frame unsupervised raft with full-image warping. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp 3887–3896
  • Sun et al [2018] Sun D, Yang X, Liu MY, et al (2018) Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8934–8943
  • Teed and Deng [2020] Teed Z, Deng J (2020) Raft: Recurrent all-pairs field transforms for optical flow. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, Springer, pp 402–419
  • Wan et al [2022] Wan Z, Dai Y, Mao Y (2022) Learning dense and continuous optical flow from an event camera. IEEE Transactions on Image Processing 31:7237–7251
  • Wu et al [2022] Wu Y, Paredes-Vallés F, de Croon GC (2022) Lightweight event-based optical flow estimation via iterative deblurring. arXiv preprint arXiv:221113726
  • Ye et al [2020] Ye C, Mitrokhin A, Fermüller C, et al (2020) Unsupervised learning of dense optical flow, depth and egomotion with event-based sensors. In: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, pp 5831–5838
  • Ye et al [2023] Ye Y, Shi H, Yang K, et al (2023) Towards anytime optical flow estimation with event cameras. arXiv preprint arXiv:230705033
  • Zhang et al [1988] Zhang W, Tanida J, Itoh K, et al (1988) Shift-invariant pattern recognition neural network and its optical architecture. In: Proceedings of annual conference of the Japan Society of Applied Physics, Montreal, CA
  • Zhou et al [2021] Zhou Y, Gallego G, Shen S (2021) Event-based stereo visual odometry. IEEE Transactions on Robotics 37(5):1433–1450
  • Zhu and Yuan [2018] Zhu AZ, Yuan L (2018) Ev-flownet: Self-supervised optical flow estimation for event-based cameras. In: Robotics: Science and Systems
  • Zhu et al [2019] Zhu AZ, Yuan L, Chaney K, et al (2019) Unsupervised event-based learning of optical flow, depth, and egomotion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 989–997
  • Zou et al [2022] Zou Z, Alimohamadi H, Kim Y, et al (2022) Eventhd: Robust and efficient hyperdimensional learning with neuromorphic sensor. Frontiers in Neuroscience 16:858329