FlightBench: Benchmarking Learning-based Methods for Ego-vision-based Quadrotors Navigation

Shu-Ang Yu¹, Chao Yu^1∗, Feng Gao^1∗, Yi Wu^1,2, Yu Wang^1†
¹ Tsinghua University, ² Shanghai Qi Zhi Institute
{yuchao,yu-wang}@mail.tsinghua.edu.cn Equal contributionCorresponding Authors

Abstract

Ego-vision-based navigation in cluttered environments is crucial for mobile systems, particularly agile quadrotors. While learning-based methods have shown promise recently, head-to-head comparisons with cutting-edge optimization-based approaches are scarce, leaving open the question of where and to what extent they truly excel. In this paper, we introduce FlightBench, the first comprehensive benchmark that implements various learning-based methods for ego-vision-based navigation and evaluates them against mainstream optimization-based baselines using a broad set of performance metrics. Additionally, we develop a suite of criteria to assess scenario difficulty and design test cases that span different levels of difficulty based on these criteria. Our results show that while learning-based methods excel in high-speed flight and faster inference, they struggle with challenging scenarios like sharp corners or view occlusion. Analytical experiments validate the correlation between our difficulty criteria and flight performance. We hope this benchmark and these criteria will drive future advancements in learning-based navigation for ego-vision quadrotors. The source code and documentation is available at https://github.com/thu-uav/FlightBench.

1 Introduction

Ego-vision-based navigation in cluttered environments is a fundamental capability for mobile systems and has been widely investigated (Anwar & Raychowdhury, 2018; Chi et al., 2018; Wang et al., 2020; Xiao et al., 2021; Stachowicz et al., 2023). It involves navigating a robot to a goal position without colliding with any obstacles in its environment, using equipped ego-vision cameras (Agarwal et al., 2023). Quadrotors, known for their agility and dynamism (Verbeke & Schutter, 2018; Loquercio et al., 2021), present unique challenges in achieving fast and safe flight. Traditionally, hierarchical methods address this problem by decoupling it into subtasks such as mapping, planning, and control (Xiao et al., 2024), optimizing the trajectory to avoid collisions. In contrast, recent works (Loquercio et al., 2021; Song et al., 2023; Kaufmann et al., 2023) have demonstrated that learning-based methods can unleash the full dynamic potential of agile platforms. These methods employ neural networks to generate a sequence of waypoints (Loquercio et al., 2021) or motion commands (Song et al., 2023; Kaufmann et al., 2023), utilizing state estimation and ego-vision input. Unlike the high computational costs associated with sequentially executed subtasks, this end-to-end manner significantly reduces processing latency and enhances agility (Loquercio et al., 2021).

Despite the promising results of learning-based navigation methods, the lack of head-to-head comparisons with state-of-the-art optimization-based methods makes it unclear in which areas they truly outperform and to what degree. Traditional methods are often evaluated using customized scenarios and sensor configurations (Ren et al., 2022; Zhou et al., 2019; Song et al., 2023), which complicates reproducibility and hinders fair comparisons. Moreover, the absence of a quantifiable approach for scenario difficulty further obscures the analysis of the strengths and weaknesses of current methods.

In this paper, we introduce FlightBench, a comprehensive benchmark that evaluates methods for ego-vision-based quadrotor navigation. We initially developed a suite of learning-based methods, encompassing both ego-vision and privileged ones for an in-depth comparative analysis. Additionally, we incorporated several representative optimization-based benchmarks for further head-to-head evaluation. Moreover, we established three criteria to measure the difficulty of scenarios, thereby creating a diverse array of tests that span a spectrum of difficulties. Finally, we compared these methods across a wide range of performance metrics to gain a deeper understanding of their specific attributes. Our experiments indicate that learning-based methods demonstrate superior performance in high-speed flight scenarios and generally offer quicker inference times. However, they encounter difficulties in handling complex situations such as sharp turns or occluded views. In contrast, our findings show that traditional optimization-based methods maintain a competitive edge. They perform well in challenging conditions, not just in success rate and flight quality, but also in computation time, particularly when they are meticulously designed. Our analytical experiments validate the effectiveness of the proposed criteria and emphasize the importance of latency randomization for learning-based methods. In summary, our key contributions include:

1.

The development of FlightBench, the first unified open-source benchmark that facilitates the head-to-head comparison of learning-based and optimization-based methods on ego-vision-based quadrotor navigation under various 3D scenarios.
2.

The proposition of tailored task difficulty and performance metrics, aiming to enable a thorough evaluation and in-depth analysis of specific attributes for different methods.
3.

Detailed experimental analyses that demonstrate the comparative strengths and weaknesses of learning-based versus optimization-based methods, particularly in difficult scenarios.

2 Related Work

Planning methods for navigation.

Classical navigation algorithms typically use search or sampling to explore the configuration or state space and generate a free path (Kamon & Rivlin, 1997; Rajko & LaValle, 2001; Penicka & Scaramuzza, 2022). With optimization, a multi-objective optimization problem is often formulated to determine the optimal trajectory (Paull et al., 2012; Ye et al., 2022). This is commonly done using gradients from local maps, such as the Artificial Potential Field (APF) (Zhu et al., 2006; Sfeir et al., 2011) and the Euclidean Signed Distance Field (ESDF) (Zhou et al., 2019). On the other side, the development of deep learning enables algorithms to perform navigating directly from sensory inputs such as images or lidar (Xiao et al., 2024). The policies are trained by imitating expert demonstrations (Schilling et al., 2019) or through exploration under specific rewards (Liu et al., 2023). Learning-based algorithms have been applied to various mobile systems, such as quadrupedal robots (Agarwal et al., 2023), wheeled vehicles (Chaplot et al., 2020; Stachowicz et al., 2023), and quadrotors (Kaufmann et al., 2023; Xing et al., 2024). In this work, we examine representative methods for ego-vision-based navigation on quadrotors, including two learning-based approaches and three optimization-based methods, providing a comprehensive comparison between these categories.

Table 1: A comparison of FlightBench to other open-source benchmarks for navigation.

Benchmark	3D Scenarios	Classical Methods	Learning Methods	Sensory Input
MRBP 1.0 (Wen et al., 2021)	✗	✓	✗	LiDAR
Bench-MR (Heiden et al., 2021)	✗	✓	✗	-
PathBench (Toma et al., 2021)	✓	✓	✓	-
Gibson Bench (Xia et al., 2020)	✗	✗	✓	Vision
OMPLBench (Moll et al., 2015)	✓	✓	✗	-
RLNav (Xu et al., 2023)	✗	✓	✓	LiDAR
Plannie (Rocha & Vivaldini, 2022)	✓	✓	✓	-
FlightBench (Ours)	✓	✓	✓	Vision

Benchmarks for navigation.

Several benchmarks exist for non-sensory-input navigation algorithms, such as OMPL (Moll et al., 2015), Bench-MR (Heiden et al., 2021), PathBench (Toma et al., 2021), and Plannie (Rocha & Vivaldini, 2022). OMPL and Bench-MR primarily focus on sampling-based methods, while PathBench evaluates graph-based and learning-based methods. Plannie offers sampling-based, heuristic, and learning-based methods for quadrotors. However, these algorithms do not utilize input from onboard sensors as well. For methods with sensory inputs, most benchmarks are primarily designed for 2D scenarios. MRBP1.0 (Wen et al., 2021) and RLNav (Xu et al., 2023) evaluate planning methods using laser-scanning data for navigation around columns and cubes. GibsonBench (Xia et al., 2020) features a mobile agent equipped with a camera, navigating in interactive environments. As outlined in Tab. 1, there’s a notable lack of a benchmark with 3D scenarios and ego-vision inputs to assess and compare both classical and learning-based navigation algorithms, a gap that FlightBench aims to fill.

3 FlightBench

Refer to caption — Figure 1: An overview of the FlightBench. FlightBench consists of three main components: (1) Tasks, featuring three scenarios categorized into eight difficulty levels. (2) Baselines, the core benchmarking platform supporting five ego-vision-based methods and two privileged methods. (3) Evaluation Metrics, offering a thorough suite of performance assessment metrics.

In this section, we detail the components of FlightBench. An overview of our benchmark is depicted in Fig. 1. In FlightBench, to design a set of Tasks with distinguishable characteristics for assessment, we propose three criteria, named task difficulty metrics, and develop eight tests across three scenarios based on these metrics. We integrate various representative Baselines to examine the strengths and features of both learning-based and optimization-based navigation methods. Furthermore, we establish a comprehensive set of performance Evaluation Metrics to facilitate quantitative comparisons. The next subsections will provide an in-depth look at the Tasks, Baselines, and Evaluation Metrics.

3.1 Tasks

3.1.1 Task Difficulty Metrics

Each task difficulty metric quantifies the challenge of a test configuration from a specific perspective. In quadrotor navigation, a test is configured by the obstacles-laden scenario, start point, and end point. We utilize the topological guide path $\mathcal{T}$ , which comprises interconnected individual waypoints (Penicka & Scaramuzza, 2022), to establish the quantitative assessment. In FlightBench, we propose three main task difficulty metrics: Traversability Obstruction (TO), View Occlusion (VO), and Angle Over Length (AOL).

Traversability Obstruction.

Traversability Obstruction (TO) measures the flight difficulty due to limited traversable space caused by obstacles. We use a sampling-based approach (Ren et al., 2022) to construct sphere-shaped flight corridors $\{B_{0},\dots,B_{N_{T}}\}$ , where $N_{T}$ is the number of spheres representing traversable space along path $\mathcal{T}$ . Fig. 2(a) illustrates the primary notations for computing these corridors. The next sampling center $\mathbf{p}_{\text{hi}}$ is chosen from existing spheres $\{B_{0},\dots,B_{i-1}\}$ along $\mathcal{T}$ . We sample $K$ candidate centers from a 3D Gaussian distribution $\mathcal{D}$ around $\mathbf{p}_{\text{hi}}$ . Each candidate sphere $B_{\text{cand}}$ is defined by its center $\mathbf{p}_{\text{cand}}$ and radius $r_{\text{cand}}=||\mathbf{p_{\text{cand}}}-\mathbf{n_{\text{cand}}}||_{2}-r_{d}$ , where $\mathbf{n_{\text{cand}}}$ is the nearest obstacle and $r_{d}$ is the drone radius. For each $B_{\text{cand}}$ , we compute $S_{\text{cand}}$ :

S_{\text{cand}}=k_{1}V_{\text{cand}}+k_{2}V_{\text{inter}}-k_{3}(\mathbf{d}\cdot\mathbf{z})-k_{4}||\mathbf{d}-(\mathbf{d}\cdot\mathbf{z})\mathbf{z}||_{2}

(1)

where $k_{1},k_{2},k_{3},k_{4}\in\mathbb{R}^{+}$ , $V_{\text{cand}}$ is the volume of $B_{\text{cand}}$ , $V_{\text{inter}}$ is the overlap with $B_{i-1}$ , $\mathbf{d}=\mathbf{p}_{\text{cand}}-\mathbf{p}_{\text{hi}}$ , and $\mathbf{z}$ is the unit vector along $\mathbf{p}_{\text{hi}}-\mathbf{p}_{i}$ . The sphere with the highest $S_{\text{cand}}$ is selected as the next sphere. This process repeats until path $\mathcal{T}$ is fully covered. Occlusion challenges mainly occur in narrow spaces, so sphere radii $\{r_{1},\dots,r_{N_{T}}\}$ are sorted in ascending order. The traversability obstruction metric $\mathbb{T}$ is defined in Eq. 2, where $R$ represents the sensing range.

\mathbb{T}=\frac{1}{N_{T}}\sum_{i=1}^{\lfloor N_{T}/2\rfloor}\frac{R}{r_{i}}.

(2)

View Occlusion.

In ego-vision-based navigation tasks, a narrow field of view (FOV) can limit the drone’s perception (Tordesillas & How, 2022; Chen et al., 2024), posing a challenge to the perception capabilities of various methods (Gao et al., 2023). We use the term view occlusion (VO) to describe the extent to which obstacles block the FOV in a given scenario. The more obstructed the view, the higher the view occlusion. As shown in Fig. 2(b), we sample drone position $\left\{\mathbf{q_{i}}\right\}$ and FOV unit vector $\left\{\mathbf{v_{i}}\right\}$ along $\mathcal{T}$ with $i\in\left\{1,\ \cdots,\ N_{V}\right\}$ . For each sampled pair $\left\{\mathbf{q_{i}},\mathbf{v_{i}}\right\}$ , we divide FOV into $M$ parts and calculate the distance $s_{ij}$ between the nearest obstacle point and drone position $q_{i}$ in each part $j$ . The view occlusion $\mathbb{V}$ can be represented as Eq. 3, where $m_{j}$ is a series of weights, which gives higher weight to obstacles closer to the center of the view.

\mathbb{V}=\frac{1}{N_{V}}\sum_{i=1}^{N_{V}}\sum_{j=1}^{M}m_{j}\frac{R}{s_{ij}}.

(3)

Angle Over Length.

For a given scenario, frequent and violent turns in traversable paths pose challenges for the agility of planning. Inspired by Heiden et al. (2021), we employ the concept of Angle Over Length (AOL) denoted as $\mathbb{A}$ to quantify the sharpness of a path. The AOL $\mathbb{A}$ is defined by Eq. 4, where $N_{AOL}$ signifies the number of angles depicted in Fig. 2(c), $\theta_{i}$ represents the $i$ -th angle within the topological path $\mathcal{T}$ , and $L$ stands for the length of $\mathcal{T}$ .

\mathbb{A}=\frac{1}{L}\sum_{i=1}^{N_{AOL}}\left(\exp{\left(\frac{\theta_{i}}{\pi/6}\right)}-1\right).

(4)

3.1.2 Scenarios and Tests

Table 2: Task difficulty score of each test case.

Scenarios	Test Cases	TO	VO	AOL
Forest	1	0.76	0.30	7.64 $\times$ 10^-4
	2	0.92	0.44	1.62 $\times$ 10^-3
	3	0.90	0.60	5.68 $\times$ 10^-3
Maze	1	1.42	0.51	1.36 $\times$ 10^-3
	2	1.51	1.01	0.010
	3	1.54	1.39	0.61
MW	1	1.81	0.55	0.08
MW	2	1.58	1.13	0.94

As illustrated in Fig. 1, our benchmark incorporates specific tests based on three scenarios: Forest, Maze, and Multi-Waypoint. These scenarios were chosen for their representativeness and frequent use in evaluating the performance of quadrotor navigation methods. (Ren et al., 2022; Kaufmann et al., 2023). Within these scenarios, we developed eight tests, each characterized by varying levels of difficulty. The task difficulty scores for each test are detailed in Tab. 2.

The Forest scenario serves as the most common benchmark for quadrotor navigation. We differentiate task difficulty based on obstacle density and establish three tests, following the settings used by Loquercio et al. (2021). In the Forest scenario, TO and VO metrics increase with higher tree density. AOL is particularly low due to sparsely spanned obstacles, making this scenario suitable for high-speed flights (Loquercio et al., 2021; Ren et al., 2022).

The Maze scenario consists of walls and boxes, creating consecutive sharp turns and narrow gaps. Quadrotors must navigate these confined spaces while maintaining flight stability and perception awareness (Park et al., 2023). We devise three tests with varying lengths and turn complexities for Maze, resulting in discriminating difficulty levels for VO and AOL.

The Multi-Waypoint (MW) scenario involves flying through multiple waypoints at different heights sequentially (Song et al., 2021b). This scenario also includes boxes and walls as obstacles. We have created two tests with different waypoint configurations. The MW scenario is relatively challenging, featuring the highest TO in test 1 and the highest AOL in test 2.

3.2 Baselines

Here, we introduce the representative methods for ego-vision-based navigation evaluated in FlightBench, covering two learning-based methods and three optimization-based methods, as well as two privileged methods that leverage access to environmental information. The characteristics of each method are detailed in Tab. 3. For links to the open-source code, key parameters, and implementation details, please refer to Appendix A.

Table 3: Characteristics of the navigation methods for quadrotors. "RL" denotes reinforcement learning. "IL" represents imitation learning. "GM" and "EM" refer to Grid Mapping and ESDF Mapping, respectively. The control level indicates the part of the control stack used by the baseline.

	Method Type	Priv.	Decision	Mapping	Planning	Control Stack
	Method Type	Info.	Horizon	Mapping	Planning	Traj.	Waypoint	Motion Cmd
SBMT	Samp.-based	✓	Global	Planning Module				MPC
LMT	RL	✓	Local	Policy Network
Fast-Planner	Opti.-based	✗	Global	GM+EM	Planning Module		MPC
EGO-Planner	Opti.-based	✗	Local	GM	Planning Module		MPC
TGK-Planner	Opti.-based	✗	Global	GM	Planning Module		MPC
Agile	IL	✗	Local	Policy Network				MPC
LPA	RL+IL	✗	Local	Policy Network

Learning-based Methods.

Utilizing techniques such as imitation learning (IL) and reinforcement learning (RL), learning-based methods train neural networks for end-to-end planning, bypassing the time-consuming mapping process. Agile (Loquercio et al., 2021) employs DAgger (Ross et al., 2011) to imitate an expert generating collision-free trajectories using Metropolis-Hastings sampling and outputs mid-level waypoints. LPA (Song et al., 2023) combines IL and RL, starting with training a teacher policy using LMT (Penicka et al., 2022), then distilling this expertise into a ego-vision-based student. Both teacher and student policies generate executable motion commands, specifically collective thrust and body rates (CTBR).

Optimization-based Methods.

Optimization-based methods typically include an online mapping module followed by planning and control modules, often referred to as planners. The three optimization-based methods described below generate B-spline trajectories (see Tab. 3). The control stack samples mid-level waypoints from the trajectory function and uses a low-level Model Predictive Control (MPC) controller to convert these waypoints into motion commands (Fig. 3). Among these baselines, Fast-Planner (Zhou et al., 2019) constructs both occupancy grid and Euclidean Signed Distance Field (ESDF) maps, whereas TGK-Planner (Ye et al., 2020) and EGO-Planner (Zhou et al., 2020) only require an occupancy grid map. In the planning stage, EGO-Planner (Zhou et al., 2020) focuses on trajectory sections with new obstacles, acting as a local planner, while the other two use a global search front-end and an optimization back-end for long-horizon planning.

3. We follow Song et al. (2023) to reproduce the work with out open source code and the key hyperparameters are listed in Table 8

Privileged Methods.

SBMT (Penicka & Scaramuzza, 2022) is a sampling-based method that uses global ESDF maps to generate a collision-free trajectory of dense points, which is then tracked by an MPC controller (Falanga et al., 2018). Its follow-up, LMT (Penicka et al., 2022), employs RL to train an end-to-end policy for minimum-time flight. This policy uses the quadrotor’s full states and the next collision-free point (CFP) to produce motion commands, with the CFP determined by finding the farthest collision-free point on a reference trajectory (Penicka & Scaramuzza, 2022).

3.3 Evaluation Metrics

We represent quadrotor states as a tuple $\left(\mathbf{x}(t),\mathbf{v}(t),\mathbf{a}(t),\mathbf{j}(t)\right)$ , where $t$ denotes time, $\mathbf{x}(t)$ denotes the position, and $\mathbf{v}(t)=\dot{\mathbf{x}}(t)$ , $\mathbf{a}(t)=\dot{\mathbf{v}}(t)$ , $\mathbf{j}(t)=\dot{\mathbf{a}}(t)$ are the velocity, acceleration, and jerk in the world frame, respectively. $T$ denotes the time taken to fly from the starting point to the end point.

First, we integrate three widely used metrics (Ren et al., 2022; Loquercio et al., 2021) into FlightBench. Success rate measures if the quadrotor reaches the goal within a 1.5 m radius without crashing. Average speed, defined as $\frac{1}{T}\int_{0}^{T}||\mathbf{v}(t)||_{2}\,\mathrm{d}t$ , reflects the achieved agility. Computation time evaluates real-time performance as the sum of processing times for mapping, planning, and control. Additionally, we introduce average acceleration and average jerk (Zhou et al., 2019; 2020), defined as $\bar{\mathbf{a}}=\frac{1}{L}\int_{0}^{T}||\mathbf{a}(t)||_{2}^{2}\,\mathrm{d}t$ and $\bar{\mathbf{j}}=\frac{1}{L}\int_{0}^{T}||\mathbf{j}(t)||_{2}^{2}\,\mathrm{d}t$ , respectively. Average acceleration indicates energy consumption, while average jerk measures flight smoothness (Wang et al., 2022). These metrics, though not commonly used in evaluations, are crucial for assessing practicability and safety in real-world applications. However, average acceleration and jerk only capture the dynamic characteristics of a flight. For instance, higher flight speeds along the same trajectory result in greater average acceleration and jerk. To assess the static quality of trajectories, we propose average curvature, inspired by Heiden et al. (2021). Curvature is calculated as $\kappa(t)=\frac{|\mathbf{v}(t)\times\mathbf{a}(t)|}{|\mathbf{v}(t)|^{3}}$ , and average curvature is defined as $\bar{\kappa}=\frac{1}{L}\int_{0}^{T}\kappa(t)\mathbf{v}(t)\,\mathrm{d}t$ .

Together, these six metrics provide a comprehensive comparison of learning-based algorithms against optimization-based methods for ego-vision-based quadrotor navigation. Our extensive experiments will demonstrate that while learning-based methods excel in certain metrics, they also have shortcomings in others.

4 Experiments

Upon FlightBench, which provides a variety of metrics and scenarios to evaluate the performance of ego-vision-based navigation methods for quadrotors, we carried out extensive experiments to study the following research questions:

•

What are the main advantages and limitations of learning-based methods compared to optimization-based methods?
•

How does the navigation performance vary across different scenario settings?
•

How much does the system latency introduced by the practical evaluation environment affect performance?

4.1 Setup

For simulating quadrotors, we use Flightmare (Song et al., 2021a) with Gazebo (Koenig & Howard, 2004) as its dynamic engine. To mimic real-world conditions, we develop a simulated quadrotor model equipped with an IMU sensor and a depth camera, calibrated with real flight data. The physical characteristics of the quadrotor and sensors are detailed in Appendix B. To simulate real-world communication delays, all data transmission in the simulation uses ROS (Quigley et al., 2009). All simulations are conducted on a desktop PC with an Intel Core i9-11900K processor and an Nvidia 3090 GPU. To evaluate real-time reaction performance on embedded platforms with limited computational resources, we also measure computation time on the Nvidia Jetson Orin NX module. Each evaluation metric is averaged over ten independent runs.

4.2 Benchmarking Flight Performance

Table 4: Performance evaluation of different methods under tests with the highest AOL within each scenario. The highest performing values for each metric are highlighted in bold, with the second highest underlined.

Scen.	Metric	Privileged		Optimization-based			Learning-based
Scen.	Metric	SBMT	LMT	TGK	Fast	EGO	Agile	LPA
Forest	Success Rate $\uparrow$	0.80	1.00	0.90	0.90	1.00	0.90	1.00
	Avg. Spd. (ms^-1) $\uparrow$	15.25	11.84	2.30	2.47	2.49	3.058	8.96
	Avg. Curv. (m^-1) $\downarrow$	0.06	0.07	0.08	0.06	0.08	0.37	0.08
	Avg. Acc. (ms^-3) $\downarrow$	28.39	10.29	0.25	0.19	0.83	4.93	9.96
	Avg. Jerk (ms^-5) $\downarrow$	4.27 $\times$ 10³	8.14 $\times$ 10³	1.03	3.97	58.39	937.02	1.14 $\times$ 10⁴
Maze	Success Rate $\uparrow$	0.60	0.9	0.50	0.60	0.20	0.50	0.30
	Avg. Spd. (ms^-1) $\uparrow$	8.73	9.62	1.85	1.99	2.19	3.00	8.35
	Avg. Curv. (m^-1) $\downarrow$	0.31	0.13	0.17	0.23	0.33	0.68	0.21
	Avg. Acc. (ms^-3) $\downarrow$	60.73	26.26	0.50	0.79	1.91	15.45	37.30
	Avg. Jerk (ms^-5) $\downarrow$	6.60 $\times$ 10³	4.64 $\times$ 10³	6.74	9.62	80.54	2.15 $\times$ 10³	4.64 $\times$ 10³
MW	Success Rate $\uparrow$	0.70	0.90	0.40	0.80	0.50	0.60	0.50
	Avg. Spd. (ms^-1) $\uparrow$	5.59	6.88	1.48	1.73	2.13	3.05	6.72
	Avg. Curv. (m^-1) $\downarrow$	0.47	0.30	0.46	0.32	0.62	0.67	0.26
	Avg. Acc. (ms^-3) $\downarrow$	80.95	31.23	1.07	0.97	5.06	16.86	36.77
	Avg. Jerk (ms^-5) $\downarrow$	9.76 $\times$ 10³	1.66 $\times$ 10⁴	25.52	22.72	155.83	2.07 $\times$ 10³	6.19 $\times$ 10³

4.2.1 Flight Quality

To systematically assess the strengths and weaknesses of various methods on ego-vision navigation, we conduct evaluations across all tests within three distinct scenarios. Tab. 4 displays the results for tests with the highest AOL in each scenario. The comprehensive results for all tests are included in Sec. C.1 due to space limitations. We evaluate these methods using our proposed evaluation metrics, and computation time will be addressed separately in a subsequent discussion. In these experiments, we standardize the expected maximum speed at $3m/s$ for a fair comparison. Exceptions are SBMT, LMT, and LPA, whose flight speeds cannot be manually controlled.

As shown in Tab. 4, the privileged methods, with a global awareness of obstacles, set the upper bound for motion performance in terms of average speed and success rate. In contrast, the success rate of ego-vision methods in the Maze and MW scenarios is generally below $0.6$ , indicating that our benchmark remains challenging for ego-vision methods, especially at the perception level.

Learning-based methods, known for their aggressive maneuvering, tend to fly less smoothly and consume more energy. They also experience more crashes in areas with large corners, as seen in the Maze and MW scenarios. When performing a large-angle turn, an aggressive policy is more likely to cause the quadrotor to lose balance and crash. Optimization-based methods are still competitive or even superior to current learning-based approaches, particularly in terms of minimizing energy costs. By contrasting the more effective Fast-Planner with the more severely impaired TGK-Planner and EGO-Planner, we find that global trajectory smoothing and enhancing the speed of replanning are crucial for improving success rates in complex scenarios. Detailed analyses on the failure cases can be found in the Sec. C.2.