This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\UseRawInputEncoding

An Explicit Method for Fast Monocular Depth Recovery in Corridor Environments

Yehao liu1, Ruoyan Xia1, Xiaosu Xu1,Zijian Wang1,Yiqing Yao1 , and Mingze Fan1 1Albert Author is with Key Laboratory of Micro-Inertial Instruments and Advanced Navigation Technology, Ministry of Education, School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China
Abstract

Monocular cameras are extensively employed in indoor robotics, but their performance is limited in visual odometry, depth estimation, and related applications due to the absence of scale information.Depth estimation refers to the process of estimating a dense depth map from the corresponding input image, existing researchers mostly address this issue through deep learning-based approaches, yet their inference speed is slow, leading to poor real-time capabilities. To tackle this challenge, we propose an explicit method for rapid monocular depth recovery specifically designed for corridor environments, leveraging the principles of nonlinear optimization. We adopt the virtual camera assumption to make full use of the prior geometric features of the scene. The depth estimation problem is transformed into an optimization problem by minimizing the geometric residual. Furthermore, a novel depth plane construction technique is introduced to categorize spatial points based on their possible depths, facilitating swift depth estimation in enclosed structural scenarios, such as corridors. We also propose a new corridor dataset, named Corr_EH_z, which contains images as captured by the UGV camera of a variety of corridors. An exhaustive set of experiments in different corridors reveal the efficacy of the proposed algorithm.

I INTRODUCTION

Monocular cameras play a pivotal role in indoor robotics[1, 2, 3];nevertheless, their performance is constrained in certain application fields, such as visual odometry and 3D object detection, due to the absence of scale information [1, 4, 5, 6]. To address this limitation and obtain scale information from images, researchers often employ supplementary auxiliary methods. While RGBD cameras offer the advantage of obtaining relatively highprecision scene depth images, their resolution remains comparably low, rendering them susceptible to the influence of deep black objects, translucent materials, specular reflections, and parallax effects, thereby leading to reduced accuracy. Conversely, stereo cameras can calculate pixel depth through triangulation; however, this advantage comes at the expense of increased computational overhead, and their distance estimation is subject to limitations imposed by the baseline length.

Depth estimation refers to the process of estimating a dense depth map from the corresponding input image[2]. Monocular depth estimation holds significant research value [5, 3]. When combined with object detection, it can achieve the effect of 3D reconstruction of general Lidar detection[7]. Furthermore, through integration with semantic segmentation, the approach can be extended from 2D to 3D, allowing the acquisition of both semantic and depth information for pixels.

Monocular depth estimation is an ill-posed problem that requires the introduction of sufficient prior information for its resolution. Monocular depth estimation methods can be categorized into structure from motion (SFM) based methods[8, 9, 10, 11], hand-crafted feature based methods, and deep learning-based methods[13, 14, 15, 16]. Each approach explores different strategies to address the challenge of recovering depth from a single camera input. With the rapid advancement of deep neural networks, monocular depth estimation based on deep learning has attracted considerable interest and demonstrated remarkable accuracy[17, 5]. The impressive performance of deep learning methods relies on thorough training on extensive datasets, and the accuracy heavily hinges on the quality of precisely annotated data. Acquiring highquality data for depth/parallax reconstruction involves substantial time and labor costs [5]. Furthermore, deep learning-based methods exhibit limited generalization capacity in depth estimation due to the influence of image size and scene characteristics present in the training data[4]. Additionally, the majority of deep learning approaches demonstrate slow inference speeds and insufficient realtime performance.

SfM-based methods perform 3D scene reconstruction using multiple image sequences from different perspectives. They extract feature points from the images for feature matching, estimate camera motion and 3D positions of pixels, and construct sparse depth maps by assembling point cloud information of 3D space points[8, 9, 10, 11]. However, SfM-based methods require matching alignment between multiple frames with continuous motion, and their accuracy is highly dependent on the results of inter-frame registration, thereby limiting their application in certain scenarios.

Long corridors/hallways are characteristic of challenging scenarios with limited texture features, and a high degree of similarity between frames hinders reliable inter-frame alignment. Thus, SfM-based methods are susceptible to failure and reduced accuracy in such degraded scenes. Nonetheless, long corridors/hallways manifest strong structured characteristics, encompassing abundant geometric information, such as parallel walls on both sides and maintaining parallel lines at the junction of the floor and walls. By fully leveraging these structured features, Inverse Projective IPM (Inverse Projective IPM)[18, 19, 20, 21] can be employed to derive scale information for these characteristics, enabling depth plane construction and subsequent scene depth recovery.

In this context, we propose a novel display method for rapid monocular depth estimation. Our approach differs from existing methods as it eliminates the need for inter-frame matching assistance and avoids extensive training on large datasets, thereby saving on model training and transfer costs. By leveraging the virtual camera assumption and minimizing geometric residuals, we transform the depth estimation problem into an optimization task. Furthermore, we introduce a depth plane construction method, which categorizes spatial points based on their possible depths, enabling fast depth estimation in enclosed structural scenes such as long corridors/hallways. Our proposed method achieves state-of-the-art depth estimation accuracy in long corridor/hallway scenarios while significantly accelerating the depth recovery process. Moreover, it can achieve realtime monocular depth recovery on mid-to-low-performance processors.

II Related Work

II-A Explicit Method for Depth Estimation

The explicit method for depth estimation refers to an approach in which the entire process of depth estimation, from feature extraction and feature transformation to the output of prediction results, can be explained using mathematical formulas. This method is commonly employed in depth estimation techniques based on SFM. On the other hand, implicit methods achieve the same task through techniques such as convolutional neural networks (CNNs), where the processes of feature extraction, feature space transformation, and depth prediction are encapsulated within an end-to-end deep network model.

The SFM algorithm receives input image sequences taken from different viewing angles and first extracts features such as Harris, SIFT or SURF from all images. Feature matching is then performed to estimate the 3D coordinates of the features and generate a point cloud that can be converted into a depth map. In 2014, Prakash et al. [prakash2014sparse]proposed a sparse depth estimation method based on SFM. Based on monocular image sequences from 5 to 8 different perspectives, the method used a multi-scale fast detector for feature detection and 3D position solution based on geometric view relations to obtain a sparse depth map.

In 2016, Ha et al. [8]proposed a Structure From Small Motion (SFM) recovery method, which uses planar scanning technology to estimate depth maps. By using Harris corner detection and optical flow tracking method to solve the 3D position of the feature points, a relatively dense depth map can be obtained, but this algorithm cannot run in real time in terms of speed.

In recent years, researchers have attempted to combine the strengths of explicit and implicit methods. In 2022, Zhong et al. [22] introduced a method that simultaneously conducts implicit reconstruction and extracts 3D feature points, while others usually use explicit method to get 3D points. It replaces manual feature extraction with an implicit description for 3D keypoint detection.

In 2023, Wu et al. [23] demonstrated the equivalence of depth and height in the 2D-to-3D mapping transformation and proposed an explicit height description method applied to deep network models for transforming Bird’s Eye View (BEV) space.

II-B Real-Time Monocular Depth Estimation

The overall development trend of monocular depth estimation is to push the increase of accuracy using extremely deep CNNs or by designing a complex network architecture, which are computationally expensive for current mobile computational devices which have limited memory and computational capability. Therefore, it is difficult for these networks to be deployed in small sized robots which depend on mobile computational devices. Under this context, researchers have begun to develop real-time monocular depth estiamtion methods[2].

In 2018,Poggi et al. [24] stack a simple encoder and multiple small decoders working in a pyramidal structure, which is capable to quickly infer an accurate depth map on a CPU, even of an embedded system, using a pyramid of features extracted from a single input image.The network was trained in an unsupervised manner casting depth estimation as an image reconstruction problem. The designed network only has 1.9M parameters and requires. 0.12s to produce a depth map on a i7-6700K CPU, which isclose to a real-time speed.

In 2019, Wofk et al. [25] develop a lightweight encoderdecoder network for monocular depth estimation.A low latency, high throughput, high accuracy depth estimation algorithm running on embedded systems was designed. In addition,a network pruning algorithm is applied to further reduce the amount of parameters, which enables real-time depth estimation on embedded platforms with an Nvidia-TX2 GPU.

In 2020, Wang et al. [14] design a highly compact network named DepthNet Nano. DepthNet Nano applies densely connected projection batchnorm expansion projection (PBEP) modules to reduce network architecture and computation complexity while maintaining the representative ability.

In 2020, Liu et al. [26] introduce a lightweight model (named MiniNet) trained on monocular video sequences for unsupervised depth estimation. The core part of MiniNet is DepthNet, which iteratively utilizes the recurrent module-based encoder to extract multi-scale feature maps. The obtained feature maps are passed to the decoder to generate multi-scale disparity maps. MiniNet achieves real-time speed about 54fps with 640 192 sized images on a single Nvidia 1080Ti GPU.

However, the accuracy of above is inferior to state-of-the-art methods. Therefore, developing real-time monocular depth estimation network is assumed to achieve the trade-off between accuracy and efficiency.

II-C Corridor Environments Perception and Localization

Long corridor is a typical degraded scene with a lack of texture features, which brings new challenges to visual perception and localization tasks[27]. It is necessary to understand the characteristics of this scene, so as to make full use of prior features and achieve high-precision perception and localization. However, long corridor scenes are inevitably faced by mobile robots, in recent years, some researchers have begun to focus on solving this problem.

In 2021, Padhy et al.[28] introduce a localization method of Unmanned Aerial Vehicles(UAV) in Corridor Environments, a Deep Neural Network(DNN) was trained to understand corridor environmental information, andpredict the position of the UAV as either on the left or center or right side of the corridor.Depending upon the divergence of the UAV with respect to an imaginary central line, known as the central bisector line (CBL) of the corridor, a suitable command is generated to bring the UAV to the center, making UAV fly safely in Corridors.

In 2023, Ge et al.[29] proposed a visual-feature-assisted localization methods in long corridor environments.A novel coarse-to-fine paradigm was presented that uses visual features to assist mobile robot localization in long corridors.Sufficient keyframes are obtained through the camera, and a visual camera map was created while the grid map built by the laser-based SLAM method with a low accuacy in corridors, and the mobile robot captures images in a proper perspective according to the moving strategy and matches them with the image map to achieve a coarse localization.

III Materials and Methods

III-A System Overview

Refer to caption
Figure 1: System framework.There are two threads.Thread 1 main function is to extract ground edges combined along the line. Thread 2 mainly completes the depth estimation of the scene.

The proposed method consists of two main threads: the edge extraction thread ,which main function is to extract ground edges line Sets, and the depth recovery thread, witch mainly completes the depth estimation of the scene.

As shown in Figure 1. Images acquired from the visual sensor are first input to the feature extraction thread. In this thread, line feature information is extracted from the scene using Hough transform-based line feature detection. The line features of the sense are then filtered based on the distribution of line segment angles, leading to the construction of a set of ground edge lines. The information about the edge line set is then sent to the depth recovery thread.

In the depth recovery thread, the edge line set is first projected into the virtual camera imaging space. The current camera-to-virtual-camera pose transformation is estimated by minimizing symmetry geometric residuals. Subsequently, the edge line set is transformed to the virtual camera imaging plane using the computed transfor-mation matrix. Distance geometric residuals are then constructed, and a non-linear optimization process is iteratively performed to estimate the camera’s pitch angle and the depths of the edge points.Based on the estimated depths of the edge points, depth planes are constructed and transformed back to the original image plane. Pixel points in the original image plane are classified based on the depth information, and finally, depth estimation values for the pixel points are obtained through approximate inter-polation.

III-B Edge Extration Thread

Thread 1 primarily engages in the extraction of structured scene features through ground edge detection, thereby furnishing the depth recovery thread with essential prior information about the scene. In this study, it is assumed that the width of the long corridor/hallway remains constant, leading to the representation of ground and side wall edges as two straight lines within the image.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2: (a)The region of interestiong. (b) The results of Canny edge detection. (c) The results of Hough line transform(after NMS). (d) The line definetion in the sence.

III-B1 Construction of ROI

In the context of long corridor/hallway scenarios, aside from ground edges, there often exist additional linear features such as door frames and objects, which can introduce interference with the detection of ground edge lines. Consequently, it becomes imperative to establish a Region of Interest (ROI) within the scene. This selection process is based on empirical observations. Due to geometric perspective, ground edge lines typically manifest in the lower-to-middle section of the image, converging from the bottom sides towards the center. Relying on this a priori knowledge, an ROI with a height of H within the image is designated, encompassing rows from the middle to the bottom of the image, as shown in Figure2(a)~{}\ref{fig2}(\textbf{a}).

III-B2 Hough Transform Based Line Feature Extration

For the extraction process, this study employs a line feature detection algorithm based on the Hough transform. The Hough transform is a methodology designed to extract linear features from images. It leverages the duality between points and lines, mapping the discrete pixel points along a straight line in image space to curves in Hough space through parameter equations. Subsequently, the intersection points of multiple curves in Hough space are mapped back to straight line equations in image space, thereby forming the detected straight lines.

Before performing the Hough transform, the image is initially binarized and subjected to Canny edge detection thus expediting the process of Hough transformationas shown in Figure2(b)~{}\ref{fig2}(\textbf{b}). Subsequently, linear features are extracted through the Hough transform,as shown in Figure2(c)~{}\ref{fig2}(\textbf{c}), and the extracted features undergo linear fitting to yield the set of linear features, denoted as SlS_{l}.

Sl={lk(θk,ptk0),k=1,2,,n}S_{l}=\{l_{k}\left({\theta}_{k},{pt}_{k}^{0}\right),k=1,2,...,n\} (1)

Where,θk,ptk0{\theta}_{k},{pt}_{k}^{0}are the inclination angle of lkl_{k},and the the pixel coordinates of the starting point of lkl_{k}, respectively, as illustrated in Figure2(d)~{}\ref{fig2}(\textbf{d}).

III-B3 Scene-Prior Based Line Feature Extraction

Based on scene priors, a reasonable range for setting the length and inclination angle of the lines is established, which is used to filter the lines within set SlS_{l}. If the image is strict symmetry of axis, the length of the edge lines L(lk)L\left(l_{k}\right) and the angle θk{\theta}_{k} should satisfy the following conditions:

H4L(lk)12H2+W2arctanHWθk<π2π2<θkarctanHW\begin{split}\frac{H}{4}\leq L\left(l_{k}\right)&\leq\frac{1}{2}\sqrt{H^{2}+W^{2}}\\ \arctan{\frac{H}{W}}\leq{\theta}_{k}<\frac{\pi}{2}\;&\vee\;\frac{\pi}{2}<{\theta}_{k}\leq\arctan{-\frac{H}{W}}\end{split} (2)

According to (2), the left and right line are picked out. And the edge line set is Sle={ll(θl,ptl0),lr(θr,ptr0)}S_{le}=\{l_{l}\left({\theta}_{l},{pt}_{l}^{0}\right),l_{r}\left({\theta}_{r},{pt}_{r}^{0}\right)\}, where lll_{l} and lrl_{r} are left and right line, respectively.

III-C Visual Camera

III-C1 Visual Camera Model

The primary function of the virtual camera is to ensure consistency in the imaging process across different scenes, currently predominantly employed within deep learning-based computer vision methodologies. In 2023, BEV-LaneDet [30] introduced the concept of the virtual camera in the context of 3D lane detection tasks on the Bird’s Eye View (BEV) plane. Due to variations in camera intrinsic parameters, installation positions, and camera poses, images captured by the same scene may have different dimensions and scaling ratios. The BEV-LaneDet method employs a deep neural network that requires image parameters to be as consistent as possible with those in the training dataset. This necessitates aligning the camera position, pose, and height above the ground during imaging to avoid substantial disparities between the predictive accuracy of the deployed model and that observed during offline training.

The virtual camera is manually defined, with its intrinsic parameters, installation position, and orientation preconfigured. Prior to training and inference with deep neural networks, images are initially projected onto the imaging space of the virtual camera through perspective transformation. This process ensures that the input images to the model exhibit consistency.

In this study, the concept of the virtual camera is introduced to achieve algorithmic consistency across various scenes. The intrinsic coordinate system for both the real and virtual cameras is established as right-down-forward. The coordinate definitions for the two types of cameras and their respective images are depicted in Figure 3. Within this hypothesis, the width of the long corridor remains constant, and the virtual camera is positioned at the center of the scene, equidistant from the left and right walls. Its optical axis is directed straight ahead along the length of the corridor, while maintaining consistent pitch angles, camera intrinsic parameters, and mounting height as the current real camera.

Refer to caption
Figure 3: The visual camera defination and cordinate difination.

To achieve the pose transformation from the real camera to the virtual camera, this study begins by projecting ground edge feature points onto the imaging plane of the virtual camera. Subsequently, a geometric error model is constructed, and the maximum likelihood estimation of the pose transformation is computed through an iterative optimization process.

III-C2 Virtual Camera Pose Estimation

In the assumption of this study, the virtual camera is positioned at the center of the scene, ensuring that the edge lines in the image maintain its symmetry in the virtual camera. To fully utilize this structural feature and ensure algorithm consistency across different input images, the original image is initially transformed into the imaging space of the virtual camera. After obtaining the reference depth plane of the virtual camera, the depth plane information is then reprojected onto the current image. This process ultimately yields the depth distribution of the current image. Therefore, prior to conducting depth estimation, it is necessary to acquire the pose transformation from the current camera to the virtual camera.

Let PWP^{W} represent a spatial point within the current scene, the pixel coordinate PC(u,v)P^{C}\left(u,v\right) of space point PW(x,y,z)P^{W}\left(x,y,z\right) is calculated as

[PC1]=[uv1]=1z[fx0cx0fycy001][xyz]=1zKPW\left[\begin{array}[]{c}P^{C}\\ 1\end{array}\right]=\left[\begin{array}[]{c}u\\ v\\ 1\end{array}\right]=\frac{1}{z}\left[\begin{array}[]{ccc}f_{x}&0&c_{x}\\ 0&f_{y}&c_{y}\\ 0&0&1\end{array}\right]\left[\begin{array}[]{c}x\\ y\\ z\end{array}\right]=\frac{1}{z}KP^{W} (3)

Where, K is the camera internal parameter matrix, fxf_{x} and fyf_{y} are the pixel focal lengths in the x and y directions, respectively, corresponding to the physical focal length f of the camera. And the cx and cy represent the pixel offsets of the optical center in the image of the camera. In general, the horizontal and vertical dimensions of the camera sensor have equal pixel sizes, in which case fxf_{x}=fyf_{y}. The rotation matrix R and displacement t are used to represent the pose transformation from the current camera to the virtual camera, and the current frame is projected onto the plane of the virtual camera. The projection result is calculated according to (3)

sVPC=K(RPW+t)s^{V}P^{C}=K\left(RP^{W}+t\right) (4)

Where sVs^{V} is the scale in the transformed virtual camera. In the assumption of the virtual camera, the disparity in pose between the virtual camera and the real camera arises from the yaw angle ϕ\phi and the displacement τ\tau in the x-direction. In which case, R and t are calculated as

R=[cosϕ0sinϕ010sinϕ0cosϕ],t=[τ00]R=\left[\begin{array}[]{ccc}\cos{\phi}&0&\sin{\phi}\\ 0&1&0\\ -\sin{\phi}&0&\cos{\phi}\end{array}\right],t=\left[\begin{array}[]{c}\tau\\ 0\\ 0\end{array}\right] (5)

As depth has not been recovered yet, the images lack scale information. In order to unify the scales in both cameras, a assumption is accepted of that for every pixel in the original image, its corresponding original spatial point lies in front of the camera at a distance of 1 meter. It means, for any spatial point PW(x,y,z)P^{W}\left(x,y,z\right), z1z\equiv 1. Therefore, the scale factor sVs^{V} for the virtual camera is calculated as:

sV=cosϕucxfsinϕs^{V}=\cos{\phi}-\frac{u-c_{x}}{f}sin{\phi} (6)

Based on equations (2) to (5), the transformation from the current frame image pixel point ptC(u,v)pt^{C}(u,v) to the virtual camera image pixel point ptV(uV,vV)pt^{V}(u^{V},v^{V}) can be obtained as follows:

[uVvV]=ξ[cosϕ001][uv]+[ft(τ+sinϕ)+(ξcosϕ)(ξ+1)cy]\begin{split}\left[\begin{array}[]{c}u^{V}\\ v^{V}\end{array}\right]&=\xi\left[\begin{array}[]{cc}cos\phi&0\\ 0&1\end{array}\right]\left[\begin{array}[]{c}u\\ v\end{array}\right]\\ \\ &+\left[\begin{array}[]{c}f_{t}\left(\tau+\sin{\phi}\right)+\left(\xi-cos{\phi}\right)\\ \left(\xi+1\right)c_{y}\end{array}\right]\\ \end{split} (7)

Where,ξ=1/(ftcosϕusinϕcxcosϕ)\xi={1}/\left({f_{t}\cos{\phi}-u\sin{\phi}-c_{x}\cos{\phi}}\right),ft=fx=fyf_{t}=f_{x}=f_{y}.Each pixel of ll(θl,ptl0)l_{l}\left({\theta}_{l},pt_{l}^{0}\right),lr(θr,ptr0)l_{r}\left({\theta}_{r},pt_{r}^{0}\right)along the edge can be transformed into the virtual camera according (7).

Refer to caption
Figure 4: The symmetry geometric error.

Within the assumption, the optical center of the virtual camera oincides with the symmetry axis of the corridor. As a result, the symmetry of the road edge lines is preserved in the virtual camera imaging space. The axis of symmetry is located at position u=W2u=\frac{W}{2} on the image plane of virtual camera, where W represents the image width. Based on this structured feature, the geometric residual is calculated. As shown in Figure 4., Set ptklt=(ukl.vkl)pt_{k}^{lt}=\left(u_{k}^{l}.v_{k}^{l}\right) as a point in lll_{l}, and its symmetry point about u=W2u=\frac{W}{2} is ptkltpt_{k}^{lt}. Substituting v=vklv=v_{k}^{l} into the equation of the right line, ptklt1pt_{k}^{lt1} can be obtained, and the distance D between two points can be calculated. According to (7), D is a function of both the yaw angle ϕ\phi and the displacement τ\tau in the x-direction. By uniformly selecting N feature points at intervals of lll_{l}, the sum of D calculated for these N feature points yields symmetry error the left line ELE_{L}. Similarly, the symmetry error of the right line ERE_{R} can be obtained. Ultimately, this process yields the symmetry geometric error EGE_{G}.

EG=EL+ER=kND(ptklt,ptklt1)+kND(ptkrt,ptkrt1)E_{G}=E_{L}+E_{R}=\sum_{k}^{N}{D\left(pt_{k}^{lt},pt_{k}^{lt1}\right)}+\sum_{k}^{N}{D\left(pt_{k}^{rt},pt_{k}^{rt1}\right)} (8)

Once the mathematical model for the symmetry-based geometric error EGE_{G} is established, leveraging the prior features of the scene, a range of variability for the yaw angle ϕ\phi and the displacement τ\tau is defined. Nonlinear optimization is employed to compute the maximum likelihood estimation values ϕ^\widehat{\phi} and τ^\widehat{\tau} for the yaw angle ϕ\phi and displacement τ\tau. Iterative calculations are performed within the range of variability for both parameters to minimize EGE_{G}. The resulting yaw angle and displacement that yield the smallest EGE_{G} are considered as the maximum likelihood estimation results.

ϕ^,τ^=argmin{kND(ptklt,ptklt1)+kND(ptkrt,ptkrt1)}\widehat{\phi},\widehat{\tau}=argmin\left\{\sum_{k}^{N}{D\left(pt_{k}^{lt},pt_{k}^{lt1}\right)}+\sum_{k}^{N}{D\left(pt_{k}^{rt},pt_{k}^{rt1}\right)}\right\} (9)

After obtaining the estimated value of the heading Angle and displacement of the heading Angle, the edge points are projected to the virtual camera space through the formula(4)(5), and SV=(ptkVl,ptkVr),k=0,1,2,nS^{V}=\left(pt_{k}^{Vl},pt_{k}^{Vr}\right),k=0,1,2...,n.

III-D Fast Monocular Dpth Recovery

III-D1 3D Coordinates Estimation of Ground Edge Points

In the virtual camera space, there are now pixel coordinates along two ground edges. Using the ground plane hypothesis, the points along each edge are assumed to be coplanar, and the 3D spatial coordinates of each point meet PkW(xk,yk,zk)SW,yk=0\forall P^{W}_{k}\left(x_{k},y_{k},z_{k}\right)\in S^{W},y_{k}=0. According to the installation position and aperture Angle information of the camera, the 3D spatial coordinates of each edge point corresponding to the virtual camera can be solved by geometric method. The depth plane can then be constructed.

Refer to caption
Figure 5: The geometry of the Δz\Delta z computation.

The space coordinate system is selected as the lower left front, the camera height is known as h, the vertical aperture Angle is θv\theta_{v} . Assume that the pitch Angle of the virtual camera in the current scene is θp\theta_{p} , and for a pixel point pt0pt_{0} at the bottom of the image, its pixel coordinate is (ut,vt)\left(u_{t},v_{t}\right) meet 0utW0\leq u_{t}\leq W,vt=Hv_{t}=H , W,H are the image height and image width respectively, and its depth is calculated as

z0=htan(θv+θp)z_{0}=\frac{h}{\tan{\left(\theta_{v}+\theta_{p}\right)}} (10)

The theoretical method of Inverse Perspective Transform (IPM) is used to calculate the depth of each edge point, as shown in Figure  5. For any point in the set of edge points SV=(ptkVl,ptkVr),k=0,1,2,nS^{V}=\left(pt_{k}^{Vl},pt_{k}^{Vr}\right),k=0,1,2...,n, its pixel coordinate is (ukV,vkV)\left(u_{k}^{V},v_{k}^{V}\right), and its depth is calculated according to the inverse perspective transform.

z=z0+Δzz=z_{0}+\Delta z (11)

The equation is constructed to solve according to the perspective geometry as (12)

ηkcos(θkθp)Δzsinθp=z1z0\frac{\eta_{k}\cos{\left(\theta_{k}-\theta_{p}\right)}}{\Delta z\sin{\theta_{p}}}=\frac{z_{1}}{z_{0}} (12)

Where , ηk=fyΔvk,Δvk=HvkV\eta_{k}=f_{y}\Delta v_{k},\Delta v_{k}=H-v_{k}^{V}is the depth in space at the bottom of the imaging plane 1m away from the optical center of the camera,

z1=cosθvsinθvsinθpz_{1}=cos\theta_{v}-sin\theta_{v}sin\theta_{p} (13)

Solving (9),Δz\Delta z is calculated as

Δz=z0z0cosθp+hsinθpz1hηkz0cosθpηk\Delta z=z_{0}\frac{z_{0}\cos{\theta_{p}}+h\sin{\theta_{p}}}{z_{1}h-\eta_{k}z_{0}\cos{\theta_{p}}}\eta_{k} (14)

III-D2 3D Coordinates Optimization &\& Pitch Angle Estimation

As shown in (14), the calculation of edge point depth Δz\Delta z is related to camera elevation Angle θp\theta_{p}. In this paper, Δz\Delta z and θp\theta_{p} are estimated simultaneously by nonlinear optimization. In the corridor, the road surface in the scene is flat and the pitch Angle variation range is small. The iterative optimization algorithm is designed to iteratively calculate the geometric residual of 3D space points within the interval [λ1,λ2]\left[\lambda_{1},\lambda_{2}\right], taking α\alpha as the step length, and calculate the optimal θp\theta_{p},to estimate the result.

Set the initial value of θp\theta_{p}, according to (14) for the edge point pixel set SVS^{V}, in which all pixels recover the depth distribution, according to camera imaging model (3), the 3D spatial coordinates of edge point ptkVlpt_{k}^{Vl} are calculated as

{zkW=z0+ΔzkxkW=zk(ukVcx)/ftykW=zk(vkVcy)/ft\left\{\begin{aligned} &z^{W}_{k}=z_{0}+\Delta z_{k}\\ &x_{k}^{W}=z_{k}\left(u_{k}^{V}-c_{x}\right)/f_{t}\\ &y_{k}^{W}=z_{k}\left(v_{k}^{V}-c_{y}\right)/f_{t}\end{aligned}\right. (15)

And the geometry residual is calculated as

EGW\displaystyle E_{G}^{W} =kLPkWl,PkWr2\displaystyle=\sum_{k}^{L}{\left\|P_{k}^{Wl},P_{k}^{Wr}\right\|_{2}} (16)
kL|xkWl,xkWr|ykWl=ykWr\displaystyle\Longleftrightarrow\sum_{k}^{L}{\left|x_{k}^{Wl},x_{k}^{Wr}\right|}_{y_{k}^{Wl}=y_{k}^{Wr}}
=kLzkft(1+z0cosθp+hsinθpz1hηkz0cosθpηk)(ukVlukVr)\displaystyle=\sum_{k}^{L}{\frac{z_{k}}{f_{t}}\left(1+\frac{z_{0}\cos{\theta_{p}}+h\sin{\theta_{p}}}{z_{1}h-\eta_{k}z_{0}\cos{\theta_{p}}}\eta_{k}\right)}\left(u_{k}^{Vl}-u_{k}^{Vr}\right)

The approximate optimal estimate of θp\theta_{p} can be obtained by iterative optimization

θ^p=argmin{kLzkft(1+z0cosθp+hsinθpz1hηkz0cosθpηk)(ukVlukVr)}\hat{\theta}_{p}=\arg\min\left\{\sum_{k}^{L}{\frac{z_{k}}{f_{t}}\left(1+\frac{z_{0}\cos{\theta_{p}}+h\sin{\theta_{p}}}{z_{1}h-\eta_{k}z_{0}\cos{\theta_{p}}}\eta_{k}\right)}\left(u_{k}^{Vl}-u_{k}^{Vr}\right)\right\} (17)

III-D3 Depth Plane Construction &\& Spatial Point Depth Recovery

In the image space of the virtual camera, the spatial points in the same depth plane retain the parallel characteristics, and the contour points of the same depth are connected to build a depth plane. The depth plane ΓkV\Gamma_{k}^{V} is guided by the contour lines. The depth plane ΓkV\Gamma_{k}^{V} is determined by the left and right edge points {ptkVl(ukVl,vkVl),ptkVr(ukVr,vkVr)}\left\{pt_{k}^{Vl}\left(u_{k}^{Vl},v_{k}^{Vl}\right),pt_{k}^{Vr}\left(u_{k}^{Vr},v_{k}^{Vr}\right)\right\}, and is uniquely determined, Dp(ΓkV)Dp\left(\Gamma_{k}^{V}\right) is the depth of ΓkV\Gamma_{k}^{V}, Dp(ΓkV)=zkDp\left(\Gamma_{k}^{V}\right)=z_{k}.

The depth plane ΓkV\Gamma_{k}^{V}, obtained in the virtual camera imaging space, is inversely transformed according to (4-5), and the set of depth planes in the real camera image can be obtained: SΓC={ΓkC},k=1,2,3,,LS_{\Gamma}^{C}=\left\{\Gamma_{k}^{C}\right\},k=1,2,3,...,L. In a real camera image, if the pixel ptkC(ukC,vkC)pt_{k}^{C}\left(u_{k}^{C},v_{k}^{C}\right) belongs to the depth plane ΓkC\Gamma_{k}^{C} then equation (18) is a necessary condition.

ptkCΓkC{f(ukC)uiCukCvkClviCvkCruk1CuiCviCvk1ClviCvk1Cr\forall pt_{k}^{C}\in\Gamma_{k}^{C}\Rightarrow\left\{\begin{aligned} &f\left(u_{k}^{C}\right)\leq u_{i}^{C}\leq u_{k}^{C}\\ &v_{k}^{Cl}\leq v_{i}^{C}\leq v_{k}^{Cr}\\ &u_{k-1}^{C}\leq u_{i}^{C}\;\vee\;v_{i}^{C}\leq v_{k-1}^{Cl}\;\vee\;v_{i}^{C}\geq v_{k-1}^{Cr}\end{aligned}\right. (18)

Where f(ukCl)=αukClβf\left(u_{k}^{Cl}\right)=\alpha u_{k}^{Cl}-\beta is a linear transform of ukClu_{k}^{Cl} , and α\alpha , β\beta are empirical coefficients.

In the corridor scene, the spatial point distribution is relatively ideal, ignoring the influence of dynamic objects, occlusion, etc., and taking (18) as sufficient and necessary conditions of ptkclΓkCpt_{k}^{cl}\in\Gamma_{k}^{C} , each pixel in the image is classified.

After pixel classification is completed, it is determined whether it is a ground point according to the coordinates of each pixel. The condition that ptiV(uiV,viV)pt_{i}^{V}\left(u_{i}^{V},v_{i}^{V}\right) is a ground point is

ptiCΓkCuk1CuiCuk1Cpt_{i}^{C}\in\Gamma_{k}^{C}\;\wedge\;u_{k-1}^{C}\leq u_{i}^{C}\leq u_{k-1}^{C} (19)

For the pixel point ptkC(ukC,vkC)ΓkCpt_{k}^{C}\left(u_{k}^{C},v_{k}^{C}\right)\in\Gamma_{k}^{C}, its depth is calculated by interpolation method (20).

Dp(ptiC)=Dp(ΓkC)+σ(Dp(Γk1C)Dp(ΓkC))Dp\left(pt_{i}^{C}\right)=Dp\left(\Gamma_{k}^{C}\right)+\sigma\left(Dp\left(\Gamma_{k-1}^{C}\right)-Dp\left(\Gamma_{k}^{C}\right)\right) (20)

In the formula, σ\sigma is a nonlinear coefficient. When the depth plane is sufficiently dense, linear interpolation method is adopted. If the point is a ground point, then

σ=uiCuk1CukCuk1C\sigma=\frac{u_{i}^{C}-u_{k-1}^{C}}{u_{k}^{C}-u_{k-1}^{C}} (21)

Otherwise,

σ={viCvk1ClvkClvk1Cl,viCvk1ClviCvk1ClvkCrvk1Cr,viCvk1Cr\sigma=\left\{\begin{aligned} &\frac{v_{i}^{C}-v_{k-1}^{Cl}}{v_{k}^{Cl}-v_{k-1}^{Cl}}\,,\,v_{i}^{C}\leq v_{k-1}^{Cl}\\ &\frac{v_{i}^{C}-v_{k-1}^{Cl}}{v_{k}^{Cr}-v_{k-1}^{Cr}}\,,\,v_{i}^{C}\geq v_{k-1}^{Cr}\end{aligned}\right. (22)

IV Results

IV-A Experiment Overview

Refer to caption
Figure 6: The Unmanned Ground Vehicle(UGV) and ZED2 camera

The algorithm proposed in this paper is validated by collecting data in real scenarios. The ZED2 camera sensors mounted on an Unmanned Ground Vehicle(UGV) were used to collect images in different scenes to build data sets, and the RGB images and 16-bit depth images output by the cameras were recorded by ROSbag. The size of the output RGB image and depth image is 420x360 for the ZED2 camera. Camera mounting height are 0.66m, 0.62m respectively. The experimental equipment is shown Figure  LABEL:figt ZED2 camera is used to collect 135 images of 9 kinds corridor with length range [0100]m[0-100]m and width range [24]m[2-4]m and different lighting conditions, from which a corridor dataset, named Corr_EH_z were constructed. There are two parts of Corr_EH_z, Cord_Exx_zCord\_Exx\_z , Cord_Hxx_zCord\_Hxx\_z. The Cord_Exx_zCord\_Exx\_z subset is a simple condition while Cord_Hxx_zCord\_Hxx\_z is a complex condition. In Figure  7, scene images and corresponding depth images in two types of subsets are shown respectively.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 7: (a)Source img of Cord_E05_z sence. (b) Depth truth of Cord_E05_z sence. (c) Source img of Cord_H02_z sence. (d) Depth truth of Cord_H02_z sence.

IV-B Experiment 1: Verification of Algorithm Accuracy

In the two types of sub-datasets, the heading Angle estimation interval is set as [0.314,0.314]rad(18)[-0.314,0.314]rad(18) interval, and the accuracy of the pitch Angle estimation and corridor width estimation in this paper is tested with 0.05rad as the step length. The estimation results are shown in Table 1. The relative error of the output corridor width estimation value of the method in this paper is less than 0.0427. The mean relative error is 0.0221.

TABLE I: Accuracy of corridor width estimation.
Scene Estimated(m) Groundtruth(m) relative error %
Cord_E01_z 2.16 2.11 2.3697
Cord_E02_z 2.08 2.13 2.3474
Cord_E03_z 2.20 2.11 4.2654
Cord_E04_z 2.17 2.09 3.8278
Cord_E05_z 1.98 2.02 1.9802
Cord_E06_z 1.88 1.86 1.0753
Cord_H01_z 2.16 2.13 1.4085
Cord_H02_z 3.04 3.09 1.6181
Cord_H03_z 2.95 2.98 1.0018

1 Tables may have a footer.

The depth estimation accuracy of the proposed method is tested in the Cord_Exx_zCord\_Exx\_z and Cord_Hxx_zCord\_Hxx\_z datasets, as shown in Table reftab2,tab3~{}ref{tab2,tab3}. Four types of general accuracy indexes of the proposed method are calculated under the two evaluation conditions of depth truth value and depth truth value respectively, and the four types of accuracy indexes are defined as [13]

AbsRel:\displaystyle AbsRel: 1NpN|ypy^pyp|\displaystyle\frac{1}{N}\sum_{p}^{N}\left|\frac{y_{p}-{\hat{y}_{p}}}{y_{p}}\right|
Log10:\displaystyle Log10: 1N|log10yplog10y^p|\displaystyle\frac{1}{N}\left|\log_{10}{y_{p}}-\log_{10}{\hat{y}_{p}}\right|
RMSE:\displaystyle RMSE: 1NpN(ypy^p)2\displaystyle\sqrt{\frac{1}{N}\sum_{p}^{N}\left({y_{p}}-{\hat{y}_{p}}\right)^{2}}
RMSElog:\displaystyle RMSElog: 1NpN(log10yplog10y^p)2\displaystyle\sqrt{\frac{1}{N}\sum_{p}^{N}\left(\log_{10}{y_{p}}-\log_{10}{\hat{y}_{p}}\right)^{2}}
TABLE II: Accuracy of depth estimation while depth is less than 5m.
Scene AbsRel Log10 RMSE RMSElog
Cord_E01_z 0.06338 0.02908 0.30892 0.05119
Cord_E02_z 0.06094 0.02630 0.26311 0.04062
Cord_E03_z 0.05813 0.02522 0.27610 0.04183
Cord_E04_z 0.07106 0.03279 0.32257 0.05431
Cord_E05_z 0.06011 0.02518 0.35607 0.04817
Cord_E06_z 0.07028 0.03156 0.32098 0.05422
Cord_H01_z 0.08588 0.04104 0.45468 0.07382
Cord_H02_z 0.11233 0.05968 0.65776 0.11828
Cord_H03_z 0.10867 0.04709 0.46969 0.06848

1 Tables may have a footer.

TABLE III: Accuracy of depth estimation while depth is less than 40m.
Scene AbsRel Log10 RMSE RMSElog
Cord_E01_z 0.09740 0.05185 1.26336 0.11411
Cord_E02_z 0.08256 0.04079 1.04763 0.09274
Cord_E03_z 0.09026 0.01348 1.29909 0.11610
Cord_E04_z 0.10603 0.05551 1.89761 0.12333
Cord_E05_z 0.08386 0.04342 1.28335 0.10847
Cord_E06_z 0.09879 0.05225 1.25565 0.11414
Cord_H01_z 0.12416 0.07247 1.61772 0.15947
Cord_H02_z 0.16391 0.10121 1.57186 0.19957
Cord_H03_z 0.12647 0.06797 1.43041 0.14808

1 Tables may have a footer.

As can be seen from Table  II, under the condition of depth truth value ¡5m, the AbsRel index <7.106%<7.106\% and RMSE index <0.35607<0.35607 output depth estimates in the Cord_Exx_zCord\_Exx\_z dataset of this method reach the advanced level of accuracy. At the same time, the accuracy of the proposed method can be predicted in the depth range of 40m, as shown in Table  III. In the Cord_E03_zCord\_E03\_z scenario, the AbsRel of the proposed method in the depth range of 40m can reach 8.256%8.256\%. Figure  8 shows the depth recovery effect of the proposed method in several scenario.It should be noted that the ceiling part is eliminated to save computing resources.

IV-C Experiment 2: Comparison with the State-of-the-art Method

In the Cord_Exx_zCord\_Exx\_z dataset, a precision comparison test was conducted between the proposed method and the ADABINS[15] method based on deep learning. ADABINS method is based on codec-decoding network +ViT for depth classification estimation. This method was published in CVPR in 2021, and currently ranks 14th in Kitti[31] list and 25th in NYU[32] list, which is at the advanced level among existing methods.

First, 25epochs of ADABINSS method was trained on Cord_Exx_zCord\_Exx\_z sub-dataset, and then the accuracy of depth prediction between the proposed method and the ADABINS method was compared, as shown in Table IV.

TABLE IV: Precision comparison experiment with Adabins.
Method AbsRel Log10 RMSE RMSElog
ADABINS1 0.079 0.033 0.299 0.101
Ours1 0.083 0.036 0.363 0.053
ADABINS2 - - - -
Ours2 0.098 0.054 1.425 0.121

1 Depth range 0-5m.
2 Depth range 0-40m.

AbsRel, Log10 and RMSE accuracy indexes of the proposed method are similar to those of ADABINS under the condition of depth truth value ¡5m, and RMSElog is 0.053, which is better than ADABINS (0.101). AbsRel and RMSE were 9.8%9.8\% and 1.425 respectively under the condition of depth truth value ¡40m, while the effective depth prediction range of ADABINS indoor depth estimation is only 10m.

Compare the speed of reasoning/recovering a single image between ADABINS and the proposed method in different computing platforms. Computing platform 1 is AMDThreadripperPRO 5975W 32-core high-performance processor with single-core main frequency 3.6GHz, and GPU is Nvidia RTX4090 graphics processing unit, which is used to assist deep neural network reasoning. Computing platform 2 is a medium-low performance Intel Core i5-7300HQCPU with a main frequency of 2.5GHz; Repeated experiments were used to calculate the average execution time of the two methods, and the results were shown in Table  V.

TABLE V: Running time comparison experiment with Adabins.
Method Average running time
ADABINS1 0.0507
Ours1 0.0097
ADABINS2 -
Ours2 0.0482

1 On platform 1.
2 On platform 2.

The average execution time of the proposed method for depth recovery of a single image on computing platform 1 is 0.0097s, which is less than 1/5 of the average execution time of the ADABINS method under GPU acceleration, and the average execution time of the proposed method on processor 2 is 0.048s. The results show that the proposed method can achieve a real-time processing speed of 20FPS in low and medium performance processors.

V Discussion

In this paper, an explicit method for fast monocular depth recovery in long corridor scenes is proposed.

By extracting key information in long corridor scenes, the method optimizes pitch angle estimates and scene edge point depths by minimizing geometric residuals, and classification of space points by constructing depth planes, the monocular depth estimation problem is transformed into a solvable optimization problem. Fast monocular depth estimation for long corridor scenarios is achieved. In this paper, we collected the corridor image construction datasets under various scenarios, tested the performance of this method experimentally, and conducted precision comparison test with the method based on deep learning. The test results show that:

  1. 1.

    The accuracy of the explicit method for fast monocular depth recovery in long corridor scenarios reaches the advanced level of the existing monocular depth estimation algorithms;

  2. 2.

    The fast monocular depth recovery method greatly accelerates the depth recovery process of a single image. In low and medium performance processors, this method can perform real-time depth estimation of corridor scenes at a speed of 20FPS.

Although the method work well in the experimental scenarios, there are still some limitations:

  1. 1.

    Due to the use of more scene prior information, the use scenarios of the proposed method are limited. It is more suitable for corridors with characteristics of closed and straight;

  2. 2.

    Due to the ground plane assumption, the performance of the method decreases when a small slope exists in the corridor, although this scenario is rare in reality;

  3. 3.

    Because of perspective geometry, the accuracy is low at farther distances, where there is a cumulative drift in height and transversality, as shown in  8. And the drift needs to be compensated empirically.

We are trying to replace straight lines with curve detection and simply model the height of the ground to tackle the above limitations, so that the proposed method can also perform well in scenes with a curvature.

The proposed algorithm can be used as an auxiliary module in the autonomous navigation and positioning system of Unmanned Aerial Vehicle(UAV) and Unmanned Ground Vehicle(UGV), such as the Simultaneous Localization and Mapping(SLAM) system, and other applications of monocular camera systems. There are several possible applications as we can thought:

  1. 1.

    Inner corridor UAV safe flying;

  2. 2.

    3D positioning of the monitor camera in the corridor;

  3. 3.

    SLAM of delivery robots in the corridor.

In the future, we will try to apply this method to the above applications effectively. In addition, a small deep neural network for segmentation and classification can be trained to combine with this method to solve the problem of limited scenarios. Finally, the theory of our method is suitable for unsupervised training of depth estimation models.

Refer to caption
Figure 8: (a)-(k) are estimation results in several sences. For each line, from left to right, there are edge extraction results, 3D edge estimation results and depth estimation results, respectively .

References

  • [1] F. Khan, S. Salahuddin, and H. Javidnia, “Deep learning-based monocular depth estimation methods—a state-of-the-art review,” Sensors, vol. 20, no. 8, p. 2272, 2020.
  • [2] X. Dong, M. A. Garratt, S. G. Anavatti, and H. A. Abbass, “Towards real-time monocular depth estimation for robotics: A survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 10, pp. 16 940–16 961, 2022.
  • [3] Y. Ming, X. Meng, C. Fan, and H. Yu, “Deep learning for monocular depth estimation: A review,” Neurocomputing, vol. 438, pp. 14–33, 2021.
  • [4] H. Laga, L. V. Jospin, F. Boussaid, and M. Bennamoun, “A survey on deep learning techniques for stereo-based depth estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 4, pp. 1738–1764, 2020.
  • [5] P. Vyas, C. Saxena, A. Badapanda, and A. Goswami, “Outdoor monocular depth estimation: A research review,” arXiv preprint arXiv:2205.01399, 2022.
  • [6] R. Mur-Artal and J. D. Tardós, “Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras,” IEEE transactions on robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
  • [7] D. Park, R. Ambrus, V. Guizilini, J. Li, and A. Gaidon, “Is pseudo-lidar needed for monocular 3d object detection?” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3142–3152.
  • [8] H. Ha, S. Im, J. Park, H.-G. Jeon, and I. S. Kweon, “High-quality depth from uncalibrated small motion clip,” in Proceedings of the IEEE conference on computer vision and pattern Recognition, 2016, pp. 5413–5421.
  • [9] H. Javidnia and P. Corcoran, “Accurate depth map estimation from small motions,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 2453–2461.
  • [10] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey, “Learning depth from monocular videos using direct methods,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2022–2030.
  • [11] N. Yang, R. Wang, J. Stuckler, and D. Cremers, “Deep virtual stereo odometry: Leveraging deep depth prediction for monocular direct sparse odometry,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 817–833.
  • [12] A. Saxena, S. Chung, and A. Ng, “Learning depth from single monocular images,” Advances in neural information processing systems, vol. 18, 2005.
  • [13] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” Advances in neural information processing systems, vol. 27, 2014.
  • [14] L. Wang, M. Famouri, and A. Wong, “Depthnet nano: A highly compact self-normalizing neural network for monocular depth estimation,” arXiv preprint arXiv:2004.08008, 2020.
  • [15] S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4009–4018.
  • [16] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2002–2011.
  • [17] C. Zhao, Q. Sun, C. Zhang, Y. Tang, and F. Qian, “Monocular depth estimation based on deep learning: An overview,” Science China Technological Sciences, vol. 63, no. 9, pp. 1612–1627, 2020.
  • [18] J. Jeong and A. Kim, “Adaptive inverse perspective mapping for lane map generation with slam,” in 2016 13th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI).   IEEE, 2016, pp. 38–41.
  • [19] J. Lin and J. Peng, “Adaptive inverse perspective mapping transformation method for ballasted railway based on differential edge detection and improved perspective mapping model,” Digital Signal Processing, vol. 135, p. 103944, 2023.
  • [20] Z. Wang, X. Wu, Y. Yan, C. Jia, B. Cai, Z. Huang, G. Wang, and T. Zhang, “An inverse projective mapping-based approach for robust rail track extraction,” in 2015 8th International Congress on Image and Signal Processing (CISP).   IEEE, 2015, pp. 888–893.
  • [21] W. Yang, B. Fang, and Y. Y. Tang, “Fast and accurate vanishing point detection and its application in inverse perspective mapping of structured road,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 48, no. 5, pp. 755–766, 2016.
  • [22] C. Zhong, P. You, X. Chen, H. Zhao, F. Sun, G. Zhou, X. Mu, C. Gan, and W. Huang, “Snake: Shape-aware neural 3d keypoint field,” Advances in Neural Information Processing Systems, vol. 35, pp. 7052–7064, 2022.
  • [23] Y. Wu, R. Li, Z. Qin, X. Zhao, and X. Li, “Heightformer: Explicit height modeling without extra data for camera-only 3d object detection in bird’s eye view,” arXiv preprint arXiv:2307.13510, 2023.
  • [24] M. Poggi, F. Aleotti, F. Tosi, and S. Mattoccia, “Towards real-time unsupervised monocular depth estimation on cpu,” in 2018 IEEE/RSJ international conference on intelligent robots and systems (IROS).   IEEE, 2018, pp. 5848–5854.
  • [25] D. Wofk, F. Ma, T.-J. Yang, S. Karaman, and V. Sze, “Fastdepth: Fast monocular depth estimation on embedded systems,” in 2019 International Conference on Robotics and Automation (ICRA).   IEEE, 2019, pp. 6101–6108.
  • [26] J. Liu, Q. Li, R. Cao, W. Tang, and G. Qiu, “Mininet: An extremely lightweight convolutional neural network for real-time unsupervised monocular depth estimation,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 166, pp. 255–267, 2020.
  • [27] R. P. Padhy, S. Verma, S. Ahmad, S. K. Choudhury, and P. K. Sa, “Deep neural network for autonomous uav navigation in indoor corridor environments,” Procedia computer science, vol. 133, pp. 643–650, 2018.
  • [28] R. P. Padhy, S. Ahmad, S. Verma, S. Bakshi, and P. K. Sa, “Localization of unmanned aerial vehicles in corridor environments using deep learning,” in 2020 25th International Conference on Pattern Recognition (ICPR).   IEEE, 2021, pp. 9423–9428.
  • [29] G. Ge, Y. Zhang, W. Wang, L. Hu, Y. Wang, and Q. Jiang, “Visual-feature-assisted mobile robot localization in a long corridor environment,” Frontiers of Information Technology & Electronic Engineering, vol. 24, no. 6, pp. 876–889, 2023.
  • [30] R. Wang, J. Qin, K. Li, and D. Cao, “Bev lane det: Fast lane detection on bev ground,” arXiv preprint arXiv:2210.06006, 2022.
  • [31] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
  • [32] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” in Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V 12.   Springer, 2012, pp. 746–760.