¹¹institutetext: Rui Fan ²²institutetext: UC San Diego, ²²email: [email protected] ³³institutetext: Li Wang ⁴⁴institutetext: ATG Robotics, ⁴⁴email: [email protected] ⁵⁵institutetext: Mohammud Junaid Bocus ⁶⁶institutetext: University of Bristol, ⁶⁶email: [email protected] ⁷⁷institutetext: Ioannis Pitas ⁸⁸institutetext: Aristotle University of Thessaloniki, ⁸⁸email: [email protected]

Computer Stereo Vision for Autonomous Driving

Rui Fan Li Wang Mohammud Junaid Bocus Ioannis Pitas

Abstract

As an important component of autonomous systems, autonomous car perception has had a big leap with recent advances in parallel computing architectures. With the use of tiny but full-feature embedded supercomputers, computer stereo vision has been prevalently applied in autonomous cars for depth perception. The two key aspects of computer stereo vision are speed and accuracy. They are both desirable but conflicting properties, as the algorithms with better disparity accuracy usually have higher computational complexity. Therefore, the main aim of developing a computer stereo vision algorithm for resource-limited hardware is to improve the trade-off between speed and accuracy. In this chapter, we introduce both the hardware and software aspects of computer stereo vision for autonomous car systems. Then, we discuss four autonomous car perception tasks, including 1) visual feature detection, description and matching, 2) 3D information acquisition, 3) object detection/recognition and 4) semantic image segmentation. The principles of computer stereo vision and parallel computing on multi-threading CPU and GPU architectures are then detailed.

1 Introduction

Autonomous car systems enable self-driving cars to navigate in complicated environments, without any intervention of human drivers. An example of autonomous car system architecture is illustrated in Fig 1. Its hardware (HW) mainly includes: 1) car sensors, such as cameras, LIDARs, and Radars; and 2) car chassis, such as throttle, brake, and wheel. On the other hand, the software (SW) is comprised of four main functional modules fan2019key : 1) perception, 2) localization and mapping, 3) prediction and planning and 4) control. Computer stereo vision is an important component of the perception module. It enables self-driving cars to perceive environment in 3D.

Refer to caption — Figure 1: Autonomous car system architecture.

This chapter first introduces the HW/SW aspects of an autonomous car system. Then, four autonomous car perception tasks are discussed, including: 1) visual feature detection, description and matching, 2) 3D information acquisition, 3) object detection/recognition and 4) semantic image segmentation. Finally, the principles of computer stereo vision and parallel computing on multi-threading CPU and GPU are detailed.

2 Autonomous Car System

2.1 Hardware

Car Sensors

The most commonly used car sensors include: a) passive sensors, such as cameras; b) active sensors, such as LIDARs, Radars and ultrasonic transceivers; and c) other types of sensors, such as global positioning systems (GPS), inertial measurement unit (IMU), among others. When choosing them, we need to consider many different factors, such as sampling rate, field of view (FoV), accuracy, range, cost and overall system complexity¹¹1autonomous-driving.org/2019/01/25/positioning-sensors-for-autonomous-vehicles.

Cameras capture 2D images, by collecting light reflected on 3D objects. Images captured from different views can be utilized to reconstruct the 3D driving scene geometry. Most autonomous car perception tasks, such as visual semantic driving scene segmentation and object detection/recognition, are developed for images. In Sec. 3, we provide readers with a comprehensive overview of these tasks. The perspective (or pinhole) camera model and the mathematical principles of multi-view geometry are discussed in Sec. 4. Acquired image quality is always subject to environmental conditions, such as weather and illumination pitas2000digital . Therefore, the visual information fusion from other sensors is typically required for robust autonomous car perception.

LIDAR illuminates a target with pulsed laser light and measures the source-target distance, by analyzing the reflected laser pulses detection2013ranging . Due to its ability to generate highly accurate 3D driving scene geometry models, LIDARs are generally mounted on autonomous cars for depth perception. Current industrial autonomous car localization and mapping systems are generally based on LIDARs. Furthermore, Radars can measure both the range and radial velocity of an object, by transmitting an electromagnetic wave and analyzing its reflections bureau2013radar . Radars have already been established in the automotive industry, and they have been prevalently employed to enable intelligent vehicle advanced driver assistance system (ADAS) features, such as adaptive cruise control and autonomous emergency braking\@footnotemark. Similar to Radar, ultrasonic transceivers calculate the source-object distance, by measuring the time between transmitting an ultrasonic signal and receiving its echo westerveld2014silicon . Ultrasonic transceivers are commonly used for autonomous car localization and navigation.

In addition to the aforementioned passive and active sensors, GPS and IMU systems are commonly used to enhance autonomous car localization and mapping performance zheng2019low . GPS can provide both time and geolocation information for autonomous cars. However, its signals can become very weak, when GPS reception is blocked by obstacles in GPS-denied regions, such as urban regions samama2008global . Hence, GPS and IMU information fusion is widely adopted to provide continuous autonomous car position and velocity information zheng2019low .

Car Chassis

Car chassis technologies, especially Drive-by-Wire (DbW), are required for building autonomous vehicles. DbW technology refers to the electronic systems that can replace traditional mechanical controls liu2019chassis . DbW systems can perform vehicle functions, which are traditionally achieved by mechanical linkages, through electrical or electro-mechanical systems. There are three main vehicle control systems that are commonly replaced with electronic controls: 1) throttle, 2) braking and 3) steering.

A Throttle-by-Wire (TbW) system helps accomplish vehicle propulsion via an electronic throttle, without any cables from the accelerator pedal to the engine throttle valve. In electric vehicles, TbW system controls the electric motors, by sensing accelerator pedal for a pressure (input) and sending signal to the power inverter modules. Compared to traditional hydraulic brakes, which provide braking effort, by building hydraulic pressure in the brake lines, a Brake-by-Wire (BbW) system completely eliminates the need for hydraulics, by using electronic motors to activate calipers. Furthermore, in vehicles that are equipped with Steer-by-Wire (SbW) technology, there is no physical connection between the steering wheel and the tires. The control of wheels’ direction is established through electric motor(s), which are actuated by electronic control units monitoring steering wheel inputs.

In comparison to traditional throttle systems, electronic throttle systems are much lighter, hence greatly reducing modern car weight. In addition, they are easier to service and tune, as a technician can simply connect a laptop to perform tuning. Moreover, an electronic control system allows more accurate control of throttle opening, compared to a cable control that stretches over time. Furthermore, since the steering wheel can be bypassed as an input device, safety can be improved by providing computer controlled intervention of vehicle controls with systems, such as Adaptive Cruise Control and Electronic Stability Control.

2.2 Software

Autonomous car perception module analyzes the raw data collected by car sensors (see Sec. 2.1) and outputs its understanding to the environment. This process is similar to human visual cognition. We discuss different autonomous car perception tasks in Sec. 3.

Perception module outputs are then used by other modules. The localization and mapping module not only estimates autonomous car location, but also constructs and updates the 3D environment map bhuttaloop . This topic has become very popular, since the concept of Simultaneous Localization and Mapping (SLAM) was introduced in 1986 smith1986representation .

Prediction and planning module first analyzes the motion patterns of other traffic agents and predicts their future trajectories. Such prediction outputs are then used to determine possible safe autonomous car navigation routes katrakazas2015real using different path planning techniques, such as Dijkstra cormen2001section , A-star (or simply A*) delling2009engineering , etc.

Finally, autonomous car control module sends appropriate commands to car controllers (see Sec. 2.1), based on its predicted trajectory and the estimated car state. This enables the autonomous car to follow the planned trajectory, as closely as possible. Traditional controllers, such as proportional-integral-derivative (PID) willis1999proportional , linear-quadratic regulator (LQR) goodwin2001control and model predictive control (MPC) garcia1989model are still the most commonly used ones in autonomous car control module.

3 Autonomous Car Perception

The autonomous car perception module has four main functionalities:

1.

visual feature detection, description and matching;
2.

3D information acquisition;
3.

objection detection/recognition;
4.

semantic image segmentation.

Visual feature detectors and descriptors have become very popular research topics in the computer vision and robotics communities. They have been applied in many application domains hassaballah2016image , such as image classification liu2012discriminative , 3D scene reconstruction moreels2007evaluation , object recognition dollar2011pedestrian and visual tracking danelljan2016discriminative . The matched visual feature correspondences between two (or more) images can be utilized to establish image relationships hassaballah2016image . The most well-known visual features are scale-invariant feature transform (SIFT) lowe2004distinctive , speeded up robust feature (SURF) bay2006surf , oriented FAST and rotated BRIEF (ORB) rublee2011orb , binary robust invariant scalable keypoints (BRISK) leutenegger2011brisk , and so forth.

The digital images captured by cameras are essentially 2D fan2018real . In order to extrapolate the 3D information from a given driving scene, images from multiple views are required hartley2003multiple . These images can be captured using either a single moving camera wang2020cot or an array of synchronized cameras, as shown in Fig. 2. The former is typically known as structure from motion (SfM) ullman1979interpretation or optical flow wang2020cot , while the latter is typically referred to as stereo vision or binocular vision (in case two cameras are used) fan2018real . SfM methods estimate both camera poses and the 3D points of interest from images captured from multiple views, which are linked by a collection of visual features. They also leverage bundle adjustment (BA) triggs1999bundle technique to refine the estimated camera poses and 3D point locations, by minimizing a cost function known as total re-projection error wang2020atg . Optical flow describes the motion of pixels between consecutive frames of a video sequence wang2020atg . It is also regarded as an effective tool for dynamic object detection wang2020cot . Stereo vision acquires depth information by finding the horizontal positional differences (disparities) of the visual feature correspondence pairs between two synchronously captured images. More details on computer stereo vision will be given in Sec. 4.3.

Object detection/recognition refers to the recognition and localization of particular objects in images/videos zhao2019object . It can be used in various autonomous car perception subtasks, such as pedestrian detection wang2019deep , vehicle detection wang2020atg , traffic sign detection mogelmose2012vision , cyclist detection wu2017squeezedet , etc., as shown in Fig. 3. Object detection approaches can be classified as either computer vision-based or machine/deep learning-based. The former typically consists of three steps zhao2019object : 1) informative region selection (scanning the whole image by sliding windows with particular templates to produce candidate regions); 2) visual feature extraction, as discussed above; and 3) object classification (distinguishing a target object from all the other categories using a classifier). With recent advances in machine/deep learning, a large number of convolutional neural networks (CNNs) have been proposed to recognize objects from images/videos. Such CNN-based approaches have achieved very impressive results. The most popular ones include: regions with CNN features (R-CNN) girshick2014rich , fast R-CNN girshick2015fast , faster R-CNN ren2015faster , you only look once (YOLO) redmon2016you , YOLOv3 redmon2018yolov3 , YOLOv4 bochkovskiy2020yolov4 , etc.

Semantic image segmentation labels every pixel in the image with a given object class fan2020sne , such as lane marking, vehicle, collision-free space, or pedestrian, as illustrated in Fig. 4. The state-of-the-art semantic image segmentation approaches are mainly categorized into two main groups wang2020applying : 1) single-modal and 2) data-fusion. The former typically segments RGB images with an encoder-decoder CNN architecture fan2020learning . In recent years, many popular single-model semantic image segmentation algorithms, such as Fully Convolutional Network (FCN) long2015fully , U-Net ronneberger2015u , SegNet badrinarayanan2017segnet , DeepLabv3+ chen2018encoder , DenseASPP yang2018denseaspp , DUpsampling tian2019decoders , etc., have been proposed. Data-fusion semantic image segmentation approaches generally learn features from two different types of vision data fan2020we , such as RGB and depth images in FuseNet hazirbas2016fusenet , RGB and surface normal images fan2020three in SNE-RoadSeg fan2020sne , RGB and transformed disparity fan2018ipl ; fan2019pothole ; fan2019road images in AA-RTFNet fan2020we , or RGB and thermal images in MFNetha2017mfnet . The learned feature maps are then fused to provide a better semantic prediction.

Please note: a given autonomous car perception application can always be solved by different types of techniques. For instance, lane marking detection can be formulated as a linear/quadratic/quadruplicate pattern recognition problem and solved using straight line detection algorithms fan2016ist or dynamic programming algorithms ozgunalp2016multiple ; fan2018lane ; jiao2019using . On the other hand, it can also be formulated as a semantic image segmentation problem and solved with CNNs.

4 Computer Stereo Vision

4.1 Preliminaries

1. Skew-symmetric matrix

In linear algebra, a skew-symmetric matrix $\mathbf{A}$ satisfies the following property: its transpose is identical to its negative, i.e., $\mathbf{A}^{\top}=-\mathbf{A}$ . In 3D computer vision, the skew-symmetric matrix $[\mathbf{a}]_{\times}$ of a vector $\mathbf{a}=[a_{1},a_{2},a_{3}]^{\top}$ can be written as hartley2003multiple :

[\mathbf{a}]_{\times}=\begin{bmatrix}[c]\ 0&-a_{3}&\ \ a_{2}\\ \ \ a_{3}&0&-a_{1}\\ -a_{2}&\ \ a_{1}&\ \ 0\\ \end{bmatrix}.

(1)

A skew-symmetric matrix has two important properties:

\mathbf{a}^{\top}[\mathbf{a}]_{\times}=\textbf{0}^{\top},\ \ \ [\mathbf{a}]_{\times}\mathbf{a}=\textbf{0},

(2)

where $\textbf{0}=[0,0,0]^{\top}$ is a zero vector. Furthermore, the cross product of two vectors $\mathbf{a}$ and $\mathbf{b}$ can be formulated as a matrix multiplication process hartley2003multiple :

\mathbf{a}\times\mathbf{b}=[\mathbf{a}]_{\times}\mathbf{b}=-[\mathbf{b}]_{\times}\mathbf{a}.

(3)

These properties are generally used to simplify the equations related to vector cross-product.

2. Lie group SO(3) and SE(3)

A 3D point $\mathbf{x_{1}}=[x_{1},y_{1},z_{1}]^{\top}\in\mathbb{R}^{3\times 1}$ can be transformed into another 3D point $\mathbf{x_{2}}=[x_{2},y_{2},z_{2}]^{\top}\in\mathbb{R}^{3\times 1}$ using a rotation matrix $\mathbf{R}\in\mathbb{R}^{3\times 3}$ and a translation vector $\mathbf{t}\in\mathbb{R}^{3\times 1}$ :

\mathbf{x_{2}}=\mathbf{R}\mathbf{x_{1}}+\mathbf{t}.

(4)

$\mathbf{R}$ satisfies matrix orthogonality:

\mathbf{R}\mathbf{R}^{\top}=\mathbf{R}^{\top}\mathbf{R}=\mathbf{I}\ \ \text{and}\ \ |\text{det}(\mathbf{R})|=1,

(5)

where $\mathbf{I}$ is an identity matrix and $\text{det}(\mathbf{R})$ represents the determinant of $\mathbf{R}$ . The group containing all rotation matrices is referred to as a special orthogonal group and is denoted as SO(3). $\mathbf{\tilde{x}_{1}}=[\mathbf{x_{1}}^{\top},1]^{\top}$ and $\mathbf{\tilde{x}_{2}}=[\mathbf{x_{2}}^{\top},1]^{\top}$ , the homogeneous coordinates of $\mathbf{x_{1}}$ and $\mathbf{x_{2}}$ , can be used to describe rotation and translation, as follows:

\mathbf{\tilde{x}_{2}}=\mathbf{P}\mathbf{\tilde{x}_{1}},

(6)

where

\mathbf{P}=\left[\begin{array}[]{c|c}\mathbf{R}&\mathbf{t}\\ \hline\cr\textbf{0}^{\top}&1\end{array}\right],

(7)

$\mathbf{P}$ is a homogeneous transformation matrix²²2seas.upenn.edu/~meam620/slides/kinematicsI.pdf. The group containing all homogeneous transformation matrices is referred to as a special Euclidean group and is denoted as SE(3).

4.2 Multi-View Geometry

Perspective Camera Model

The perspective (or pinhole) camera model, as illustrated in Fig. 5, is the most common geometric camera model describing the relationship between a 3D point $\mathbf{p^{\text{C}}}=[x^{\text{C}},y^{\text{C}},z^{\text{C}}]^{\top}$ in the camera coordinate system (CCS) and its projection $\mathbf{\bar{p}}=[x,y,f]^{\top}$ on the image plane $\Pi$ . $\mathbf{o^{\text{C}}}$ is the camera center. The distance between $\Pi$ and $\mathbf{o^{\text{C}}}$ is the camera focal length $f$ . $\hat{\mathbf{p}}^{\text{C}}=[\frac{x^{\text{C}}}{z^{\text{C}}},\frac{y^{\text{C}}}{z^{\text{C}}},1]^{\top}$ are the normalized coordinates of $\mathbf{p^{\text{C}}}=[x^{\text{C}},y^{\text{C}},z^{\text{C}}]^{\top}$ . Optical axis is the ray originating from $\mathbf{o^{\text{C}}}$ and passing perpendicularly through $\Pi$ . The relationship between $\mathbf{p}^{\text{C}}$ and $\bar{\mathbf{p}}$ is as follows trucco1998introductory :

\mathbf{\bar{p}}=f\hat{\mathbf{p}}^{\text{C}}=\frac{f}{z^{\text{C}}}\mathbf{p^{\text{C}}}.

(8)

Intrinsic Matrix

Since lens distortion does not exist in a perspective camera model, $\mathbf{\bar{p}}=[x,y,f]^{\top}$ on the image plane $\Pi$ can be transformed into a pixel $\mathbf{{p}}=[u,v]^{\top}$ in the image using fan2018real :

u=u_{o}+s_{x}x,\ \ \ v=v_{o}+s_{y}y,

(9)

where $\mathbf{p}_{o}=[u_{o},v_{o}]^{\top}$ is the principal point; and $s_{x}$ and $s_{y}$ are the effective size measured (in pixels per millimeter) in the horizontal and vertical directions, respectively trucco1998introductory . To simplify the expression of the camera intrinsic matrix $\mathbf{K}$ , two notations $f_{x}=fs_{x}$ and $f_{y}=fs_{y}$ are introduced. $u_{o}$ , $v_{o}$ , $f$ , $s_{x}$ and $s_{y}$ hartley2003multiple are five camera intrinsic parameters. Combining (8) and (9), a 3D point $\mathbf{p}^{\text{C}}$ in the CCS can be transformed into a pixel $\mathbf{p}$ in the image using fan2019cvprw :

\begin{split}\mathbf{\tilde{p}}=\frac{1}{z^{\text{C}}}\mathbf{K}\mathbf{p^{\text{C}}}=\frac{1}{z^{\text{C}}}\begin{bmatrix}f_{x}&0&u_{o}\\ 0&f_{y}&v_{o}\\ 0&0&1\\ \end{bmatrix}\begin{bmatrix}x^{\text{C}}\\ y^{\text{C}}\\ z^{\text{C}}\\ \end{bmatrix}\end{split},

(10)

where $\tilde{\mathbf{p}}=[\mathbf{{p}}^{\top},1]^{\top}=[u,v,1]^{\top}$ denotes the homogeneous coordinates of $\mathbf{p}=[u,v]^{\top}$ . Plugging (10) into (8) results in:

\hat{\mathbf{p}}^{\text{C}}=\mathbf{K}^{-1}\mathbf{\tilde{p}}=\frac{\mathbf{\bar{p}}}{f}=\frac{\mathbf{p^{\text{C}}}}{z^{\text{C}}}.

(11)

Therefore, an arbitrary 3D point lying on the ray, which goes from $\mathbf{o}^{\text{C}}$ and through $\mathbf{p}^{\text{C}}$ , is always projected at $\bar{\mathbf{p}}$ on the image plane.

Lens Distortion

In order to get better imaging results, a lens is usually installed in front of the camera trucco1998introductory . However, this introduces image distortions. The optical aberration caused by the installed lens typically deforms the physically straight lines provided by projective geometry to curves in the images zhang2000flexible , as shown in Fig. 6(a). We can observe in Fig. 6(b) that the bent checkerboard grids become straight when the lens distortion is corrected.

Lens distortion can be categorized into two main types: 1) radial distortion and 2) tangential distortion fan2018real . The presence of radial distortion is due to the fact that geometric lens shape affects straight lines. Tangential distortion occurs because the lens is not perfectly parallel to the image plane trucco1998introductory . In practical experiments, the image geometry is affected by radial distortion to a much higher extent than by tangential distortion. Therefore, the latter is sometimes neglected in the process of distorted image correction.

Radial distortion

Radial distortion mainly includes 1) barrel distortion, 2) pincushion distortion and 3) mustache distortion, as illustrated in Fig. 7.

It can be observed that a) radial distortions are symmetric about the image center and b) straight lines are no longer preserved. In barrel distortion, the image magnification decreases with the distance from the optical axis (lines curve outwards). In contrast, the pincushion distortion pinches the image (lines curve inwards). Mustache distortion is a mixture of the above two distortion types. It starts out as barrel distortion close to the optical axis and gradually turns into pincushion distortion close to image periphery. Barrel distortion is commonly applied in fish-eye lenses to produce wide-angle/panoramic images, while pincushion distortion is often associated with telephoto/zoom lenses. Radial distortions can be corrected using³³3docs.opencv.org/2.4/doc/tutorials/calib3d/camera_calibration/camera_calibration.html:

\begin{split}x_{\text{undist}}&=x_{\text{dist}}(1+k_{1}r^{2}+k_{2}r^{4}+k_{3}r^{6}),\\ y_{\text{undist}}&=y_{\text{dist}}(1+k_{1}r^{2}+k_{2}r^{4}+k_{3}r^{6}),\end{split}

(12)

where the corrected point will be $\mathbf{p}_{\text{undist}}=[x_{\text{undist}},y_{\text{undist}}]^{\top}$ ; $r^{2}={x_{\text{dist}}}^{2}+{y_{\text{dist}}}^{2}$ ; $x_{\text{dist}}=\frac{x^{\text{C}}}{z^{\text{C}}}=\frac{u-u_{o}}{f_{x}}$ and $y_{\text{dist}}=\frac{y^{\text{C}}}{z^{\text{C}}}=\frac{v-v_{o}}{f_{y}}$ ⁴⁴4docs.opencv.org/2.4/modules/imgproc/doc/geometric_transformations.html can be obtained from the distorted image. $k_{1}$ , $k_{2}$ and $k_{3}$ are three intrinsic parameters used for radial distortion correction. They can be estimated using a collection of images containing a planar checkerboard pattern.

Tangential distortion

Similar to radial distortion, tangential distortion can also be corrected using:

\begin{split}x_{\text{undist}}&=x_{\text{dist}}+\big{[}2p_{1}x_{\text{dist}}y_{\text{dist}}+p_{2}(r^{2}+2{x_{\text{dist}}}^{2})\big{]},\\ y_{\text{undist}}&=y_{\text{dist}}+\big{[}p_{1}(r^{2}+2{y_{\text{dist}}}^{2})+2p_{2}x_{\text{dist}}y_{\text{dist}}\big{]},\end{split}

(13)

where $p_{1}$ and $p_{2}$ are two intrinsic parameters, which can also be estimated using a collection of images containing a planar checkerboard pattern.

Epipolar Geometry

The generic geometry of stereo vision is known as epipolar geometry. An example of the epipolar geometry is shown in Fig. 8.

$\Pi_{\text{L}}$ and $\Pi_{\text{R}}$ represent the left and right image planes, respectively. $\mathbf{o}_{\text{L}}^{\text{C}}$ and $\mathbf{o}_{\text{R}}^{\text{C}}$ denote the origins of the left camera coordinate system (LCCS) and the right camera coordinate system (RCCS), respectively. The 3D point $\mathbf{p}^{\text{W}}=[x^{\text{W}},y^{\text{W}},z^{\text{W}}]^{\top}$ in the WCS, is projected at $\mathbf{\bar{p}_{\text{L}}}=[x_{\text{L}},y_{\text{L}},f_{\text{L}}]^{\top}$ on $\Pi_{\text{L}}$ and at $\mathbf{\bar{p}_{\text{R}}}=[x_{\text{R}},y_{\text{R}},f_{\text{R}}]^{\top}$ on $\Pi_{\text{R}}$ , respectively. $f_{\text{L}}$ and $f_{\text{R}}$ are the focal lengths of the left and right cameras, respectively; The representations of $\mathbf{p}^{\text{W}}$ in the LCCS and RCCS are $\mathbf{p^{\text{C}}_{\text{L}}}=[x^{\text{C}}_{\text{L}},y^{\text{C}}_{\text{L}},z^{\text{C}}_{\text{L}}]^{\top}=\frac{{z_{\text{L}}^{\text{C}}}}{f_{\text{L}}}\mathbf{\bar{p}_{\text{L}}}$ and $\mathbf{p^{\text{C}}_{\text{R}}}=[x^{\text{C}}_{\text{R}},y^{\text{C}}_{\text{R}},z^{\text{C}}_{\text{R}}]^{\top}=\frac{{z_{\text{R}}^{\text{C}}}}{f_{\text{R}}}\mathbf{\bar{p}_{\text{R}}}$ , respectively. According to (4), $\mathbf{p^{\text{C}}_{\text{L}}}$ can be transformed into $\mathbf{p^{\text{C}}_{\text{R}}}$ using:

\mathbf{p^{\text{C}}_{\text{R}}}=\mathbf{R}\mathbf{p^{\text{C}}_{\text{L}}}+\mathbf{t},

(14)

where $\mathbf{R}\in\mathbb{R}^{3\times 3}$ is a rotation matrix and $\mathbf{t}\in\mathbb{R}^{3\times 1}$ is a translation vector. $\mathbf{e^{\text{C}}_{\text{L}}}$ and $\mathbf{e^{\text{C}}_{\text{R}}}$ denote the left and right epipoles, respectively. The epipolar plane is uniquely defined by $\mathbf{o^{\text{C}}_{\text{L}}}$ , $\mathbf{o^{\text{C}}_{\text{R}}}$ and $\mathbf{p^{\text{W}}}$ . It intersects $\Pi_{\text{L}}$ and $\Pi_{\text{R}}$ giving rise to two epipolar lines, as shown in Fig. 8. Using (11), $\mathbf{p^{\text{C}}_{\text{L}}}$ and $\mathbf{p^{\text{C}}_{\text{R}}}$ can be normalized using:

\begin{split}\mathbf{\hat{p}^{\text{C}}_{\text{L}}}=\frac{\mathbf{p^{\text{C}}_{\text{L}}}}{z^{\text{C}}_{\text{L}}}={\mathbf{K}_{\text{L}}}^{-1}\mathbf{\tilde{p}}_{\text{L}},\ \ \ \mathbf{\hat{p}^{\text{C}}_{\text{R}}}=\frac{\mathbf{p^{\text{C}}_{\text{R}}}}{Z^{\text{C}}_{\text{R}}}=\mathbf{K_{\text{R}}}^{-1}\mathbf{\tilde{p}_{\text{R}}},\end{split}

(15)

where $\mathbf{K_{\text{L}}}$ and $\mathbf{K_{\text{R}}}$ denote the intrinsic matrices of the left and right cameras, respectively. $\mathbf{\tilde{p}}_{\text{L}}=[{\mathbf{p}_{\text{L}}}^{\top},1]^{\top}$ and $\mathbf{\tilde{p}}_{\text{R}}=[{\mathbf{p}_{\text{R}}}^{\top},1]^{\top}$ are the homogeneous coordinates of the image pixels $\mathbf{p}_{\text{L}}=[u_{\text{L}},v_{\text{L}}]^{\top}$ and $\mathbf{p}_{\text{R}}=[u_{\text{R}},v_{\text{R}}]^{\top}$ , respectively.

Essential Matrix

Essential matrix $\mathbf{E}\in\mathbb{R}^{3\times 3}$ was first introduced by Longuet-Higgins in 1981 longuet1981computer . A simple way of introducing the defining equation of $\mathbf{E}$ is to multiply both sides of (14) by $\mathbf{p^{\text{C}}_{\text{R}}}^{\top}[\mathbf{t}]_{\times}$ :

\begin{split}\mathbf{p^{\text{C}}_{\text{R}}}^{\top}[\mathbf{t}]_{\times}\mathbf{p^{\text{C}}_{\text{R}}}=\mathbf{p^{\text{C}}_{\text{R}}}^{\top}[\mathbf{t}]_{\times}(\mathbf{R}\mathbf{p^{\text{C}}_{\text{L}}}+\mathbf{t}).\end{split}

(16)

According to (3), (16) can be rewritten as follows:

-\mathbf{p^{\text{C}}_{\text{R}}}^{\top}[\mathbf{p^{\text{C}}_{\text{R}}}]_{\times}\mathbf{t}=\mathbf{p^{\text{C}}_{\text{R}}}^{\top}[\mathbf{t}]_{\times}\mathbf{R}\mathbf{p^{\text{C}}_{\text{L}}}+\mathbf{p^{\text{C}}_{\text{R}}}^{\top}[\mathbf{t}]_{\times}\mathbf{t}.

(17)

Applying (2) to (17) yields:

\mathbf{p^{\text{C}}_{\text{R}}}^{\top}[\mathbf{t}]_{\times}\mathbf{R}\mathbf{p^{\text{C}}_{\text{L}}}=\mathbf{p^{\text{C}}_{\text{R}}}^{\top}\mathbf{E}\mathbf{p^{\text{C}}_{\text{L}}}=0,

(18)

The essential matrix $\mathbf{E}$ is defined by:

\mathbf{E}=[\mathbf{t}]_{\times}\mathbf{R}

(19)

Plugging (15) into (18) results in:

{\hat{\mathbf{p}}^{\text{C}}_{\text{R}}}{}^{\top}\mathbf{E}{\hat{\mathbf{p}}^{\text{C}}_{\text{L}}}=0,

(20)

which depicts the relationship between each pair of normalized points ${\hat{\mathbf{p}}^{\text{C}}_{\text{R}}}$ and ${\hat{\mathbf{p}}^{\text{C}}_{\text{L}}}$ lying on the same epipolar plane. It is important to note here that $\mathbf{E}$ has five degrees of freedom: both $\mathbf{R}$ and $\mathbf{t}$ have three degrees of freedom, but the overall scale ambiguity causes the degrees of freedom to be reduced by one hartley2003multiple . Hence, in theory, $\mathbf{E}$ can be estimated with at least five pairs of $\mathbf{{p^{\text{C}}_{\text{L}}}}$ and $\mathbf{{p^{\text{C}}_{\text{R}}}}$ . However, due to the non-linearity of $\mathbf{E}$ , its estimation using five pairs of correspondences is always intractable. Therefore, $\mathbf{E}$ is commonly estimated with at least eight pairs of $\mathbf{{p^{\text{C}}_{\text{L}}}}$ and $\mathbf{{p^{\text{C}}_{\text{R}}}}$ trucco1998introductory , as discussed in Sec. 4.2.

Fundamental Matrix

As introduced in Sec. 4.2, the essential matrix creates a link between a given pair of corresponding 3D points in the LCCS and RCCS. When the intrinsic matrices of the two cameras in a stereo rig are unknown, the relationship between each pair of corresponding 2D image pixels $\mathbf{p}_{\text{L}}=[u_{\text{L}},v_{\text{L}}]^{\top}$ and $\mathbf{p}_{\text{R}}=[u_{\text{R}},v_{\text{R}}]^{\top}$ can be established, using the fundamental matrix $\mathbf{F}\in\mathbb{R}^{3\times 3}$ . It can be considered as a generalization of $\mathbf{E}$ , where the assumption of calibrated cameras is removed hartley2003multiple . Applying (15) to (20) yields:

{\tilde{\mathbf{p}}_{\text{R}}}^{\top}{\mathbf{K}_{\text{R}}}^{-\top}\mathbf{E}{\mathbf{K}_{\text{L}}}^{-1}\tilde{\mathbf{p}}_{\text{L}}={\tilde{\mathbf{p}}_{\text{R}}}^{\top}\mathbf{F}\tilde{\mathbf{p}}_{\text{L}}=0,

(21)

where the fundamental matrix $\mathbf{F}$ is defined as:

\mathbf{F}={\mathbf{K}_{\text{R}}}^{-\top}\mathbf{E}{\mathbf{K}_{\text{L}}}^{-1}.

(22)

$\mathbf{F}$ has seven degrees of freedom: a 3 $\times$ 3 homogeneous matrix has eight independent ratios, as there are nine entries, but the common scaling is not significant. However, $\mathbf{F}$ also satisfies the constraint $\text{det}(\mathbf{F})=0$ , which removes one degree of freedom hartley2003multiple . The most commonly used algorithm to estimate $\mathbf{E}$ and $\mathbf{F}$ is “eight-point algorithm” (EPA), which was introduced by Hartley in 1997 hartley1997defense . This algorithm is based on the scale invariance of $\mathbf{E}$ and $\mathbf{F}$ : $\lambda_{\mathbf{E}}{\mathbf{p}^{\text{C}}_{\text{R}}}^{\top}\mathbf{E}{\mathbf{p}^{\text{C}}_{\text{L}}}=0$ and $\lambda_{\mathbf{F}}{\tilde{\mathbf{p}}_{\text{R}}}^{\top}\mathbf{F}{\tilde{\mathbf{p}}_{\text{L}}}=0$ , where $\lambda_{\mathbf{E}},\lambda_{\mathbf{F}}\neq 0$ . By setting one element in $\mathbf{E}$ and $\mathbf{F}$ to $1$ , eight unknown elements still need to be estimated. This can be done using at least eight correspondence pairs. If the intrinsic matrices $\mathbf{K}_{\text{L}}$ and $\mathbf{K}_{\text{R}}$ of the two cameras are known, the EPA only needs to be carried out once to estimate either $\mathbf{E}$ or $\mathbf{F}$ , because the other one can be easily worked out using (21).

Homography Matrix

An arbitrary 3D point $\mathbf{p^{\text{W}}}=[x^{\text{W}},y^{\text{W}},z^{\text{W}}]^{\top}$ lying on a planar surface satisfies:

\mathbf{n}^{\top}\mathbf{p^{\text{W}}}+b=0,

(23)

where $\mathbf{n}=[n_{x},n_{y},n_{z}]^{\top}$ is the normal vector of the planar surface. Its corresponding pixels $\mathbf{p}_{\text{L}}=[u_{\text{L}},v_{\text{L}}]^{\top}$ and $\mathbf{p}_{\text{R}}=[u_{\text{R}},v_{\text{R}}]^{\top}$ in the left and right images, respectively, can be linked by a homography matrix $\mathbf{H}\in\mathbb{R}^{3\times 3}$ . The expression of the planar surface can be rearranged as follows:

-{\mathbf{n}^{\top}\mathbf{p^{\text{W}}}}/{b}=1.

(24)

Assuming that $\mathbf{p}^{\text{C}}_{\text{L}}=\mathbf{p}^{\text{W}}$ and plugging (24) and (15) into (14) results in:

\begin{split}\mathbf{{p^{\text{C}}_{\text{R}}}}=\mathbf{R}\mathbf{{p^{\text{C}}_{\text{L}}}}-\frac{1}{b}\mathbf{t}{\mathbf{n}^{\top}\mathbf{p^{\text{C}}_{\text{L}}}}=\bigg{(}\mathbf{R}-\frac{1}{b}\mathbf{t}{\mathbf{n}^{\top}}\bigg{)}{z^{\text{C}}_{\text{L}}}{\mathbf{K}_{\text{L}}}^{-1}\tilde{\mathbf{p}}_{\text{L}}={z^{\text{C}}_{\text{R}}}{\mathbf{K}_{\text{R}}}^{-1}\tilde{\mathbf{p}}_{\text{R}}\end{split}

(25)

Therefore, $\tilde{\mathbf{p}}_{\text{L}}$ and $\tilde{\mathbf{p}}_{\text{R}}$ can be linked using:

\tilde{\mathbf{p}}_{\text{R}}=\frac{z^{\text{C}}_{\text{L}}}{z^{\text{C}}_{\text{R}}}{\mathbf{K}_{\text{R}}}\bigg{(}\mathbf{R}-\frac{1}{b}\mathbf{t}{\mathbf{n}^{\top}}\bigg{)}{\mathbf{K}_{\text{L}}}^{-1}\tilde{\mathbf{p}}_{\text{L}}=\mathbf{H}\tilde{\mathbf{p}}_{\text{L}}.

(26)

The homography matrix $\mathbf{H}$ is generally used to distinguish obstacles from a planar surface fan2018road . For a well-calibrated stereo vision system, $\mathbf{R}$ , $\mathbf{t}$ , $\mathbf{K}_{\text{L}}$ as well as $\mathbf{K}_{\text{R}}$ are already known, and $z^{\text{C}}_{\text{L}}$ is typically equal to $z^{\text{C}}_{\text{R}}$ . Thus, $\mathbf{H}$ only relates to $\mathbf{n}$ and $b$ , and it can be estimated with at least four pairs of correspondences $\mathbf{p}_{\text{L}}$ and $\mathbf{p}_{\text{R}}$ fan2018road .

4.3 Stereopsis

Stereo Rectification

3D scene geometry reconstruction with a pair of synchronized cameras is based on determining pairs of correspondence pixels between the left and right images. For an uncalibrated stereo rig, finding the correspondence pairs is a 2D search process (optical flow estimation), which is extremely computationally intensive. If the stereo rig is calibrated, 1D search should be performed along the epipolar lines. An image transformation process, referred to as stereo rectification, is always performed beforehand to reduce the dimension of the correspondence pair search. The stereo rectification consists of four main steps trucco1998introductory :

1.

Rotate the left camera by $\mathbf{R}_{\text{rect}}$ so that the left image plane is parallel to the vector $\mathbf{t}$ ;
2.

Apply the same rotation to the right camera to recover the original epipolar geometry;
3.

Rotate the right camera by $\mathbf{R}^{-1}$ ;
4.

Adjust the left and right image scales by allocating an identical intrinsic matrix to both cameras.

After the stereo rectification, the left and right images appear as if they were taken by a pair of parallel cameras with the same intrinsic parameters, as shown in Fig. 9, where $\Pi_{\text{L}}$ and $\Pi_{\text{R}}$ are the original image planes; ${\Pi^{\prime}_{\text{L}}}$ and ${\Pi^{\prime}_{\text{R}}}$ are the rectified image planes. Also, each pair of conjugate epipolar lines become collinear and parallel to the horizontal image axis trucco1998introductory . Hence, determining the correspondence pairs is simplified to a 1D search problem.

Stereo Vision System

A well-rectified stereo vision system is illustrated in Fig. 10,

which can be regarded as a special epipolar geometry introduced in Sec. 4.2, where the left and right cameras are perfectly parallel to each other. $x_{\text{L}}^{\text{C}}$ and $x_{\text{R}}^{\text{C}}$ axes are collinear. $\mathbf{o}_{\text{L}}^{\text{C}}$ and $\mathbf{o}_{\text{L}}^{\text{C}}$ are the left and right camera optical centers, respectively. The baseline of the stereo rig $T_{c}$ , is defined as the distance between $\mathbf{o}_{\text{L}}^{\text{C}}$ and $\mathbf{o}_{\text{R}}^{\text{C}}$ . The intrinsic matrices of the left and right cameras are given by:

\mathbf{K}_{\text{L}}=\mathbf{K}_{\text{R}}=\mathbf{K}=\begin{bmatrix}f&0&u_{o}\\ 0&f&v_{o}\\ 0&0&1\end{bmatrix},

(27)

respectively. Let $\mathbf{p}^{\text{W}}=[x^{\text{W}},y^{\text{W}},z^{\text{W}}]^{\top}$ be a 3D point of interest in the WCS. Its representations in the LCCS and RCCS are $\mathbf{p_{\text{L}}^{\text{C}}}=[x_{\text{L}}^{\text{C}},y_{\text{L}}^{\text{C}},z_{\text{L}}^{\text{C}}]^{\top}$ and $\mathbf{p_{\text{R}}^{\text{C}}}=[x_{\text{R}}^{\text{C}},y_{\text{R}}^{\text{C}},z_{\text{R}}^{\text{C}}]^{\top}$ , respectively. Since the left and right cameras are considered to be exactly the same in a well-rectified stereo vision system, $s_{x}$ and $s_{y}$ in (9) are simply set to $1$ and $f_{x}=f_{y}=f$ . $\mathbf{p}^{\text{W}}$ is projected on $\Pi_{\text{L}}$ and $\Pi_{\text{R}}$ at $\bar{\mathbf{p}}_{\text{L}}=[x_{\text{L}},y_{\text{L}},f]^{\top}$ and $\bar{\mathbf{p}}_{\text{R}}=[x_{\text{R}},y_{\text{R}},f]^{\top}$ , respectively. $\mathbf{o}^{\text{W}}$ , the origin of the WCS, is at the center of the line segment $L=\{t\mathbf{{o}_{\text{L}}^{\text{C}}}+(1-t)\mathbf{{o}_{\text{R}}^{\text{C}}}\ |\ t\in[0,1]\}$ . $z^{\text{W}}$ axis is parallel to the camera optical axes and perpendicular to $\Pi_{\text{L}}$ and $\Pi_{\text{R}}$ . Therefore, an arbitrary point $\mathbf{p^{\text{W}}}$ in the WCS can be transformed to $\mathbf{p}_{\text{L}}^{\text{C}}$ and $\mathbf{p}_{\text{R}}^{\text{C}}$ using:

\mathbf{p}_{\text{L}}^{\text{C}}=\mathbf{I}\mathbf{p}^{\text{W}}+\mathbf{t}_{\text{L}},\ \ \ \ \mathbf{p}_{\text{R}}^{\text{C}}=\mathbf{I}\mathbf{p}^{\text{W}}+\mathbf{t}_{\text{R}},

(28)

where $\mathbf{t}_{\text{L}}=[\frac{T_{c}}{2},0,0]^{\top}$ and $\mathbf{t}_{\text{R}}=[-\frac{T_{c}}{2},0,0]^{\top}$ ; Applying (11) and (15) to (28) results in the following expressions:

\begin{split}x_{\text{L}}=f\frac{x^{\text{W}}+{T_{c}}/{2}}{z^{\text{W}}},\ \ \ y_{\text{L}}=f\frac{y^{\text{W}}}{z^{\text{W}}},\\ x_{\text{R}}=f\frac{x^{\text{W}}-{T_{c}}/{2}}{z^{\text{W}}},\ \ \ y_{\text{R}}=f\frac{y^{\text{W}}}{z^{\text{W}}}.\end{split}

(29)

Applying (29) to (9) yields the following expressions:

\mathbf{p}_{\text{L}}=\begin{bmatrix}u_{\text{L}}\\ v_{\text{L}}\end{bmatrix}=\begin{bmatrix}f\frac{x^{\text{W}}}{z^{\text{W}}}+u_{o}+f\frac{{T_{c}}}{2z^{\text{W}}}\\ f\frac{y^{\text{W}}}{z^{\text{W}}}+v_{o}\end{bmatrix},\ \ \ \mathbf{p}_{\text{R}}=\begin{bmatrix}u_{\text{R}}\\ v_{\text{R}}\end{bmatrix}=\begin{bmatrix}f\frac{x^{\text{W}}}{z^{\text{W}}}+u_{o}-f\frac{{T_{c}}}{2z^{\text{W}}}\\ f\frac{y^{\text{W}}}{z^{\text{W}}}+v_{o}\end{bmatrix}.

(30)

The relationship between the so-called disparity $d$ and depth $z^{\text{W}}$ is as follows fan2018real :

d=u_{\text{L}}-u_{\text{R}}=f\frac{T_{c}}{z^{\text{W}}}.

(31)

It can be observed that $d$ is inversely proportional to $z^{\text{W}}$ . Therefore, for a distant 3D point $\mathbf{p^{\text{W}}}$ , $\mathbf{p}_{\text{L}}$ and $\mathbf{p}_{\text{R}}$ are close to each other. On the other hand, when $\mathbf{p^{\text{W}}}$ lies near the stereo camera rig, the position difference between $\mathbf{p}_{\text{L}}$ and $\mathbf{p}_{\text{R}}$ is large. Therefore, disparity estimation can be regarded as a task of 1) finding the correspondence ( $\mathbf{p}_{\text{L}}$ and $\mathbf{p}_{\text{R}}$ ) pairs, which are on the same image row, on the left and right images and 2) producing two disparity images $\mathbf{D}_{\text{L}}$ and $\mathbf{D}_{\text{R}}$ , as shown in Fig. 11.

Disparity Estimation

The two key aspects of computer stereo vision are speed and accuracy tippetts2016review . Over the past two decades, a lot of research has been carried out to improve disparity estimation accuracy while reducing computational complexity. However, the stereo vision algorithms designed to achieve better disparity accuracy typically have higher computational complexity fan2018real . Hence, speed and accuracy are two desirable but conflicting properties. It is very challenging to achieve both of them simultaneously tippetts2016review .

In general, the main motivation of designing a stereo vision algorithm is to improve the trade-off between speed and accuracy. In most circumstances, a desirable trade-off entirely depends on the target application tippetts2016review . For instance, a real-time performance is required for stereo vision systems employed in autonomous driving, because other systems, such as data-fusion semantic driving scene segmentation, usually take up only a small portion of the processing time, and can be easily implemented in real-time if the 3D information is available fan2018real . Although stereo vision execution time can definitely be reduced with future HW advances, algorithm and SW improvements are also very important tippetts2016review .

State-of-the-art stereo vision algorithms can be classified as either computer vision-based or machine/deep learning-based. The former typically formulates disparity estimation as a local block matching problem or a global energy minimization problem fan2018road , while the latter basically considers disparity estimation as a regression problem luo2016efficient .

Computer vision-based stereo vision algorithms

Computer vision-based disparity estimation algorithms are categorized as: 1) local, 2) global and 3) semi-global scharstein2003high . Local algorithms simply select an image block from the left image and match it with a series of image blocks selected from the right image. Optimal disparity estimation corresponds to either the lowest difference costs or the highest correlation costs. Global algorithms translate disparity estimation into a probability maximization problem or an energy minimization problem mozerov2015accurate , which can be solved using Markov random field (MRF)-based optimization algorithms tappen2003comparison . Semi-global matching (SGM) hirschmuller2007stereo approximates MRF inferences by performing cost aggregation along all image directions, which greatly improves both disparity estimation accuracy and efficiency. Generally, a computer vision-based disparity estimation algorithm consists of four main steps: 1) cost computation, 2) cost aggregation, 3) disparity optimization and 4) disparity refinement vzbontar2016stereo .

1. Cost computation

Disparity $d$ is a random variable with $N$ possible discrete states, each of them being associated with a matching cost $c$ . The two most commonly used pixel-wise matching costs are the absolute difference (AD) cost $c_{\text{AD}}$ and the squared difference (SD) cost $c_{\text{SD}}$ vzbontar2016stereo . Since the left and right images are typically in gray-scale format, $c_{\text{AD}}$ and $c_{\text{SD}}$ can be computed using hirschmuller2008evaluation :

\begin{split}c_{\text{AD}}(\mathbf{p},d)&=\big{|}i_{\text{L}}(\mathbf{p})-i_{\text{R}}(\mathbf{p}-\mathbf{d})\big{|},\\ c_{\text{SD}}(\mathbf{p},d)&=\big{(}i_{\text{L}}(\mathbf{p})-i_{\text{R}}(\mathbf{p}-\mathbf{d})\big{)}^{2},\end{split}

(32)

where $\mathbf{d}=[d,0]^{\top}$ , $i_{\text{L}}(\mathbf{p})$ denotes the pixel intensity of $\mathbf{p}=[u,v]^{\top}$ in the left image and $i_{\text{R}}(\mathbf{p}-\mathbf{d})$ represents the pixel intensity of $\mathbf{p}-\mathbf{d}=[u-d,v]^{\top}$ in the right image.

2. Cost aggregation

In order to minimize incorrect matches, pixel-wise difference costs are often aggregated over all pixels within a support region scharstein2003high :

c_{\text{agg}}(\mathbf{p},d)=w(\mathbf{p},d)*C(\mathbf{p},d),

(33)

where the center of the support region is at $\mathbf{p}=[u,v]^{\top}$ . The corresponding disparity is $d$ . $c_{\text{agg}}$ denotes the aggregated cost. $w$ is a kernel that represents the support region. $C$ represents a neighborhood system containing the pixel-wise matching costs of all pixels within the support region. $c_{\text{agg}}$ can be obtained by performing a convolution between $w$ and $C$ . A large support region can help reduce disparity optimization uncertainties, but also increase the algorithm execution time significantly.

Since the support regions are always rectangular blocks, these algorithms are also known as stereo block matching fan2018road . When the convolution process is a uniform box filtering (all the elements in $w$ are 1), the aggregations of $c_{\text{AD}}$ and $c_{\text{SD}}$ are referred to as the sum of absolute difference (SAD) and the sum of squared difference (SSD), respectively, which can be written as fan2018real :

\begin{split}c_{\text{SAD}}(\mathbf{p},d)&={\sum\limits_{\mathbf{q}\in\mathscr{N}_{\mathbf{p}}}\big{|}i_{\text{L}}(\mathbf{q})-i_{\text{R}}(\mathbf{q}-\mathbf{d}))\big{|}},\\ c_{\text{SSD}}(\mathbf{p},d)&={\sum\limits_{\mathbf{q}\in\mathscr{N}_{\mathbf{p}}}\big{(}i_{\text{L}}(\mathbf{q})-i_{\text{R}}(\mathbf{q}-\mathbf{d})\big{)}^{2}},\end{split}

(34)

where $\mathscr{N}_{\mathbf{p}}$ is the support region (or neighborhood system) of $\mathbf{p}$ . Although the SAD and the SSD are computationally efficient, they are very sensitive to image intensity noise. In this regard, some other cost or similarity functions, such as the normalized cross-correlation (NCC), are more prevalently used for cost computation and aggregation. The cost function of the NCC is as follows fan2018road :

c_{\text{NCC}}(\mathbf{p},d)=\frac{1}{n\sigma_{\text{L}}\sigma_{\text{R}}}{\sum\limits_{\mathbf{q}\in\mathscr{N}_{\mathbf{p}}}\Big{(}i_{\text{L}}\big{(}\mathbf{q}\big{)}-{\mu_{\text{L}}}\Big{)}\Big{(}i_{\text{R}}\big{(}\mathbf{q}-\mathbf{d}\big{)}-{\mu_{\text{R}}}\Big{)}},

(35)

where

\sigma_{\text{L}}=\sqrt{\sum\limits_{\mathbf{q}\in\mathscr{N}_{\mathbf{p}}}\Big{(}i_{\text{L}}(\mathbf{q})-{\mu_{\text{L}}}\Big{)}^{2}/n},\ \ \ \ \sigma_{\text{R}}=\sqrt{\sum\limits_{\mathbf{q}\in\mathscr{N}_{\mathbf{p}}}\Big{(}i_{\text{R}}(\mathbf{q}-\mathbf{d})-{\mu_{\text{R}}}\Big{)}^{2}/n},

(36)

$\mu_{\text{L}}$ and $\mu_{\text{R}}$ represent the means of the pixel intensities within the left and right image block, respectively. $\sigma_{\text{L}}$ and $\sigma_{\text{R}}$ denote the standard deviations of the left and right image block, respectively. $n$ represents the number of pixels within each image blocks. The NCC cost $c_{\text{NCC}}\in[-1,1]$ reflects the similarity between the given pair of left and right image blocks. A higher $c_{\text{NCC}}$ corresponds to a better block matching.

In addition to the cost aggregation via uniform box filtering, many adaptive cost aggregation strategies have been proposed to improve disparity accuracy. One of the most famous algorithms is fast bilateral stereo (FBS) yang2008stereo ; fan2018fbs , which uses a bilateral filter to aggregate the matching costs adaptively. A general expression of cost aggregation in FBS is as follows fan2018ist :

c_{\text{agg}}(\mathbf{p},d)=\frac{\sum_{\mathbf{q}\in\mathscr{N}_{\mathbf{q}}}\omega_{d}(\mathbf{q})\omega_{r}(\mathbf{q})c(\mathbf{q},d)}{\sum_{\mathbf{q}\in\mathscr{N}_{\mathbf{q}}}\omega_{d}(\mathbf{q})\omega_{r}(\mathbf{q})},

(37)

where functions $\omega_{d}$ and $\omega_{r}$ are based on spatial distance and color similarity, respectively fan2018fbs . The costs $c$ within a rectangular block are aggregated adaptively to produce $c_{\text{agg}}$ .

3. Disparity Optimization

The local algorithms simply select the disparities that correspond to the lowest difference costs or the highest correlation costs as the best disparities in a Winner-Take-All (WTA) way.

Unlike WTA applied in the local algorithms, matching costs from neighboring pixels are also taken into account in the global algorithms, e.g., graph cuts (GC) boykov2001fast and belief propagation (BP) ihler2005loopy . The MRF is a commonly used graphical model in such algorithms. An example of the MRF model is depicted in Fig. 12. The graph $\mathscr{G}=(\mathscr{P},\mathscr{E})$ is a set of vertices $\mathscr{P}$ connected by edges $\mathscr{E}$ , where $\mathscr{P}=\{\mathbf{p}_{11},\mathbf{p}_{12},\cdots,\mathbf{p}_{mn}\}$ and $\mathscr{E}=\{(\mathbf{p}_{ij},\mathbf{p}_{st})\ |\ \mathbf{p}_{ij},\mathbf{p}_{st}\in\mathscr{P}\}$ . Two edges sharing one common vertex are called a pair of adjacent edges blake2011markov . Since the MRF is considered to be undirected, $(\mathbf{p}_{ij},\mathbf{p}_{st})$ and $(\mathbf{p}_{st},\mathbf{p}_{ij})$ refer to the same edge here. $\mathscr{N}_{ij}=\{\mathbf{q}_{{1}_{\mathbf{p}_{ij}}},\mathbf{q}_{{2}_{\mathbf{p}_{ij}}},\cdots,\mathbf{q}_{{k}_{\mathbf{p}_{ij}}}\ |\ \mathbf{q}_{{\mathbf{p}_{ij}}}\in\mathscr{P}\}$ is a neighborhood system of $\mathbf{p}_{ij}$ .

For stereo vision problems, $\mathscr{P}$ is a $m\times n$ pixel disparity image and $\mathbf{p}_{ij}$ is a graph vertex (or node) at the site of $(i,j)$ with a disparity node value $d_{ij}$ . Because the consideration of more candidates usually makes true disparity inference intractable, only the neighbors adjacent to $\mathbf{p}_{ij}$ are considered for stereo matching tappen2003comparison , in a pairwise MRF fashion, as the disparity of node $\mathbf{p}_{ij}$ tends to have a strong correlation with its vicinities, while it is linked implicitly to any other random nodes in the disparity map. The joint MRF probability can be written as tappen2003comparison :

P(\mathbf{p},q)=\prod_{\mathbf{p}_{ij}\in\mathscr{P}}\Phi(\mathbf{p}_{ij},q_{\mathbf{p}_{ij}})\prod_{\mathbf{q}_{\mathbf{p}_{ij}}\in\mathscr{N}_{ij}}\Psi(\mathbf{p}_{ij},\mathbf{q}_{\mathbf{p}_{ij}}),

(38)

where $q_{\mathbf{p}_{ij}}$ represents image intensity differences, $\Phi(\cdot)$ expresses the compatibility between possible disparities and the corresponding image intensity differences, while $\Psi(\cdot)$ expresses the compatibility between $\mathbf{p}_{ij}$ and its neighborhood system. Now, the aim of finding the best disparity is equivalent to maximizing $P(\mathbf{p},q)$ in (38), by formulating it as an energy function fan2019cvprw :

\begin{split}E(\mathbf{p})=\sum_{\mathbf{p}_{ij}\in\mathscr{P}}D(\mathbf{p}_{ij},q_{\mathbf{p}_{ij}})+\sum_{\mathbf{q}_{\mathbf{p}_{ij}}\in\mathscr{N}_{ij}}V(\mathbf{p}_{ij},\mathbf{q}_{\mathbf{p}_{ij}}),\end{split}

(39)

where $D(\cdot)$ and $V(\cdot)$ are two energy functions. $D(\cdot)$ corresponds to the matching cost and $V(\cdot)$ determines the aggregation from the neighbors. In the MRF model, the method to formulate an adaptive $V(\cdot)$ is important, because image intensity in discontinuous areas usually varies greatly from that of its neighbors fan2018real . However, the process of minimizing (39) results in high computational complexities, rendering real-time performance challenging. Therefore, SGM hirschmuller2007stereo breaks down (39) into:

\begin{split}E(\mathbf{D})=\sum_{\mathbf{p}}\bigg{(}c(\mathbf{p},d_{\mathbf{p}})+\sum_{\mathbf{q}\in\mathscr{N}_{\mathbf{p}}}\lambda_{1}\delta(|d_{\mathbf{p}}-d_{\mathbf{q}}|=1)+\sum_{\mathbf{q}\in\mathscr{N}_{\mathbf{p}}}\lambda_{2}\delta(|d_{\mathbf{p}}-d_{\mathbf{q}}|>1)\bigg{)},\end{split}

(40)

where $\mathbf{D}$ is the disparity image, $c$ is the matching cost, $\mathbf{q}$ is a pixel in the neighborhood system $\mathscr{N}_{\mathbf{p}}$ of $\mathbf{p}$ . $\lambda_{1}$ penalizes the neighboring pixels with small disparity differences, i.e., one pixel; $\lambda_{2}$ penalizes the neighboring pixels with large disparity differences, i.e., larger than one pixel. $\delta(\cdot)$ returns 1 if its argument is true and 0 otherwise.

4. Disparity Refinement

Disparity refinement usually involves several post-processing steps, such as the left-and-right disparity consistency check (LRDCC), subpixel enhancement and weighted median filtering scharstein2002taxonomy . The LRDCC can remove most of the occluded areas, which are only visible in one of the left/right image fan2018road . In addition, a disparity error larger than one pixel may result in a non-negligible 3D geometry reconstruction error fan2018road . Therefore, subpixel enhancement provides an easy way to increase disparity image resolution by simply interpolating the matching costs around the initial disparity scharstein2002taxonomy . Moreover, a median filter can be applied to the disparity image to fill the holes and remove the incorrect matches scharstein2002taxonomy . However, the above disparity refinement algorithms are not always necessary and the sequential use of these steps depends entirely on the chosen algorithm and application needs.

Machine/deep learning-based stereo vision algorithms

With recent advances in machine/deep learning, CNNs have been prevalently used for disparity estimation. For instance, Žbontar and LeCun zbontar2015computing utilized a CNN to compute patch-wise similarity scores, as shown in Fig. 13. It consists of a convolutional layer $L_{1}$ and seven fully-connected layers $L_{2}$ - $L_{8}$ . The inputs to this CNN are two 9 $\times$ 9-pixel gray-scale

image patches. $L_{1}$ consists of 32 convolution kernels of size 5 $\times$ 5 $\times$ 1. $L_{2}$ and $L_{3}$ have 200 neurons each. After $L_{3}$ , the two 200-dimensional feature vectors are concatenated into a 400-dimensional vector and passed through $L_{4}$ - $L_{7}$ layers. Layer $L_{8}$ maps $L_{7}$ output into two real numbers, which are then fed through a softmax function to produce a distribution over the two classes: a) good match and b) bad match. Finally, they utilize computer vision-based cost aggregation and disparity optimization/refinement techniques to produce the final disparity images. Although this method has achieved the state-of-the-art accuracy, it is limited by the employed matching cost aggregation technique and can produce wrong predictions in occluded or texture-less/reflective regions zhang2019GANet .

In this regard, some researchers have leveraged CNNs to improve computer vision-based cost aggregation step. SGM-Nets seki2017sgm is one of the most well-known methods of this type. Its main contribution is a CNN-based technique for predicting SGM penalty parameters $\lambda_{1}$ and $\lambda_{2}$ in (40) hirschmuller2007stereo , as illustrated in Fig. 14. A $5\times 5$ -pixel gray-scale image patch and its normalized position are used as the CNN inputs. It has a) two convolution layers, each followed by a rectified linear unit (ReLU) layer; b) a concatenate layer for merging the two types of inputs; c) two fully connected (FC) layers of size 128 each, followed by a ReLU layer and an exponential linear unit (ELU); d) a constant layer to keep SGM penalty values positive. The costs can then be accumulated along four directions. The CNN output values correspond to standard parameterization.

Recently, end-to-end deep CNNs have become very popular. For example, Mayer et al. mayer2016large created three large synthetic datasets⁵⁵5lmb.informatik.uni-freiburg.de/resources/datasets/SceneFlowDatasets.en.html ( FlyingThings3D, Driving and Monkaa) and proposed a CNN named DispNet for dense disparity estimation. Later on, Pang et al. pang2017cascade proposed a two-stage cascade CNN for disparity estimation. Its the first stage enhances DispNet mayer2016large by equipping it with extra up-convolution modules and the second stage rectifies the disparity initialized by the first stage and generates residual signals across multiple scales. Furthermore, GCNet kendall2017end incorporated feature extraction (cost computation), cost aggregation and disparity optimization/refinement into a single end-to-end CNN model, and it achieved the state-of-the-art accuracy on the FlyingThings3D benchmark mayer2016large as well as the KITTI stereo 2012 and 2015 benchmarks geiger2012we ; menze2015joint ; menze2018object . In 2018, Chang et al. chang2018pyramid proposed Pyramid Stereo Matching Network (PSMNet), consisting of two modules: a) spatial pyramid pooling and b) 3D CNN. The former aggregates the context of different scales and locations, while the latter regularizes the cost volume. Unlike PSMNet chang2018pyramid , guided aggregation net (GANet) zhang2019GANet replaces the widely used 3D CNN with two novel layers: a semi-global aggregation layer and a local guided aggregation layer, which help save a lot of memory and computational cost.

Although the aforementioned CNN-based disparity estimation methods have achieved compelling results, they usually have a huge number of learnable parameters, resulting in a long processing time. Therefore, current state-of-the-art CNN-based disparity estimation algorithms have hardly been put into practical uses in autonomous driving. We believe these methods will be applied in more real-world applications, with future advances in embedded computing HW.

Performance Evaluation

As discussed above, disparity estimation speed and accuracy are two key properties and they are always pitted against each other. Therefore, the performance evaluation of a given stereo vision algorithm usually involves both of these two properties tippetts2016review .

The following two metrics are commonly used to evaluate the accuracy of an estimated disparity image barron1994performance :

Root mean squared (RMS) error $e_{\text{RMS}}$ :

e_{\text{RMS}}=\sqrt{\frac{1}{N}\sum_{\mathbf{p}\in\mathscr{P}}|\mathbf{D}_{\text{E}}(\mathbf{p})-\mathbf{D}_{\text{G}}(\mathbf{p})|^{2}},

(41)

Percentage of error pixels (PEP) $e_{\text{PEP}}$ (tolerance: $\delta_{d}$ pixels):

e_{\text{PEP}}={\frac{1}{N}\sum_{\mathbf{p}\in\mathscr{P}}\delta\bigg{(}|\mathbf{D}_{\text{E}}(\mathbf{p})-\mathbf{D}_{\text{G}}(\mathbf{p})|>\delta_{d}\bigg{)}}\times 100\%,

(42)

where $\mathbf{D}_{\text{E}}$ and $\mathbf{D}_{\text{G}}$ represent the estimated and ground truth disparity images, respectively; $N$ denotes the total number of disparities used for evaluation; $\delta_{d}$ represents the disparity evaluation tolerance.

Additionally, a general way to depict the efficiency of an algorithm is given in millions of disparity evaluations per second $\text{Mde}/s$ tippetts2016review as follows:

\text{Mde}/s=\frac{u_{\text{max}}v_{\text{max}}d_{\text{max}}}{t}{10^{-6}}.

(43)

However, the speed of a disparity estimation algorithm typically varies across different platforms, and it can be greatly boosted by exploiting the parallel computing architecture.

5 Heterogeneous Computing

Heterogeneous computing systems use multiple types of processors or cores. In the past, heterogeneous computing meant that different instruction-set architectures (ISAs) had to be handled differently, while modern heterogeneous system architecture (HSA) systems allow users to utilize multiple processor types. As illustrated in Fig. 15,

a typical HSA system consists of two different types of processors: 1) a multi-threading central processing unit (CPU) and 2) a graphics processing unit (GPU) mittal2015survey , which are connected by a peripheral component interconnect (PCI) express bus. The CPU’s memory management unit (MMU) and the GPU’s input/output memory management unit (IOMMU) comply with the HSA HW specifications. CPU runs the operating system and performs traditional serial computing tasks, while GPU performs 3D graphics rendering and CNN training.

5.1 Multi-Threading CPU

The application programming interface (API) Open Multi-Processing (OpenMP) is typically used to break a serial code into independent chunks for parallel processing. It supports multi-platform shared-memory multiprocessing programming in C/C++ and Fortran jin2011high . An explicit parallelism programming model, typically known as a fork-join model, is illustrated in Fig. 16,

where the compiler instructs a section of the serial code to run in parallel fan2019icvs . The master thread (serial execution on one core) forks a number of slave threads. The tasks are divided to run in parallel amongst the slave threads on multiple cores. Synchronization waits until all slave threads finish their allocated tasks fan2016ist . Finally, the slave threads join together at a subsequent point and resume sequential execution.

5.2 GPU

GPUs have been extensively used in computer vision and deep learning to accelerate the computationally intensive but parallelly-efficient processing and CNN training. Compared with a CPU, which consists of a low number of cores optimized for sequentially serial processing, GPU has a highly parallel architecture which is composed of hundreds or thousands of light GPU cores to handle multiple tasks concurrently.

A typical GPU architecture is shown in Fig. 17, which consists of $N$ streaming multiprocessors (SMs) with $M$ streaming processors (SPs) on each of them. The single instruction multiple data (SIMD) architecture allows the SPs on the same SM to execute the same instruction but process different data at each clock cycle. The device has its own dynamic random access memory (DRAM) which consists of global memory, constant memory and texture memory. DRAM can communicate with the host memory via the graphical/memory controller hub (GMCH) and the I/O controller hub (ICH), which are also known as the Intel northbridge and the Intel southbridge, respectively. Each SM has four types of on-chip memories: register, shared memory, constant cache and texture cache. Since they are on-chip memories, the constant cache and texture cache are utilized to speed up data fetching from the constant memory and texture memory, respectively. Due to the fact that the shared memory is small, it is used for the duration of processing a block. The register is only visible to the thread. The details of different types of GPU memories are illustrated in Table 1.

Table 1: GPU memory comparison fan2017ist .

Memory	Location	Cached	Access	Scope
register	on-chip	n/a	r/w	one thread
shared	on-chip	n/a	r/w	all threads in a block
global	off-chip	no	r/w	all threads + host
constant	off-chip	yes	r	all threads + host
texture	off-chip	yes	r	all threads + host

In CUDA C programming, the threads are grouped into a set of 3D thread blocks which are then organized as a 3D grid. The kernels are defined on the host using the CUDA C programming language. Then, the host issues the commands that submit the kernels to devices for execution. Only one kernel can be executed at a given time. Once a thread block is distributed to an SM, the threads are divided into groups of 32 parallel threads which are executed by SPs. Each group of 32 parallel threads is known as a warp. Therefore, the size of a thread block is usually chosen as a multiple of 32 to ensure efficient data processing.

6 Summary

In this chapter, we first introduced the autonomous car system, from both HW aspect (car sensors and car chassis) and SW aspect (perception, localization and mapping, prediction and planning, and control). Particularly, we introduced the autonomous car perception module, which has four main functionalities: 1) visual feature detection, description and matching, 2) 3D information acquisition, 3) object detection/recognition and 4) semantic image segmentation. Later on, we provided readers with the preliminaries for the epipolar geometry and introduced computer stereo vision from theory to algorithms. Finally, heterogeneous computing architecture, consisting of a multi-threading CPU and a GPU, was introduced.

Acknowledgments

This chapter has received partial funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 871479 (AERIALCORE).

References

(1) R. Fan, J. Jiao, H. Ye, Y. Yu, I. Pitas, and M. Liu, “Key ingredients of self-driving cars,” 2019.
(2) I. Pitas, Digital image processing algorithms and applications. John Wiley & Sons, 2000.
(3) “Lidar–light detection and ranging–is a remote sensing method used to examine the surface of the earth,” NOAA. Archived from the original on, vol. 4, 2013.
(4) T. Bureau, “Radar definition,” Public Works and Government Services Canada, 2013.
(5) W. J. Westerveld, “Silicon photonic micro-ring resonators to sense strain and ultrasound,” 2014.
(6) L. Zheng, Y. Zhu, B. Xue, M. Liu, and R. Fan, “Low-cost gps-aided lidar state estimation and map building,” in 2019 IEEE International Conference on Imaging Systems and Techniques (IST). IEEE, 2019, pp. 1–6.
(7) N. Samama, Global positioning: Technologies and performance. John Wiley & Sons, 2008, vol. 7.
(8) S. Liu, “Chassis technologies for autonomous robots and vehicles,” 2019.
(9) M. U. M. Bhutta, M. Kuse, R. Fan, Y. Liu, and M. Liu, “Loop-box: Multiagent direct slam triggered by single loop closure for large-scale mapping,” IEEE transactions on cybernetics, 2020.
(10) R. C. Smith and P. Cheeseman, “On the representation and estimation of spatial uncertainty,” The international journal of Robotics Research, vol. 5, no. 4, pp. 56–68, 1986.
(11) C. Katrakazas, M. Quddus, W.-H. Chen, and L. Deka, “Real-time motion planning methods for autonomous on-road driving: State-of-the-art and future research directions,” Transportation Research Part C: Emerging Technologies, vol. 60, pp. 416–442, 2015.
(12) T. H. Cormen, “Section 24.3: Dijkstra’s algorithm,” Introduction to algorithms, pp. 595–601, 2001.
(13) D. Delling, P. Sanders, D. Schultes, and D. Wagner, “Engineering route planning algorithms,” in Algorithmics of large and complex networks. Springer, 2009, pp. 117–139.
(14) M. Willis, “Proportional-integral-derivative control,” Dept. of Chemical and Process Engineering University of Newcastle, 1999.
(15) G. C. Goodwin, S. F. Graebe, M. E. Salgado et al., Control system design. Upper Saddle River, NJ: Prentice Hall,, 2001.
(16) C. E. Garcia, D. M. Prett, and M. Morari, “Model predictive control: theory and practice—a survey,” Automatica, vol. 25, no. 3, pp. 335–348, 1989.
(17) M. Hassaballah, A. A. Abdelmgeid, and H. A. Alshazly, “Image features detection, description and matching,” in Image Feature Detectors and Descriptors. Springer, 2016, pp. 11–45.
(18) S. Liu and X. Bai, “Discriminative features for image classification and retrieval,” Pattern Recognition Letters, vol. 33, no. 6, pp. 744–751, 2012.
(19) P. Moreels and P. Perona, “Evaluation of features detectors and descriptors based on 3d objects,” International journal of computer vision, vol. 73, no. 3, pp. 263–284, 2007.
(20) P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An evaluation of the state of the art,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 34, no. 4, pp. 743–761, 2011.
(21) M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Discriminative scale space tracking,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 39, no. 8, pp. 1561–1575, 2016.
(22) D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
(23) H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” in ECCV. Springer, 2006, pp. 404–417.
(24) E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International conference on computer vision. Ieee, 2011, pp. 2564–2571.
(25) S. Leutenegger, M. Chli, and R. Y. Siegwart, “Brisk: Binary robust invariant scalable keypoints,” in 2011 International conference on computer vision. Ieee, 2011, pp. 2548–2555.
(26) R. Fan, “Real-time computer stereo vision for automotive applications,” Ph.D. dissertation, University of Bristol, 2018.
(27) R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cambridge university press, 2003.
(28) H. Wang, R. Fan, and M. Liu, “Cot-amflow: Adaptive modulation network with co-teaching strategy for unsupervised optical flow estimation,” 2020.
(29) S. Ullman, “The interpretation of structure from motion,” Proceedings of the Royal Society of London. Series B. Biological Sciences, vol. 203, no. 1153, pp. 405–426, 1979.
(30) B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle adjustment—a modern synthesis,” in International workshop on vision algorithms. Springer, 1999, pp. 298–372.
(31) H. Wang, Y. Liu, H. Huang, Y. Pan, W. Yu, J. Jiang, D. Lyu, M. J. Bocus, M. Liu, I. Pitas et al., “Atg-pvd: Ticketing parking violations on a drone,” European Conference on Computer Vision (ECCV) Workshops, 2020.
(32) Z.-Q. Zhao, P. Zheng, S.-t. Xu, and X. Wu, “Object detection with deep learning: A review,” IEEE transactions on neural networks and learning systems, vol. 30, no. 11, pp. 3212–3232, 2019.
(33) D. Wang, C. Devin, Q.-Z. Cai, F. Yu, and T. Darrell, “Deep object-centric policies for autonomous driving,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8853–8859.
(34) A. Mogelmose, M. M. Trivedi, and T. B. Moeslund, “Vision-based traffic sign detection and analysis for intelligent driver assistance systems: Perspectives and survey,” IEEE Transactions on Intelligent Transportation Systems, vol. 13, no. 4, pp. 1484–1497, 2012.
(35) B. Wu, F. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 129–137.
(36) R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
(37) R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
(38) S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
(39) J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
(40) J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” 2018.
(41) A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” 2020.
(42) R. Fan, H. Wang, P. Cai, and M. Liu, “Sne-roadseg: Incorporating surface normal information into semantic segmentation for accurate freespace detection,” in European Conference on Computer Vision. Springer, 2020, pp. 340–356.
(43) H. Wang, R. Fan, Y. Sun, and M. Liu, “Applying surface normal information in drivable area and road anomaly detection for ground mobile robots,” 2020.
(44) R. Fan, H. Wang, P. Cai, J. Wu, M. J. Bocus, L. Qiao, and M. Liu, “Learning collision-free space detection from stereo images: Homography matrix brings better data augmentation,” 2020.
(45) J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
(46) O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
(47) V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 39, no. 12, pp. 2481–2495, 2017.
(48) L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018, pp. 801–818.
(49) M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic segmentation in street scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3684–3692.
(50) Z. Tian, T. He, C. Shen, and Y. Yan, “Decoders matter for semantic segmentation: Data-dependent decoding enables flexible feature aggregation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3126–3135.
(51) R. Fan, H. Wang, M. J. Bocus, and M. Liu, “We learn better road pothole detection: from attention aggregation to adversarial domain adaptation,” 2020.
(52) C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture,” in Asian conference on computer vision. Springer, 2016, pp. 213–228.
(53) R. Fan, H. Wang, B. Xue, H. Huang, Y. Wang, M. Liu, and I. Pitas, “Three-filters-to-normal: An accurate and ultrafast surface normal estimator,” 2020.
(54) R. Fan, M. J. Bocus, and N. Dahnoun, “A novel disparity transformation algorithm for road segmentation,” Information Processing Letters, vol. 140, pp. 18–24, 2018.
(55) R. Fan, U. Ozgunalp, B. Hosking, M. Liu, and I. Pitas, “Pothole detection based on disparity transformation and road surface modeling,” IEEE Transactions on Image Processing, vol. 29, pp. 897–908, 2019.
(56) R. Fan and M. Liu, “Road damage detection based on unsupervised disparity map segmentation,” IEEE Transactions on Intelligent Transportation Systems, 2019.
(57) Q. Ha, K. Watanabe, T. Karasawa, Y. Ushiku, and T. Harada, “Mfnet: Towards real-time semantic segmentation for autonomous vehicles with multi-spectral scenes,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 5108–5115.
(58) R. Fan, V. Prokhorov, and N. Dahnoun, “Faster-than-real-time linear lane detection implementation using soc dsp tms320c6678,” in 2016 IEEE International Conference on Imaging Systems and Techniques (IST). IEEE, 2016, pp. 306–311.
(59) U. Ozgunalp, R. Fan, X. Ai, and N. Dahnoun, “Multiple lane detection algorithm based on novel dense vanishing point estimation,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 3, pp. 621–632, 2016.
(60) R. Fan and N. Dahnoun, “Real-time stereo vision-based lane detection system,” Measurement Science and Technology, vol. 29, no. 7, p. 074005, 2018.
(61) J. Jiao, R. Fan, H. Ma, and M. Liu, “Using dp towards a shortest path problem-related application,” in 2019 International Conference on Robotics and Automation (ICRA). IEEE, 2019, pp. 8669–8675.
(62) E. Trucco and A. Verri, Introductory techniques for 3-D computer vision. Prentice Hall Englewood Cliffs, 1998, vol. 201.
(63) R. Fan, J. Jiao, J. Pan, H. Huang, S. Shen, and M. Liu, “Real-time dense stereo embedded in a uav for road inspection,” in Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 535–543.
(64) Z. Zhang, “A flexible new technique for camera calibration,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 22, no. 11, pp. 1330–1334, 2000.
(65) H. C. Longuet-Higgins, “A computer algorithm for reconstructing a scene from two projections,” Nature, vol. 293, no. 5828, pp. 133–135, 1981.
(66) R. I. Hartley, “In defense of the eight-point algorithm,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 19, no. 6, pp. 580–593, 1997.
(67) R. Fan, X. Ai, and N. Dahnoun, “Road surface 3d reconstruction based on dense subpixel disparity map estimation,” IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 3025–3035, 2018.
(68) B. Tippetts, D. J. Lee, K. Lillywhite, and J. Archibald, “Review of stereo vision algorithms and their suitability for resource-limited systems,” Journal of Real-Time Image Processing, vol. 11, no. 1, pp. 5–25, 2016.
(69) W. Luo, A. G. Schwing, and R. Urtasun, “Efficient deep learning for stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5695–5703.
(70) D. Scharstein and R. Szeliski, “High-accuracy stereo depth maps using structured light,” in 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., vol. 1. IEEE, 2003, pp. I–I.
(71) M. G. Mozerov and J. van de Weijer, “Accurate stereo matching by two-step energy minimization,” IEEE Transactions on Image Processing, vol. 24, no. 3, pp. 1153–1163, 2015.
(72) M. F. Tappen and W. T. Freeman, “Comparison of graph cuts with belief propagation for stereo, using identical mrf parameters,” in null. IEEE, 2003, p. 900.
(73) H. Hirschmuller, “Stereo processing by semiglobal matching and mutual information,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 30, no. 2, pp. 328–341, 2007.
(74) J. Žbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” The journal of machine learning research, vol. 17, no. 1, pp. 2287–2318, 2016.
(75) H. Hirschmuller and D. Scharstein, “Evaluation of stereo matching costs on images with radiometric differences,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 31, no. 9, pp. 1582–1599, 2008.
(76) Q. Yang, L. Wang, R. Yang, H. Stewénius, and D. Nistér, “Stereo matching with color-weighted correlation, hierarchical belief propagation, and occlusion handling,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 31, no. 3, pp. 492–504, 2008.
(77) R. Fan, Y. Liu, M. J. Bocus, L. Wang, and M. Liu, “Real-time subpixel fast bilateral stereo,” in 2018 IEEE International Conference on Information and Automation (ICIA). IEEE, 2018, pp. 1058–1065.
(78) R. Fan, Y. Liu, X. Yang, M. J. Bocus, N. Dahnoun, and S. Tancock, “Real-time stereo vision for road surface 3-d reconstruction,” in 2018 IEEE International Conference on Imaging Systems and Techniques (IST). IEEE, 2018, pp. 1–6.
(79) Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energy minimization via graph cuts,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 23, no. 11, pp. 1222–1239, 2001.
(80) A. T. Ihler, A. S. Willsky et al., “Loopy belief propagation: Convergence and effects of message errors,” Journal of Machine Learning Research, vol. 6, no. May, pp. 905–936, 2005.
(81) A. Blake, P. Kohli, and C. Rother, Markov random fields for vision and image processing. Mit Press, 2011.
(82) D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” International journal of computer vision, vol. 47, no. 1-3, pp. 7–42, 2002.
(83) J. Žbontar and Y. LeCun, “Computing the stereo matching cost with a convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1592–1599.
(84) F. Zhang, V. Prisacariu, R. Yang, and P. H. Torr, “Ga-net: Guided aggregation net for end-to-end stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 185–194.
(85) A. Seki and M. Pollefeys, “Sgm-nets: Semi-global matching with neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 231–240.
(86) N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox, “A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4040–4048.
(87) J. Pang, W. Sun, J. S. Ren, C. Yang, and Q. Yan, “Cascade residual learning: A two-stage convolutional neural network for stereo matching,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 887–895.
(88) A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 66–75.
(89) A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2012, pp. 3354–3361.
(90) M. Menze, C. Heipke, and A. Geiger, “Joint 3d estimation of vehicles and scene flow.” ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences, vol. 2, 2015.
(91) C. Menze, Moritz; Heipke and A. Geiger, “Object scene flow,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 140, pp. 60–76, 2018.
(92) J.-R. Chang and Y.-S. Chen, “Pyramid stereo matching network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5410–5418.
(93) J. L. Barron, D. J. Fleet, and S. S. Beauchemin, “Performance of optical flow techniques,” International journal of computer vision, vol. 12, no. 1, pp. 43–77, 1994.
(94) S. Mittal and J. S. Vetter, “A survey of cpu-gpu heterogeneous computing techniques,” ACM Computing Surveys (CSUR), vol. 47, no. 4, pp. 1–35, 2015.
(95) H. Jin, D. Jespersen, P. Mehrotra, R. Biswas, L. Huang, and B. Chapman, “High performance computing using mpi and openmp on multi-core parallel systems,” Parallel Computing, vol. 37, no. 9, pp. 562–575, 2011.
(96) R. Fan, S. Duanmu, Y. Liu, Y. Zhu, J. Jiao, M. J. Bocus, Y. Yu, L. Wang, and M. Liu, “Real-time binocular vision implementation on an soc tms320c6678 dsp,” in International Conference on Computer Vision Systems. Springer, 2019, pp. 13–23.
(97) R. Fan and N. Dahnoun, “Real-time implementation of stereo vision based on optimised normalised cross-correlation and propagated search range on a gpu,” in 2017 IEEE International Conference on Imaging Systems and Techniques (IST). IEEE, 2017, pp. 1–6.