HPPS: A Hierarchical Progressive Perception System for Luggage Trolley Detection and Localization at Airports

Zhirui Sun, Zhe Zhang, Jieting Zhao, Hanjing Ye, and Jiankun Wang, Senior Member, IEEE This work is supported by National Natural Science Foundation of China under Grant 62103181 and Shenzhen Science and Technology Program under Grant 20231115141459001. (Corresponding author: Jiankun Wang).Zhirui Sun, Zhe Zhang and Jiankun Wang are with Shenzhen Key Laboratory of Robotics Perception and Intelligence, Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China (e-mail: [email protected]; [email protected]; [email protected]).Zhirui Sun and Jiankun Wang are also with Jiaxing Research Institute, Southern University of Science and Technology, Jiaxing, China.Jieting Zhao and Hanjing Ye are with Shenzhen Key Laboratory of Robotics and Computer Vision, Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China (e-mail: [email protected]; [email protected]).

Abstract

The robotic autonomous luggage trolley collection system employs robots to gather and transport scattered luggage trolleys at airports. However, existing methods for detecting and locating these luggage trolleys often fail when they are not fully visible. To address this, we introduce the Hierarchical Progressive Perception System (HPPS), which enhances the detection and localization of luggage trolleys under partial occlusion. The HPPS processes the luggage trolley’s position ( $x,y$ ) and orientation ( $\theta$ ) separately, which requires only RGB images for labeling and training, eliminating the need for 3D coordinates and alignment. The HPPS can accurately determine the position of the luggage trolley with just one well-detected keypoint and estimate the luggage trolley’s orientation when it is partially occluded. Once the luggage trolley’s initial pose is detected, HPPS updates this information continuously to refine its accuracy until the robot begins grasping. The experiments on detection and localization demonstrate that HPPS is more reliable under partial occlusion compared to existing methods. Its effectiveness and robustness have also been confirmed through practical tests in actual luggage trolley collection tasks. A website about this work is available at HPPS.

Index Terms:

Hierarchical structure, progressive perception, partial occlusion, robotic autonomous luggage trolley collection.

I Introduction

Intelligent robotics and autonomous driving technologies [1] [2] are increasingly utilized in areas like traffic lighting [3], logistics [4], and mining trucks [5], playing a significant role in their advancement. Similarly, researchers are exploring the robotic system for autonomous luggage trolley collection at airports. It means using robots to gather luggage trolleys scattered around the airport and transport them to designated areas for reuse. It helps reduce the need for human resources and improve efficiency in collecting luggage trolleys. Designing such a robotic system is complex and involves multiple components, such as object detection, localization, motion planning, control, and manipulation.

Refer to caption — Figure 1: Schematic diagram of HPPS for luggage trolley detection and localization under partial occlusion at airports.

Some progress has been made in areas like robotic autonomous trolley collection [6], autonomous multiple-trolley collection [7] and collaborative trolley transportation [8]. However, these studies usually ignore situations where the luggage trolley is occluded, or do not consider the perception of the luggage trolley. Challenges remain in detecting and localizing luggage trolleys, especially in occlusion cases. The luggage trolley often becomes partially occluded due to objects or people at airports, as illustrated in Fig. 1, where two people stop to talk, occluding the view of the idle luggage trolley. Therefore, effective detection and localization under partial occlusion are essential for the system. They ensure the robot to identify the correct luggage trolley’s pose, which is fundamental for subsequent motion planning and accurate manipulation of end-effectors.

For detecting luggage trolleys, current methods overlook scenarios where the luggage trolley is occluded and fail to distinguish the usage states of the luggage trolley. It leads to frequent failures in detecting luggage trolleys and the inability to identify idle ones from various usage states. To address this, a new dataset of luggage trolleys is collected and labeled. This dataset includes different luggage trolley states, multiple occlusion levels, and various environmental settings. Retraining with this dataset effectively resolves the issue of failing to detect idle luggage trolleys under partial occlusion. For luggage trolley localization, mainstream methods rely on wireless, visual, and laser sensors. Choosing the most suitable method based on the application scenario presents a challenge. Fortunately, previous studies have already demonstrated solutions to this. In related work [9], multiple localization methods are systematically evaluated. It concludes that vision-based Keypoints methods are most suitable for robotic autonomous luggage trolley collection system. It primarily relies on gathering keypoint information from the luggage trolley and determining the luggage trolley’s pose by solving EPNP [10]. However, when the luggage trolley is occluded or when keypoint detection is incomplete, this method often fails to determine the luggage trolley’s pose.

In this work, we propose HPPS, which detects and locates luggage trolleys even under partial occlusion. Traditional methods for acquiring the pose of a luggage trolley face significant limitations, particularly in handling occlusions and ensuring accuracy and real-time. For example, the method by Xiao et al. [6], which relies on solving EPNP, fails to operate effectively under occluded conditions. The approach by Pan et al. [11], which uses the registration of point clouds, struggles with poor real-time performance and insufficient pose accuracy. HPPS processes and obtains the position and orientation of the luggage trolley independently, which can handle complex situations efficiently and robustly. HPPS requires only simple RGB information to determine the luggage trolley’s position and orientation without complex sensor equipment. For input RGB images, HPPS identifies the luggage trolley’s keypoints via a detection network and then calculates the luggage trolley’s position using the camera’s projection geometry relationship. At the same time, HPPS’s orientation detection network predicts the luggage trolley’s orientation probability distribution and then regresses the most likely orientation. The luggage trolley’s pose is determined by integrating these two parts of information. In this process, we divide the pose of the luggage trolley into position and orientation, processing them separately and simplifying the requirements for input information. Compared to methods requiring depth cameras or LiDAR, this system only needs a monocular camera, making it more cost-effective. RGB image data is more accessible than depth data, allowing the use of many existing image datasets and network resources, thus reducing the workload of data collection and preprocessing. Moreover, once the initial pose of the luggage trolley is obtained, the robot continues to detect the luggage trolley during navigation to refine the localization accuracy further until the grasping process begins.

I-A Contributions

The main contributions of this article are summarized as follows:

•

This article presents a dataset of 13740 images capturing various luggage trolley states and occlusion levels across diverse environments to improve luggage trolley detection accuracy in real-world scenarios.
•

This article introduces a novel Hierarchical Progressive Perception System (HPPS) designed for detecting and locating luggage trolleys under partial occlusion.
•

Real-world experiments demonstrate the robustness and accuracy of HPPS in complex and dynamic environments, where it detects and locates the target luggage trolley and successfully collects it. This progress enhances the deployment of robotic autonomous luggage trolley collection systems at airports.

I-B Outline

The remainder of this article is organized as follows. The related work is reviewed in Sec. II. Sec. III provides the details of the HPPS, including the Detection Module, Keypoints Process Module, Orientation Process Module, Filter Module, and Motion Planner Module. The Dataset, Implementation Details, Experiment Platform, and Results are explained in Sec. IV. The final section, in Sec. V, summarizes this article and considerations for future work.

II Related Work

II-A 2D Bounding-box Detection and Classification

Object detection and classification are essential for identifying and localizing objects within images. Popular methods include region-based approaches like the R-CNN series [12][13][14] and single-shot detectors such as YOLO (You Only Look Once) [15] and SSD (Single Shot MultiBox Detector) [16]. While R-CNN models are highly accurate, they are computationally intensive. In contrast, YOLO offers a compelling balance between speed and accuracy, making it suitable for real-time applications. YOLO divides the image into a grid, each cell predicting multiple bounding boxes and their confidence scores, which indicate the likelihood of an object’s presence and the accuracy of the box. Each grid cell also classifies the detected object, combining localization with precise categorization. Considering both performance and efficiency, the YOLO model is an ideal choice for real-time object detection. Its ability to process images quickly and with high accuracy meets the demands of most practical applications. Employing the YOLO model will facilitate efficient and accurate visual recognition capabilities for our task.

II-B Keypoint Detection

Keypoint detection is instrumental for accurately identifying and localizing specific parts of objects in images. Several neural network models have significantly advanced this field: OpenPose [17] excels in multi-object scenarios by generating part candidates and analyzing their connections. Stacked Hourglass Networks [18] leverage a repetitive structure that captures and integrates features at multiple scales, ideal for precise single-object keypoint localization. DeepPose [19], developed by Google, applies deep learning to keypoint detection by directly regressing coordinates from images but may struggle with complex backgrounds. Convolutional Pose Machines (CPM) [20] refine keypoint predictions through multi-stage processing, each enhancing the heatmaps’ accuracy. HRNet [21] maintains high-resolution representations throughout the network, enabling simultaneous capture of detailed and contextual information, making it highly effective for complex poses and environments. HRNet maintains high-resolution streams throughout its architecture, effectively capturing fine and coarse features. This design allows HRNet to excel in precision and detail, outperforming other models, especially on challenging datasets like COCO keypoints. Unlike other models that may lose spatial information due to pooling operations, HRNet retains high-resolution features during processing, making it particularly suitable for complex scenarios involving occlusions and various poses.

II-C Orientation Estimation

Accurate orientation estimation is fundamental for autonomous systems. Traditional approaches relied on handcrafted features and machine learning classifiers like Support Vector Machines (SVMs) [22], which often struggled with occlusions and complex backgrounds. Current deep learning-based methods frequently approach orientation estimation through classification tasks. The method in [23] employs a four-layer neural network to regress the orientation directly from the image, providing an end-to-end solution. In the study by [24], the orientation is classified into eight bins and then regressed within those bins for a finer estimate. These methods employ straightforward network structures, leading to models that achieve optimal performance in environments similar to the training data used. Yu et al. [25] present models that detect keypoints and infer orientation based on the spatial arrangement of these keypoints, effectively handling occlusions. Estimating orientation directly from images is valid, as it simplifies the annotation process for training datasets and enhances performance by concentrating on the orientation estimation task [26]. Considering the strengths and constraints of these methods, combining HRNet and ResNet provides a promising solution. HRNet excels in keypoint detection, and ResNet is suited for global feature extraction and orientation estimation. This combined method offers reliable orientation estimation, maintaining precision even when objects are partially occluded.

II-D 3D Pose Estimation

3D pose estimation is crucial for determining objects’ spatial orientation and position. Current leading methods in this domain primarily utilize deep learning. These approaches employ neural networks to determine the pose of objects from the input images. PoseCNN [27] predicts the 3D rotation and translation directly from images, suitable for single-object scenarios but struggles with complex backgrounds and occlusions. It also requires extensive, precisely annotated 3D pose data, which can be costly and time-consuming to prepare. DeepIM [28] improves initial pose estimates through an iterative process, offering significant precision advantages. However, its high computational cost and slow processing speed make it unsuitable for real-time applications. DeepIM also heavily relies on accurate initial pose estimations. PVNet [29] uses keypoints and a voting mechanism to estimate poses, handling partial occlusions and various viewing angles effectively. However, it demands precisely annotated 2D image positions and corresponding 3D spatial locations, which can be challenging with data collection. NOCS [30] provides an end-to-end solution by mapping objects into a normalized coordinate space, facilitating viewpoint and size-independent representations. While NOCS excels at generalizing to unseen object categories, its reliance on detailed 3D models and precise data alignment during training limits its practical use. MonoLoc [31] utilizes a neural network to identify keypoints within an image, then employs these calculated 2D keypoint positions to get the object’s 3D location through a multi-task neural network. EPro-PnP [32] infers the object’s pose using a 3D bounding box incorporating learnable 2D-3D correspondences. The training datasets for these methods often require complex data collection and annotation, which increases implementation costs. Furthermore, these datasets usually need complete observation of the objects. However, in environments like airports, luggage trolleys are frequently only partially visible, making these methods challenging to handle effectively.

Compared to these methods, our approach separately detects the position and orientation of the luggage trolley. It uses a keypoints detector to provide 2D information for model-based 3D location processing and an orientation detector to estimate the orientation probability for Gaussian regression analysis. The main advantages of our approach include: 1) Utilizing 2D detectors simplifies the 3D pose estimation process by relying solely on RGB image data and offers improved robustness against partial occlusion compared to deep learning-based methods. 2) Our model-based process module efficiently uses a predefined model of the luggage trolley to compute position from well-detected 2D keypoints. 3) Our method decreases the requirement for dataset preparation, streamlines implementation, and enhances processing speed, making it well-suited for real-time applications, especially in robotic systems that are resource-constrained and time-sensitive.

III System Description

This section introduces the architecture of the luggage trolley detection and localization system, as shown in Fig. 2. This system utilizes a camera to capture RGB images. The luggage trolley detection network identifies the idle ones within these images, marking them with bounding boxes. Based on these bounding boxes, the keypoints detection network gets the keypoints’ coordinates on the image plane, and the orientation detection network predicts the luggage trolley orientation’s probability distribution (Sec. III-A). The 3D coordinates of keypoints are determined through the camera’s projection geometry and a prior model (Sec. III-B). The luggage trolley’s orientation is estimated via Gaussian regression (Sec. III-C). Following a filtering process (Sec. III-D), the pose of the luggage trolley is continuously updated through effective observed information, thereby enhancing accuracy. After obtaining the luggage trolley’s pose, this information guides the motion planning module (Sec. III-E) to generate control instructions for robot navigation. In the forthcoming definitions, boldfaced variables signify vectors, whereas non-bold variables represent scalars.

III-A Detection Module

Given an input image $I$ with dimensions $W\times H\times 3$ , the idle luggage trolley is identified, generating bounding boxes $[x_{min},y_{min},x_{max},y_{max}]$ . From these bounding boxes, we can derive 2D coordinates of the keypoints on the image plane and a probability distribution of the luggage trolley’s orientation. Specifically, YOLOV5 [33] is employed for real-time luggage trolley detection due to its effectiveness in object detection. The next step involves cropping the image within the bounding box to focus solely on the luggage trolley. This produces a cropped image $I_{c}\in\mathbb{R}^{W_{c}\times H_{c}\times 3}$ .

Inspiration by human pose estimation, we use HRNet to predict heatmaps of six 2D keypoint (as shown in Fig. 3) coordinates $\bm{p}_{i}=[x_{i},y_{i}]^{T}(i=0,1,...,5)$ , leading to the generation of homogeneous 2D keypoints coordinates $\bm{\hat{p}}_{i}=[u_{i},v_{i},1]^{T}(i=0,1,...,5)$ . The HRNet and ResNet are combined [34] for orientation detection. The cropped images pass through a backbone network, acting as a feature extractor. These features are then combined and processed via additional residual layers, culminating in a fully connected layer and a softmax layer. The outcome includes $n$ unit orientations, ${\vartheta}(j)=[\vartheta(0),\vartheta(1),...,\vartheta(n-1)](\sum_{j=0}^{j=n-1}{\vartheta}(j)=1.0)$ , each indicating the probability of corresponding orientation unit that best represents the luggage trolley’s orientation in the image.

III-B Keypoints Process Module

The keypoints detection network identifies the 2D coordinates of the keypoints on the image plane. Assuming the center point of the luggage trolley is on the ground, a coordinate system is established with this center point as the origin. From this reference, the 3D coordinates of the keypoints are determined. With the known 3D heights of the keypoints corresponding to the center point, their precise 3D positions can be calculated [35]. The luggage trolley’s prior model is defined as follows:

\bm{\mathcal{M}}=\{\bm{\mathcal{Y}}_{i}=(x_{i},y_{i},\zeta_{i})\in\mathbb{R}^{3}\mid i=0,1,\ldots,5\},

(1)

where $\bm{\mathcal{Y}}_{i}$ represents the $i$ -th keypoint’s 3D position relative to the center point, consisting of coordinates $x_{i},y_{i}$ and height $\zeta_{i}$ . We define the position of the luggage trolley as the center point, $\bm{\mathcal{Y}}\in\mathbb{R}^{3}$ , within the camera coordinate frame, satisfying the ground plane constraint [36]:

\bm{\mathcal{-N}}^{T}\bm{\mathcal{\hat{Y}}}+\lambda=0,

(2)

where $\bm{\mathcal{N}}$ ( $\bm{\mathcal{N}}\in\mathbb{R}^{3}$ ) indicates the normal vector to the ground, and $\lambda$ ( $\lambda>0$ ) is the distance from the camera’s optical center perpendicular to the ground, as shown in Fig. 3.

Based on the prior model, the 3D position of visible keypoints in any input image is determined as follows:

	$\displaystyle\bm{\rho}=\bm{\mathcal{K}}^{-1}\bm{\hat{p}}_{i},$		(3)
	$\displaystyle\bm{\mathcal{X}}_{i}=\frac{\|\lambda-\zeta_{i}\|}{\|\bm{\mathcal{N}}^{T}\bm{\rho}\|}\cdot\bm{\rho}.$		(3)

The image coordinates of the $i$ -th keypoint can be obtained through the keypoints detection network, creating homogeneous coordinates $\bm{\hat{p}}_{i}$ . By multiplying the homogeneous coordinates with the inverse of the camera’s intrinsic matrix $\bm{\mathcal{K}}^{-1}$ , getting the ray $\bm{\rho}$ that passes through the $i$ -th keypoint. Using the geometric relation between the camera’s height $\lambda$ and the height of the $i$ -th keypoint $\zeta_{i}$ , the 3D position of this keypoint $\bm{\mathcal{X}}_{i}$ is determined. Once the coordinates of $i$ -th keypoint $\bm{\mathcal{X}}_{i}$ are calculated, and with the prior model $\bm{\mathcal{M}}$ , the center point coordinates of the luggage trolley are derived using the following formula:

\bm{\mathcal{C}}=\frac{1}{N_{vis}}\sum_{i=1}^{N_{vis}}(\bm{\mathcal{X}}_{i,vis}-\bm{\mathcal{Y}}_{i,vis}),

(4)

where $\bm{\mathcal{C}}$ represents the center point coordinates of the luggage trolley, $N_{vis}$ is the count of visible keypoints, $\bm{\mathcal{X}}_{i,vis}$ refers to the 3D position of the $i$ -th visible keypoint and $\bm{\mathcal{Y}}_{i,vis}$ corresponds to the $i$ -th visible keypoint in prior model.

III-C Orientation Process Module

The orientation detection network infers possible orientations for the luggage trolley. The number of possible orientations corresponds to the degree of discreteness. After conducting comparative experiments, this article adopts 360 divisions based on findings demonstrating optimal orientation accuracy. Details of the experimental results that support this decision are provided in Tab. II. ${\vartheta}(j)$ indicates the probability that the luggage trolley’s orientation is within the $j$ -th unit orientation. This means the orientation falls in the range of $\{\theta\mid j*1^{\circ}-0.5^{\circ}\leq\theta\leq j*1^{\circ}+0.5^{\circ}\}$ . The loss function for ${\vartheta}(j)$ is defined as follows:

\mathcal{J}(\theta)=\sum_{j=0}^{360}\left({\vartheta}(j)-\psi(\mu,\sigma)\right)^{2},

(5)

where $\psi(\mu,\sigma)$ represents the “circular” Gaussian probability, a representation of the ground truth, shown in Fig. 4 (orange curve):

\psi(\mu,\sigma)=\frac{1}{\sqrt{(2\pi)}\sigma}e^{-\frac{1}{2\sigma^{2}}\left(\min\left(\left|\mu-\tau\right|,360-\left|\mu-\tau\right|\right)\right)^{2}},

(6)

where $\tau$ is the ground truth orientation. This process predicts a Gaussian function centered on the accurate unit orientation. The idea is that a unit orientation closer to the true orientation gets a higher probability score from the model. Therefore, the final orientation, $\theta$ , is determined by the highest probability score, represented as:

\theta=\arg\max({\vartheta}(j))\times 1.

(7)

III-D Filter Module

To ensure the accuracy and stability of the luggage trolley’s central coordinates, it is essential to implement filtering based on the coordinates of the visible keypoints. In this article, the Modified Moving Average Filter (MMAF) is employed, which dynamically adapts to incoming data points, focusing on mitigating the impact of outliers. The core process is described as follows:

1.

Initialization of the Filter Parameters:

$\bm{\mathcal{F}}=(\Delta,\Theta_{z}),$ (8)

where $\Delta$ represents the window size for the moving average, and $\Theta_{z}$ denotes the threshold for z-score outlier detection.

Dynamic Adaptation and Outlier Filtering: Given a sequence of data points ${\mathcal{O}}_{i}=[x_{i},y_{i},\theta_{i}]\quad(i=1,2,\ldots,N)$ , the MMAF updates its state by:

{\mathcal{Z}}({\mathcal{O}}_{i})=\frac{|{\mathcal{O}}_{i}-{\mu}_{{\Delta}}|}{{\sigma}_{\Delta}},

(9)

where ${\mu}_{\Delta}$ and ${\sigma}_{\Delta}$ denoting the mean and standard deviation of the points within the window $\Delta$ , respectively. Computing the updated moving average excluding outliers:

\bar{{\mathcal{O}}}=\frac{1}{|N_{{\mathcal{Q}}}|}\sum_{{\mathcal{O}}_{i}\in{\mathcal{Q}}}{\mathcal{O}}_{i},\forall{\mathcal{O}}_{i}\in{\mathcal{Q}},{\mathcal{Z}}({\mathcal{O}}_{i})\leq\Theta_{z},

(10)

where ${\mathcal{Q}}$ is the set of data points not considered outliers, and $N_{{\mathcal{Q}}}$ represents the count of ${\mathcal{Q}}$ .

Dynamic Adjustment:

\bar{{\mathcal{O}}}_{f}=\begin{cases}\bar{{\mathcal{O}}},&\text{if }|{\mathcal{Q}}|>0\\ {\mathcal{O}}_{\text{latest}},&\text{otherwise}\end{cases}.

(11)

This equation signifies the filter’s output is the average of non-outlier points if available; otherwise, it defaults to the latest point.

III-E Motion Planner Module

Using this filter, the luggage trolley’s pose is progressively updated by effectively observing visible keypoints. Once the luggage trolley’s pose is determined, the Motion Planner takes over the planning task. The Multi-Risk-RRT [37], our previous work, integrates Multi-directional Searching with Heuristic Sampling. This approach efficiently incorporates heuristic information from dynamic sub-trees into the rooted tree, overcoming the constraints of TBVP solvers. The Multi-Risk-RRT algorithm improves motion planning in static and dynamic environments, enabling the robot to receive control instructions based on the planned trajectory.

\displaystyle\left[\upsilon,\omega\right]

\displaystyle=\text{Multi-Risk-RRT}\left(\left[x_{f},y_{f},\theta_{f}\right],Map\right),

(12)

where $\upsilon,\omega\in\mathbb{R}^{2}$ are the linear and angular velocity of the robot, $x_{f},y_{f}\in\mathbb{R}^{2},\theta_{f}\in[0,2\pi)$ are the pose of the luggage trolley, and $Map$ is the environment representation.

IV Experiments

This section introduces the datasets, implementation details, experiment platform, and results. The experiments focus on detection, localization, and robot trials, illustrating the efficacy and resilience of our proposed HPPS. Fig. 5 displays the experimental workflow, contrasting our method with Xiao’s for estimating luggage trolley pose from image inputs. Initially, both methods assess their ability to detect luggage trolleys under occlusion. Our method successfully identifies luggage trolleys in such conditions, whereas Xiao’s does not. Subsequently, our method employs separate processes for orientation and keypoint detection, contributing to the overall pose estimation through model-based methods and Gaussian regression. Conversely, Xiao’s method applies an EPnP solver for pose estimation following keypoint detection. We select three critical stages during the experiment—luggage trolley detection, keypoint detection, and pose estimation—to compare the two methods.

IV-A Dataset

A detailed dataset of luggage trolleys is created to make the HPPS more accurate. The dataset comprises 13740 images featuring diverse backgrounds, lighting conditions, viewpoints, occlusion levels, and different luggage trolley states. Each image is labeled with 2D bounding boxes, six keypoints, an orientation angle, and an indicator to determine the usage states of the luggage trolley. This dataset is gathered in indoor and outdoor environments to enhance detection robustness. Recognizing that airports often have many luggage trolleys, this dataset includes images categorized by the number of luggage trolleys. To reflect real-world airport challenges, images capture luggage trolleys from various angles and include scenarios where luggage trolleys are either idle or occupied. As shown in Fig. 6, the inner circle highlights the contrast between indoor and outdoor environments, while the outer circle shows images based on the number of luggage trolleys. The label “0” means no luggage trolley is in the image, and a bar chart shows the type distribution of images with occupied luggage trolleys under different occlusion conditions. The red, blue, and green boxes in each bar represent three different luggage trolleys. The transparency of each color indicates the luggage trolley’s visibility level: high transparency shows visibility under 40%, medium transparency denotes visibility between 40% and 80%, and low transparency indicates visibility over 80%. Crosses in various colors on the boxes mark the occupied luggage trolleys. The dataset classifies the visibility and occupancy conditions of the luggage carts into 84 different categories.

IV-B Implementation Details

The dataset is divided into three parts: 80% for training, 10% for validation, and 10% for testing. Training is performed offline using PyTorch on an AMD EPYC MILAN 7413 CPU and an NVIDIA RTX A6000 GPU. The 2D detection network builds on the official YOLOV5 [33] code. For training, the SGD [38] optimizer assists for 300 epochs with a batch size 16. The training results are shown in Tab. I, which evaluates 1374 images across several metrics. The results show the training model effectively identifies whether luggage trolleys are idle or occupied and accurately determines their bounding box.

TABLE I: Training Results of the Luggage Trolley Detection and Classification

Metric	Overall	Occupied	Idle
Images	1374	1374	1374
Objects	3478	1706	1772
Precision	0.973	0.97	0.977
Recall	0.974	0.981	0.967
mAP50	0.982	0.981	0.984
mAP50-95	0.967	0.966	0.967

HRNet is utilized to detect 2D keypoints, and orientation detection is achieved through a combination of HRNet and ResNet. Training results ranging from 72 equal divisions to 360 equal divisions are analyzed to compare the relationship between angle discrete division and accuracy. Each training is conducted for 100 epochs with a batch size of 64. The training results are presented in Tab. II. Bins refer to equally spaced segments of the circle’s 360 degrees. For example, dividing a circle into 72 equal parts means each bin corresponds to an angle of 5 degrees (360° / 72). Average degree error (ADE) indicates the average error in degrees. Acc.-5°, Acc.-15°, and Acc.-30° specify the rate at which predictions fall within the 5°, 15°, and 30° error margin, respectively. Keypoint detection accuracy (KDA) reflects the model’s performance in identifying and accurately locating keypoints within images. It is evident that as the division becomes finer, the average error of the angle gradually decreases. From the comparison results, a division into 360 equal bins is selected, achieving an average degree error of less than 3° and a keypoint detection accuracy of 99.4%.

TABLE II: Comparison of Orientation and KeyPoint Detection Accuracy Across Various Angle Discretizations

Bins	ADE	Acc.-5°	Acc.-15°	Acc.-30°	KDA
72	9.005	62.60%	89.20%	97.20%	99.40%
90	8.284	61.00%	88.40%	96.90%	99.40%
120	6.909	61.20%	95.40%	97.50%	99.40%
180	5.159	78.30%	97.00%	97.70%	99.40%
360	2.572	95.20%	97.80%	98.60%	99.40%

To enhance clarity, Fig. 7 displays two types of images: the left column for outdoor environments and the right column for indoor environments. The first row illustrates the detection and classification of luggage trolleys, with “yes” indicating occupied and “no” signifying idle. The second row highlights the detection of visible keypoints on the luggage trolleys, each marked with a white circle and a corresponding number. The third row displays the orientation of luggage trolleys, illustrated by a red line within a 360-degree circle.

IV-C Experiment Platform

As shown in Fig. 8, this article employs a robot designed for luggage trolley collection. The robot measures 0.45 m $\times$ 0.416 m $\times$ 1.2 m, while the luggage trolley dimensions are 0.79 m $\times$ 0.525 m $\times$ 1.01 m. This robot features advanced sensors like LiDAR (Velodyne VLP-16) and a camera (Realsense D435i) powered by an onboard computer. It employs a specialized manipulator for the efficient collection of the luggage trolley. The algorithms integrate with the Robot Operating System (ROS) and function on the robot’s onboard computer in real-time. This system is powered by an i7-1165G7 CPU and an NVIDIA GTX 2060 GPU.

IV-D Experiment Results

IV-D1 Detection Results

To verify the effectiveness of HPPS detection, comparative experiments are conducted on various detection components of the previous perception system. We select categories in our dataset that match the detection capabilities of the Xiao’s method. For instance, we chose the single idle luggage trolley during occlusion experiments, where Xiao’s method struggles with multiple or occupied luggage trolleys. Despite this, our method consistently outperforms Xiao’s in detection performance. Firstly, we compare our and Xiao’s methods [6] across different metrics for luggage trolley detection and classification, as shown in Tab. III. It evaluates the visibility of a single, idle luggage trolley at various visibility thresholds (>80%, 40% to 80%, and <40%), detection accuracy in scenarios with one to three luggage trolleys without occlusion, and luggage trolley classification accuracy when luggage trolleys are idle or occupied. It also assesses performance in complex situations where visibility is between 40% to 80% and only one of three luggage trolleys is idle. Xiao’s method has varying success rates, is significantly lower in poor visibility conditions. In contrast, our method consistently achieves high accuracy, maintaining performance levels above 90%, even in low visibility conditions, indicating a robust detection and classification even in complex situations. An important note is that Xiao’s method fails to identify the states of the luggage trolley. However, this limitation has been effectively addressed with the help of our newly annotated dataset. To evaluate the accuracy of keypoint detection, we compare the HPPS with Xiao’s method under the same luggage trolley detection and classification conditions, as detailed in Tab. IV. It presents the accuracy rates for keypoint detection across three visibility levels of luggage trolleys: above 80%, between 40% and 80%, and below 40%. We determine if a keypoint is successfully detected based on whether the sum of the average errors of its x and y coordinates does not exceed six pixels. Xiao’s method shows a detection rate of 75% for highly visible luggage trolleys, dropping to 33% for moderate visibility and 9% for low visibility. In contrast, our method significantly improves the detection rates to 89% for high visibility, 70% for moderate visibility, and 50% for low visibility, clearly outperforming Xiao’s approach across all categories.

TABLE III: Comparison of Two Methods Regarding the Accuracy of Luggage Trolley Detection and Classification

Metric

Method

Xiao’s Method [6]

Our Method

Trolley Visibility (Single Trolley, Idle)

>80%

83% (25/30)

100% (30/30)

40% - 80%

77% (23/30)

97% (29/30)

<40%

20% (6/30)

93% (28/30)

Trolley Count (No Occlusion, Idle)

One

93% (28/30)

100% (30/30)

Two

92% (55/60)

100% (60/60)

Three

81% (73/90)

100% (90/90)

Trolley Classification (Single Trolley, No Occlusion)

Idle

—

100% (30/30)

Occupied

—

100% (30/30)

Complex Situations:

Poor Visibility

Multiple Trolley Counts

Multiple Trolley Classifications

Recognition of

the Idle Trolley

—

93% (28/30)

TABLE IV: Comparison of Two Methods Regarding the Keypoint Detection Accuracy

Visibility	Xiao’s Method [6]	Our Method
>80%	75% (113/150)	89% (133/150)
40% - 80%	33% (31/93)	70% (65/93)
<40%	9% (5/58)	50% (29/58)

IV-D2 Localization Results

The two methods are implemented on the robot, and their operational real-time performance is assessed, as illustrated in Fig. 9. Over 150 image frames, the average execution time for our method is noted to be 0.180 ± 0.017 seconds, compared to Xiao’s method at 0.138 ± 0.012 seconds. Although our approach shows slightly longer running times due to an additional network model for orientation estimation and a greater number of model parameters, it still meets the real-time requirements of robotic perception systems by delivering timely pose estimation.

TABLE V: The Mean and Standard Deviation of

x

y

, and

\theta

for Two Methods Under Three Different Occlusion Conditions

Occlusion Conditions	Xiao’s Method [6]			Our Method
Occlusion Conditions	$\bm{x}$	$\bm{y}$	$\bm{\theta}$	$\bm{x}$	$\bm{y}$	$\bm{\theta}$
No Occlusion	0.1199 ± 0.1223	0.0766 ± 0.0882	0.1285 ± 0.4217	0.1136 ± 0.0766	0.0933 ± 0.0703	0.0710 ± 0.0653
Occluded by Static Obstacles	0.1428 ± 0.1024	0.0961 ± 0.0485	0.1978 ± 0.5082	0.1320 ± 0.1132	0.1312 ± 0.1091	0.1228 ± 0.1066
Occluded by an Occupied Luggage Trolley	1.2550 ± 0.9181	1.3454 ± 1.1319	2.6815 ± 1.7123	0.1496 ± 0.1067	0.1209 ± 0.0841	0.0848 ± 0.0719

HPPS and Xiao’s method are tested on a robot to evaluate their ability to detect the pose of a moving luggage trolley. The 3D poses detected by these two methods are compared with ground truth measured by a motion capture system. Three conditions are tested: no occlusion, static obstacles occlusion, and luggage trolley occlusion, as illustrated in Fig. 10, 11, and 12, respectively. It is important to note that the version of Xiao’s method used in this experiment has been modified. Initially, Xiao’s method requires identifying six keypoints to determine the pose of the luggage trolley. However, the EPNP calculation needs only four or more keypoints. Thus, Xiao’s method has been optimized here to ensure pose estimation with four or more keypoints.

In scenarios without occlusion (refer to Fig. 10), both methods effectively capture the pose of the luggage trolley. However, self-occlusion during the luggage trolley’s movement results in incomplete trajectory tracking by Xiao’s method, while our method consistently achieves complete recognition. With static obstacles present (refer to Fig. 11), the limitations of Xiao’s method become more apparent, particularly in sections occluded by obstacles where the luggage trolley’s pose is frequently lost. Conversely, our method maintains complete acquisition of the luggage trolley. In situations where the occupied luggage trolley is an occlusion (refer to Fig. 12), Xiao’s method struggles with distinguishing the trolley’s states, often failing to capture the pose of the target luggage trolley. In contrast, our method effectively and accurately detects and locates the idle luggage trolley. Furthermore, these three experimental results illustrate that as the detection distance surpasses 5 meters, oscillations in the pose obtained by HPPS increase, leading to a decrease in locating accuracy. This decrease in accuracy originates from the precision of keypoints’s image coordinates. As the proportion of the locating object within the image becomes minimal, even minor errors in image coordinates lead to significant inaccuracies in locating. In contrast, at closer detection distances, the locating exhibits greater accuracy and smoothness. Thus, a progressive perception strategy has been integrated into this system. Following the acquisition of the luggage trolley’ initial pose, the system continuously updates the luggage trolley’s pose with effective observation information (refer to Sec. III-D), thereby reducing errors associated with long-distance locating.

Tab. V provides a more detailed comparison of the mean and standard deviations for the proposed HPPS and Xiao’s method for measurements of horizontal distance ( $x$ ), vertical distance ( $y$ ), and angle ( $\theta$ ). The table shows that the proposed method consistently maintains the mean and standard deviation of the locating error, regardless of occlusion. In contrast, Xiao’s method shows increased locating errors under occlusion conditions, particularly in scenarios with multiple luggage trolleys. Although the mean and standard deviation values of Xiao’s method under static obstacles occlusion seem similar to ours, this similarity is because Xiao’s method does have many occlusion data, leading to incomplete statistical results. Nonetheless, our method equals or exceeds Xiao’s method in performance, demonstrating greater resilience to occlusion and improved locating robustness.

IV-D3 Robot Trial Results

As shown in Fig. 13, a real-world experiment is constructed to verify the proposed perception system, HPPS. To simulate the airport environment, two pedestrians with luggage are arranged to stop and chat, and another pedestrian pushes a luggage trolley. Through HPPS, the robot initially detects an idle luggage trolley under partial occlusion. It then locates the luggage trolley and navigates towards it while avoiding pedestrians and obstacles. HPPS filters out interference from occupied luggage trolleys during navigation and refines the target’s pose as the robot moves closer. Finally, the robot grasps the luggage trolley and stops detection during transportation.

V Conclusions and Future Work

This article introduces the HPPS, which detects the position and orientation of luggage trolleys separately and gradually updates the luggage trolley’s pose during navigation. This innovative hierarchical processing structure simplifies the labeling needs for datasets and enhances practicality and scalability. Additionally, the system’s progressive perception improves robustness and accuracy for locating luggage trolleys. The HPPS robustly completes the localization task, even in cases of partial occlusion. We have tested the system’s accuracy and robustness in experiments and demonstrated its capabilities on actual trolley collection tasks in complex environments. Thus, the HPPS enhances the robot’s ability to detect and localize luggage trolleys, providing a foundation for the subsequent operation and execution of the robotic autonomous luggage trolley collection system.

In future work, we intend to explore the potential of multi-robot cooperation to enhance environment perception and further advance our research of multi-robot motion planning.

Conflict of Interest

Jiankun Wang is an Associate Editor of IEEE Transactions on Intelligent Vehicles.

References

[1] Y. Huang, J. Du, Z. Yang, Z. Zhou, L. Zhang, and H. Chen, “A survey on trajectory-prediction methods for autonomous driving,” IEEE Transactions on Intelligent Vehicles, vol. 7, no. 3, pp. 652–674, 2022.
[2] L. Chen, Y. Li, C. Huang, B. Li, Y. Xing, D. Tian, L. Li, Z. Hu, X. Na, Z. Li, S. Teng, C. Lv, J. Wang, D. Cao, N. Zheng, and F.-Y. Wang, “Milestones in autonomous driving and intelligent vehicles: Survey of surveys,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 2, pp. 1046–1056, 2023.
[3] S. Zhou, J. Chen, S. Teng, H. Zhang, and F.-Y. Wang, “Integrating sustainability in future traffic lighting: Designing efficient light systems for vehicle, road, and traffic,” IEEE Transactions on Intelligent Vehicles, vol. 9, no. 1, pp. 65–68, 2024.
[4] Y. Lin, X. Na, D. Wang, X. Dai, and F.-Y. Wang, “Mobility 5.0: Smart logistics and transportation services in cyber-physical-social systems,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 6, pp. 3527–3532, 2023.
[5] Q. Yang, Y. Ai, S. Teng, Y. Gao, C. Cui, B. Tian, and L. Chen, “Decoupled real-time trajectory planning for multiple autonomous mining trucks in unloading areas,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 10, pp. 4319–4330, 2023.
[6] A. Xiao, H. Luan, Z. Zhao, Y. Hong, J. Zhao, W. Chen, J. Wang, and M. Q.-H. Meng, “Robotic autonomous trolley collection with progressive perception and nonlinear model predictive control,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 4480–4486.
[7] P. Xie, B. Xia, A. Hu, Z. Zhao, L. Meng, Z. Sun, X. Gao, J. Wang, and M. Q.-H. Meng, “Autonomous multiple-trolley collection system with nonholonomic robots: Design, control, and implementation,” arXiv preprint arXiv:2401.08433, 2024.
[8] B. Xia, H. Luan, Z. Zhao, X. Gao, P. Xie, A. Xiao, J. Wang, and M. Q.-H. Meng, “Collaborative trolley transportation system with autonomous nonholonomic robots,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 8046–8053.
[9] Z. Sun, W. Chen, J. Wang, and M. Q.-H. Meng, “A systematic evaluation of different indoor localization methods in robotic autonomous luggage trolley collection at airports,” arXiv preprint arXiv:2303.06551, 2023.
[10] V. Lepetit, F. Moreno-Noguer, and P. Fua, “Ep n p: An accurate o (n) solution to the p n p problem,” International journal of computer vision, vol. 81, pp. 155–166, 2009.
[11] J. Pan, X. Mai, C. Wang, Z. Min, J. Wang, H. Cheng, T. Li, E. Lyu, L. Liu, and M. Q.-H. Meng, “A searching space constrained partial to full registration approach with applications in airport trolley deployment robot,” IEEE Sensors Journal, vol. 21, no. 10, pp. 11 946–11 960, 2020.
[12] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
[13] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
[14] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
[15] P. Jiang, D. Ergu, F. Liu, Y. Cai, and B. Ma, “A review of yolo algorithm developments,” Procedia computer science, vol. 199, pp. 1066–1073, 2022.
[16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 2016, pp. 21–37.
[17] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7291–7299.
[18] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer, 2016, pp. 483–499.
[19] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1653–1660.
[20] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2016, pp. 4724–4732.
[21] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5693–5703.
[22] I. Steinwart and A. Christmann, Support vector machines. Springer Science & Business Media, 2008.
[23] J. Choi, B.-J. Lee, and B.-T. Zhang, “Human body orientation estimation using convolutional neural network,” arXiv preprint arXiv:1609.01984, 2016.
[24] M. Raza, Z. Chen, S.-U. Rehman, P. Wang, and P. Bao, “Appearance based pedestrians’ head pose and body orientation estimation using deep learning,” Neurocomputing, vol. 272, pp. 647–659, 2018.
[25] D. Yu, H. Xiong, Q. Xu, J. Wang, and K. Li, “Continuous pedestrian orientation estimation using human keypoints,” in 2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2019, pp. 1–5.
[26] A. Ghodrati, M. Pedersoli, and T. Tuytelaars, “Is 2d information enough for viewpoint estimation?” in British Machine Vision Conference, 2014. [Online]. Available: https://api.semanticscholar.org/CorpusID:8230302
[27] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” arXiv preprint arXiv:1711.00199, 2017.
[28] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox, “Deepim: Deep iterative matching for 6d pose estimation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 683–698.
[29] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao, “Pvnet: Pixel-wise voting network for 6dof pose estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4561–4570.
[30] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas, “Normalized object coordinate space for category-level 6d object pose and size estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2642–2651.
[31] L. Bertoni, S. Kreiss, and A. Alahi, “Perceiving humans: from monocular 3d localization to social distancing,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp. 7401–7418, 2021.
[32] H. Chen, P. Wang, F. Wang, W. Tian, L. Xiong, and H. Li, “Epro-pnp: Generalized end-to-end probabilistic perspective-n-points for monocular object pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2781–2790.
[33] G. Jocher, A. Stoken, J. Borovec, A. Chaurasia, L. Changyu, A. Hogan, J. Hajek, L. Diaconu, Y. Kwon, Y. Defretin et al., “ultralytics/yolov5: v5. 0-yolov5-p6 1280 models, aws, supervise. ly and youtube integrations,” Zenodo, 2021.
[34] C. Wu, Y. Chen, J. Luo, C.-C. Su, A. Dawane, B. Hanzra, Z. Deng, B. Liu, J. Z. Wang, and C.-h. Kuo, “Mebow: Monocular estimation of body orientation in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3451–3461.
[35] H. Ye, J. Zhao, Y. Pan, W. Cherr, L. He, and H. Zhang, “Robot person following under partial occlusion,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 7591–7597.
[36] Y. Ma, S. Soatto, J. Košecká, and S. Sastry, An invitation to 3-d vision: from images to geometric models. Springer, 2004, vol. 26.
[37] Z. Sun, B. Lei, P. Xie, F. Liu, J. Gao, Y. Zhang, and J. Wang, “Multi-risk-rrt: An efficient motion planning algorithm for robotic autonomous luggage trolley collection at airports,” IEEE Transactions on Intelligent Vehicles, 2024.
[38] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers. Springer, 2010, pp. 177–186.