A Survey on 3D Skeleton Based Person Re-Identification:
Approaches, Designs, Challenges, and Future Directions
Abstract
Person re-identification via 3D skeletons is an important emerging research area that triggers great interest in the pattern recognition community. With distinctive advantages for many application scenarios, a great diversity of 3D skeleton based person re-identification (SRID) methods have been proposed in recent years, effectively addressing prominent problems in skeleton modeling and feature learning. Despite recent advances, to the best of our knowledge, little effort has been made to comprehensively summarize these studies and their challenges. In this paper, we attempt to fill this gap by providing a systematic survey on current SRID approaches, model designs, challenges, and future directions. Specifically, we first formulate the SRID problem, and propose a taxonomy of SRID research with a summary of benchmark datasets, commonly-used model architectures, and an analytical review of different methods’ characteristics. Then, we elaborate on the design principles of SRID models from multiple aspects to offer key insights for model improvement. Finally, we identify critical challenges confronting current studies and discuss several promising directions for future research of SRID. A corresponding up-to-date resource is provided in the GitHub repository111github.com/Kali-Hac/3D-skeleton-based-person-re-ID-survey.
1 Introduction
Person re-identification (re-ID) is an essential pattern recognition task of matching and retrieving a person-of-interest across different views or scenes, which has been extensively applied to security authentication, smart surveillance, human tracking, and robotics Vezzani et al. (2013); Nambiar et al. (2019); Ye et al. (2021). Recent advancements in economical and precise skeleton-tracking devices ( Kinect Shotton et al. (2011)) have simplified the acquisition of 3D skeletons, establishing them as a prevalent and versatile data modality for gait analysis and person re-ID Liao et al. (2020); Rao et al. (2023). Unlike conventional person re-ID methods that resort to appearance or facial characteristics Ye et al. (2021), 3D Skeleton based person Re-ID (SRID) models typically exploit body-structure features and motion patterns ( gait Murray et al. (1964)) from 3D coordinates of key body joints to identify different persons. With unique merits such as small input size, enhanced privacy ( without using appearances or faces), and good robustness against variations in view, scale, and background Han et al. (2017); Rao et al. (2021b), SRID has attracted a surge of attention from both academia and industry Andersson and Araujo (2015); Pala et al. (2019); Rao et al. (2021b); Rao and Miao (2023).
In recent years, research on SRID has gained momentum, leading to diversity in skeleton representations and model designs. Several early attempts Barbosa et al. (2012); Munaro et al. (2014a, b); Andersson and Araujo (2015); Pala et al. (2019) have been made on extracting hand-crafted features such as skeleton descriptors in terms of pre-established anthropometric and gait attributes of body. As these methods often necessitate domain expertise such as anatomy and kinematics Yoo et al. (2002) for skeleton modeling, they lack the ability to fully mine latent high-level features beyond human cognition. To resolve this challenge, recent mainstream methods Liao et al. (2020); Huynh-The et al. (2020); Rao et al. (2021b); Rashmi and Guddeti (2022); Rao and Miao (2023) leverage deep neural networks to automatically perform skeleton representation learning. One of exemplar methods (termed “sequence learning methods”) is to model sequential dynamics of raw or normalized skeletons to capture body and pose features, mainly using long short-term memory (LSTM) and its variants Wei et al. (2020); Rao et al. (2021b). However, they rarely investigate the intrinsic body relationships such as inter-joint motion correlations, thereby largely overlooking some valuable skeleton patterns. Another paradigm (termed “graph learning methods”) mitigates this problem by constructing skeleton graphs to model discriminative structural and actional features based on the interrelations of body parts Rao and Miao (2023), while they often require an efficient intra-skeleton ( structural-collaborative modeling Rao et al. (2021a)) and inter-skeleton modeling based on graph representations. In light of the aforementioned methods and challenges especially in skeleton representations, model designs, and performance, there has not been a survey that provides a systematic and comprehensive investigation of this area to advance model improvement and facilitate related research.
To fill this gap, we present the first survey on 3D skeleton based person re-identification (SRID), elucidating recent advancements of SRID methods, principles of model designs, existing challenges, and future directions. To the best of our knowledge, this is also the first work that provides a systematic taxonomy of different methods and design principles used in SRID. The structure of this survey is organized as follows. In Sec. 2, we first formulate the problem of SRID and provide the rationale of the proposed taxonomy. Then, we introduce and summarize benchmark datasets and representative approaches of SRID with an analysis of their characteristics in Sec. 3. Sec. 4 illustrates the principles of SRID model designs from three key aspects for model enhancements. In Sec. 5, we identify the key challenges of SRID, and provide a discussion of potential solutions and promising future directions. Finally, we conclude this paper in Sec. 6.

2 Preliminary
2.1 Problem Formulation
In the SRID task, when given a skeleton sequence containing the person-of-the-interest from the probe database, the target of the model is to query the identity of this person from the gallery database. Formally, a 3D skeleton sequence can be represented by , where denotes the skeleton with 3D coordinates of body joints. Each skeleton sequence corresponds to a person identity y, where and C is the number of different classes (, identities). We use , , and to denote the Training set, Probe set, and Gallery set that contain , , and skeleton sequences of different persons collected from different scenes or views. The SRID target is to learn a model that maps 3D skeleton sequences into effective representations, so that we can query the correct identity of an encoded skeleton sequence representation in the probe set via matching it with the sequence representations in the gallery set.
2.2 Taxonomy Rationale
We show the taxonomy of SRID in Fig. 1, which classifies the SRID research from three perspectives: (1) Approaches (Sec. 3); (2) Model designs (Sec. 4); (3) Challenges and future directions (Sec. 5). For SRID approaches, we divide them into three categories, including hand-crafted methods using manually-extracted features ( skeleton descriptors) (Sec. 3.2) , sequence learning methods that perform sequential representation learning of 3D skeletons (Sec. 3.3), and graph learning methods that model 3D skeletons as graphs (Sec. 3.4), and further subcategorize them according to different learning focuses such as pose or graph dynamics. Corresponding to different levels of skeleton modeling, three key parts of model designs are elaborated, including intra-skeleton modeling (Sec. 4.1) that focuses on body structure and relations, inter-skeleton modeling (Sec. 4.2) that captures skeleton correlations and importance, and skeletal sequential modeling (Sec. 4.3) that learns sequential pose dynamics and motion semantics. In challenges and future directions, we elucidate three main challenges and their potential solutions from skeleton representations (Sec. 5.1), SRID data (Sec. 5.2), and model performance (Sec. 5.3). Different open directions of SRID such as its security and future real-world applications are also discussed (Sec. 5.4).
Study | Architecture/Algorithm | Dataset | Venue | Code Link | ||
Hand-Crafted Methods | ||||||
Munaro et al. (2014b) | SVM | BIWI, IAS | Person Re-Identification’2014 | — | ||
SPS Munaro et al. (2014c) | Euclidean/Hamming Metric Algorithm | BIWI, IAS | ICRA’2014 | — | ||
Munaro et al. (2014a) | KNN | BIWI, IAS | ICRA’2014 | — | ||
Gharghabi et al. (2015) | KNN | BIWI | ICIEV’2015 | — | ||
Pala et al. (2015) | Euclidean Metric Algorithm | BIWI, KinectREID | TCSVT’2015 | — | ||
Andersson and Araujo (2015) | KNN, SVM, MLP | KGBD | AAAI’2015 | |||
Bondi et al. (2016) | Euclidean Metric Algorithm | Florence 3D Re-Id | ICPR’2016 | — | ||
CoB Khamsemanan et al. (2017) | KNN, MLP | KGBD, Freestyle Walks | TIFS’2017 | |||
Nambiar et al. (2017) | KNN | KS20 | FG’2017 | — | ||
Nambiar et al. (2018) | KNN | KS20 | VISIGRAPP’2018 | — | ||
Pala et al. (2019) | Adaboost | BIWI, PAVIS, Florence 3D Re-ID | Computers&Graphics’2019 | — | ||
PM Elaoud et al. (2021) | Random Forest | BIWI, IAS |
|
Github-PM | ||
Rao et al. (2022) | Jaccard Metric Algorithm | BIWI, IAS, KS20 |
|
— | ||
Sequence Learning Methods | ||||||
Wei et al. (2020) | Bi-LSTM | BIWI, IAS | ACIVS’2020 | — | ||
Huynh-The et al. (2020) | CNN | UPCV1, UPCV2, KS20, SDUgait | Neurocomputing’2020 | — | ||
PoseGait Liao et al. (2020) | CNN | CASIA-B, CASIA-E | Pattern Recognition’2020 | Github-PoseGait | ||
AGE Rao et al. (2020) | LSTM, MLP | BIWI, IAS, KGBD | IJCAI’2020 | Github-AGE | ||
SGELA Rao et al. (2021b) | LSTM, MLP | BIWI, IAS, KGBD, KS20, CASIA-B | TPAMI’2021 | Github-SGELA | ||
Rashmi and Guddeti (2022) | LSTM | KGBD, KS20, UPCV1, UPCV2 | JVCIR’2022 | — | ||
SimMC Rao and Miao (2022a) | MLP | BIWI, IAS, KGBD, KS20, CASIA-B | IJCAI’2022 | Github-SimMC | ||
Hi-MPC Rao et al. (2023) | MLP | BIWI, IAS, KGBD, KS20, CASIA-B | IJCV’2023 | Github-Hi-MPC | ||
Graph Learning Methods | ||||||
MG-SCR Rao et al. (2021c) | GAT, MLP | BIWI, IAS, KGBD, KS20, CASIA-B | IJCAI’2021 | Github-MG-SCR | ||
SM-SGE Rao et al. (2021a) | GAT/GRN, MLP | BIWI, IAS, KGBD, KS20, CASIA-B | ACM MM’2021 | Github-SM-SGE | ||
SPC-MGR Rao and Miao (2022b) | GAT, MLP | BIWI, IAS, KGBD, KS20, CASIA-B | Arxiv’2022 | Github-SPC-MGR | ||
TranSG Rao and Miao (2023) | Transformer, MLP | BIWI, IAS, KGBD, KS20, CASIA-B | CVPR’2023 | Github-TranSG |
Dataset | Year | Source | # ID | # Skeletons | # View | ||
---|---|---|---|---|---|---|---|
BIWI RGBD-ID[1] | 2013 | Kinect V1 | 50 | 205.8K | Ego | ||
IAS-Lab RGBD-ID[2] | 2013 | Kinect V1 | 11 | 89.0K | Ego | ||
KGBD[3] | 2014 | Kinect V1 | 164 | 188.7K | Ego | ||
KinectREID[4] | 2015 | Kinect V1 | 71 | 4.8K | 7 | ||
UPCV1[5] | 2015 | Kinect V1 | 30 | 13.1K | Ego | ||
UPCV2[6] | 2016 | Kinect V2 | 30 | 26.3K | Ego | ||
Florence 3D Re-Id[7] | 2016 | Kinect V2 | 16 | 18.0K | Ego | ||
KS20[8] | 2017 | Kinect V2 | 20 | 36.0K | 5 | ||
CAISA-B-3D[9] | 2020 |
|
124 | 706.5K | 11 | ||
PoseTrackReID-2D[10] | 2020 |
|
5350 | 53.6K | — |
3 Approaches and Datasets
In this section, we first provide an overview for SRID benchmark datasets, models and algorithms. Then, we detail representative SRID approaches of three categories (corresponding to the taxonomy in Sec. 2.2) with an analysis of their pros and cons. All representative methods are summarized in Table 1.
3.1 Overview
Benchmark Datasets. Table 2 provides the statistics of commonly-used datasets for evaluating SRID methods, which can be categorized to two types: (1) Sensor-based datasets, in which 3D skeleton data are captured from depth sensors such as Kinect, and (2) RGB-estimated datasets, including CASIA-B-3D and PoseTrackReID-2D, in which skeleton data are estimated from RGB videos using pose estimation models Cao et al. (2019); Chen and Ramanan (2017). Existing SRID datasets typically contain skeleton data collected from varying scenarios such as multiple views (KS20 Nambiar et al. (2017), KinectREID Pala et al. (2015)), appearance and clothe changes (BIWI RGBD-ID Munaro et al. (2014b) IAS-Lab RGBD-ID Munaro et al. (2014c), or/and different illumination conditions (KGBD Andersson and Araujo (2015)), which enables a comprehensive evaluation of both short-term and long-term SRID performance. SRID as an emerging task is currently with a limited number and scale of datasets, which is further discussed in Sec. 5.2.
Models and Algorithms. As summarized in Table 1, Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Multi-Layer Perceptron (MLP), and different metric algorithms (including Euclidean, Hamming, Jaccard distance metrics) are mainstream algorithms used in hand-crafted methods, while LSTM, CNN, MLP and their combination are extensively utilized in sequence learning models to capture sequential dynamics of skeletons. The representative sequence methods contain Bi-LSTM Wei et al. (2020), attention-based LSTM encoder-decoder Rao et al. (2020, 2021b), temporal-spatial CNN Liao et al. (2020), etc. The recent advancements of graph learning methods are primarily focused on the development of graph attention network (GAT), Transformer, and their variants, including graph relation networks (GRN) Rao et al. (2021a) and skeleton graph transformer Rao and Miao (2023).
3.2 Hand-Crafted Methods
Early research endeavors Barbosa et al. (2012); Munaro et al. (2014b); Andersson and Araujo (2015); Khamsemanan et al. (2017); Nambiar et al. (2017) design skeleton or body-joint descriptors in terms of anthropometric ( bone lengths) and gait attributes ( speed, joint angles) for person re-ID. For example, Euclidean distances between different joint pairs are computed as descriptors in Barbosa et al. (2012), which are followed and extended by different studies Munaro et al. (2014b); Pala et al. (2019) using domain knowledge such as human anatomy. These hand-crafted features are learned by different classifiers (, KNN, SVM) to perform person re-ID. They are also combined with different metric algorithms Pala et al. (2015); Rao et al. (2022) or other modalities such as 3D point clouds and face descriptors Gharghabi et al. (2015); Bondi et al. (2016); Pala et al. (2019); Munaro et al. (2014a) to further boost the person re-ID accuracy.
Hand-crafted methods possess relatively high explainability and efficiency: (1) They manually extract features based on well-established skeleton knowledge, which can be explained by domain experts Yoo et al. (2002). (2) Most of these methods utilize classic machine learning models ( SVM) that are supported by solid theoretical foundations. Their model size and computational complexity are usually lower than deep neural networks, leading to higher efficiency under the same performance. However, they often lack the ability to fully represent advanced motion-related concepts ( motion consistency) or latent high-level features beyond human cognition, which might limit their performance on capturing complex class-related motion patterns. Another limitation is their inferior performance compared to data-driven deep learning techniques, especially on large-scale datasets Zhou et al. (2017).
3.3 Sequence Learning Methods
Sequence learning methods typically leverage deep neural networks to learn pose dynamics or latent motion semantics from consecutive 3D skeletons: (1) Pose Dynamics Models: Wei et al. (2020); Rashmi and Guddeti (2022) exploit LSTM Hochreiter and Schmidhuber (1997) and its variants to capture temporal dynamics of pose features such as body part lengths, joint distances, relative joint positions, joint angles within a gait cycle for SRID, while Huynh-The et al. (2020); Liao et al. (2020) further model poses as spatial geometric features and temporal statistic features, which are encoded by coarse-to-fine level convolutional architectures; (2) Latent Motion Semantics Models: AGE Rao et al. (2020) and SGELA Rao et al. (2021b) devise self-supervised high-level semantics tasks including skeleton sequence reconstruction, sorting, and prediction to encourage the LSTM-based encoder-decoder model to learn sequential dynamics and useful motion/gait semantics such as motion continuity and local skeleton/joint correlations (defined as locality in Rao et al. (2021b)) for person re-ID. SimMC Rao and Miao (2022a) proposes an MLP-based skeleton prototype learning framework with intra-sequence similarity learning and masked contrastive learning, which respectively facilitate learning local motion continuity and high-level class semantics for SRID. Hi-MPC Rao et al. (2023) further designs hierarchical prototype learning with a hard skeleton mining approach to learn multi-level motion semantics of key informative skeletons in an unsupervised manner without using skeleton labels.
Sequence learning methods can automatically model sequential dynamics of skeletons to learn latent effective representations without necessitating external domain knowledge. They typically do not require complicated pre-modeling such as graph modeling and can directly use raw or standardized skeleton sequences for learning, which could possess lower model complexity and less model parameters than graph learning models Rao and Miao (2022a). However, as these methods do not explicitly model intrinsic structural relations and motion correlations of human body, they might largely ignore some valuable skeleton patterns for SRID. On the other hand, they usually require devising effective learning tasks such as skeleton sequence prediction to enhance skeleton or motion semantics learning.
3.4 Graph Learning Methods
Recently, 3D skeleton graph models have been widely employed to capture body structural and actional features, which can be roughly divided into two categories: (1) Graph Dynamics Models: MG-SCR Rao et al. (2021c) constructs multi-level skeleton graphs based on physical connections of body joints or parts, and utilize GAT Velickovic et al. (2018) and LSTM to model joint relations and graph dynamics for person re-ID. In SM-SGE Rao et al. (2021a), a self-supervised multi-scale skeleton reconstruction and prediction mechanism is further integrated to enhance the discriminative graph dynamics learning for SRID; (2) Graph Prototype Models: Combining structural-collaborative body relation modeling based on GAT, SPC-MGR Rao and Miao (2022b) proposes an unsupervised clustering-contrasting paradigm to contrast and learn the most representative features, , feature cluster centroids (termed prototypes), from different-level skeleton graph representations for person re-ID. In TranSG Rao and Miao (2023), a skeleton graph transformer is devised to simultaneously learn both structural and actional relations of body joints, while the model utilizes ground-truth skeleton labels to generate more reliable graph prototypes to capture richer class-specific semantics for SRID.
Graph learning methods explicitly model human body structure and relations based on the constructed skeleton graphs, which can mine richer structural features and motion patterns than non-graph methods. In contrast to sequence learning methods that learn latent motion semantics, graph-based models can offer more intuitive explanations such as visualization of joint nodes and relations Rao and Miao (2023), thereby aiding in selecting key features and further improving the model. Despite several advantages, these methods typically require pre-defining skeletal topology ( mapping body joints to specific parts) to construct graphs of different levels, and necessitate combining multi-level or/and multi-stage relation modeling Rao et al. (2021c) for effective intra-skeleton and inter-skeleton feature learning. Recent efforts such as TranSG Rao and Miao (2023) unify different relation modeling into a transformer-based framework, while they demand multiple learning tasks ( graph structure prediction) to enhance the capture of valuable graph semantics.
4 Model Designs
To design an effective SRID model, three essential aspects are generally considered and combined: (1) Intra-skeleton modeling, which focuses on feature learning of body joints, parts, and relations in skeletons; (2) Inter-skeleton modeling, which captures inherent importance or key correlations of different skeletons; and (3) Skeletal sequential modeling, which aims to learn skeletal pose dynamics, gait attributes, and latent motion semantics from consecutive skeletons. In this section, we elaborate on the design principles with corresponding key motivation and representative solutions for each aspect.
4.1 Intra-Skeleton Modeling
4.1.1 Structural Features
As a 3D skeleton retains both absolute and relative positions of different body components based on the physically-connected skeletal topology Han et al. (2017), we can extract various structural measurements such as bone lengths, joint angles, and relative positions as discriminative skeleton features for SRID Munaro et al. (2014c, a); Andersson and Araujo (2015); Khamsemanan et al. (2017). From the perspective of graphs, the body joints can be naturally modeled as nodes with their structural connections as edges Rao et al. (2021a); Rao and Miao (2022b), which explicitly embeds the whole body structure into graphs to capture low-level spatial attributes such as neighbor joints’ coordinates and velocity, or learn high-level structural features and graph semantics using graph-based deep neural networks.
4.1.2 Body Relations
Body relations are defined as internal relations of joints or body parts at the physical ( adjacent relations) or kinematic level ( virtual motion relations) Rao et al. (2021a), which could be exploited to capture unique body and motion features Aggarwal et al. (1998); Murray et al. (1964). According to Rao et al. (2021c, a), structural relations and collaborative relations are two most important body relation types, which respectively learn local motion correlation between structurally-connected body components and global collaborative relations among different body components. GAT and its variants (MGRN Rao et al. (2021a)) are widely used to capture these relations, while some recent endeavors devise transformer-based models (TranSG Rao and Miao (2023)) to unify both structural and actional relations into full-relation learning of skeleton graphs based on self-attention mechanisms. Different types of joint relations can also be generalized to different-level body representations ( coarse-to-fine graphs Rao and Miao (2022b)) to characterize motion correlations of higher level components such as limbs, which facilitates capturing more global SRID features.
4.1.3 Skeletal Properties
Different anthropometric ( body height, bone lengths), geometric ( pairwise joint distances, joint angles), morphological ( trajectory of movement, body symmetry), and gait attributes of body joints are commonly-used skeleton features for SRID Munaro et al. (2014c); Pala et al. (2019). From the view of gait, step length, stride length, gait cycle time, velocity, and related statistical measures can be extracted from 3D skeletons, most of which are unique skeletal attributes for human identification Andersson and Araujo (2015); Murray et al. (1964). Other useful skeletal properties such as higher-order motion descriptors ( skeleton/joint acceleration) and specific relative positions based on different centers or coordinate systems Khamsemanan et al. (2017) could also help better depict skeleton motion patterns and provide distinctive features.
4.2 Inter-Skeleton Modeling
4.2.1 Skeleton Correlations
As one key of SRID learning is to capture motion patterns from 3D skeleton sequences, it is natural to consider their inherent property termed motion continuity Rao et al. (2020). This property ensures that skeletons in a small temporal interval will not undergo drastic changes, thus resulting in higher correlations among adjacent skeletons in the local context of a sequence. In this sense, the motion continuity can be exploited to learn the inherent skeleton correlations by employing different self-supervised learning tasks such as skeleton sequence reconstruction, sorting, and prediction Rao et al. (2021b, 2020). These tasks can help model better focus on key motion context and aggregate correlated features for SRID. We can also devise intra-sequence similarity contrastive learning to adaptively learn skeleton correlations and motion consistency Rao and Miao (2022a). Recent works such as TranSG Rao and Miao (2023) further explore context-based sequence reconstruction based on partial skeleton trajectory or structure to infer key temporal motion or spatial relations for better skeleton representation learning.
4.2.2 Importance Ranking
Different skeletons and their feature representations typically possess different importance in characterizing discriminative patterns of a person, which can be exploited to mine key skeletons or hard samples Hermans et al. (2017) for SRID learning. Voting or weighting strategies based on accuracy, pose quality or other reliability metrics can be devised to rank importance of skeletons and combine them to achieve higher SRID performance Barbosa et al. (2012); Gharghabi et al. (2015); Bondi et al. (2016). To exploit more important skeletons for training, it is crucial to focus on the hardest negative or positive samples based on their informative value in classification Rao et al. (2023). Post-processing techniques such as -reciprocal re-ranking Zhong et al. (2017); Rao et al. (2022) can also be potentially used to rank and combine key skeletons for SRID.
4.3 Skeletal Sequential Modeling
4.3.1 Pose Dynamics
A sequence of skeletons typically conveys dynamics of unique body poses, which can be utilized to differentiate different persons. Existing endeavors usually extract hand-crafted features such as angles of key joints Liao et al. (2020), or utilize sequence learning models such as LSTM to automatically model long-term pose dynamics Rao et al. (2020); Rashmi and Guddeti (2022). Based on sequence learning models, employing some auxiliary tasks such as skeleton sequence reconstruction, sorting, prediction, or contrastive learning Rao et al. (2021b) could facilitate capturing richer spatial-temporal pose features for SRID.
4.3.2 Gait Attributes
Considering different individuals often possess different walking ( gait) patterns such as speed, stride length, gait cycle time, swing and stance times Murray et al. (1964), it is a common practice to compute these gait attributes from 3D skeleton sequences as important discriminators Andersson and Araujo (2015). Another feasible direction is to devise high-level gait learning tasks ( gait reconstruction) or model various relations of body joints or parts to perform auto-encoding of gait attributes and semantics based on deep learning models such as sequence learning models Rao et al. (2021b, c, a).
4.3.3 Latent Motion Semantics
Apart from hand-crafted motion or pose features that are usually extracted based on domain knowledge ( anatomy and kinematics Yoo et al. (2002)), recent deep learning models can capture latent motion semantics from high-dimensional feature spaces, including high-level motion concepts ( motion continuity Rao and Miao (2023), consistency Rao and Miao (2022a), locality Rao et al. (2021b)) and class-related motion semantics. Motivated by different properties of skeletons and motion (illustrated in Sec. 4.1 and 4.2), different semantics learning tasks at the skeleton level ( masked skeleton reconstruction Rao and Miao (2023)), sequence level ( sequence reconstruction, sorting, prediction, similarity contrasting Rao et al. (2021b)), and class level ( prototype clustering and contrasting Rao and Miao (2022a)) are devised to learn effective class-related motion semantics for SRID.
5 Challenges and Future Directions
In this section, we first identify main challenges of SRID from skeleton representations (see Sec. 5.1), SRID data (see Sec. 5.2), and model performance (see Sec. 5.3). Then, we propose corresponding potential solutions, followed by a discussion of open future directions (see Sec. 5.4).
5.1 Multi-Level Skeleton Sequence Modeling
Most existing studies learn skeleton features from a single level ( body joint level Rao et al. (2020, 2021b)), while they rarely explore a comprehensive multi-level skeleton learning from structural features, body relations, and motion dynamics. This could cause an overlook of valuable hierarchical skeleton information ( global-local patterns Rao et al. (2023)) and limit the SRID performance.
Solutions and Directions: First, as human body can be naturally modeled with several key functional regions at different levels ( joints, limbs) Winter (2009), we can exploit different groups of joints to construct higher-level body representations ( limbs) to characterize anthropometric or kinetic features within body structure from coarse to fine. Second, it is feasible to generalize the joint-level relation modeling to multi-level body representations. We can exploit the multi-grained relations between different-level body parts in terms of physical or kinematic attributes to mine richer body and motion features. Last, by devising and employing motion semantics learning tasks on different level skeleton sequences, the model can not only capture dynamics of local parts such as joints but also learn more high-level semantics ( identity-related patterns) from global body components such as limbs.
5.2 Scarcity, Imbalance, and Noise of SRID Data
Existing SRID data, 3D skeletons, are mainly collected from prevailing depth sensors such as Kinect Shotton et al. (2011), while diverse skeleton collection settings ( different devices in uncontrollable environments) have not be thoroughly explored. In contrast to existing large-scale RGB-based person re-ID data ( MSMT17 Wei et al. (2018)), the available data for SRID is relatively scarce and imbalanced. As shown in Table 2, existing Kinect-based SRID datasets contain 4.8 thousand to 205.8 thousand skeletons with less than 200 identities, while the numbers of skeletons belonging to each identity often differ greatly ( imbalanced class distribution). This might negatively influence the learning and generalization ability of current data-driven SRID models using deep neural networks, with a risk of model over-fitting. On the other hand, either Kinect-based or RGB-estimated skeleton data unavoidably contain noise, which are possibly affected by factors such as device’s tracking distance, illumination changes (which may affect structured light used in Kinect V1), and pose estimation algorithm precision. Such inherent noise puts high demand on the robustness of model against random perturbations.
Solutions and Directions: To address these challenges, more high-quality 3D skeleton data should be collected or generated. First, the number of different identities and the skeleton data size of each identity should be controlled within a reasonable range, enlarging the identity size to hundreds and keeping the total data size of each identity as similar as possible. Such balanced data could benefit the model to learn higher inter-identity difference with better robustness against intra-identity diversity Johnson and Khoshgoftaar (2019). To make up for the data scarcity and imbalance in this area and facilitate related research, we will collect and open a large-scale SRID dataset in the future. Second, to reduce skeleton noise and sample more balanced data, it is feasible to devise skeleton denoising models and augmentation strategies for skeleton generation. For example, the GAN-based pose generator Yan et al. (2017) and newly-emerging diffusion models Ho et al. (2020) could be transferred to generate or/and denoise 3D skeleton data, while skeleton-level, sequence-level, and class-level data augmentations can be further explored.
5.3 Model Robustness, Generality, and Interpretability
Existing SRID models Rao and Miao (2022a, 2023); Rao et al. (2023) report unstable performance variations that are sensitive to different model parameter initialization, hyper-parameter settings ( clustering parameters) and the diversity and quality of training datasets. Owing to such non-robust performance, many studies opt for the best-trained model with carefully-selected initialization and parameters, while they often fail to reflect the true average performance of corresponding architectures and typically exhibit limited generalization ability. On the other hand, due to training on a single dataset with limited data sizes, views, scenes or conditions, many SRID methods may only perform well on the scenarios that are similar to that of the training data, while they cannot generalize to more challenging data or real-world scenarios ( RGB-estimated skeleton data). Another important flaw of existing models is their weak interpretability, as they rarely provide human-friendly explanation for the effectiveness of model architectures, skeleton features, and prediction results. Such “black-box” models might increase the risk of wrong prediction, cause the opacity of performance improvement, and further hinder their large-scale application.
Solutions and Directions: To obtain a more robust SRID model, it is necessary to delve into the investigation of effects of model initialization, the most essential hyper-parameters, and data quality ( extreme data noise or imbalance) on model performance. We can conduct a comprehensive empirical evaluation of them to find key factors and seek an improvement or control of model robustness. Theoretical analyses for the performance variations in terms of model-approximated functions and convergence conditions Rao and Miao (2022a, 2023) can also be provided for a more robust model design. Moreover, fairer multi-faceted performance evaluation metrics such as performance average and standard deviation should be reported to better evaluate the overall robustness of models.
To improve model generality, a promising future direction is to exploit larger-scale SRID datasets containing diverse scenario settings to learn both domain-specific ( identity-specific) and domain-general ( identity-shared) gait features, so as to enable the model to be transferred/generalized to different scenarios. It is also feasible to explore domain adaptation or generalization techniques to combine different datasets for model training and transfer Rao and Miao (2022b). More diverse benchmarks such as RGB-estimated skeleton datasets such as CASIA-B-3D Liao et al. (2020) should be explored for generality evaluation of models.
To provide interpretability, different human-friendly explanation including pose/feature visualization and text description can be considered. There also exist various architecture-specific explanation mechanisms, class activation maps (CAM) Zhou et al. (2016) for CNN, knowledge graphs Ji et al. (2021) for graph neural networks (GNN), attention visualization for transformers Rao and Miao (2023), which could be applied to explainable skeleton learning. Inspired by the recent success of large language models (LLMs) such as GPT-4V, they can serve as an agent to join the SRID model training to interpret the learning process, and we can transform skeleton representations into text (, time series) or image inputs ( pose images) and devise prompts to query the pose/feature importance. We provide a further discussion about prompt-based skeletal foundation models in Sec. 5.4.
5.4 Other Directions and Discussions
Multi/Cross-Modal Learning. Combining 3D skeleton data and other modalities such as RGB images, depth images, radio frequency waves ( Radar waves) is a promising direction, as they can provide pose or gait information from different dimensions ( appearances, silhouettes) to better identify different persons. Another direction is to transfer and fuse gait representations across skeleton modality and other modality, which can be the key to achieving more general and scalable skeleton learning for more multi-modal tasks.
Unified Evaluation Protocol. As existing SRID studies adopt either identity classification Rao et al. (2021b) or re-ID matching protocol Rao et al. (2023), a more comprehensive unified evaluation protocol that provides not only accuracy-related metrics ( Rank-1 accuracy, mAP) but also measures of model generality, robustness, and reliability should be devised. It is also imperative to formulate a fair cross-modality evaluation and comparison protocol that standardizes re-ID settings ( probe/gallery settings, single/multi-shot recognition) for comparing skeleton-based methods and RGB/depth-based methods or multi-modal methods.
Prompt-Based Skeletal Foundation Model. Motivated by the success of LLMs and pose generative models Lucas et al. (2022), we can train or fine-tune them with large-scale skeleton data to build a skeletal foundation model that supports using prompts ( textual user interaction) for skeleton attribute generation ( analyzing and summarizing gait attributes), skeleton augmentation, prediction, classification, and customizable applications ( gait visualization, SRID). Constructing this foundational model is advantageous for investigating the scope of adaptability, universality, and interpretability in 3D skeleton data and SRID.
Privacy Protection of SRID. Although existing SRID models do not utilize or disclose human appearance information, and all publicly-available training skeleton data are completely anonymized, the privacy issue should be kept in mind when developing this emerging technology further ( combine with RGB images) Ye et al. (2024). As illegally or irresponsibly deploying person re-ID technologies might invade personal privacy, it is still important to establish SRID-relevant laws to protect the privacy.
Future Real-World Applications. (1) Mobile Pattern Recognition: With smaller input data size and lower resource requirement than RGB-based models, SRID models can be promisingly integrated to mobile devices to combine different sensor data ( camera images) to jointly perform identity-aware action recognition, motion prediction, and gesture recognition tasks. (2) Gait-based Medical Diagnosis: As gait is one of the most important features used for SRID, the pre-trained model can be potentially transferred to different gait classification tasks. A promising direction is to apply SRID models to medical areas to assist in identifying patients and their abnormal gait ( Parkinsonian gait). (3) Criminal Tracking: With good robustness to appearance changes and environmental variations Han et al. (2017), SRID models can be combined with other bioinformatic features ( faces) or RGB-based methods Li et al. (2019); Lan et al. (2020) to track criminals and monitor their activities via skeletal poses under more challenging scenarios.
6 Conclusion
In this paper, we present the first survey of 3D skeleton based person re-ID (SRID). We systematically review the recent advances of SRID approaches, encompassing their algorithms, model architectures, and used benchmark datasets. The principles of SRID model designs are summarized to provide crucial insights for advancing model improvement. We further identify critical challenges in SRID research, and propose several promising directions for future exploration.
References
- Aggarwal et al. [1998] Jake K Aggarwal, Quin Cai, W Liao, and Bikash Sabata. Nonrigid motion analysis: Articulated and elastic motion. Computer Vision and Image Understanding, 70(2):142–156, 1998.
- Andersson and Araujo [2015] Virginia O Andersson and Ricardo M Araujo. Person identification using anthropometric and gait data from Kinect sensor. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pages 425–431, 2015.
- Barbosa et al. [2012] Igor Barros Barbosa, Marco Cristani, Alessio Del Bue, Loris Bazzani, and Vittorio Murino. Re-identification with RGB-D sensors. In European Conference on Computer Vision (ECCV) Workshop, pages 433–442. Springer, 2012.
- Bondi et al. [2016] Enrico Bondi, Pietro Pala, Lorenzo Seidenari, Stefano Berretti, and Alberto Del Bimbo. Long term person re-identification from depth cameras using facial and skeleton data. In International Conference on Pattern Recognition (ICPR) Workshop, pages 29–41, 2016.
- Cao et al. [2019] Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser Sheikh. OpenPose: Realtime multi-person 2D pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1):172–186, 2019.
- Chen and Ramanan [2017] Ching-Hang Chen and Deva Ramanan. 3D human pose estimation= 2D pose estimation+ matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7035–7043, 2017.
- Chen et al. [2022] Di Chen, Andreas Döring, Shanshan Zhang, Jian Yang, Juergen Gall, and Bernt Schiele. Keypoint message passing for video-based person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), volume 36, pages 239–247, 2022.
- Elaoud et al. [2021] Amani Elaoud, Walid Barhoumi, Hassen Drira, and Ezzeddine Zagrouba. Person re-identification from different views based on dynamic linear combination of distances. Multimedia Tools and Applications, 80:17685–17704, 2021.
- Gharghabi et al. [2015] Shaghayegh Gharghabi, Faraz Shamshirdar, and Taher Abbas Shangari, et al. People re-identification using 3D descriptor with skeleton information. In 2015 International Conference on Informatics, Electronics & Vision (ICIEV), pages 1–5. IEEE, 2015.
- Han et al. [2017] Fei Han, Brian Reily, William Hoff, and Hao Zhang. Space-time representation of people based on 3D skeletal data: A review. Computer Vision and Image Understanding, 158:85–105, 2017.
- Hermans et al. [2017] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
- Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS), 33:6840–6851, 2020.
- Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Huynh-The et al. [2020] Thien Huynh-The, Cam-Hao Hua, Nguyen Anh Tu, and Dong-Seong Kim. Learning 3D spatiotemporal gait feature by convolutional network for person identification. Neurocomputing, 397:192–202, 2020.
- Ji et al. [2021] Shaoxiong Ji, Shirui Pan, Erik Cambria, Pekka Marttinen, and S Yu Philip. A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems, 33(2):494–514, 2021.
- Johnson and Khoshgoftaar [2019] Justin M Johnson and Taghi M Khoshgoftaar. Survey on deep learning with class imbalance. Journal of Big Data, 6(1):1–54, 2019.
- Kastaniotis et al. [2015] Dimitris Kastaniotis, Ilias Theodorakopoulos, Christos Theoharatos, George Economou, and Spiros Fotopoulos. A framework for gait-based recognition using Kinect. Pattern Recognition Letters, 68:327–335, 2015.
- Kastaniotis et al. [2016] Dimitris Kastaniotis, Ilias Theodorakopoulos, George Economou, and Spiros Fotopoulos. Gait based recognition via fusing information from euclidean and riemannian manifolds. Pattern Recognition Letters, 84:245–251, 2016.
- Khamsemanan et al. [2017] Nirattaya Khamsemanan, Cholwich Nattee, and Nitchan Jianwattanapaisarn. Human identification from freestyle walks using posture-based gait feature. IEEE Transactions on Information Forensics and Security, 13(1):119–128, 2017.
- Lan et al. [2020] Long Lan, Xinchao Wang, Gang Hua, Thomas S Huang, and Dacheng Tao. Semi-online multi-people tracking by re-identification. International Journal of Computer Vision, 128(7):1937–1955, 2020.
- Li et al. [2019] Minxian Li, Xiatian Zhu, and Shaogang Gong. Unsupervised tracklet person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(7):1770–1782, 2019.
- Liao et al. [2020] Rijun Liao, Shiqi Yu, Weizhi An, and Yongzhen Huang. A model-based gait recognition method with body pose and human prior knowledge. Pattern Recognition, 98:107069, 2020.
- Lucas et al. [2022] Thomas Lucas, Fabien Baradel, Philippe Weinzaepfel, and Grégory Rogez. PoseGPT: Quantization-based 3D human motion generation and forecasting. In Proceedings of the European Conference on Computer Vision (ECCV), pages 417–435. Springer, 2022.
- Munaro et al. [2014a] Matteo Munaro, Alberto Basso, Andrea Fossati, Luc Van Gool, and Emanuele Menegatti. 3D reconstruction of freely moving persons for re-identification with a depth sensor. In International Conference on Robotics and Automation (ICRA), pages 4512–4519. IEEE, 2014.
- Munaro et al. [2014b] Matteo Munaro, Andrea Fossati, Alberto Basso, Emanuele Menegatti, and Luc Van Gool. One-shot person re-identification with a consumer depth camera. In Person Re-Identification, pages 161–181. Springer, 2014.
- Munaro et al. [2014c] Matteo Munaro, Stefano Ghidoni, Deniz Tartaro Dizmen, and Emanuele Menegatti. A feature-based approach to people re-identification using skeleton keypoints. In International Conference on Robotics and Automation (ICRA), pages 5644–5651. IEEE, 2014.
- Murray et al. [1964] M Pat Murray, A Bernard Drought, and Ross C Kory. Walking patterns of normal men. Journal of Bone and Joint Surgery, 46(2):335–360, 1964.
- Nambiar et al. [2017] Athira Nambiar, Alexandre Bernardino, Jacinto C Nascimento, and Ana Fred. Context-aware person re-identification in the wild via fusion of gait and anthropometric features. In International Conference on Automatic Face & Gesture Recognition, pages 973–980. IEEE, 2017.
- Nambiar et al. [2018] Athira M Nambiar, Alexandre Bernardino, and Jacinto C Nascimento. Cross-context analysis for long-term view-point invariant person re-identification via soft-biometrics using depth sensor. In VISIGRAPP, pages 105–113, 2018.
- Nambiar et al. [2019] Athira Nambiar, Alexandre Bernardino, and Jacinto C Nascimento. Gait-based person re-identification: A survey. ACM Computing Surveys, 52(2):33, 2019.
- Pala et al. [2015] Federico Pala, Riccardo Satta, Giorgio Fumera, and Fabio Roli. Multimodal person reidentification using RGB-D cameras. IEEE Transactions on Circuits and Systems for Video Technology, 26(4):788–799, 2015.
- Pala et al. [2019] Pietro Pala, Lorenzo Seidenari, Stefano Berretti, and Alberto Del Bimbo. Enhanced skeleton and face 3D data for person re-identification from depth cameras. Computers & Graphics, 79:69–80, 2019.
- Rao and Miao [2022a] Haocong Rao and Chunyan Miao. SimMC: Simple masked contrastive learning of skeleton representations for unsupervised person re-identification. In International Joint Conference on Artificial Intelligence (IJCAI), pages 1290–1297, 2022.
- Rao and Miao [2022b] Haocong Rao and Chunyan Miao. Skeleton prototype contrastive learning with multi-level graph relation modeling for unsupervised person re-identification. arXiv preprint arXiv:2208.11814, 2022.
- Rao and Miao [2023] Haocong Rao and Chunyan Miao. TranSG: Transformer-based skeleton graph prototype contrastive learning with structure-trajectory prompted reconstruction for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Rao et al. [2020] Haocong Rao, Siqi Wang, Xiping Hu, Mingkui Tan, Huang Da, Jun Cheng, and Bin Hu. Self-supervised gait encoding with locality-aware attention for person re-identification. In International Joint Conference on Artificial Intelligence (IJCAI), volume 1, pages 898–905, 2020.
- Rao et al. [2021a] Haocong Rao, Xiping Hu, Jun Cheng, and Bin Hu. SM-SGE: A self-supervised multi-scale skeleton graph encoding framework for person re-identification. In Proceedings of the 29th ACM International Conference on Multimedia, pages 1812–1820, 2021.
- Rao et al. [2021b] Haocong Rao, Siqi Wang, Xiping Hu, Mingkui Tan, Yi Guo, Jun Cheng, Xinwang Liu, and Bin Hu. A self-supervised gait encoding approach with locality-awareness for 3D skeleton based person re-identification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6649–6666, 2021.
- Rao et al. [2021c] Haocong Rao, Shihao Xu, Xiping Hu, Jun Cheng, and Bin Hu. Multi-level graph encoding with structural-collaborative relation learning for skeleton-based person re-identification. In International Joint Conference on Artificial Intelligence (IJCAI), pages 973–980, 2021.
- Rao et al. [2022] Haocong Rao, Yuan Li, and Chunyan Miao. Revisiting k-reciprocal distance re-ranking for skeleton-based person re-identification. IEEE Signal Processing Letters, 29:2103–2107, 2022.
- Rao et al. [2023] Haocong Rao, Cyril Leung, and Chunyan Miao. Hierarchical skeleton meta-prototype contrastive learning with hard skeleton mining for unsupervised person re-identification. International Journal of Computer Vision, pages 1–23, 2023.
- Rashmi and Guddeti [2022] M Rashmi and Ram Mohana Reddy Guddeti. Human identification system using 3d skeleton-based gait features and lstm model. Journal of Visual Communication and Image Representation (JVCIR), 82:103416, 2022.
- Shotton et al. [2011] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, Mark J Finocchio, Richard Moore, Alex Abenathar Kipman, and Andrew Blake. Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1297–1304, 2011.
- Velickovic et al. [2018] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In International Conference on Learning Representation (ICLR), 2018.
- Vezzani et al. [2013] Roberto Vezzani, Davide Baltieri, and Rita Cucchiara. People reidentification in surveillance and forensics: A survey. ACM Computing Surveys, 46(2):29, 2013.
- Wei et al. [2018] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 79–88, 2018.
- Wei et al. [2020] Chu-Chien Wei, Li-Huang Tsai, Hsin-Ping Chou, and Shih-Chieh Chang. Person identification by walking gesture using skeleton sequences. In Advanced Concepts for Intelligent Vision Systems, pages 205–214. Springer, 2020.
- Winter [2009] David A Winter. Biomechanics and motor control of human movement. John Wiley & Sons, 2009.
- Yan et al. [2017] Yichao Yan, Jingwei Xu, Bingbing Ni, Wendong Zhang, and Xiaokang Yang. Skeleton-aided articulated motion generation. In Proceedings of the 25th ACM international conference on Multimedia, pages 199–207, 2017.
- Ye et al. [2021] Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. Deep learning for person re-identification: A survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6):2872–2893, 2021.
- Ye et al. [2024] Mang Ye, Wei Shen, Junwu Zhang, Yao Yang, and Bo Du. Securereid: Privacy-preserving anonymization for person re-identification. IEEE Transactions on Information Forensics and Security, 2024.
- Yoo et al. [2002] Jang-Hee Yoo, Mark S Nixon, and Chris J Harris. Extracting gait signatures based on anatomical knowledge. In Proceedings of BMVA Symposium on Advancing Biometric Technologies, pages 596–606. Citeseer, 2002.
- Zhong et al. [2017] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1318–1327, 2017.
- Zhou et al. [2016] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2921–2929, 2016.
- Zhou et al. [2017] Lina Zhou, Shimei Pan, Jianwu Wang, and Athanasios V Vasilakos. Machine learning on big data: Opportunities and challenges. Neurocomputing, 237:350–361, 2017.