HiH: A Multi-modal Hierarchy in Hierarchy Network for
Unconstrained Gait Recognition
Abstract
Gait recognition has achieved promising advances in controlled settings, yet it significantly struggles in unconstrained environments due to challenges such as view changes, occlusions, and varying walking speeds. Additionally, efforts to fuse multiple modalities often face limited improvements because of cross-modality incompatibility, particularly in outdoor scenarios. To address these issues, we present a multi-modal Hierarchy in Hierarchy network (HiH) that integrates silhouette and pose sequences for robust gait recognition. HiH features a main branch that utilizes Hierarchical Gait Decomposer (HGD) modules for depth-wise and intra-module hierarchical examination of general gait patterns from silhouette data. This approach captures motion hierarchies from overall body dynamics to detailed limb movements, facilitating the representation of gait attributes across multiple spatial resolutions. Complementing this, an auxiliary branch, based on 2D joint sequences, enriches the spatial and temporal aspects of gait analysis. It employs a Deformable Spatial Enhancement (DSE) module for pose-guided spatial attention and a Deformable Temporal Alignment (DTA) module for aligning motion dynamics through learned temporal offsets. Extensive evaluations across diverse indoor and outdoor datasets demonstrate HiH’s state-of-the-art performance, affirming a well-balanced trade-off between accuracy and efficiency.
1 Introduction
Gait recognition aims to identify individuals by analyzing their walking patterns and styles captured uncooperatively from a distance [36]. Compared to fingerprint recognition, gait offers the advantage of being contactless [44]. In contrast with facial recognition, gait patterns are more robust against spoofing and better preserve privacy, as gait analysis relies on human silhouette and movement rather than detailed visual features [12]. Owing to these merits, gait has emerged as a promising biometric approach for applications like video surveillance [45, 3], healthcare [34, 38], and forensics [35, 31].

In constrained or laboratory settings, existing gait recognition methods have achieved promising results. Particularly, appearance-based methods using binary silhouette images excel at capturing discriminative shape and contour information [29, 8, 10, 46, 48, 47]. In parallel, model-based approaches explicitly estimate and exploit skeletal dynamics, uncovering view-invariant patterns robust to occlusions and cluttered backgrounds [42, 13]. By complementing silhouette information with pose data, multi-modal methods further enhance performance under challenging conditions like clothing and carrying variation [32, 6, 20, 56]. However, as the focus shifts towards in-the-wild scenarios to cater to real-world applications [10, 55], two main issues have emerged, as illustrated in the left of Fig. 1. Firstly, algorithms highly effective in constrained settings often exhibit a significant decrease in performance when applied to outdoor benchmarks [10, 13, 56]. This is attributed to covariates like camera view, occlusions, and step speed in real-world scenarios. Secondly, the incorporation of additional modalities like skeleton poses has not led to expected performance gains [20, 56]. The inherent data incompatibility between different modalities can introduce additional ambiguity.
In light of the discussed challenges, we propose modeling two complementary aspects of gait: general gait motion patterns and dynamic pose changes, as depicted in the right of Fig. 1. The general patterns, which can be extracted from silhouette sequences, refer to the kinematic gait hierarchy that manifests through biomechanics consistent across scenarios [33, 11], thereby enhancing model generalization. Meanwhile, we employ 2D joints to represent evolving pose changes during walking. They circumvent potential inaccuracies of 3D skeleton estimation, especially in unconstrained settings. Moreover, mapping the 2D joints onto the image plane facilitates fusion with silhouette features and learning spatio-temporal offsets for deformation-based processing.
Specifically, we introduce a multi-modal Hierarchy in Hierarchy network (HiH) for unconstrained gait recognition. HiH consists of two branches. The main branch takes in silhouette sequences to model stable gait patterns. It centers on the Hierarchical Gait Decomposer (HGD) module and adopts a layered architecture via depth-wise and intra-module principles to unpack the kinematic hierarchy. In the depth-wise hierarchy, cascaded HGD modules progressively decompose motions into more localized actions across layers, enabling increasingly fine-grained feature learning. Meanwhile, the intra-module hierarchy in each HGD amalgamates multi-scale features to enrich global and local representations. Through joint modeling of the hierarchical structure across and within modules, the main branch effectively captures discriminative gait signatures. Complementing the main branch, the auxiliary branch leverages 2D pose sequences to enhance the spatial and temporal processing of the HGD modules in two ways: Spatially, the Deformable Spatial Enhancement (DSE) module highlights key local regions guided by the pose input. Temporally, the Deformable Temporal Alignment (DTA) module reduces redundant frames and extracts compact motion dynamics based on learned offsets. By providing pose cues, the auxiliary branch enhances the alignment of the main branch’s learned representations with actual gait movements.
The main contributions are summarized as follows:
-
•
We propose the HiH network, a novel multi-modal framework for gait recognition in unconstrained environments. This network integrates silhouette and 2D pose data through the HGD, which executes a depth and width hierarchical decomposition specifically tailored to the complexities of gait analysis.
-
•
We propose two pose-driven guidance mechanisms for HGD. DSE provides spatial attention to each frame using joints cues. DTA employs learned offsets to adaptively align silhouette sequences over time, reducing redundancy while adapting gait movement variations.
-
•
Comprehensive evaluation shows our HiH framework achieving state-of-the-art results on Gait3D and GREW in-the-wild and competitive performance on controlled datasets like OUMVLP and CASIA-B.This underscores its enhanced generalizability and a well-maintained balance between accuracy and efficiency.
2 Related Work
2.1 Single-modal Gait Recognition
Single-modal gait recognition methods primarily leverage two modalities of input data: appearance-based modalities like silhouettes [5, 9, 27, 7, 53, 8, 46] and model-based modalities like skeletons [1, 41, 42, 13] and 3D meshes [14]. Appearance-based approaches directly extract gait features from raw input data. Earlier template-based methods such as the Gait Energy Image (GEI) [15] and Gait Entropy Image (GEnI) [2] create distinct gait templates by aggregating silhouette information over gait cycles. These techniques compactly represent gait signatures while losing temporal details and being sensitive to viewpoint changes.
Recent silhouette-based methods have excelled by focusing on structural feature learning and temporal modeling. For structure, set-based methods like GaitSet [5] and Set Residual Network [19] treat sequences as unordered sets, enhancing robustness to frame permutation. GaitPart [9] emphasizes unique expressions of different body parts, with 3D Local CNN [22] extracting part features variably. GaitGL [27] and HSTL [46] integrate local and global cues, though HSTL’s pre-defined hierarchical body partitioning may limit its adaptability. Temporally, methods like Contextual relationships [21], second-order motion patterns [4], and meta attention and pooling [7] discern subtle patterns, with advanced techniques exploring dynamic mechanisms [29, 48] and counterfactual intervention learning [8] for robust spatio-temporal signatures. To harness the color and texture information in the original images, recent RGB-based gait recognition techniques aim to directly extract gait features from video frames, mitigating reliance on preprocessing like segmentation [37, 52, 24, 25].

Model-based approaches build gait representations of body joints or 3D structure, then extract features and classify. Recent approaches utilize pose estimation advances to obtain cleaner skeleton input representing joint configurations[41, 13]. Graph convolutional networks help model inherent spatial-temporal patterns among joints [14, 42]. Some techniques incorporate biomechanical or physics priors to learn gait features aligned with human locomotion [26, 14]. Additionally, 3D mesh recovery from video has been explored for pose and shape modeling [50].
2.2 Multi-modal Gait Recognition
Many recent approaches fuse complementary modalities like silhouette, 2D/3D pose, and skeleton to obtain more comprehensive gait representations. TransGait [23] combines silhouette appearance and pose dynamics via a set transformer model. SMPLGait [54] introduces a dual-branch network leveraging estimated 3D body models to recover detailed shape and motion patterns lost in 2D projections. Other works focus on effective fusion techniques, including part-based alignment [6, 32, 20] and refining skeleton with silhouette cues [56]. While fusing modalities like silhouette and pose has demonstrated performance gains in controlled settings, their effectiveness decreases on outdoor benchmarks. This is partly due to inaccurate skeleton pose estimation under unconstrained conditions, which causes difficulty in modality alignment. Unlike existing works, our approach utilizes more reliable 2D joint sequences to apply per-frame spatial-temporal attention correction to the silhouettes, achieving greater consistency across different modalities.
3 Method
In this section, we first overview the proposed Hierarchy in Hierarchy (HiH) framework (Sec. 3.1). We introduce the Hierarchical Gait Decomposer (HGD) module for hierarchical gait feature learning (Sec. 3.2), followed by the Spatially Enhanced HGD (Sec. 3.3) and Temporally Enhanced HGD (Sec. 3.4) modules, which strengthen HGD under the guidance of pose cues. Finally, we describe the loss function (Sec. 3.5).
3.1 Framework Overview
The core of our proposed HiH framework integrates gait silhouettes with pose data to enhance gait recognition. As illustrated in Fig. 2, the framework operates on two input sequences: the silhouette sequence and the pose sequence , where for both binary silhouette images and 2D keypoint-based pose representations. is the number of frames, and denotes the spatial dimensions.
Building on the input sequences, the framework utilizes a dual-branch backbone. The main branch processes for general motion extraction. The initial step involves a 3D convolutional operation to extract foundational spatio-temporal features. This is followed by a series of stage-specific Hierarchical Gait Decomposer (HGD) modules, denoted as for the -th stage. The auxiliary branch leverages to provide spatial and temporal guidance to the HGD, via either Deformable Spatial Enhancement (DSE) or Deformable Temporal Alignment (DTA) modules. Thus, the output feature from each stage is expressed as:
(1) |
Following the backbone, our framework applies Temporal Pooling (TP) [27] and Horizontal Pooling (HP) [9] to downsample the spatio-temporal dimensions. The reduced features are then processed through the head layer, which includes separate fully-connected layers and BNNeck [28], effectively mapping them into a metric space. The model is optimized using separate triplet and cross-entropy losses.

3.2 Hierarchical Gait Decomposer (HGD)
Unlike part-based techniques that typically divide the body into uniform horizontal segments [9, 27], the HGD employs a dual hierarchical approach for gait recognition, as shown in Fig. 2. This approach achieves a depth-wise hierarchy by stacking multiple HGD stages, which captures the gait dynamics from global body movements down to subtle limb articulations. In parallel, the width-wise hierarchy within each HGD stage conducts multi-scale processing to capture a comprehensive set of spatial features. The implementation of the -th stage of the HGD, as illustrated in Fig. 3, can be formalized as follows:
(2) |
where denotes the convolution operation applied to horizontal strips of with a kernel size of , designed to capture spatial features, while refines these features over the temporal dimension. Building on this multi-scale feature extraction, the aggregated features are further processed through a combination of additional convolutions and a residual connection [16] by
(3) |

3.3 Spatially Enhanced HGD (SE-HGD)
Silhouettes offer overall shape of gait descriptions but lack structural details. Fusing poses can provide complementary information on joint and limb movements. To achieve this, inspired by [43, 49], we introduce the Deformable Spatial Enhancement (DSE) module to adapt silhouettes using derived pose cues. As shown in Fig. 4, DSE utilizes learned deformable offsets to dynamically warp input silhouettes, emphasizing key spatial gait features and aligning them to corresponding poses. This forms a Spatially Enhanced HGD (SE-HGD) for more discriminative gait analysis.
Within the DSE, offsets are learned from pose input using a convolutional layer. The offsets are organized into a tensor , prescribing spatial transformations for the silhouette input . contains two components: offsets representing pixel displacements in and directions, constrained by activation, and offsets as scaling factors, processed via ReLU activation. The pixel-wise update to the silhouette input, utilizing the learned offsets, is conducted as follows:
(4) |
where denotes the bilinear interpolation function that applies the spatial adjustments and scaling to , and represents element-wise multiplication.
3.4 Temporally Enhanced HGD (TE-HGD)
While silhouette sequences provide a outline of basic body movement, they typically fail to capture the intricate joint dynamics. Building on DSE, the proposed Deformable Temporal Alignment (DTA) module extends pose guidance to the temporal domain, adaptively aligning silhouettes to match gait variations. Combined with HGD, DTA constitutes the Temporally Enhanced HGD (TE-HGD). Moreover, DTA enables per-pixel sampling between frames. This makes temporal downsampling via fixed-stride max pooling more adaptive to motion variations.

As illustrated in Fig. 5, the DTA module employs a 3D convolution on to extract 5-channel spatio-temporal offsets . The first two channels, , capture spatial displacements in and axes, while the third, , quantifies temporal displacement. The fourth channel, , scales spatial offsets, and the fifth, , adjusts temporal offsets. Similar to the DSE module, is formed by concatenating the processed offsets. This can be expressed as:
(5) | ||||
These modified offsets guide the update of the silhouette features, followed by a MaxPooling operation to reduce redundancy. This process can be formulated as:
(6) |
where represents trilinear interpolation for spatio-temporal adjustments of the silhouette sequence, and MaxPool is applied along the temporal dimension.
3.5 Loss Function
To effectively train our model, we use the joint losses which include triplet loss [17, 29] and cross-entroy loss . The can be formulated as:
(7) |
where represents the number of triplets with a positive loss, is the number of horizontal stripes, and are the number of subjects and sequences per subject, respectively, equals to , is the margin, denotes the euclidean distance, and is the feature extraction function. The variables , and represent the input sequences from anchors, positives, and negatives within the batch, respectively.
The can be express as:
(8) |
where is the number of subject categories. In this formulation, denotes the predicted probability that the -th sequence of the -th subject in the batch belongs to the -th category, and is the ground truth label. Finally, combining Eq. 7 and Eq. 8, the joint loss function can be formulated as:
(9) |
4 Experiments
4.1 Datasets
We evaluated our method across four gait recognition datasets: Gait3D [54] and GREW [57] from real-world environments, and OUMVLP [40] and CASIA-B [51] from controlled laboratory settings .
Gait3D [54] is a large-scale wild gait dataset, featuring 4,000 subjects and 25,309 sequences collected from 39 cameras in a large supermarket. This dataset provides four modalities: silhouette, 2D and 3D poses, and 3D mesh. It is divided into a training set with 3,000 subjects and a test set comprising 1,000 subjects. During the evaluation phase, one sequence from each subject is randomly selected as the probe, while the remaining sequences form the gallery.
GREW [57] is one of the largest wild gait datasets, containing 26,345 subjects and 128,671 sequences collected by 882 cameras. This dataset offers data in four modalities: silhouette, optical flow, 2D poses, and 3D poses. It is divided into training, validation, and test sets, containing 2,000, 345, and 6,000 subjects, respectively. For evaluation, two sequences from each subject in the test set are selected as probes, with the remaining sequences serving as the gallery. Additionally, GREW includes a distractor set with 233,857 unlabelled sequences.
OUMVLP [40] is a large-scale indoor dataset, featuring 10,307 subjects captured from 14 different camera viewpoints. Each subject is recorded in two sequences under normal walking (NM) conditions. The dataset is evenly divided into a training set and a test set, containing 5,153 and 5,154 subjects, respectively. During the evaluation phase, sequences labeled NM#01 are used as the gallery, while those labeled NM#00 serve as the probe.
CASIA-B [51] is a widely-used indoor gait dataset that includes 124 subjects captured from 11 different viewpoints. Subjects were recorded under three walking conditions: normal walking (NM), walking with a bag (BG), and walking with a coat (CL). For evaluation purposes, we adopt the prevailing protocol, dividing the dataset into training and test sets with 74 and 50 subjects, respectively. During the evaluation, sequences NM#01-04 are designated as the gallery, while the remaining sequences serve as probes. In addition, the HiH method requires precise frame alignment between RGB and silhouette sequences. However, we observed misalignment in the original CASIA-B dataset, hindering direct application. To address this, we utilize CASIA-B* [25], a variant with aligned RGB and silhouette data tailored to HiH’s needs.
4.2 Implementation Details
Training details. 1) The margin in Eq. 7 is set to 0.2, and the number of HP bins is 16; 2) Batch sizes are for CASIA-B, for OUMVLP, and for Gait3D and GREW; 3) Our input modalities include gait silhouettes and pose heatmaps generated by HRNet [39], both resized to , with a fixed of 2 for the 2D Gaussian distribution on keypoints. For training CASIA-B and OUMVLP, 30 frames are randomly sampled. For Gait3D, the frame range is , and for GREW, it is , following [10, 29]; 4) The optimizer is SGD with a learning rate of 0.1 , training the model for 60K, 140K, 70K, and 200K iterations for CASIA-B, OUMVLP, Gait3D, and GREW, respectively; 5) In the training phase for two real-world datasets, data augmentation strategies (e.g., horizontal flipping, rotation, perspective) are applied as outlined in [10, 30].
Architecture details. 1) The output channels of the backbone in the four stages are set to for CASIA-B, for OUMVLP, Gait3D and GREW, respectively, to fit larger datasets; 2) Spatial downsampling is applied at the second and third stages for OUMVLP, Gait3D, and GREW, while not for CASIA-B; 3) For all four datasets, temporal downsampling with stride 3 is applied at the third stage.
4.3 Comparison with State-of-the-Art Methods
We conduct comprehensive comparisons between our proposed HiH method variants, HiH-S (using only silhouette modality) and HiH-M (integrating silhouettes with 2D keypoints for multimodal analysis), and three types of existing gait recognition methods: 1) Pose-based methods including GaitGraph2 [42], PAA [14], and GPGait [13]; 2) Silhouette-based methods such as GaitSet [5], GaitPart [9], GLN [18], GaitGL [27], 3D Local [22], CSTL [21], LagrangeGait [4], MetaGait [7], GaitBase [10], DANet [29], GaitGCI [8], STANet [30], DyGait [48], and HSTL [46]; 3) Multimodal approaches like SMPLGait [54], TransGait [23], BiFusion [32], GaitTAKE [20], GaitRef [56], and MMGaitFormer [6].
Method | Venue | Rank-1 | Rank-5 | mAP | mINP |
---|---|---|---|---|---|
GaitGraph2 [42] | CVPRW22 | 11.2 | - | - | - |
PAA [14] | ICCV23 | 38.9 | 59.1 | - | - |
GPGait [13] | ICCV23 | 22.4 | - | - | - |
GaitSet [5] | AAAI19 | 36.7 | 58.3 | 30.0 | 17.3 |
GaitPart [9] | CVPR20 | 28.2 | 47.6 | 21.6 | 12.4 |
GLN [18] | ECCV20 | 31.4 | 52.9 | 24.7 | 13.6 |
GaitGL [27] | ICCV21 | 29.7 | 48.5 | 22.3 | 13.3 |
CSTL [21] | ICCV21 | 11.7 | 19.2 | 5.6 | 2.6 |
GaitBase [10] | CVPR23 | 64.6 | - | - | - |
DANet [29] | CVPR23 | 48.0 | 69.7 | - | - |
GaitGCI [8] | CVPR23 | 50.3 | 68.5 | 39.5 | 24.3 |
DyGait [48] | ICCV23 | 66.3 | 80.8 | 56.4 | 37.3 |
HSTL [46] | ICCV23 | 61.3 | 76.3 | 55.5 | 34.8 |
HiH-S | 72.4 | 86.9 | 64.4 | 38.1 | |
SMPLGait [54] | CVPR22 | 46.3 | 64.5 | 37.2 | 22.2 |
GaitRef [56] | IJCB23 | 49.0 | 69.3 | 40.7 | 25.3 |
HiH-M | 75.8 | 88.3 | 67.3 | 40.4 |
Method | Venue | Rank-1 | Rank-5 | Rank-10 | Rank-20 |
---|---|---|---|---|---|
GaitGraph2 [42] | CVPRW22 | 34.8 | - | - | - |
PAA [14] | ICCV23 | 38.7 | 62.1 | - | - |
GPGait [13] | ICCV23 | 57.0 | - | - | - |
GaitSet [5] | AAAI19 | 46.3 | 63.6 | 70.3 | 76.8 |
GaitPart [9] | CVPR20 | 44.0 | 60.7 | 67.3 | 73.5 |
GaitGL [27] | ICCV21 | 47.3 | 63.6 | 69.3 | 74.2 |
CSTL [21] | ICCV21 | 50.6 | 65.9 | 71.9 | 76.9 |
GaitBase [10] | CVPR23 | 60.1 | - | - | - |
GaitGCI [8] | CVPR23 | 68.5 | 80.8 | 84.9 | 87.7 |
STANet [30] | CVPR23 | 41.3 | - | - | - |
DyGait [48] | ICCV23 | 71.4 | 83.2 | 86.8 | 89.5 |
HSTL [46] | ICCV23 | 62.7 | 76.6 | 81.3 | 85.2 |
HiH-S | 72.5 | 83.6 | 87.1 | 90.0 | |
TransGait [23] | APIN23 | 56.3 | 72.7 | 78.1 | 82.5 |
GaitTAKE [20] | JSTSP23 | 51.3 | 69.4 | 75.5 | 80.4 |
GaitRef [56] | IJCB23 | 53.0 | 67.9 | 73.0 | 77.5 |
HiH-M | 73.4 | 84.3 | 87.8 | 90.4 |
Evaluation on Gait3D. On the real-world Gait3D dataset, our method outperforms existing methods of both single-modality (pose-based, silhouette-based) and multi-modality methods across all metrics, as detailed in Tab. 1. Specifically, HiH-S achieves 6.1%, 6.1%, and 8.0% higher Rank-1, Rank-5, and mAP than the state-of-the-art silhouette-based method DyGait, demonstrating the efficacy of dual-hierarchy modeling. HiH-M records 26.8%, 19.0%, and 26.6% higher Rank-1, Rank-5, and mAP, respectively, than GaitRef. Moreover, HiH-M achieves 3.4% higher Rank-1 accuracy than HiH-S. These indicate that pose-guided learning supplements the fine details missing in silhouettes. It is observed that pose-based methods lag behind silhouette-based approaches, indicating that pose estimation in the wild remains challenging.
Method | Venue | Probe View | Mean | Std | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GaitGraph2 [42] | CVPRW22 | 54.3 | 68.4 | 76.1 | 76.8 | 71.5 | 75.0 | 70.1 | 52.2 | 60.6 | 57.8 | 73.2 | 67.8 | 70.8 | 65.3 | 67.1 | 7.7 |
GPGait [13] | ICCV23 | 59.1 | |||||||||||||||
GaitSet [5] | AAAI19 | 79.3 | 87.9 | 90.0 | 90.1 | 88.0 | 88.7 | 87.7 | 81.8 | 86.5 | 89.0 | 89.2 | 87.2 | 87.6 | 86.2 | 87.1 | 4.0 |
GaitPart [9] | CVPR20 | 82.6 | 88.9 | 90.8 | 91.0 | 89.7 | 89.9 | 89.5 | 85.2 | 88.1 | 90.0 | 90.1 | 89.0 | 89.1 | 88.2 | 88.7 | 2.3 |
GLN [18] | ECCV20 | 83.8 | 90.0 | 91.0 | 91.2 | 90.3 | 90.0 | 89.4 | 85.3 | 89.1 | 90.5 | 90.6 | 89.6 | 89.3 | 88.5 | 89.2 | 2.1 |
CSTL [21] | ICCV21 | 87.1 | 91.0 | 91.5 | 91.8 | 90.6 | 90.8 | 90.6 | 89.4 | 90.2 | 90.5 | 90.7 | 89.8 | 90.0 | 89.4 | 90.2 | 1.1 |
GaitGL [27] | ICCV21 | 84.9 | 90.2 | 91.1 | 91.5 | 91.1 | 90.8 | 90.3 | 88.5 | 88.6 | 90.3 | 90.4 | 89.6 | 89.5 | 88.8 | 89.7 | 1.7 |
3D Local [22] | ICCV21 | 86.1 | 91.2 | 92.6 | 92.9 | 92.2 | 91.3 | 91.1 | 86.9 | 90.8 | 92.2 | 92.3 | 91.3 | 91.1 | 90.2 | 90.9 | 2.0 |
LagrangeGait [4] | CVPR22 | 85.9 | 90.6 | 91.3 | 91.5 | 91.2 | 91.0 | 90.6 | 88.9 | 89.2 | 90.5 | 90.6 | 89.9 | 89.8 | 89.2 | 90.0 | 1.4 |
MetaGait [7] | ECCV22 | 88.2 | 92.3 | 93.0 | 93.5 | 93.1 | 92.7 | 92.6 | 89.3 | 91.2 | 92.0 | 92.6 | 92.3 | 91.9 | 91.1 | 91.9 | 1.4 |
GaitBase [10] | CVPR23 | 90.8 | |||||||||||||||
DANet [29] | CVPR23 | 87.7 | 91.3 | 91.6 | 91.8 | 91.7 | 91.4 | 91.1 | 90.4 | 90.3 | 90.7 | 90.9 | 90.5 | 90.3 | 89.9 | 90.7 | 1.0 |
GaitGCI [8] | CVPR23 | 91.2 | 92.3 | 92.6 | 92.7 | 93.0 | 92.3 | 92.1 | 92.0 | 91.8 | 91.9 | 92.6 | 92.3 | 91.4 | 91.6 | 92.1 | 0.5 |
STANet [30] | ICCV23 | 87.7 | 91.4 | 91.6 | 91.9 | 91.6 | 91.4 | 91.2 | 90.4 | 90.3 | 90.8 | 91.0 | 90.5 | 90.3 | 90.1 | 90.7 | 1.0 |
HSTL [46] | ICCV23 | 91.4 | 92.9 | 92.7 | 93.0 | 92.9 | 92.5 | 92.5 | 92.7 | 92.3 | 92.1 | 92.3 | 92.2 | 91.8 | 91.8 | 92.4 | 0.5 |
BiFusion [32] | MTA23 | 86.2 | 90.6 | 91.3 | 91.6 | 90.9 | 90.8 | 90.5 | 87.8 | 89.5 | 90.4 | 90.7 | 90.0 | 89.8 | 89.3 | 89.9 | 1.4 |
GaitTAKE [20] | JSTSP23 | 87.5 | 91.0 | 91.5 | 91.8 | 91.4 | 91.1 | 90.8 | 90.2 | 89.7 | 90.5 | 90.7 | 90.3 | 90.0 | 89.5 | 90.4 | 1.0 |
GaitRef [56] | IJCB23 | 85.7 | 90.5 | 91.6 | 91.9 | 91.3 | 91.3 | 90.9 | 89.3 | 89.0 | 90.8 | 90.8 | 90.1 | 90.1 | 89.5 | 90.2 | 1.5 |
MMGaitFormer [6] | CVPR23 | 90.1 | |||||||||||||||
HiH-S | 92.1 | 93.0 | 92.4 | 92.7 | 93.2 | 92.5 | 92.4 | 93.0 | 92.4 | 91.9 | 92.1 | 92.5 | 91.9 | 91.9 | 92.4 | 0.4 |
Evaluation on GREW. As shown in Tab. 2, the results on GREW follow a similar trend as Gait3D. Even using only a single modality, our HiH-S achieves the best results among the methods compared, and HiH-M further improves Rank-1 accuracy by 0.9% through multi-modality fusion. Our HiH-M addresses this issue by using 2D pose as an auxiliary modality, which retains essential gait information and avoids 3D joint motion errors.
Evaluation on OUMVLP. Since OUMVLP only provides silhouettes, we compare single-modality results. As Tab. 3 shows, HiH-S leads in mean Rank-1 accuracy. Notably, HiH-S excels in 8 of 14 camera views, particularly in those views like the front and back where the gait posture is less visible. Although the average result is on par with HSTL, our lower std suggests better cross-view stability while using less than half of HSTL’s parameters (refer to Sec. 4.6 for details).
Method | Venue | NM | BG | CL | Mean |
---|---|---|---|---|---|
GaitGraph2 [42] | CVPRW22 | 80.3 | 71.4 | 63.8 | 71.8 |
GPGait [13] | ICCV23 | 93.6 | 80.2 | 69.3 | 81.0 |
GaitSet [5] | AAAI19 | 95.0 | 87.2 | 70.4 | 84.2 |
GaitPart [9] | CVPR20 | 96.2 | 91.5 | 78.7 | 88.8 |
GLN [18] | ECCV20 | 96.9 | 94.0 | 77.5 | 89.5 |
CSTL [21] | ICCV21 | 97.8 | 93.6 | 84.2 | 91.9 |
3D Local [22] | ICCV21 | 97.5 | 94.3 | 83.7 | 91.8 |
GaitGL [27] | ICCV21 | 97.4 | 94.5 | 83.6 | 91.8 |
LagrangeGait [4] | CVPR 22 | 96.9 | 93.5 | 86.5 | 92.3 |
MetaGait [7] | ECCV22 | 98.1 | 95.2 | 86.9 | 93.4 |
DANet [29] | CVPR23 | 98.0 | 95.9 | 89.9 | 94.6 |
GaitBase [10] | CVPR23 | 97.6 | 94.0 | 77.4 | 89.7 |
GaitGCI [8] | CVPR23 | 98.4 | 96.6 | 88.5 | 94.5 |
STANet [30] | ICCV23 | 98.1 | 96.0 | 89.7 | 94.6 |
DyGait [48] | ICCV23 | 98.4 | 96.2 | 87.8 | 94.1 |
HSTL [46] | ICCV23 | 98.1 | 95.9 | 88.9 | 94.3 |
HiH-S | 98.2 | 96.3 | 89.2 | 94.6 | |
TransGait [23] | APNI23 | 98.1 | 94.9 | 85.8 | 92.9 |
BiFusion [32] | MTA23 | 98.7 | 96.0 | 92.1 | 95.6 |
GaitTAKE [20] | JSTSP23 | 98.0 | 97.5 | 92.2 | 95.9 |
GaitRef [56] | IJCB23 | 98.1 | 95.9 | 88.0 | 94.0 |
MMGaitFormer [6] | CVPR23 | 98.4 | 96.0 | 94.8 | 96.4 |
Method | Venue | NM | BG | CL | Mean |
---|---|---|---|---|---|
GaitSet [5] | AAAI19 | 92.3 | 86.1 | 73.4 | 83.9 |
GaitPart [9] | CVPR20 | 93.1 | 86.0 | 75.1 | 84.7 |
GaitGL [27] | ICCV21 | 94.2 | 90.0 | 81.4 | 88.5 |
GaitBase [9] | CVPR23 | 96.5 | 91.5 | 78.0 | 88.7 |
HiH-S | 94.6 | 91.1 | 84.2 | 90.0 | |
HiH-M | 96.8 | 93.9 | 87.0 | 92.6 |
Evaluation on CASIA-B. Since the RGB videos and silhouette sequences in CASIA-B are not frame-aligned, we report HiH-S results for fair comparison. Other multimodal methods like MMGaitFormer mainly adopt late fusion and thus do not require frame alignment. As shown in Tab. 4, HiH-S achieves the highest average accuracy among single-modality methods, on par with top models like DANet and STANet. Moreover, HiH-S demonstrates strong generalizability, surpassing DANet by 24.4% on Gait3D (see Tab. 1) and STANet by 31.2% in Rank-1 on GREW (see Tab. 2). Benefiting from accurate pose estimation, multimodal methods show advantages on CASIA-B, especially for the CL condition. However, their performance degrades significantly on larger datasets like OUMVLP (see Tab. 3) and more complex scenarios like Gait3D and GREW (see Tabs. 2 and 1) , falling behind even single-modality methods. This highlights the strength of HiH in unconstrained settings. Moreover, to validate the effectiveness of HiH-M in indoor settings against other silhouette-based methods, we report results on CASIA-B* with aligned RGB and silhouettes in Tab. 5. HiH-M demonstrates the best performance, improving HiH-S by 2.2%, 2.8% and 2.8% for three walking conditions, respectively. For more comparison results, such as cross-dataset evaluations, please refer to the supplementary materials.
4.4 Ablation Study
To validate the efficacy of each component in HiH, including HGD which provides hierarchical feature learning in depth and width, DSE and DTA for pose-guided spatial-temporal modeling, we conduct ablation studies the Gait3D dataset with results in Tab. 6. The hierarchical depth and width provided by the HGD module lay a foundational baseline for our approach. Adding either DSE or DTA can further improve over this baseline. The best results are achieved when all modules are considered together.
HGD | DSE | DTA | Rank-1 | Rank-5 | mAP | mINP | |
---|---|---|---|---|---|---|---|
Depth | Width | ||||||
✓ | 69.2 | 83.7 | 59.6 | 34.8 | |||
✓ | ✓ | 72.4 | 86.9 | 64.4 | 38.1 | ||
✓ | ✓ | ✓ | 74.3 | 87.4 | 65.1 | 38.7 | |
✓ | ✓ | ✓ | 74.6 | 87.2 | 65.3 | 39.2 | |
✓ | ✓ | ✓ | ✓ | 75.8 | 88.3 | 67.3 | 40.4 |
4.5 Visualization Analysis

We visualize the heatmaps from the last layer of HiH and other methods on Gait3D in Fig. 6. It can be observed that HSTL focuses on sparse key joints like knees, shoulders and arms, with attention on limited body parts. In comparison, GaitBase attends to more body parts within each silhouette, explaining its superior performance over HSTL. Our HiH-S further outputs denser discriminative regions across multiple views, hence achieving better results. By incorporating pose modality, HiH-M obtains more comprehensive coverage of full-body motion areas.
4.6 Trade-off between Accuracy and Efficiency

In our trade-off analysis, shown in Fig. 7, we explore the relationship between model complexity and accuracy. Pose-based methods like GPGait [13] demonstrate parameter efficiency but fall short in performance. Clearly, larger models with higher FLOPs (Floating Point Operations per Second) generally achieve better results. DyGait [48], achieving high accuracy, demands significant computation due to 3D convolutions. In contrast, GaitBase [10] offers better parameter efficiency by utilizing 2D spatial convolutions, but incurs higher FLOPs, likely due to the lack of effective temporal aggregation mechanisms. Our HiH-S finds an optimal balance. HiH-M further enhances performance without substantially increasing computational cost. This is accomplished by using only one 2D and one 3D convolution to extract spatial and temporal dependencies from poses.
5 Conclusion, Limitations, and Future Work
We proposed the HiH framework, combining hierarchical decomposition with multi-modal data for multi-scale motion modeling and spatio-temporal analysis in gait recognition. While achieving state-of-the-art performance, HiH faces limitations in handling heavy occlusions and lacks automated design optimization. Future improvements for HiH include integrating 3D pose estimation from multiple views to mitigate errors, using neural architecture search for automated model design, and applying domain adaptation techniques to address challenges posed by covariates such as different clothing types.
References
- An et al. [2020] Weizhi An, Shiqi Yu, Yasushi Makihara, Xinhui Wu, Chi Xu, Yang Yu, Rijun Liao, and Yasushi Yagi. Performance evaluation of model-based gait on multi-view very large population database with pose sequences. IEEE TBIOM, 2(4):421–430, 2020.
- Bashir et al. [2009] Khalid Bashir, Tao Xiang, and Shaogang Gong. Gait recognition using gait entropy image. In ICDP, pages 1–6, 2009.
- Bouchrika et al. [2011] Imed Bouchrika, Michaela Goffredo, John Carter, and Mark Nixon. On using gait in forensic biometrics. Journal of forensic sciences, 56(4):882–889, 2011.
- Chai et al. [2022] Tianrui Chai, Annan Li, Shaoxiong Zhang, Zilong Li, and Yunhong Wang. Lagrange motion analysis and view embeddings for improved gait recognition. In CVPR, pages 20249–20258, 2022.
- Chao et al. [2019] Hanqing Chao, Yiwei He, Junping Zhang, and Jianfeng Feng. Gaitset: Regarding gait as a set for cross-view gait recognition. In AAAI, pages 8126–8133, 2019.
- Cui and Kang [2023] Yufeng Cui and Yimei Kang. Multi-modal gait recognition via effective spatial-temporal feature fusion. In CVPR, pages 17949–17957, 2023.
- Dou et al. [2022] Huanzhang Dou, Pengyi Zhang, Wei Su, Yunlong Yu, and Xi Li. Metagait: Learning to learn an omni sample adaptive representation for gait recognition. In ECCV, pages 357–374. Springer, 2022.
- Dou et al. [2023] Huanzhang Dou, Pengyi Zhang, Wei Su, Yunlong Yu, Yining Lin, and Xi Li. Gaitgci: Generative counterfactual intervention for gait recognition. In CVPR, pages 5578–5588, 2023.
- Fan et al. [2020] Chao Fan, Yunjie Peng, Chunshui Cao, Xu Liu, Saihui Hou, Jiannan Chi, Yongzhen Huang, Qing Li, and Zhiqiang He. Gaitpart: Temporal part-based model for gait recognition. In CVPR, pages 14225–14233, 2020.
- Fan et al. [2023] Chao Fan, Junhao Liang, Chuanfu Shen, Saihui Hou, Yongzhen Huang, and Shiqi Yu. Opengait: Revisiting gait recognition towards better practicality. In CVPR, pages 9707–9716, 2023.
- Ferber et al. [2016] Reed Ferber, Sean T Osis, Jennifer L Hicks, and Scott L Delp. Gait biomechanics in the era of data science. Journal of biomechanics, 49(16):3759–3761, 2016.
- Filipi Gonçalves dos Santos et al. [2022] Claudio Filipi Gonçalves dos Santos, Diego de Souza Oliveira, Leandro A. Passos, Rafael Gonçalves Pires, Daniel Felipe Silva Santos, Lucas Pascotti Valem, Thierry P. Moreira, Marcos Cleison S. Santana, Mateus Roder, Jo Paulo Papa, et al. Gait recognition based on deep learning: A survey. CSUR, 55(2):1–34, 2022.
- Fu et al. [2023] Yang Fu, Shibei Meng, Saihui Hou, Xuecai Hu, and Yongzhen Huang. Gpgait: Generalized pose-based gait recognition. In ICCV, pages 19595–19604, 2023.
- Guo and Ji [2023] Hongji Guo and Qiang Ji. Physics-augmented autoencoder for 3d skeleton-based gait recognition. In ICCV, pages 19627–19638, 2023.
- Han and Bhanu [2005] Jinguang Han and Bir Bhanu. Individual recognition using gait energy image. IEEE TPAMI, 28(2):316–322, 2005.
- He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
- Hermans et al. [2017] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
- Hou et al. [2020] Saihui Hou, Chunshui Cao, Xu Liu, and Yongzhen Huang. Gait lateral network: Learning discriminative and compact representations for gait recognition. In ECCV, pages 382–398. Springer, 2020.
- Hou et al. [2021] Saihui Hou, Xu Liu, Chunshui Cao, and Yongzhen Huang. Set residual network for silhouette-based gait recognition. IEEE TBIOM, 3(3):384–393, 2021.
- Hsu et al. [2023] Hung-Min Hsu, Yizhou Wang, Cheng-Yen Yang, Jenq-Neng Hwang, Hoang Le Uyen Thuc, and Kwang-Ju Kim. Learning temporal attention based keypoint-guided embedding for gait recognition. IEEE JSTSP, 2023.
- Huang et al. [2021a] Xiaohu Huang, Duowang Zhu, Hao Wang, Xinggang Wang, Bo Yang, Botao He, Wenyu Liu, and Bin Feng. Context-sensitive temporal feature learning for gait recognition. In ICCV, pages 12909–12918, 2021a.
- Huang et al. [2021b] Zhen Huang, Dixiu Xue, Xu Shen, Xinmei Tian, Houqiang Li, Jianqiang Huang, and Xian-Sheng Hua. 3d local convolutional neural networks for gait recognition. In ICCV, pages 14920–14929, 2021b.
- Li et al. [2023] Guodong Li, Lijun Guo, Rong Zhang, Jiangbo Qian, and Shangce Gao. Transgait: Multimodal-based gait recognition with set transformer. Applied Intelligence, 53(2):1535–1547, 2023.
- Li et al. [2021] Xiang Li, Yasushi Makihara, Chi Xu, and Yasushi Yagi. End-to-end model-based gait recognition using synchronized multi-view pose constraint. In ICCV, pages 4106–4115, 2021.
- Liang et al. [2022] Junhao Liang, Chao Fan, Saihui Hou, Chuanfu Shen, Yongzhen Huang, and Shiqi Yu. Gaitedge: Beyond plain end-to-end gait recognition for better practicality. In ECCV, pages 375–390. Springer, 2022.
- Liao et al. [2020] Rijun Liao, Shiqi Yu, Weizhi An, and Yongzhen Huang. A model-based gait recognition method with body pose and human prior knowledge. PR, 98:107069, 2020.
- Lin et al. [2021] Beibei Lin, Shunli Zhang, and Xin Yu. Gait recognition via effective global-local feature representation and local temporal aggregation. In ICCV, pages 14648–14656, 2021.
- Luo et al. [2019] Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. Bag of tricks and a strong baseline for deep person re-identification. In CVPRW, pages 0–0, 2019.
- Ma et al. [2023a] Kang Ma, Ying Fu, Dezhi Zheng, Chunshui Cao, Xuecai Hu, and Yongzhen Huang. Dynamic aggregated network for gait recognition. In CVPR, pages 22076–22085, 2023a.
- Ma et al. [2023b] Kang Ma, Ying Fu, Dezhi Zheng, Yunjie Peng, Chunshui Cao, and Yongzhen Huang. Fine-grained unsupervised domain adaptation for gait recognition. In ICCV, pages 11313–11322, 2023b.
- Macoveciuc et al. [2019] Ioana Macoveciuc, Carolyn J Rando, and Hervé Borrion. Forensic gait analysis and recognition: standards of evidence admissibility. Journal of forensic sciences, 64(5):1294–1303, 2019.
- Peng et al. [2023] Yunjie Peng, Kang Ma, Yang Zhang, and Zhiqiang He. Learning rich features for gait recognition by integrating skeletons and silhouettes. Multimedia Tools and Applications, pages 1–22, 2023.
- Phinyomark et al. [2015] Angkoon Phinyomark, Sean Osis, Blayne A Hettinga, and Reed Ferber. Kinematic gait patterns in healthy runners: A hierarchical cluster analysis. Journal of biomechanics, 48(14):3897–3904, 2015.
- Ren et al. [2014] Yanzhi Ren, Yingying Chen, Mooi Choo Chuah, and Jie Yang. User verification leveraging gait recognition for smartphone enabled mobile healthcare systems. IEEE TMC, 14(9):1961–1974, 2014.
- Seckiner et al. [2019] Dilan Seckiner, Xanthé Mallett, Philip Maynard, Didier Meuwly, and Claude Roux. Forensic gait analysis—morphometric assessment from surveillance footage. Forensic science international, 296:57–66, 2019.
- Sepas-Moghaddam and Etemad [2022] Alireza Sepas-Moghaddam and Ali Etemad. Deep gait recognition: A survey. IEEE TPAMI, 45(1):264–284, 2022.
- Song et al. [2019] Chunfeng Song, Yongzhen Huang, Yan Huang, Ning Jia, and Liang Wang. Gaitnet: An end-to-end network for gait based human identification. PR, 96:106988, 2019.
- Sun et al. [2020] Fangmin Sun, Weilin Zang, Raffaele Gravina, Giancarlo Fortino, and Ye Li. Gait-based identification for elderly users in wearable healthcare systems. Information fusion, 53:134–144, 2020.
- Sun et al. [2019] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In CVPR, pages 5693–5703, 2019.
- Takemura et al. [2018] Noriko Takemura, Yasushi Makihara, Daigo Muramatsu, Tomio Echigo, and Yasushi Yagi. Multi-view large population gait dataset and its performance evaluation for cross-view gait recognition. IPSJ TCVA, 10:1–14, 2018.
- Teepe et al. [2021] Torben Teepe, Ali Khan, Johannes Gilg, Fabian Herzog, Stefan Hörmann, and Gerhard Rigoll. Gaitgraph: Graph convolutional network for skeleton-based gait recognition. In ICIP, pages 2314–2318, 2021.
- Teepe et al. [2022] Torben Teepe, Johannes Gilg, Fabian Herzog, Stefan Hörmann, and Gerhard Rigoll. Towards a deeper understanding of skeleton-based gait recognition. In CVPR, pages 1569–1577, 2022.
- Tu et al. [2022] Danyang Tu, Xiongkuo Min, Huiyu Duan, Guodong Guo, Guangtao Zhai, and Wei Shen. Iwin: Human-object interaction detection via transformer with irregular windows. In ECCV, pages 87–103, 2022.
- Wan et al. [2018] Changsheng Wan, Li Wang, and Vir V Phoha. A survey on gait recognition. CSUR, 51(5):1–35, 2018.
- Wang et al. [2003] Liang Wang, Tieniu Tan, Huazhong Ning, and Weiming Hu. Silhouette analysis-based gait recognition for human identification. IEEE TPAMI, 25(12):1505–1518, 2003.
- Wang et al. [2023a] Lei Wang, Bo Liu, Fangfang Liang, and Bincheng Wang. Hierarchical spatio-temporal representation learning for gait recognition. In ICCV, pages 19639–19649, 2023a.
- Wang et al. [2023b] Lei Wang, Bo Liu, Bincheng Wang, and Fuqiang Yu. Gaitmm: Multi-granularity motion sequence learning for gait recognition. In ICIP, pages 845–849. IEEE, 2023b.
- Wang et al. [2023c] Ming Wang, Xianda Guo, Beibei Lin, Tian Yang, Zheng Zhu, Lincheng Li, Shunli Zhang, and Xin Yu. Dygait: Exploiting dynamic representations for high-performance gait recognition. In ICCV, pages 13424–13433, 2023c.
- Xia et al. [2022] Zhuofan Xia, Xuran Pan, Shiji Song, Li Erran Li, and Gao Huang. Vision transformer with deformable attention. In CVPR, pages 4794–4803, 2022.
- Xu et al. [2023] Chi Xu, Yasushi Makihara, Xiang Li, and Yasushi Yagi. Occlusion-aware human mesh model-based gait recognition. IEEE TIFS, 18:1309–1321, 2023.
- Yu et al. [2006] Shiqi Yu, Daoliang Tan, and Tieniu Tan. A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In ICPR, pages 441–444, 2006.
- Zhang et al. [2019] Ziyuan Zhang, Luan Tran, Xi Yin, Yousef Atoum, Xiaoming Liu, Jian Wan, and Nanxin Wang. Gait recognition via disentangled representation learning. In CVPR, pages 4710–4719, 2019.
- Zheng et al. [2022a] Jinkai Zheng, Xinchen Liu, Xiaoyan Gu, Yaoqi Sun, Chuang Gan, Jiyong Zhang, Wu Liu, and Chenggang Yan. Gait recognition in the wild with multi-hop temporal switch. In ACM MM, pages 6136–6145, 2022a.
- Zheng et al. [2022b] Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, and Tao Mei. Gait recognition in the wild with dense 3d representations and a benchmark. In CVPR, pages 20228–20237, 2022b.
- Zheng et al. [2023] Jinkai Zheng, Xinchen Liu, Shuai Wang, Lihao Wang, Chenggang Yan, and Wu Liu. Parsing is all you need for accurate gait recognition in the wild. In ACM MM, pages 116–124, 2023.
- Zhu et al. [2023] Haidong Zhu, Wanrong Zheng, Zhaoheng Zheng, and Ram Nevatia. Gaitref: Gait recognition with refined sequential skeletons. arXiv preprint arXiv:2304.07916, 2023.
- Zhu et al. [2021] Zheng Zhu, Xianda Guo, Tian Yang, Junjie Huang, Jiankang Deng, Guan Huang, Dalong Du, Jiwen Lu, and Jie Zhou. Gait recognition in the wild: A benchmark. In ICCV, pages 14789–14799, 2021.