DyGait: Exploiting Dynamic Representations for
High-performance Gait Recognition

Ming Wang Beijing Jiaotong University Xianda Guo ^$\dagger$ PhiGent Robotics Beibei Lin National University of Singapore Tian Yang PhiGent Robotics Zheng Zhu PhiGent Robotics Lincheng Li NetEase Fuxi AI Lab Shunli Zhang Shunli Zhang is the corresponding author.

\dagger

Joint first authors. Beijing Jiaotong University Xin Yu University of Technology Sydney

Abstract

Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns. Compared with other biometric technologies, gait recognition is more difficult to disguise and can be applied to the condition of long-distance without the cooperation of subjects. Thus, it has unique potential and wide application for crime prevention and social security. At present, most gait recognition methods directly extract features from the video frames to establish representations. However, these architectures learn representations from different features equally but do not pay enough attention to dynamic features, which refers to a representation of dynamic parts of silhouettes over time (e.g. legs). Since dynamic parts of the human body are more informative than other parts (e.g. bags) during walking, in this paper, we propose a novel and high-performance framework named DyGait. This is the first framework on gait recognition that is designed to focus on the extraction of dynamic features. Specifically, to take full advantage of the dynamic information, we propose a Dynamic Augmentation Module (DAM), which can automatically establish spatial-temporal feature representations of the dynamic parts of the human body. The experimental results show that our DyGait network outperforms other state-of-the-art gait recognition methods. It achieves an average Rank-1 accuracy of 71.4% on the GREW dataset, 66.3% on the Gait3D dataset, 98.4% on the CASIA-B dataset and 98.3% on the OU-MVLP dataset.

1 Introduction

Gait recognition is a biometric technology that can identify humans based on their postures at a long distance without the cooperation of subjects. Meanwhile, gait is hard to disguise, and these advantages give gait recognition unique potential for many applications such as crime prevention, person identification, and social security. Even though gait recognition has achieved impressive advances in past years [33, 31, 20, 19, 13, 5, 21], this challenging technique has not yet been widely used in real-world applications due to some complex factors, such as clothing and people carrying objects [8, 27, 61, 1, 49, 36, 60, 40].

Refer to caption — Figure 1: The first row represents the original sequence of gait. The bounding boxes for human silhouette parts: main body (yellow boxes), bag (red boxes), left and right legs (green boxes). The second row shows heatmaps of our DyGait.

As an identification task in computer vision, the essential goal of gait recognition is to learn unique and invariant representations from temporal changing characteristics about the motion of bodies. Currently, there are two main types of feature representation methods to portray human gait. One is the spatial-based method which usually extracts spatial gait features of the whole gait sequences [5, 17]. Despite having a low computational cost in this way, it may lose temporal information. The other one is the temporal-based method which models temporal features for representations [13, 33]. These CNN-based works can automatically extract spatial and temporal gait features, but they do not focus on the dynamic differences between frames, which may be the most useful gait features for gait recognition.

As the first row in Figure 1 shows, the human’s torso and the bag occupy a large region in each frame and remain almost static and unchanged during the walking process. In contrast, the legs take up small spaces and are continuously changing in motion. It can be observed that main differences of the gait among different frames lie in the dynamic features such as the moving legs. This suggests that some of the static regions, e.g. bag or coat, are not critical to distinguish one person from the others. From this perspective, we stress that the dynamic parts of the human body are more informative than others. Therefore, the applied approach should pay more attention to the dynamic feature.

Motivated by the above observation, we propose a novel gait recognition method called DyGait, which can automatically extract dynamic information of the gait. As shown in Figure 1, DyGait pays more attention to the motion parts, such as legs and arms.

First, a novel Dynamic Augmentation Module (DAM) is developed to extract more comprehensive representations. DAM is built based on Dynamic Feature Extractor (DFE), which can ensemble the global temporal information of feature maps to generate a gait template. Then, the dynamic feature maps can be obtained by computing the difference between the feature maps of each frame and the gait template.

In addition, both Temporal Aggregation (TA) and Horizontal Mapping (HM) operations are applied to generate feature representations [33]. The proposed DyGait achieves strong performance and outperforms other state-of-the-art models by a large margin on GREW, Gait3D, CASIA-B and OU-MVLP. The main contributions are as follows:

1) We propose a novel framework for gait recognition, called DyGait. To the best of our knowledge, this is the first network that is designed to explicitly focus on the extraction of dynamic features of gait.

2) DyGait is built based on the Dynamic Augmentation Module (DAM), which allows a network to focus on the key information and learn more discriminative representations for gait recognition. Meanwhile, this module can effectively filter invalid noise by paying attention to dynamic information.

3) We achieve the state-of-the-art performance on the most popular datasets inclduing GREW, Gait3D, CASIA-B and OU-MVLP. It obtains an average Rank-1 accuracy of 71.4% on GREW, 66.3% on Gait3D, 98.4% on CASIA-B and 98.3% on OU-MVLP, respectively. The experiments demonstrate that our method significantly outperforms the previous methods by a large margin.

2 Related Work

Gait Recognition. Most prior works [17, 50, 56] are based on the extraction of handcrafted features from gait sequences using traditional machine learning approaches. Gait energy image (GEI) [17] used in such investigations is the most popular approach to describe gait. Although the noise can be effectively suppressed by averaging over the gait cycle within a long temporal range in a GEI, this template loses most details such as temporal information. Inspired by the successful application of Convolutional Neural Networks (CNNs) in face recognition [44, 37, 52, 9, 23, 59, 16, 24] and person Re-IDentification (Re-ID) [65, 34, 57, 18, 15, 73, 71, 72, 74, 26, 54], recent researchers propose many gait recognition frameworks based on CNN. Current works in gait recognition are divided into two types of feature representations: spatial feature representation and temporal modeling.

Spatial Feature Representation. The first type regards the gait sequence as a template, which relies on binary human silhouette images. The goal of template generation is to encode a gait cycle into a single image, i.e. Gait Energy Image (GEI) [17] or a Chrono-Gait Image (CGI) [50]. In the template matching procedure, the gait representation is firstly extracted from a template image using machine learning approaches [2, 58] or deep learning [56, 41, 64, 5, 67, 66, 31, 6, 3, 4, 38]. Then, similarities between pairs of representations are measured using Euclidean distance or other metric learning approaches. For example, Shiraga et al. [41] propose the GEINet framework to extract gait features from Gait Energy Image (GEI), which is generated by using the mean function. Zhang et al. [64] also take the GEI as input to extract gait features. However, the generation process of GEI causes severe information loss. Hence, Chao et al. [5] propose a GaitSet framework, in which the first step is to extract the static gait features and then use a max function to generate gait templates. Zhang et al. [67] propose an attention module to learn weights of different frames, and then adopt a weighted average operation to create a gait template. Although these methods can achieve excellent performance and be easy to compute, they do not consider temporal information at the feature extraction stage.

Temporal Modeling. In the second category, 3D-CNNs [27, 28, 31, 32, 30, 39, 53, 29, 11, 62, 63] or LSTMs [55, 43, 48] are used for modeling the temporal information. These approaches can comprehend more spatial information and gather more temporal information but require higher computational costs. Wolf et al. [55] partition a gait sequence into multiple non-overlapping gait clips and use 3D CNNs to extract each clip’s gait features. Thapar et al. [48] also adopt a similar strategy to extract gait features and further introduce an LSTM module to aggregate multiple clips’ features. However, it is inflexible because it only extracts and aggregates a fixed-length clip’s information. Recently, Lin et al. [31] propose a novel framework to combine advantages of both template-based and sequence-based methods. They firstly use 3D CNN to extract spatial-temporal gait features and then generate a gait representation by using the statistical function. However, despite the success of spatial feature representation and temporal modeling, their extraction become more complicated for dynamic and changing information. In other words, they do not focus on the most valid information from the gait.

Thus, we turn the attention to dynamic parts of gait and propose the Dynamic Augmentation Module (DAM) which can be used to augment the representation ability. GaitNet proposed by Zhang et al. [68] is the most related work. Unusually, GaitNet learns a representation of gait directly from RGB frames in videos. Compared with [68], our approach can automatically disentangle dynamic features from the binary silhouettes, which is beneficial to privacy protection and gives its strong robustness to different clothes/carrying conditions. In addition, some recent work extract gait features through optical flow [43], 2D pose [46] and 3D pose [28]. These approaches are robust to clothing variations but depend on the optical flow and pose estimation accuracy.

3 Proposed Method

In this section, we first overview the framework of the proposed method. Then, we introduce the Dynamic Augmentation Module (DAM), Temporal Aggregation (TA), Horizontal Mapping (HM) and loss function we used. Finally, we explain the training and test details.

3.1 Overview

The framework of the proposed method is illustrated in Figure 2, which includes Dynamic Augmentation Module (DAM), Temporal Aggregation (TA) and Horizontal Mapping (HM). We firstly use a convolution layer to extract shallow features and then aggregate local temporal information by using the Local Temporal Aggregation (LTA) [33]. Assume that $X_{in}\in\mathbb{R}^{C_{in}\times T_{in}\times H_{in}\times W_{in}}$ is the input gait sequences, where $C_{in}$ is the number of input channels, $T_{in}$ is the length of the gait sequence and ( $H_{in}$ , $W_{in}$ ) is the image size of each frame. These operations can be represented as

Y_{L}=\sigma(C^{3\times 1\times 1}(\sigma(C^{3\times 3\times 3}(X_{in})))),

(1)

where $Y_{L}\in\mathbb{R}^{C_{L}\times T_{L}\times H_{in}\times W_{in}}$ is the output of the Local Temporal Aggregation (LTA), $C_{L}$ is the number of output channels, and $T_{L}$ is the length of the feature map $Y_{L}$ . $C^{3\times 3\times 3}$ denotes the 3D convolution with kernel size $3\times 3\times 3$ . $C^{3\times 1\times 1}$ means the 3D convolution with kernel size $3\times 1\times 1$ and stride $3$ . $\sigma$ means activation function.

Then, we propose the feature extraction module based on DAM to extract augmented dynamic features. After that, we introduce TA and HM operations to generate feature representations. Finally, the triplet loss and cross-entropy loss are taken as loss functions to train the proposed network [19, 33].

3.2 Feature Extraction based on DAM

Recently, many researchers use spatial-based [56, 41, 64, 5, 67, 66, 31, 6, 3, 4] or temporal-based models [27, 28, 31, 7, 55, 43, 48] to extract features for gait representations. However, these methods do not pay enough attention to the dynamic information of the human body. As we mentioned above, the human torso and some disturbances, such as bags and coats, can be considered as static information, while moving limbs can be regarded as dynamic information. The bags and coats which do not belong to the human identification information may harm the recognition. On the other hand, the dynamic limbs often have larger changes than the relatively stable torso when walking, which indicates that the dynamic parts of the human body may provide more discriminative information. The traditional gait template based on mean function, such as Gait Energy Image (GEI) [17], preserves the information of the torso completely and weakens the role of dynamic limbs. To take advantage of dynamic information, we utilize the difference between the gait feature of each frame and the gait template based on the mean function to generate the dynamic feature map, the DAM block is shown in Figure 3.

Assume that $X_{o}=\left\{f_{i}|i=1,2,...,T_{o}\right\}$ , where $X_{o}\in\mathbb{R}^{C_{o}\times T_{o}\times H_{o}\times W_{o}}$ , $C_{o}$ is the number of input channels, $T_{o}$ is the length of the feature map and ( $H_{o}$ , $W_{o}$ ) is the image size of each frame. $f_{i}\in\mathbb{R}^{C_{o}\times 1\times H_{o}\times W_{o}}$ is the i-th frame of the feature map $X_{o}$ . The dynamic feature map can be obtained by

X_{d}=\left\{f_{i}-X_{m}|i=1,2,...,T_{o}\right\},

(2)

where

X_{m}=\frac{1}{T_{o}}\sum_{i=1}^{T_{o}}f_{i},

(3)

$X_{d}\in\mathbb{R}^{C_{o}\times T_{o}\times H_{o}\times W_{o}}$ is the dynamic feature map, and $X_{m}\in\mathbb{R}^{C_{o}\times 1\times H_{o}\times W_{o}}$ is the gait template based on mean function. Based on the dynamic feature map, we propose a Dynamic Feature Extractor (DFE) to establish the spatial-temporal feature representations of the dynamic parts of gait. The Dynamic Feature Extractor (DFE) can be designed as

Y_{DFE}=C^{3\times 3\times 3}(X_{d}),

(4)

where $Y_{DFE}\in\mathbb{R}^{C_{od}\times T_{o}\times H_{o}\times W_{o}}$ is the output of the DFE, $C_{od}$ is the number of output channels and $C^{3\times 3\times 3}$ denotes the 3D convolution operation with kernel size $(3,3,3)$ .

Considering that DFE focuses on extracting the dynamic information of the human body, we add a Global Feature Extractor (GFE) to extract the global features of a gait sequence. The Global Feature Extractor (GFE) can be represented as

Y_{GFE}=C^{1\times 3\times 3}(X_{o}),

(5)

where $Y_{GFE}\in\mathbb{R}^{C_{od}\times T_{o}\times H_{o}\times W_{o}}$ is the output of the GFE and $C^{1\times 3\times 3}$ denotes the 3D convolution operation with kernel size $(1,3,3)$ .

Based on the DFE, in this paper, we propose a novel module, DAM, to produce the dynamic augmentation features, greatly improving the representation ability. The DAM can be denoted as

Y_{DAM}=\sigma(Y_{GFE}+Y_{DFE}),

(6)

where $\sigma$ means LeakyRelu function. The Augmented Feature Maps (AFM) after DAM can be represented as

Y_{AFM}=\sigma(C^{3\times 3\times 3}(X_{o}))+Y_{DAM}

(7)

where $Y_{AFM}\in\mathbb{R}^{C_{od}\times T_{o}\times H_{o}\times W_{o}}$ .

As shown in Figure 2, the feature extraction stage is implemented by multiple convolutions with DAM and the max-pooling operation.

3.3 Loss Function

To achieve the best performance, triplet loss and cross-entropy loss are used to train our network [6, 14]. Assume that $F_{i}$ , $F_{j}$ and $F_{k}$ are feature representations corresponding to samples ${i}$ , ${j}$ and ${k}$ , respectively. Note that samples ${i}$ and ${j}$ are from class $A$ , and the sample ${k}$ belongs to class $B$ . The combined loss function can be represented as

Loss_{all}=Loss_{tri}+Loss_{cse},

(8)

where $Loss_{tri}$ and $Loss_{cse}$ mean triplet loss and cross-entropy loss respectively.

On the one hand, the triplet loss $Loss_{tri}$ is proposed to optimize the inter-class and intra-class distances, which can be defined as

Loss_{tri}=[D(F_{i},F_{k})-D(F_{i},F_{j})+m]_{+},

(9)

where $D(F_{i},F_{k})$ is the Euclidean distance between features of samples $i$ and $k$ , $m$ means the margin of the triplet loss, and $[\alpha]_{+}$ is equal to $max(\alpha,0)$ .

On the other hand, the cross entropy loss $Loss_{cse}$ is introduced to optimize the classification space, which can be formulated as

Loss_{cse}=-\frac{1}{N}\sum_{i=1}^{N}log\frac{e^{W_{y_{i}}^{T}x_{i}+b_{y_{i}}}}{\sum_{j=1}^{n}e^{W_{j}^{T}x_{i}+b_{j}}}

(10)

where $x_{i}$ is the feature of the $i$ -th sample, and its label is $y_{i}$ .

In our method, we obtain multiple column vectors at the Horizontal Mapping stage and then calculate the loss of each column vector following Equation 8 [31, 19].

3.4 Training Details and Test

Training. During the training phase, we first extract feature maps based on the dynamic augmentation model from the input sequence. Then, Temporal Aggregation (TA) and Horizontal Mapping (HM) are used to generate the fixed-size feature representation. After that, triplet loss and cross-entropy loss are used for evaluation. The sampling strategy is Batch ALL (BA) [18, 5] and $P\times K$ instances are sampled in each step, where $P$ corresponds to the number of subject IDs, and $K$ denotes the number of samples for each subject ID.

Test. During the test phase, the whole input sequences can be fed into the proposed network to generate a feature representation $Y_{HM}$ that represents human gait. To evaluate the proposed method, we adopt the pattern ”Gallery-Probe” to calculate Rank-1 accuracy. Therefore, the test set is split into two sets, i.e. the gallery set and the probe set. Firstly, we input the gait sequence from all of the gallery set into the proposed network to generate feature representations, as the standard view sets. Secondly, each gait sequence of the probe set is put into the proposed network for feature representation. Then, this feature representation is used to calculate the Euclidean distance with all representations of the standard view sets. The label of the sample which has the smallest distance with the input sample is assigned to the input sample. Finally, we calculate the average accuracy to evaluate the performance of the proposed method.

Table 1: Rank-1 accuracy (%), Rank-5 accuracy (%), Rank-10 accuracy (%), and Rank-20 accuracy (%) on the GREW dataset.

Methods	Rank-1	Rank-5	Rank-10	Rank-20
PoseGait [28]	0.2	1.0	2.2	4.3
GaitGraph [46]	1.3	3.5	5.1	7.5
GEINet [41]	6.8	13.4	17.0	21.0
TS-CNN [56]	13.6	24.6	30.2	37.0
GaitSet [5]	46.3	63.6	70.3	76.8
GaitPart [13]	44.0	60.7	67.3	73.5
GaitGL [33]	47.3	63.6	69.3	74.2
MGN [51]	44.5	61.3	67.7	72.7
CSTL [22]	50.6	65.9	71.9	76.9
MTSGait [69]	55.3	71.3	76.9	81.6
OpenGait [12]	60.1	-	-	-
Ours	71.4	83.2	86.8	89.5

4 Experiments

4.1 Datasets

To evaluate the performance of the proposed method, we conduct experiments on four popular gait datasets, including two real-word datasets i.e. GREW [75] and Gait3D [70], and two datasets captured from experimental environment, i.e. CASIA-B [61] and OU-MVLP [45].

GREW. The GREW dataset [75] is a large-scale outdoor gait dataset. The GREW includes 26,345 subjects and 128,671 sequences captured by 882 cameras in open environments, providing data in the form of Silhouettes, Optical Flow, GEI, and 2D/3D human poses. GREW has performed age grouping and gender annotation for all subjects. These subjects are divided into five age groups, with the adult groups (i.e., 16-30, 31-45, 46-60 years), the Children group (under 16 years) and the elder group (over 60 years). Each group has the males and females with similar amount. The GREW also provides five carrying conditions and six dressing styles. The GREW is divided into three parts, i.e. the training set with 20,000 identities and 102,887 sequences, the validation set with 345 identities and 1,784 sequences, and the test set with 6,000 identities and 24,000 sequences. These three sets of identities are captured by different cameras. Each test subject has four sequences, two of which are taken as probes and the other two as gallery. In addition, there is a distractor set with 233,857 sequences.

Gait3D. The Gait3D [70] is a large-scale dataset, which contains 4,000 subjects and over 25,000 sequences captured from an unconstrained indoor scene by 39 cameras. It provides 3D SMPL models recovered from video. To facilitate comparison with other algorithms, the Gait3D dataset is divided into the training set with 3000 subjects and the test set with 1000 subjects. For the test set, the probe set of 1000 sequences is constructed by randomly selecting one sequence from each subject, and the rest sequences are put into the gallery set.

CASIA-B. The CASIA-B dataset [61] is one of the most popular gait databases, which consists of 124 subjects. For each subject, the CASIA-B dataset collected 10 groups of gait sequences (NM#01-06, BG#01-02 and CL#01-02) under three conditions of normal walking (NM), walking with a bag (BG), and walking in coats (CL). Each group contains 11 videos from different view angles ( $0^{\circ}$ , $18^{\circ}$ , …, $162^{\circ}$ , $180^{\circ}$ ). In experiments, methods usually adopt three protocols (Small-scale Training (ST), Medium-scale Training (MT), and Large-scale Training (LT)) to evaluate the performance of the proposed method [5]. For these three settings, 24, 62, and 74 subjects are taken as the training set and the rest 100, 62, and 50 subjects are used for test, respectively. During the training stage, all gait sequences of the training set are used to train the network. In the test phase, gait sequences NM#01-04 are taken as the gallery set, and gait sequences NM#05-06, BG#01-02 and CL#01-02 are used as the probe set to calculate Rank-1 accuracy.

OU-MVLP. The OU-MVLP [45] is one of the largest gait recognition datasets. It includes 10,307 subjects, each of which has two groups of gait sequences (Seq#00 and Seq#01). Each group collects 14 angles of gait sequences ( $0^{\circ}$ , $15^{\circ}$ ,…, $75^{\circ}$ , $90^{\circ}$ , $180^{\circ}$ , $195^{\circ}$ ,…, $255^{\circ}$ , $270^{\circ}$ ). In our experiments, we use gait sequences of 5,153 subjects to train the proposed network and take rest sequences as the test set to evaluate the performance [5]. During the test phase, sequences in Seq#01 are defined as the gallery set, and sequences in Seq#00 are considered as the probe set to calculate Rank-1 accuracy.

Table 2: Rank-1 accuracy (%), Rank-5 accuracy (%), mAP (%) and mINP on the Gait3D dataset.

Methods	Rank-1	Rank-5	mAP	mINP
PoseGait [28]	0.24	1.08	0.47	0.34
GaitGraph [47]	6.25	16.23	5.18	2.42
GEINet [41]	5.40	14.20	5.06	3.14
GaitSet [5]	36.70	58.30	30.01	17.30
GaitPart [13]	28.20	47.60	21.58	12.36
GLN [19]	31.40	52.90	24.74	13.58
GaitGL [33]	29.70	48.50	22.29	13.26
CSTL [22]	11.70	19.20	5.59	2.59
SMPLGait [70]	46.30	64.50	37.16	22.23
MTSGait [69]	48.70	67.10	37.63	21.92
OpenGait [12]	65.60	-	-	-
Ours	66.30	80.80	56.40	37.30

4.2 Implementation Details

We pre-process the original gait sequences. For GREW, CASIA-B and OU-MVLP, we normalize the size of each frame to $64\times 44$ . For Gait3D, the size of each frame is $128\times 88$ . In GREW and Gait3D, DyGait has five blocks built with the proposed DAM. For the CASIA-B and OU-MVLP, there are three DAM blocks to build the network. In all experiments, $m$ in Equation 9 is set to 0.2. In Section 3.4, we introduce our sampling strategy Batch ALL (BA) in the training phase, which contains parameters $P$ and $K$ . Parameters ( $P$ , $K$ ) are set to (32, 4) for GREW and Gait3D datasets. Parameters ( $P$ , $K$ ) are set to (8, 16) for the CASIA-B dataset. For the OU-MVLP dataset, $P$ and $K$ are set to 32 and 8, respectively. SGD optimizer was adopted with the learning rate of 0.1 for GREW and Gait3D. For CASIA-B and OU-MVLP, Adam optimizer [25] was adopted with the learning rate of 1e-4. In the training stage, the frame length of each batch is set to 30. The code of all experiments was written in PyTorch 1.1.0 [35]. In the test phase, all frames of sequence can be taken as input to generate the feature representation. Iterations of the GREW, Gait3D, CASIA-B and OU-MVLP dataset are set to 200K, 150K, 80K and 210k, respectively.

4.3 Comparison with State-of-the-art Methods

Evaluation on GREW. We compare the performance of the proposed method with several gait recognition methods on the GREW dataset and show complete experimental results in Table 1. The comparison methods include PoseGait [28], GaitGraph [46], GEINeT [42], TS-CNN [58], GaitSet [5], GaitPart [13], GaitGL [33], MGN [51], CSTL [22] , MTSGait [69] and OpenGait [12]. From Table 1, we find that gait recognition methods that perform well in laboratory scenarios degrade significantly on real scenario datasets. Although the real scenario dataset is subject to many external factors, our method performs 11.3% higher than the state-of-art method OpenGait [12] on Rank-1 accuracy. Besides, our method gets 16.1%, 11.9%, 9.9% and 7.9% higher accuracy than the MTSGait on Rank-1, Rank-5, Rank-10 and Rank-20, respectively. The experimental results indicate that the proposed method obtains the highest average accuracy. This may be because the proposed method can learn more discriminative dynamic information. Besides, the methods [28, 46] using the skeleton are less effective on real-world datasets, which may be that the skeleton-based approaches have less representation capability, the silhouette-based methods can obtain more discriminative feature representations.

Evaluation on Gait3D. Some competing gait recognition methods on the GREW dataset are used on Gait3D as well. Compared with the methods in Table 2, since our method pays more attention to the dynamic feature of the human body which obtains more specific information about legs and arms, our method achieves more appealing performance than the other gait recognition methods. Its accuracy is 0.7% higher than the OpenGait [12] on Rank-1. And the accuracy is 17.6% and 13.7% greater than MTSGait on Rank-1 and Rank-5, respectively. It shows that the captured dynamic features in our method may be more discriminative, which contributes to better performance.

Table 3: Rank-1 accuracy (%) on CASIA-B under all view angles and different conditions in LT setting, excluding identical-view case.

Gallery NM#1-4		$0^{\circ}$ - $180^{\circ}$
Probe		$0^{\circ}$	$18^{\circ}$	$36^{\circ}$	$54^{\circ}$	$72^{\circ}$	$90^{\circ}$	$108^{\circ}$	$126^{\circ}$	$144^{\circ}$	$162^{\circ}$	$180^{\circ}$	Mean
NM#5-6	GaitSet	90.8	97.9	99.4	96.9	93.6	91.7	95.0	97.8	98.9	96.8	85.8	95.0
	GaitPart	94.1	98.6	99.3	98.5	94.0	92.3	95.9	98.4	99.2	97.8	90.4	96.2
	MT3D	95.7	98.2	99.0	97.5	95.1	93.9	96.1	98.6	99.2	98.2	92.0	96.7
	GaitGL	96.0	98.3	99.0	97.9	96.9	95.4	97.0	98.9	99.3	98.8	94.0	97.4
	OpenGait	-	-	-	-	-	-	-	-	-	-	-	97.6
	MetaGait	97.3	99.2	99.5	99.1	97.2	95.5	97.6	99.1	99.3	99.1	96.7	98.1
	DyGait (ours)	97.4	98.9	99.2	98.3	97.7	96.8	98.2	99.3	99.3	99.2	97.6	98.4
BG#1-2	GaitSet	83.8	91.2	91.8	88.8	83.3	81.0	84.1	90.0	92.2	94.4	79.0	87.2
	GaitPart	89.1	94.8	96.7	95.1	88.3	84.9	89.0	93.5	96.1	93.8	85.8	91.5
	MT3D	91.0	95.4	97.5	94.2	92.3	86.9	91.2	95.6	97.3	96.4	86.6	93.0
	OpenGait	-	-	-	-	-	-	-	-	-	-	-	94.0
	GaitGL	92.6	96.6	96.8	95.5	93.5	89.3	92.2	96.5	98.2	96.9	91.5	94.5
	MetaGait	92.9	96.7	97.1	96.4	94.7	90.4	92.9	97.2	98.5	98.1	92.3	95.2
	DyGait (ours)	94.5	96.9	97.4	96.1	95.4	94.0	94.8	97.6	98.5	97.7	94.9	96.2
CL#1-2	GaitSet	61.4	75.4	80.7	77.3	72.1	70.1	71.5	73.5	73.5	68.4	50.0	70.4
	GaitPart	70.7	85.5	86.9	83.3	77.1	72.5	76.9	82.2	83.8	80.2	66.5	78.7
	OpenGait	-	-	-	-	-	-	-	-	-	-	-	77.4
	MT3D	76.0	87.6	89.8	85.0	81.2	75.7	81.0	84.5	85.4	82.2	68.1	81.5
	GaitGL	76.6	90.0	90.3	87.1	84.5	79.0	84.1	87.0	87.3	84.4	69.5	83.6
	MetaGait	80.0	91.8	93.0	87.8	86.5	82.9	85.2	90.0	90.8	89.3	78.4	86.9
	DyGait (ours)	82.2	93.0	95.2	91.6	87.1	83.4	87.2	90.1	92.4	88.2	75.8	87.8

Table 4: Rank-1 accuracy (%) on OU-MVLP dataset under different view angles, excluding invalid probe sequences.

Method	Probe View														Mean
Method	$0^{\circ}$	$15^{\circ}$	$30^{\circ}$	$45^{\circ}$	$60^{\circ}$	$75^{\circ}$	$90^{\circ}$	$180^{\circ}$	$195^{\circ}$	$210^{\circ}$	$225^{\circ}$	$240^{\circ}$	$255^{\circ}$	$270^{\circ}$	Mean
GEINet	24.9	40.7	51.6	55.1	49.8	51.1	46.4	29.2	40.7	50.5	53.3	48.4	48.6	43.5	45.3
GaitSet	84.5	93.3	96.7	96.6	93.5	95.3	94.2	87.0	92.5	96.0	96.0	93.0	94.3	92.7	93.3
GaitPart	88.0	94.7	97.7	97.6	95.5	96.6	96.2	90.6	94.2	97.2	97.1	95.1	96.0	95.0	95.1
GLN	89.3	95.8	97.9	97.8	96.0	96.7	96.1	90.7	95.3	97.7	97.5	95.7	96.2	95.3	95.6
SRN+CB	91.2	96.5	98.3	98.4	96.3	97.3	96.8	92.3	96.3	98.1	98.1	96.0	97.0	96.2	96.4
GaitGL	90.5	96.1	98.0	98.1	97.0	97.6	97.1	94.2	94.9	97.4	97.4	95.7	96.5	95.7	96.2
DyGait (ours)	96.2	98.2	99.1	99.0	98.6	99.0	98.8	97.9	97.6	98.8	98.6	98.1	98.3	98.2	98.3

Evaluation on CASIA-B. Our approach not only shows better performance on real scenario datasets, but also in lab scenarios. We compare the performance of the proposed method with several gait recognition methods on the CASIA-B dataset and show complete experimental results in Table 3. Comparison methods include GaitSet [5], GaitPart [13], MT3D [31], GaitGL [33], OpenGait [12] and MetaGait [10]. Experimental results indicate that the proposed method has the highest average accuracy in all of the conditions (NM, BG and CL). We further explore the performance of the methods with different conditions. It can be observed that the recognition accuracy has a significant decrease when the external environment changes. For example, recognition accuracies of GaitGL in NM, BG and CL are 97.4%, 94.5% and 83.6% respectively. For the MetaGait framework, the corresponding accuracy is 98.1%, 95.2% and 86.9%, respectively. Since both methods mentioned above equivalently extract the information of different regions of the human gait, which contains only static features, they are more vulnerable to suffer from the complex external environment. For the proposed method, the accuracy in NM, BG and CL is 98.4%, 96.2% and 87.8%, respectively, we pay more attention to the dynamic information so that more discriminative features may be captured. Specifically, the recognition accuracy of our method outperforms that of the other methods in BG and CL. Furthermore, our method performs well on some specific angles( $0^{\circ}$ and $90^{\circ}$ ) in complex environment. For example, the accuracy of the MataGait is 92.9% and 90.4% in BG condition while that of our method is 94.5% and 94.0%.

Evaluation on OU-MVLP. We compare the experimental result of our method with several state-of-the-art methods on the OU-MVLP dataset. The comparison methods include GEINet [41], GaitSet [5], GaitPart [13], GLN [19], SRN+CB [20] and GaitGL [33]. Experimental results are shown in Table 4, which indicates that the proposed method achieves better recognition accuracy than the state-of-the-art methods. The accuracy of the proposed method is 98.3% which outperforms the GaitGL by 2.1%. Furthermore, it can be observed that the accuracy of some specific view angles ( $0^{\circ}$ and $180^{\circ}$ ) is far below the average accuracy. The main reason may be that the gait in these view angles have less information than the others. Table 4 shows that the accuracy of our method at $0^{\circ}$ and $180^{\circ}$ is 96.2% and 97.9%, which outperforms GaitGL by 5.7%, and 3.7% respectively. This indicates that the dynamic feature extracted by our method can obtain more effective motion information, which is unique to each individual.

4.4 Ablation Study

In Section 3, we propose the Dynamic Augmentation Module (DAM) to improve the feature representation ability. Therefore, we design more experiments to explore the role of these modules and some critical parameters.

Table 5: The accuracy (%) on GREW under different combinations of the proposed modules. GFE and DFE mean Global Feature Extractor and Dynamic Feature Extractor, respectively.

DAM		Rank-1	Rank-5	Rank-10	Rank-20
GFE	DFE	Rank-1	Rank-5	Rank-10	Rank-20
✓	✓	71.4	83.2	86.8	89.5
✓	$\times$	67.4	80.3	84.6	87.5
$\times$	✓	70.0	82.2	85.9	88.7

Analysis of DAM. To explore the contribution of the two branches used for feature extraction, we design another two models with only a single branch and conduct the experiments on the real scenario dataset GREW. The results are shown in Table 5. It can be observed that the models with DFE achieve better performance than those without DFE. This is because DFE can enable the model to focus on the dynamic information of gait, which is helpful for recognition. It can be also observed that combining global features and dynamic features can bring extra performance gain. From the results, we can see that the accuracy of the model combining both the DFE and GFE performs 4.0% better than the model without DFE. This indicates using the both branches to extract features can lead to more powerful and discriminative representation ability.

Analysis of the number of DAMs. In this paper, we propose a novel dynamic feature extractor to generate discriminative feature representations. The proposed dynamic feature extractor can be used to replace any layer of the network. To explore the optimal number of DAMs, we design the ablation studies by using different numbers of DAMs. The ablation studies are built on GREW. Experimental results are shown in Table 6. It can be observed that higher recognition accuracy can be obtained by using a larger number of DAMs. Thereby, the number of DAMs on the GREW dataset is finally set to five.

Table 6: Rank-1 accuracy (%) of different DAM number.

DAM	Rank-1	Rank-5	Rank-10	Rank-20
1	13.9	24.5	30.1	35.7
2	41.4	56.8	62.8	68.0
3	57.2	70.9	75.6	79.6
4	69.2	80.8	84.7	87.9
5	71.4	83.2	86.8	89.5

5 Visualization

In this section, we visualize the feature distribution of the models with DFE and withou DFE on CASIA-B dataset. As shown in Figure 5, we can observe that the intra-class distance of the features is closer and the inter-class distance is farther the DFE modules is added. The visualization demonstrates that by introducing the DFE which extracts the dynamic information, more discriminative representation can be obtained and contribute to better inter-class and intra-class distribution.

Furthermore, we visualize the heatmaps of the existing methods and our DyGait. As shown in Figure 4, the methods [5, 12] fail to distinguish the differences between the dynamic parts (e.g. legs and arms) and the static parts (e.g. torso) during walking. Figure 4 (d) shows our DyGait pays more attention to dynamic parts. This is because our DyGait explicitly models the motion information by DAM, which enhances the discriminativeness of moving parts.

6 Conclusion

In this paper, we propose a novel network to generate more discriminative representations for gait recognition. The proposed DyGait includes both the Global Feature Extractor (GFE) and the Dynamic Feature Extractor (DFE) modules. The model can extract not only spatial-temporal features but also dynamic features from gait sequences. We develop the feature extraction module based on dynamic augmentation to generate augmented features. Taking the dynamic feature extractor as an extra branch can effectively enhance the discriminability of the global feature representation. The experiments on four popular datasets indicate that the proposed method can achieve optimal performance.

References

[1] Feng Bao, Yifei Cao, Shunli Zhang, Beibei Lin, and Sicong Zhao. Using segmentation with multi-scale selective kernel for visual object tracking. IEEE SPL, 2022.
[2] Khalid Bashir, Tao Xiang, and Shaogang Gong. Gait recognition without subject cooperation. PRL, 2010.
[3] Tianrui Chai, Xinyu Mei, Annan Li, and Yunhong Wang. Semantically-guided disentangled representation for robust gait recognition. In ICME, 2021.
[4] Tianrui Chai, Xinyu Mei, Annan Li, and Yunhong Wang. Silhouette-based view-embeddings for gait recognition under multiple views. In ICIP, 2021.
[5] Hanqing Chao, Yiwei He, Junping Zhang, and Jianfeng Feng. Gaitset: Regarding gait as a set for cross-view gait recognition. In AAAI, 2019.
[6] Hanqing Chao, Kun Wang, Yiwei He, Junping Zhang, and Jianfeng Feng. Gaitset: Cross-view gait recognition through utilizing gait as a deep set. TPAMI, 2021.
[7] Ching-Hang Chen and Deva Ramanan. 3D human pose estimation = 2D pose estimation + matching. In CVPR, 2017.
[8] Patrick Connor and Arun Ross. Biometric recognition by gait: A survey of modalities and features. CVIU, 2018.
[9] Jiankang Deng, Jia Guo, and Stefanos Zafeiriou. ArcFace: Additive angular margin loss for deep face recognition. In CVPR, 2019.
[10] Huanzhang Dou, Pengyi Zhang, Wei Su, Yunlong Yu, and Xi Li. Metagait: Learning to learn an omni sample adaptive representation for gait recognition. In ECCV, 2022.
[11] Huanzhang Dou, Pengyi Zhang, Yuhan Zhao, Lin Dong, Zequn Qin, and Xi Li. Gaitmpl: Gait recognition with memory-augmented progressive learning. IEEE TIP, 2022.
[12] Chao Fan, Junhao Liang, Chuanfu Shen, Saihui Hou, Yongzhen Huang, and Shiqi Yu. Opengait: Revisiting gait recognition toward better practicality. CVPR23, 2022.
[13] Chao Fan, Yunjie Peng, Chunshui Cao, Xu Liu, Saihui Hou, Jiannan Chi, Yongzhen Huang, Qing Li, and Zhiqiang He. Gaitpart: Temporal part-based model for gait recognition. In CVPR, 2020.
[14] Hongyu Fu, Chen Liu, Xingqun Qi, Beibei Lin, Lincheng Li, Li Zhang, and Xin Yu. Sign spotting via multi-modal fusion and testing time transferring. In ECCVW, 2023.
[15] Yang Fu, Yunchao Wei, Yuqian Zhou, Honghui Shi, Gao Huang, Xinchao Wang, Zhiqiang Yao, and Thomas Huang. Horizontal pyramid matching for person re-identification. In AAAI, 2019.
[16] Yandong Guo, Lei Zhang, Yuxiao Hu, Xiaodong He, and Jianfeng Gao. MS-Celeb-1M: A dataset and benchmark for large-scale face recognition. In ECCV, 2016.
[17] Ju Han and Bir Bhanu. Individual recognition using gait energy image. TPAMI, 2006.
[18] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
[19] Saihui Hou, Chunshui Cao, Xu Liu, and Yongzhen Huang. Gait lateral network: Learning discriminative and compact representations for gait recognition. In ECCV, 2020.
[20] Saihui Hou, Xu Liu, Chunshui Cao, and Yongzhen Huang. Set residual network for silhouette-based gait recognition. TBIOM, 2021.
[21] Xiaohu Huang, Xinggang Wang, Botao He, Shan He, Wenyu Liu, and Bin Feng. Star: Spatio-temporal augmented relation network for gait recognition. IEEE TBIOM, 2022.
[22] Xiaohu Huang, Duowang Zhu, Hao Wang, Xinggang Wang, Bo Yang, Botao He, Wenyu Liu, and Bin Feng. Context-sensitive temporal feature learning for gait recognition. In ICCV, 2021.
[23] Yuge Huang, Yuhan Wang, Ying Tai, Xiaoming Liu, Pengcheng Shen, Shaoxin Li, Jilin Li, and Feiyue Huang. CurricularFace: adaptive curriculum learning loss for deep face recognition. In CVPR, 2020.
[24] Ira Kemelmacher-Shlizerman, Steven M Seitz, Daniel Miller, and Evan Brossard. The MegaFace benchmark: 1 million faces for recognition at scale. In CVPR, 2016.
[25] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
[26] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. DeepReID: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
[27] Rijun Liao, Chunshui Cao, Edel B Garcia, Shiqi Yu, and Yongzhen Huang. Pose-based temporal-spatial network (ptsn) for gait recognition with carrying and clothing variations. In CCBR, 2017.
[28] Rijun Liao, Shiqi Yu, Weizhi An, and Yongzhen Huang. A model-based gait recognition method with body pose and human prior knowledge. PR, 2020.
[29] Beibei Lin, Chen Liu, Lincheng Li, Robby T Tan, and Xin Yu. Uncertainty-aware gait recognition via learning from dirichlet distribution-based evidence. ArXiv, 2022.
[30] Beibei Lin, Yu Liu, and Shunli Zhang. Gaitmask: Mask-based model for gait recognition. In BMVC, 2021.
[31] Beibei Lin, Shunli Zhang, and Feng Bao. Gait recognition with multiple-temporal-scale 3d convolutional neural network. In ACM MM, 2020.
[32] Beibei Lin, Shunli Zhang, Yu Liu, and Shengdi Qin. Multi-scale temporal information extractor for gait recognition. In ICIP, 2021.
[33] Beibei Lin, Shunli Zhang, and Xin Yu. Gait recognition via effective global-local feature representation and local temporal aggregation. In ICCV, 2021.
[34] Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. Bag of tricks and a strong baseline for deep person re-identification. In CVPR Workshops, 2019.
[35] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. NeurIPS, 2019.
[36] Sudeep Sarkar, P Jonathon Phillips, Zongyi Liu, Isidro Robledo Vega, Patrick Grother, and Kevin W Bowyer. The humanid gait challenge problem: Data sets, performance, and analysis. TPAMI, 2005.
[37] Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet: A unified embedding for face recognition and clustering. In CVPR, 2015.
[38] Chuanfu Shen, Chao Fan, Wei Wu, Rui Wang, George Q Huang, and Shiqi Yu. Lidar gait: Benchmarking 3d gait recognition with point clouds. ArXiv, 2022.
[39] Chuanfu Shen, Beibei Lin, Shunli Zhang, George Q Huang, Shiqi Yu, and Xin Yu. Gait recognition with mask-based regularization. ArXiv, 2022.
[40] Chuanfu Shen, Shiqi Yu, Jilong Wang, George Q Huang, and Liang Wang. A comprehensive survey on deep gait recognition: algorithms, datasets and challenges. ArXiv, 2022.
[41] Kohei Shiraga, Yasushi Makihara, Daigo Muramatsu, Tomio Echigo, and Yasushi Yagi. Geinet: View-invariant gait recognition using a convolutional neural network. In ICB, 2016.
[42] K. Shiraga, Y. Makihara, D. Muramatsu, T. Echigo, and Y. Yagi. GEINet: View-invariant gait recognition using a convolutional neural network. In ICB, 2016.
[43] Anna Sokolova and Anton Konushin. Pose-based deep gait recognition. IET Biometrics, 2019.
[44] Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, and Lior Wolf. DeepFace: Closing the gap to human-level performance in face verification. In CVPR, 2014.
[45] Noriko Takemura, Yasushi Makihara, Daigo Muramatsu, Tomio Echigo, and Yasushi Yagi. Multi-view large population gait dataset and its performance evaluation for cross-view gait recognition. CVA, 2018.
[46] Torben Teepe, Ali Khan, Johannes Gilg, Fabian Herzog, Stefan Hörmann, and Gerhard Rigoll. GaitGraph: Graph convolutional network for skeleton-based gait recognition. arXiv preprint arXiv:2101.11228, 2021.
[47] Torben Teepe, Ali Khan, Johannes Gilg, Fabian Herzog, Stefan Hörmann, and Gerhard Rigoll. Gaitgraph: Graph convolutional network for skeleton-based gait recognition. In ICIP. IEEE, 2021.
[48] Daksh Thapar, Gaurav Jaswal, Aditya Nigam, and Chetan Arora. Gait metric learning siamese network exploiting dual of spatio-temporal 3d-cnn intra and lstm based inter gait-cycle-segment features. PRL, 2019.
[49] Senmao Tian, Shunli Zhang, and Beibei Lin. Blind image deblurring based on dual attention network and 2d blur kernel estimation. In ICIP, 2021.
[50] Chen Wang, Junping Zhang, Liang Wang, Jian Pu, and Xiaoru Yuan. Human identification using temporal information preserving gait template. TPAMI, 2011.
[51] Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple granularities for person re-identification. In ACM MM, 2018.
[52] Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Zhifeng Li, Dihong Gong, Jingchao Zhou, and Wei Liu. CosFace: Large margin cosine loss for deep face recognition. In CVPR, 2018.
[53] Ming Wang, Beibei Lin, Xianda Guo, Lincheng Li, Zheng Zhu, Jiande Sun, Shunli Zhang, Yu Liu, and Xin Yu. Gaitstrip: Gait recognition via effective strip-based feature representations and multi-level framework. In ACCV, 2022.
[54] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer GAN to bridge domain gap for person re-identification. In CVPR, 2018.
[55] Thomas Wolf, Mohammadreza Babaee, and Gerhard Rigoll. Multi-view gait recognition using 3d convolutional neural networks. In ICIP, 2016.
[56] Zifeng Wu, Yongzhen Huang, Liang Wang, Xiaogang Wang, and Tieniu Tan. A comprehensive study on cross-view gait based human identification with deep cnns. TPAMI, 2016.
[57] Qiqi Xiao, Hao Luo, and Chi Zhang. Margin sample mining loss: A deep learning based method for person re-identification. arXiv:1710.00478, 2017.
[58] Xianglei Xing, Kejun Wang, Tao Yan, and Zhuowen Lv. Complete canonical correlation analysis with application to multi-view gait recognition. PR, 2016.
[59] Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch. arXiv:1411.7923, 2014.
[60] Shiqi Yu, Yongzhen Huang, Liang Wang, Yasushi Makihara, Edel B García Reyes, Feng Zheng, Md Atiqur Rahman Ahad, Beibei Lin, Yuchao Yang, Haijun Xiong, et al. Hid 2021: Competition on human identification at a distance 2021. In IJCB, 2021.
[61] Shiqi Yu, Daoliang Tan, and Tieniu Tan. A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In ICPR, 2006.
[62] Weichen Yu, Hongyuan Yu, Yan Huang, Chunshui Cao, and Liang Wang. Cntn: Cyclic noise-tolerant network for gait recognition. ArXiv, 2022.
[63] Weichen Yu, Hongyuan Yu, Yan Huang, and Liang Wang. Generalized inter-class loss for gait recognition. In ACM MM, 2022.
[64] Cheng Zhang, Wu Liu, Huadong Ma, and Huiyuan Fu. Siamese neural network based gait recognition for human identification. In ICASSP, 2016.
[65] Xuan Zhang, Hao Luo, Xing Fan, Weilai Xiang, Yixiao Sun, Qiqi Xiao, Wei Jiang, Chi Zhang, and Jian Sun. AlignedReID: Surpassing human-level performance in person re-identification. arXiv:1711.08184, 2017.
[66] Yuqi Zhang, Yongzhen Huang, Liang Wang, and Shiqi Yu. A comprehensive study on gait biometrics using a joint cnn-based method. PR, 2019.
[67] Yuqi Zhang, Yongzhen Huang, Shiqi Yu, and Liang Wang. Cross-view gait recognition by discriminative feature learning. TIP, 2019.
[68] Ziyuan Zhang, Luan Tran, Feng Liu, and Xiaoming Liu. On learning disentangled representations for gait recognition. TPAMI, 2020.
[69] Jinkai Zheng, Xinchen Liu, Xiaoyan Gu, Yaoqi Sun, Chuang Gan, Jiyong Zhang, Wu Liu, and Chenggang Yan. Gait recognition in the wild with multi-hop temporal switch. In ACM MM, 2022.
[70] Jinkai Zheng, Xinchen Liu, Wu Liu, Lingxiao He, Chenggang Yan, and Tao Mei. Gait recognition in the wild with dense 3d representations and a benchmark. In CVPR, 2022.
[71] Liang Zheng, Zhi Bie, Yifan Sun, Jingdong Wang, Chi Su, Shengjin Wang, and Qi Tian. MARS: A video benchmark for large-scale person re-identification. In ECCV, 2016.
[72] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In ICCV, 2015.
[73] Liang Zheng, Yi Yang, and Alexander G Hauptmann. Person re-identification: Past, present and future. arXiv:1610.02984, 2016.
[74] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV, 2017.
[75] Zheng Zhu, Xianda Guo, Tian Yang, Junjie Huang, Jiankang Deng, Guan Huang, Dalong Du, Jiwen Lu, and Jie Zhou. Gait recognition in the wild: A benchmark. In ICCV, 2021.

DyGait: Exploiting Dynamic Representations for High-performance Gait Recognition