This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multimodal Adaptive Fusion of Face and Gait Features using Keyless attention based Deep Neural Networks for Human Identification
thanks:

Ashwin Prakash1, Thejaswin S1 and Athira Nambiar1*, Alexandre Bernardino2 1Department of Computational Intelligence, SRM Institute of Science and Technology, Chennai, India.
{ap4471, ts6959, athiram}@srmist.edu.in
2Department of Electrical and Computer Engineering, Instituto Superior Técnico, Univ. of Lisbon, Portugal.
[email protected]
Abstract

Biometrics plays a significant role in vision-based surveillance applications. Soft biometrics such as gait is widely used with face in surveillance tasks like person recognition and re-identification. Nevertheless, in practical scenarios, classical fusion techniques respond poorly to changes in individual users and in the external environment. To this end, we propose a novel adaptive multi-biometric fusion strategy for the dynamic incorporation of gait and face biometric cues by leveraging keyless attention deep neural networks. Various external factors such as viewpoint and distance to the camera, are investigated in this study. Extensive experiments have shown superior performance of the proposed model compared with the state-of-the-art model.

Index Terms:
Soft-biometrics, surveillance, Gait, Face, Adaptive fusion, person identification, Deep Learning, attention models, multimodal fusion.

I Introduction

Human biometrics refers to the unique intrinsic physical or behavioural traits that allow distinguishing between different individuals, e.g., face, fingerprint, hand geometry, iris, and gait. The use of biometrics helps in various surveillance applications such as access control, human recognition, and re-identification. Single biometric modalities are often affected by practical challenges such as noisy data, lack of distinctiveness, intra/ inter-class variability, error rate, and spoof attacks. A common method to overcome this issue is to combine multiple biometric modalities, known as multimodal biometric fusion.

A critical constraint that any biometric system confronts is the variation in the environment owing to external conditions. This includes user-induced variability, i.e., inherent distinctiveness, pose, distance, and expression, or environment-induced variability, i.e., lighting condition, background noise, and weather conditions [1]. These constraints have not been adequately addressed in the literature on multimodal fusion. For instance, most of the existing works are based on static fusion strategies, wherein the fusion rules are fixed for certain external conditions such as pose/ lighting/ distance or based on manual computations. As a result, when the environment changes, the biometric system performs sub-optimally. To overcome this issue, a novel context-aware adaptive multibiometric fusion strategy, which can dynamically adapt the fusion rules to external conditions, is proposed in this paper. In particular, the adaptive fusion of gait and face at different viewpoints was investigated using an attention-based deep learning technique.

Face is one of the predominant biometric traits commonly employed in human recognition. Similarly, gait is an important soft biometric commonly used in surveillance applications, because it is unobtrusive and perceivable from a distance [2]. While fusing gait and face, the most influential factors may be the view angle and distance from the subject to the camera. Notably, gait can be clearly captured in the lateral view, whereas the face can be well captured in the frontal view. Based on this rationale, a novel context-aware adaptive fusion mechanism was designed to assign weights to gait and face biometric cues based on the context. The key notion of the proposed model is that when the person is in far/ lateral view, gait features should be gaining more priority than the less visible facial cues, whereas when the person is in near/ frontal view, the face should be getting more importance than the partially occluded gait features.

To facilitate the aforesaid context-aware adaptive fusion strategy, a keyless attention-based deep learning fusion is leveraged in the multimodal biometric fusion framework. As mentioned in [3], keyless attention is a sophisticated and effective technique for better accounting for the sequential character of data without the need for supplementary input, thereby excelling in identifying relationships across modalities. Extensive experiments are conducted via individual biometric-based identification, naïve bilinear pooling [4] based multimodal fusion and keyless attention-based adaptive fusion mechanism. Results clearly highlight the superior performance of the proposed model.

The remainder of this paper is organized as follows. Related works on face and gait-based human recognition are detailed in Section 2. Section 3 presents the framework of the proposed context-aware adaptive multibiometric fusion method. The experiments and results are presented in Sections 4 and 5, respectively. Finally, conclusions and future directions are presented in Section 6.

II Related Work

One of the earliest face recognition systems was discovered in [5] using the manual marking of various facial landmarks. Recognition of faces in images with objects has gained popularity with [6], which introduced the eigenface method. Since then, various other similar techniques, e.g., Linear Discriminant Analysis, to produce Fisherfaces, Gabor, LBP, and PCANet were reported in  [7]. Recently, deep learning-based techniques have also gained popularity, e.g. DeepFaces, Facenet, and Blazeface approach human-level performance under unconstrained conditions [7] (DeepFace: 97.35% vs. Human: 97.53%).

Classical gait-based identification approaches use either model-based or appearance-based approaches[2]. The former detects joints/body parts using 2D cameras or depth cameras. For example, [8] applied Hough transform to detect legs in each frame, whereas [9] leveraged Procrustes shape analysis to calculate joint angles of body parts. Gait recognition/re-identification using a Kinect camera has also been proposed in some works  [10]. In contrast to model-based approaches, appearance-based approaches use richer information, such as silhouettes of the human body in gait frames, to recognise gaits, e.g., gait energy image (GEI) [11] and GEI-based local multi-scale feature descriptors [12]. Recent deep learning approaches presented advanced techniques, e.g., view-invariant gait recognition using a convolutional neural network GEINet [13], a comprehensive model with both LSTM and residual attention components for cross-view gait recognition [14].

On the fusion of gait and face for human identification, one of the early works [15] proposed a fusion strategy by combining the results of gait and face recognition algorithms based on sequential importance sampling. A probabilistic combination of facial and gait cues was studied in [16]. Yet another work on the adaptive fusion of gait and face is [17] via score-level fusion. All the aforementioned studies leverage either classical machine learning techniques using handcrafted features, static fusion rules, or manual computations. On the contrary, in this work, we present a deep learning technique based on a keyless attention-based adaptive fusion mechanism for human identification, one of its first kind to the best of our knowledge.

III Multimodal Adaptive Fusion Methodology

The proposed keyless attention-based adaptive fusion of face and gait towards human identification is shown in Fig.1, in which all the symbols are introduced in the following subsections. The proposed framework maps spatio-temporal feature sequences corresponding to gait and face to a single label. First, the video sequence’s descriptors of gait and face are extracted from each frame via a Feature extractor module. Further, the Attention & Fusion block is employed to compute the feature importance and adaptively amalgamate them. Finally, the class probabilities are generated by a classifier module using a fully connected (FC) layer, followed by a softmax layer.

Refer to caption
Figure 1: Overall architecture of the proposed keyless attention-based adaptive fusion of face and gait for person recognition. The gait and face features are encoded by the gait and face feature extraction networks \mathcal{F} and 𝒢\mathcal{G}, respectively. The outputs are subsequently weighted using keyless attention. Context-aware adaptive multimodal fusion is then employed to fuse global gait and facial features. Finally, the outputs are passed through the classifier to determine the class (Person ID) of the person.

III-A Gait feature extractor

Gait recognition involves recognizing a person based on their gait features, i.e, movement patterns [18]. The temporal variation in human silhouettes is considered by calculating the cyclic pattern of movement, commonly referred to as the gait cycle. It can be observed that the size of the closed area between the legs and the aspect ratio in the human silhouettes are alternating periodically in a gait sequence (Refer Fig. 2(a)& 2(b)). Based on this notion, a complete gait cycle is determined by the number of frames between three consecutive local minima (two red points in Fig. 2(b)). The corresponding frames are extracted from the RGB images. This technique of gait cycle computation is applied to every person. Accordingly, the video is divided into an adequate number of frames required for gait feature computation.

The images are preprocessed and converted from RGB to grayscale to facilitate computational efficiency. Further, the extracted frames of gait silhouette images of height H and width W are fed onto a Convolutional LSTM [19] architecture as depicted in gait feature extractor network 𝒢\mathcal{G} in Fig. 1 and obtains a gait feature descriptor GG. Formally, the gait feature sequence of a video can be represented as G={g1,,gL},giCG=\{g_{1},\cdot\cdot\cdot,g_{L}\},g_{i}\in\mathbb{R}^{C} where gig_{i} denotes the gait feature of frame i, C denotes the feature dimension, and L denotes the number of frames.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d) 45°
Refer to caption
(e) 90°
Figure 2: (a) Human silhouette taken from a gait video sequence of CASIA-A. (b) Representation of silhouette aspect ratio over the whole video. The marked points in red represent the starting and ending of one gait cycle. (c),(d) & (e) Glimpses from the CASIA-A dataset at angles 0°, 45°, 90° respectively.

III-B Face feature extractor

Face recognition involves recognizing a person by his facial features [5]. In our case, since the viewpoint and distance of the person vary significantly across the frames, traditional face detection algorithms that rely on the frontal view do not work well. Hence, facial bounding boxes are initially cropped out of the video frames leveraging Google Mediapipe human pose detection framework [20]. The framework employs a two-step detector-tracker setup where the detector locates the pose region-of-interest (ROI) within the frame and the tracker predicts all 33 keypoints from this ROI. In the case of videos, the detector is run only on the first frame and the ROI of the subsequent images is derived from the pose keypoints of the previous frame.

Refer to caption
Figure 3: Process of obtaining face images from Mediapipe pose estimation.

As shown in Fig. 3, from the estimated Mediapipe keypoints, the human face is manually cropped out by fixed measurements, with respect to the facial coordinates.

The cropped face images are preprocessed, converted from RGB to grayscale, and resized to the dimension of H×W\textit{H}\times\textit{W}. Further, the images are fed into ConvolutionalLSTM [19] architecture to extract facial feature descriptor FF per person, as depicted in the face feature extractor network \mathcal{F} in Fig.1. Formally, the face feature descriptor corresponding to an L-frame video is represented as F={f1,,fL},fiCF=\{f_{1},\cdot\cdot\cdot,f_{L}\},f_{i}\in\mathbb{R}^{C}, where fif_{i} represents the facial feature of frame i.

III-C Naïve fusion of Face and Gait via Bilinear pooling

As an initial fusion technique, we propose the naïve bilinear pooling (BLP) method [4] to fuse features. The method takes in the 3D tensor outputs from the final max-pooling layers of the face (\mathcal{F}) and gait (𝒢\mathcal{G}) feature extraction networks. The outputs are further reshaped into the matrix of dimensions p×dp\times d and are combined via the bilinear pooling method to obtain the fusion result ZZ, as follows:

Z=FGT,Fp×d,Gp×dZ=FG^{T},F\in\mathbb{R}^{p\times d},G\in\mathbb{R}^{p\times d} (1)

The matrix ZZ is then flattened into a vector and then passed onto the softmax activation function, where it computes the probability for class kk out of KK classes.

III-D Keyless attention based Adaptive fusion of Face and Gait

Attention mechanisms are widely used in sequence models, enabling the modelling of dependencies regardless of their location in the input or output sequences [21]. In our case, not every frame in a video helps identify a subject in the same way. In order to estimate the importance weights for each frame, we adapt the attention mechanism. An attention function is a process that takes a query vector and a set of key-value pairs and produces an output. In existing soft attention mechanisms [21], the weight computation is not limited to the feature vectors but also incorporates an additional input, such as the previously hidden state vector of the LSTM or a vector representing a target entity as in [22]. These additional inputs along with feature vectors referred to as key vectors, help to find the most related weighted average of feature vectors. However, the weights in our work depend only on the feature vectors and do not require any additional input, thus named as keyless attention, synonymous to the work [3]. In our case, referring to Fig. 1, the gait feature descriptor G and face feature descriptor F are further fed onto two attention modules viz. Gait attention block and Face attention block respectively. Further, multimodal adaptive weights are computed via the fusion mechanism. Detailed explanations are given below.

III-D1 Face Attention

The facial feature is updated by incorporating the attention mechanism to assign weighted visual elements. Formally, face attention is computed as follows:

fi¯=𝐖𝐟fi+𝐛𝐟\bar{f_{i}}=\mathbf{W_{f}}f_{i}+\mathbf{b_{f}} (2)
ei¯=u¯Ttanh(fi¯)\bar{e_{i}}=\bar{u}^{T}\tanh(\bar{f_{i}}) (3)
αi¯=exp((λei¯))k=1Lexp((λek¯))\bar{\alpha_{i}}=\frac{\exp{(\lambda\bar{e_{i}})}}{\sum_{k=1}^{L}\exp{(\lambda\bar{e_{k}})}} (4)

Here, fi¯\bar{f_{i}} is the low-dimension representation of frame i and 𝐖𝐟\mathbf{W_{f}} & 𝐛𝐟\mathbf{b_{f}} are the learnable parameters. The importance weight, ei¯\bar{e_{i}}, of the element fif_{i} is computed by the inner product between the new representation of fi¯\bar{f_{i}} and a learnable vector u¯\bar{u}. The normalized importance weight of facial feature αi¯\bar{\alpha_{i}} is calculated using the softmax function, as shown in Eq. (5). λ\lambda is a scale factor that ensures that the importance weights are evenly distributed, which ranges between 0 and 1. Nevertheless, it is observed in Eq. (4) that tanh(())\tanh{(\cdot)} non-linearity may not be effective for learning complicated linkages, since tanh((x))\tanh{(x)} is roughly linear for x \in [1-1, 1]. Therefore, inspired by the method as in [23], we leverage an effective gated mechanism as shown in Eq. (6) to formulate a better normalized facial importance weight αi¯\bar{\alpha_{i}}.

αi¯=exp({λu¯T(tanh(fi¯)sigm(fi¯)})k=1Lexp({λu¯T(tanh(fk¯)sigm(fk¯)})\bar{\alpha_{i}}=\frac{\exp{\{\lambda\bar{u}^{T}(\tanh(\bar{f_{i}})\odot sigm(\bar{f_{i}})\}}}{\sum_{k=1}^{L}\exp{\{\lambda\bar{u}^{T}(\tanh(\bar{f_{k}})\odot sigm(\bar{f_{k}})\}}} (5)
α¯=i=1Lαi¯\bar{\alpha}=\sum_{i=1}^{L}\bar{\alpha_{i}} (6)

where sigm(\cdot) is the sigmoid non-linearity and \odot is an element-wise multiplication. This new αi¯\bar{\alpha_{i}} is further used to compute global facial attention weight α¯\bar{\alpha} by combining facial importance weight across all L frames(Refer Eq. (7)).

III-D2 Gait Attention

Analogous to the face modality, the attention mechanism is incorporated in the gait counterpart as well. The global gait attention weight β¯\bar{\beta} by leveraging the weighted visual elements in the gait stream is computed as follows.

gi¯=𝐖𝐠gi+𝐛𝐠\bar{g_{i}}=\mathbf{W_{g}}g_{i}+\mathbf{b_{g}} (7)
βi¯=exp({λu¯T(tanh(gi¯)sigm(gi¯)})k=1Lexp({λu¯T(tanh(gk¯)sigm(gk¯)})\bar{\beta_{i}}=\frac{\exp{\{\lambda\bar{u}^{T}(\tanh(\bar{g_{i}})\odot sigm(\bar{g_{i}})\}}}{\sum_{k=1}^{L}\exp{\{\lambda\bar{u}^{T}(\tanh(\bar{g_{k}})\odot sigm(\bar{g_{k}})\}}} (8)
β¯=i=1Lβi¯\bar{\beta}=\sum_{i=1}^{L}\bar{\beta_{i}} (9)

III-D3 Context-aware Adaptive Fusion

From Eq.(6) & Eq.((9), we obtain the value of α¯\bar{\alpha} and β¯\bar{\beta}, which are the global individual attention weights of the face and gait features, respectively. The weighted average of the face and gait feature is computed using the adaptive weights:

α=α¯α¯+β¯\alpha=\frac{\norm{\bar{\alpha}}}{\norm{\bar{\alpha}}+\norm{\bar{\beta}}} (10)
β=β¯α¯+β¯\beta=\frac{\norm{\bar{\beta}}}{\norm{\bar{\alpha}}+\norm{\bar{\beta}}} (11)

The adaptive fusion is performed by combining the two features multiplied individually by their weighted global attention weights, as follows:

=αF+βG\textit{$\mathbb{Z}$}=\alpha F+\beta G (12)

where \mathbb{Z} refers to the context-aware adaptively fused feature. \mathbb{Z} is further passed onto a fully-connected (FC) layer, followed by a softmax function that classifies the feature according to the K classes provided. The resultant column vector, RR is then used to determine the class identifier (Person ID) of the subject in consideration for the fused feature \mathbb{Z} by

ID()=argmax(R)ID(\mathbb{Z})=argmax(R) (13)

III-E Objective functions

The model classifier employs categorical cross-entropy loss, also known as Softmax loss. This supervised loss calculates the classification error among K classes. The number of nodes in the softmax layer depends on the number of identities in the training set. Considering tt and wkw_{k} as the target vector and learnable vector respectively, the loss is computed as:

Loss=iKtilog(softmax(wkT)i)Loss=-\sum_{i}^{K}t_{i}\log(softmax(w_{k}^{T}\mathbb{Z})_{i}) (14)

IV Experimental Setup

Dataset: In this work, we use CASIA Gait Dataset A [24], which includes 19139 images of 20 subjects. Each person has 12 image sequences, 4 sequences for each of the three directions, i.e. 0°, 45°, 90°(Refer Fig. 2(c), 2(d) and 2(e)). Among the 4 sequences per angle, 2 sequences are used for training, and the remaining 2 sequences are used for testing.

Evaluation protocols: Standard evaluation metrics like accuracy and log-loss are employed to validate the performance of our model. Accuracy is used to evaluate how well the algorithm is performing for all classes by giving them equal importance, whereas log loss is considered to be a crucial metric that is based on probabilities. Mathematically, log-loss is computed by:

log_loss=1Ni=1N[yilnpi+(1yi)ln(1pi)]log\_loss=\frac{-1}{N\sum_{i=1}^{N}[y_{i}\ln p_{i}+(1-y_{i})\ln(1-p_{i})]} (15)

where N is number of person and yiy_{i} is the observed value and pip_{i} is predicted probability.

Implementation details: The proposed method is implemented using the TensorFlow framework. During training, video frames correspond to one gait cycle across three orientations 0°, 45°, and 90°are considered. In this work, the gait cycle corresponds to L = 24 frames, each with height H= 128, and width W = 128 is used. The images are normalized using the RGB mean and standard deviation of ImageNet before passing them to the network. After dimension reduction, the resulting dimension of the gait and face feature descriptor is 588 each. In the experiments, we use Optuna [25], a hyperparameter optimization framework, to obtain the best hyperparameter for our models. We train the network for approximately 1000 iterations. The implementations are done in a machine with Tesla V100 GPU with 12GB RAM and took around 1 hour to train the model.

V Experimental Results

To verify the effectiveness of our proposed approach, various experiments using single feature-based and multimodal fusion-based human identification are carried out. The result summary is shown in Tables I and II. Referring to Table I, the first two rows are single-modality based results, whereas the remaining are multi-modality results.
(i) Face feature based human recognition: Training of facial features separately on each angle 0°, 45°and 90° with custom parameters and hyperparameter tuning using Optuna [25] produce accuracies up to 65%, 80%, and 85%, respectively (Refer Table II). The overall accuracy of the face model incorporating all orientations is 80% (Refer Table I).
(ii) Gait feature based human recognition: Training of the gait features across three view-points produces accuracies 75%, 60%, and 55%, respectively (Refer Table II). Referring to Table I, the overall accuracy of the gait model across viewpoints is observed to be 70%.
One noteworthy observation from the aforesaid single-modality based results, is the outperformance of face and gait models in 90°and in 0°viewpoints, respectively. This accentuates our initial intuition of the influence viewpoint on feature modalities. To incorporate the best of both modalities in different viewpoints, we facilitate fusion techniques. In particular, four fusion approaches are carried out.
(iii) Average based fusion: Average weight-based fusion incorporates a manual weight input to the face and gait model. For this technique, a weightage of 0.5 is devised on the individual face and gait features, achieving an accuracy of 75%75\% on the test dataset.
(iv) Naïve fusion via BLP: Bilinear Pooling incorporates the fusion of both gait and face models, as explained in Section III (C). The model achieves 85%, 75%, and 85% viewpoint-wise accuracies, as shown in Table II. The overall fused model results in an accuracy of 80%. It was observed that compared to the Average based fusion model, Naïve fusion with BLP improves the accuracy by 5%5\%.
(v) Attention Fusion: In this model, the keyless attention mechanism (discussed in Sec III-D) is implemented to obtain global face and gait attention weights α¯\bar{\alpha} and β¯\bar{\beta}. It is further multiplied with the respective features F and G and is concatenated to obtain a single feature vector. Note that no adaptive fusion strategy is employed in this scheme. This model is able to achieve an overall accuracy of 85%85\% incorporating features over all the viewpoints.
(vi): Context-aware Adaptive Fusion with attention: This strategy incorporates the proposed context-aware adaptive fusion strategy to the attention module, discussed in Sec.III-D. The viewpoint-wise accuracy attained by this method are 90%, 80%, and 90%, respectively. Referring to Table I, this model outperforms all other models by achieving an overall accuracy of 90%, highlighting the importance of the context-aware fusion of modalities across the view-points. In terms of the log loss metric, the adaptive fusion strategy has achieved the least value with 0.389 compared to all other models.

TABLE I: Overall result summary of the models

Index Model Accuracy(%)(\scalebox{0.9}{\%}) Log Loss
(i) Face feature model 80 0.436
(ii) Gait feature model 70 1.641
(iii) Average based fusion 75 0.779
(iv) Naïve Fusion via BLP 80 0.519
(v) Attention Fusion 85 1.619
(vi) Adaptive Fusion Attention 90 0.389
(vii) Geng, Wang et. al. [17] 86.67 -

In order to demonstrate the effectiveness of the proposed adaptive fusion of gait and face with attention algorithm, a comparative analysis against the state-of-the-art result on CASIA-A dataset is carried out. The experimental result in Table I (vii) shows that our adaptive fusion result (90%) outperforms the state-of-the-art performance reported by Geng et al. [17], which had an overall average test accuracy of 86.67%. Our attention fusion model is also found to be achieving competitive result (85%) with the state-of-the-art result. These results clearly highlight the potential of our proposed attention-based adaptive multimodal fusion using deep neural networks, in contrary to their classical hand-crafted features and manual condition-based fusion approach.

TABLE II: Result of all models conducted angle-wise 00^{\circ}, 4545^{\circ}, and 9090^{\circ} with respect to camera

Angle() Accuracy(%)(\%)
Face Gait Naïve Fusion Adaptive Fusion
65 75 85 90
45° 75 60 75 80
90° 85 55 85 90

From the view-point-wise performance of the models in Table II, some key interpretations also can be drawn. For 0°, the gait model surpasses the face model performance, which aligns with our intuition that the model learns gait features better when the subject walks laterally. Similarly, works well when the subject walks towards the camera at 90°. However, while incorporating adaptive fusion strategy, the best of both modalities are incorporated adaptively based on the context, resulting in high performance irrespective of the view-point.

Refer to caption
Figure 4: Confusion matrix of adaptive fusion attention model on the test set

A visual interpretation of the attention based adaptive fusion model is depicted in Fig. 4 in terms of a confusion matrix. A confusion matrix visualizes and summarizes the performance of a classification algorithm. All the three viewpoints together on the test dataset of 20 subjects is depicted in Fig. 4. We can observe that 18 out of 20 subjects are correctly classified, except subject IDs ‘1’ and ‘13’. The subject ID ‘1’ is most often classified as 14 and the subject ID ‘13’ as 5 and 20 in most models. Further, we observe that subject 13 being identified correctly by the face model but not in the gait model, which might be ascribable to the suboptimal learning of the model from the grayscale features of the image.

VI Conclusion and Future Works

In this work, we proposed a multimodal adaptive fusion of the face and gait toward human identification. In particular, a keyless attention-based deep neural network for learning the attention in the gait and face videos and a context-aware adaptive fusion strategy to efficiently extract and fuse the features is presented. Based on the observation that single biometric modality results in suboptimal results, various studies leveraging average based, naïve fusion, attention fusion and context-aware adaptive fusion were investigated. Results of the proposed attention-based adaptive fusion strategy show superior performance compared to all the other models as well as the state-of-the-art result. Future Improvements can be made by introducing better attention mechanisms such as dense co-attention and spatial-channel attention, as well as advanced fusion mechanisms like tucker fusion, block fusion etc.

References

  • [1] National Research Council “Biometric Recognition: Challenges and Opportunities” Washington, DC: The National Academies Press, 2010
  • [2] Jasvinder Pal Singh, Sanjeev Jain, Sakshi Arora and Uday Pratap Singh “A survey of behavioral biometric gait recognition: Current success and future perspectives” In Computational Methods in Engineering 28.1 Springer, 2021
  • [3] Xiang Long, Chuang Gan, Gerard Melo, Xiao Liu, Yandong Li, Fu Li and Shilei Wen “Multimodal keyless attention fusion for video classification” In AAAI conference on artificial intelligence 32.1, 2018
  • [4] Tsung-Yu Lin, Aruni RoyChowdhury and Subhransu Maji “Bilinear CNN models for fine-grained visual recognition” In IEEE international conference on computer vision, 2015, pp. 1449–1457
  • [5] Woody Bledsoe, Chan Wolf and Charles Bisson “Man-Machine Facial Recognition” Panoramic Research Inc, Palo Alto, 1966
  • [6] Matthew Turk and Alex Pentland “Eigenfaces for recognition” In Journal of cognitive neuroscience 3.1 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 1991, pp. 71–86
  • [7] Mei Wang and Weihong Deng “Deep face recognition: A survey” In Neurocomputing 429 Elsevier, 2021, pp. 215–244
  • [8] David Cunado, Mark S Nixon and John N Carter “Automatic extraction and description of human gait models for recognition purposes” In Computer vision and image understanding 90.1 Elsevier, 2003, pp. 1–41
  • [9] Liang Wang, Huazhong Ning, Tieniu Tan and Weiming Hu “Fusion of static and dynamic body biometrics for gait recognition” In Transactions on circuits and systems for video technology 14.2 IEEE, 2004, pp. 149–158
  • [10] Athira M Nambiar, Alexandre Bernardino, Jacinto C Nascimento and Ana LN Fred “Towards View-point Invariant Person Re-identification via Fusion of Anthropometric and Gait Features from Kinect Measurements.” In VISIGRAPP (5: VISAPP), 2017, pp. 108–119
  • [11] Jinguang Han and Bir Bhanu “Individual recognition using gait energy image” In IEEE transactions on pattern analysis and machine intelligence 28.2 IEEE, 2005, pp. 316–322
  • [12] Ait O Lishani, Larbi Boubchir, Emad Khalifa and Ahmed Bouridane “Human gait recognition using GEI-based local multi-scale feature descriptors” In Multimedia Tools and Applications 78.5 Springer, 2019, pp. 5715–5730
  • [13] Kohei Shiraga, Yasushi Makihara, Daigo Muramatsu, Tomio Echigo and Yasushi Yagi “Geinet: View-invariant gait recognition using a convolutional neural network” In 2016 international conference on biometrics (ICB), 2016, pp. 1–8 IEEE
  • [14] Shuangqun Li, Wu Liu and Huadong Ma “Attentive spatial–temporal summary networks for feature learning in irregular gait recognition” In IEEE Transactions on Multimedia 21.9 IEEE, 2019, pp. 2361–2375
  • [15] Amit Kale, Amit K RoyChowdhury and Rama Chellappa “Fusion of gait and face for human identification” In IEEE international conference on acoustics, speech, and signal processing 5, 2004, pp. V–901 IEEE
  • [16] Gregory Shakhnarovich and Trevor Darrell “On probabilistic combination of face and gait cues for identification” In Int. Conf. on automatic face gesture recognition, 2002, pp. 176–181 IEEE
  • [17] Xin Geng, Liang Wang, Ming Li, Qiang Wu and Kate Smith-Miles “Adaptive fusion of gait and face for human identification in video” In IEEE Workshop on Applications of Computer Vision, 2008, pp. 1–6 IEEE
  • [18] M Pat Murray “Gait as a total pattern of movement: Including a bibliography on gait” In American Journal of Physical Medicine & Rehabilitation 46.1 LWW, 1967, pp. 290–333
  • [19] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong and Wang-chun Woo “Convolutional LSTM network: A machine learning approach for precipitation nowcasting” In Advances in neural information processing systems 28, 2015
  • [20] Valentin Bazarevsky, Ivan Grishchenko, Karthik Raveendran, Tyler Zhu, Fan Zhang and Matthias Grundmann “Blazepose: On-device real-time body pose tracking” In arXiv:2006.10204, 2020
  • [21] Dzmitry Bahdanau, Kyunghyun Cho and Yoshua Bengio “Neural machine translation by jointly learning to align and translate” In arXiv preprint arXiv:1409.0473, 2014
  • [22] Linlin Wang, Zhu Cao, Gerard De Melo and Zhiyuan Liu “Relation classification via multi-level attention cnns”, 2016, pp. 1298–1307
  • [23] Yann N. Dauphin, Angela Fan, Michael Auli and David Grangier “Language Modeling with Gated Convolutional Networks” In CoRR abs/1612.08083, 2016 arXiv:1612.08083
  • [24] Liang Wang, Tieniu Tan, Huazhong Ning and Weiming Hu “Silhouette analysis-based gait recognition for human identification” In IEEE transactions on pattern analysis and machine intelligence 25.12 IEEE, 2003, pp. 1505–1518
  • [25] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta and Masanori Koyama “Optuna: A next-generation hyperparameter optimization framework” In Inter. Conf. on knowledge discovery & data mining, 2019