Spatial-Temporal Decoupling Contrastive Learning for Skeleton-based Human Action Recognition

Shaojie Zhang¹ Jianqin Yin^1,2¹¹1Corresponding Author Yonghao Dang¹ ¹Beijing University of Posts and Telecommunications
²Queen Mary School Hainan, Beijing University of Posts and Telecommunications
{zsj, join, dyh2018}@bupt.edu.cn

Abstract

Skeleton-based action recognition is a central task in human-computer interaction. However, most previous methods suffer from two issues: (i) semantic ambiguity arising from spatial-temporal information mixture; and (ii) overlooking the explicit exploitation of the latent data distributions (i.e., the intra-class variations and inter-class relations), thereby leading to sub-optimum solutions of the skeleton encoders. To mitigate this, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to obtain discriminative and semantically distinct representations from the sequences, which can be incorporated into various previous skeleton encoders and can be removed when testing. Specifically, we decouple the global features into spatial-specific and temporal-specific features to reduce the spatial-temporal coupling of features. Furthermore, to explicitly exploit the latent data distributions, we employ the attentive features to contrastive learning, which models the cross-sequence semantic relations by pulling together the features from the positive pairs and pushing away the negative pairs. Extensive experiments show that STD-CL with four various skeleton encoders (HCN, 2S-AGCN, CTR-GCN, and Hyperformer) achieves solid improvements on NTU60, NTU120, and NW-UCLA benchmarks. The code will be released at https://github.com/LibertyZsj/STD-CL soon.

1 Introduction

Human action recognition is an active task in the computer vision area. Due to its wide range of applications in human-computer interaction, video analysis, virtual reality, and so on, this task has been researched extensively in the past decade Ren et al. (2020). In recent years, with the development of depth sensors Cao et al. (2017) and human pose estimation algorithms Dang et al. (2019), the skeleton sequence consisting of coordinates of human joints can be easily acquired. The skeleton-based action recognition utilizing human joints has attracted much interest due to its robustness to background clutter, illumination, and viewpoint changes Zhang et al. (2019).

Refer to caption — Figure 1: The framework of action recognition in the previous methods. A skeleton encoder extracts the spatial-temporal co-occurrence feature, and a softmax-based linear classifier is used to project the feature to the category distributions.

Generally, most of the previous approaches Shi et al. (2019); Chen et al. (2021a, b) in skeleton-based action recognition follow a paradigm as shown in Figure 1. A skeleton encoder aggregates the spatial-temporal co-occurrence feature Li et al. (2018), and a softmax-based linear classifier is used to project the feature to the category distributions. However, this paradigm has some limitations. The main problems are semantic ambiguity arising from the spatial-temporal information mixture in the co-occurrence feature and the absence of explicit exploitation of the latent data distributions.

The movements of joints in the skeleton sequences can be decoupled into spatial dimension and temporal dimension. The spatial dimension indicates the topological relations and the temporal dimension indicates the trajectory of action, each reflecting distinct semantics at the corresponding dimension Xu et al. (2023). For example, temporal-specific information is more important when distinguishing the actions “put on a shoe” and “take off a shoe”, which differ more in the frame-level dimension. Conversely, the actions “point to something” and “throw”, have more spatial differences in the joint-level dimension, so spatial-specific information is more important to distinguish them. The skeleton encoders blend the spatial-temporal information in their aggregations, leading to a mixture of spatial-specific and temporal-specific features and indistinctiveness of semantics in the global features. Thus, the discriminative capability of existing skeleton encoders is limited by the semantic ambiguity arising from the spatial-temporal information mixture.

Nevertheless, as shown in Figure 2, both the action “put on a hat” and the action “put on glasses” contains a crucial sub-action “lift arms”, leading to semantic similarity and less distinct inter-class distances to some extent. Furthermore, due to individual differences and the unknowability of the external environment, actions within the same category exhibit intra-class variation. The linear classifier is typically trained to optimize accuracy only, paying less attention to exploiting the latent data distributions Wang et al. (2023). The t-SNE Van der Maaten and Hinton (2008) distribution of features extracted from CTR-GCN Chen et al. (2021a) is shown in Figure 2 (b), which can’t locate a distinguishable classification boundary. Therefore, the absence of explicit modeling of the intra-class variation and inter-class relation is insufficient for recognizing the ambiguous samples.

To alleviate these drawbacks, we propose a spatial-temporal decoupling contrastive learning (STD-CL) framework to encourage the skeleton encoders to learn more discriminative and semantically distinct representations. It’s worth noting that our STD-CL could be combined with previous skeleton encoders and can be removed naturally without additional consumption when testing. In the training phase, the global features are decoupled into spatial attentive features and temporal attentive features through the designed spatial-temporal feature decoupling (STFD) module. Moreover, we build two memory banks to store the decoupled spatial and temporal features respectively, and sample the positives and negatives from them according to labels. In particular, we employ contrastive learning to model the cross-sequence semantic relations by pulling together the decoupled features from positive pairs and pushing away the features from negative pairs. With constant contrast, the skeleton encoders are forced to explicitly explore the latent data distributions with distinct inter-class relations and reduced intra-class variations. Meanwhile, since the features used for contrast are decoupled, the degree of spatial-temporal information coupling within global features decreases so that the semantics become more explicit. Consequently, in the testing phase, the skeleton encoders can directly predict the categories with higher accuracy.

We summarize our main contributions as follows:

•

To extract more discriminative and semantically distinct features from the skeleton sequences, we proposed a novel STD-CL framework to decouple the features into spatial- and temporal-specific features and apply them for contrastive learning to explore the latent data distributions.
•

The proposed STD-CL can be seamlessly incorporated into various previous skeleton encoders, which can be regarded as plug-and-play in the training stage and can be removed at the testing stage.
•

Extensive experiments show that STD-CL achieves significant improvements based on four various state-of-the-art methods (HCN Li et al. (2018), 2S-AGCN Shi et al. (2019), CTR-GCN Chen et al. (2021a), and Hyperformer Zhou et al. (2022)) on three benchmark datasets (NTU60, NTU120, and NW-UCLA).

2 Related Work

2.1 Skeleton-based Action Recognition

Skeleton-based action recognition aims to classify actions from sequences of human keypoints. In early studies, RNN was a natural choice to handle the skeleton sequences. HBRNN Du et al. (2015) applied a hierarchical RNN architecture to model the long-range temporal dependency of the skeleton sequences. STA-LSTM Song et al. (2017) proposed an end-to-end spatial and temporal attention model to focus on discriminative joints of the skeleton within each frame of the inputs. Inspired by the success of CNN in image recognition, CNN-based methods have also been studied. HCN Li et al. (2018) treated the joint dimension as channels to aggregate different levels of spatial-temporal context information. TA-CNN Xu et al. (2022) adopted a pure CNN architecture to model the irregular skeleton topology.

The human body can be abstracted as a graph structure, which treats the joint as a node and the bone between two joints as an edge. Thus, GCN-based methods were introduced to this task. ST-GCN Yan et al. (2018) adopted GCN on the predefined spatial-temporal graphs to model the relations between joints. 2S-AGCN Shi et al. (2019) modeled the correlation between two joints given corresponding features with the self-attention mechanism. CTR-GCN Chen et al. (2021a) proposed a channel-wise topology refinement graph convolution to model fine-grained relation. InfoGCN Chi et al. (2022) combined a learning objective and an encoding method to break the information bottleneck.

With the popularity of transformers in computer vision, transformer-based methods have also been investigated for this task. ST-TR Plizzari et al. (2021) proposed a two-stream transformer architecture to model the spatial and temporal dimensions respectively. Hyperformer Zhou et al. (2022) applied transformer to model spatial-temporal features from the hyper-graph of the sequences.

2.2 Contrastive Learning

Recently, contrastive learning has been widely used in self-supervised representation learning. The core idea is to learn the discriminative features by pulling the positive pairs and pushing away the negative pairs in the embedded space. The positive pairs are features from transformed sample versions with different augmentations, and the negative pairs are from different samples. InstDisc Wu et al. (2018) maintained a memory bank for storing representations. SimCLR Chen et al. (2020) proposed to adopt a huge batch size to learn more useful representations. MoCo He et al. (2020) built a dynamic dictionary with a momentum update based on a memory bank to keep the stored representations consistent. Inspired by this, our work tries to introduce memory bank mechanisms for contrastive learning to obtain discriminative features.

Contrastive learning was usually adopted in prior works Xu et al. (2023) to learn the invariance from the skeleton sequences for unsupervised learning in this field. Moreover, several works are applying contrastive learning to supervised action recognition. GAP Xiang et al. (2023) proposed a framework for contrasting the features from visual and text modalities to enhance the representation by using knowledge about actions and human body parts. SkeletonGCL Huang et al. (2023) explicitly explored the rich cross-sequence relations by using graph contrastive learning. However, SkeletonGCL can only be incorporated into the GCN-based methods, which limits its generalization. FR-Head Zhou et al. (2023) employed multi-level contrastive learning in GCN-based methods to distinguish ambiguous actions.

However, the methods mentioned above are more like surgery for GCN-based skeleton encoders, leading to limitations on generalizability to other types of methods.

3 Methodology

3.1 Perliminary

Skeleton-based action recognition is to predict the categories from the input sequences. We define the human skeleton sequence with $J$ keypoints and $T_{0}$ frames in 3D space as $\boldsymbol{S}\in\mathbb{R}^{J\times T_{0}\times 3}$ . The sequence is fed into a skeleton encoder $\mathcal{F}$ to extract the features $\boldsymbol{X}\in\mathbb{R}^{J\times T\times C}$ . The skeleton encoder aggregates spatial and temporal information to obtain discriminative spatial-temporal features. After fully spatial-temporal aggregations, a global pooling (GP) layer is adapted to summarize the global features. Finally, a softmax-based fully-connected (FC) layer maps the global feature to a probability prediction $\hat{\boldsymbol{y}}\in\mathbb{R}^{K}$ of $K$ candidate categories. The process can be defined as:

\boldsymbol{X}=\mathcal{F}(\boldsymbol{S})

(1)

\boldsymbol{y}=FC(GP(\boldsymbol{X}))

(2)

In the process of optimization, a standard cross-entropy loss is applied to supervise the prediction distribution and the ground truth $\boldsymbol{y}$ as follows:

\mathcal{L}_{CE}=-\sum_{i}\boldsymbol{y}_{i}log\hat{\boldsymbol{y}_{i}}

(3)

Table 1: Top-1 action aclssification accuracy comparison (

\%

) with the state-of-the-art methods on NTU 60 and NTU120 datasets. Particularly, the 2S incidents the ensemble results of joint and bone streams. And the 4S incidents the ensemble results of joint, bone, joint motion and bone motion streams. * denotes the ensemble results of the streams reported in their papers.

Dataset	NTU 60								NTU 120
Setting	X-Sub				X-View				X-Sub				X-Set
Method/Streams	J	B	2S	4S	J	B	2S	4S	J	B	2S	4S	J	B	2S	4S
VA-LSTM Zhang et al. (2017)			79.4*				87.6*									-
AGC-LSTM Si et al. (2019)	-	-	89.2*	-	-	-	95.0*	-	-	-	-	-	-	-	-	-
ST-GCN Yan et al. (2018)	81.5	-	-	-	88.3	-	-	-	-	-	-	-	-	-	-	-
SGCN Zhang et al. (2020)	-	-	89.0	-	-	-	94.5	-	-	-	79.2	-	-	-	81.5	-
Shift-GCN Cheng et al. (2020b)	87.8	-	89.7	90.7	95.1	-	96.0	96.5	80.9	-	85.3	85.9	83.2	-	86.6	87.6
DC-GCN+ADG Cheng et al. (2020a)	-	-	90.8	-	-	-	96.6	-	-	-	86.5	-	-	-	88.1	-
Dynamic GCN Ye et al. (2020)	-	-	-	91.5	-	-	-	96.0	-	-	-	87.3	-	-	-	88.6
MS-G3D Liu et al. (2020)	89.4	90.1	91.5	-	95.0	95.3	96.2	-	-	-	86.9	-	-	-	88.4	-
DSTA Shi et al. (2020)	-	-	91.5	-	-	-	96.4	-	-	-	86.6	-	-	-	88.0	-
MST-GCN Chen et al. (2021b)	89.0	89.5	91.1	91.5	95.1	95.2	96.4	96.6	82.8	84.8	87.0	87.5	84.5	86.3	88.3	88.8
ST-TR Plizzari et al. (2021)	-	-	89.9	-	-	-	96.1	-	-	-	82.7	-	-	-	84.7	-
TA-CNN Xu et al. (2022)	88.9	89.2	91.0	91.5	94.5	94.1	95.7	95.9	84.0	85.1	87.8	88.2	85.3	86.3	89.0	89.6
EfficientGCN-B4 Song et al. (2022)	-	-	91.7	-	-	-	95.7	-	-	-	88.3	-	-	-	89.1	-
InfoGCN Chi et al. (2022)	89.4	90.6	91.3	92.3	95.2	95.4	96.2	96.7	84.2	86.9	88.2	89.2	86.3	88.5	89.4	90.7
HCN Li et al. (2018)	-	-	84.3*	-	-	-	89.9*	-	-	-	-	-	-	-	-	-
HCN w/ STD-CL	-	-	85.1*	-	-	-	91.2*	-	-	-	-	-	-	-	-	-
2S-AGCN Shi et al. (2019)	88.9	89.2	91.0	91.5	94.5	94.1	95.7	95.9	83.8	84.9	87.7	88.1	85.3	86.3	89.0	89.6
2S-AGCN w/STD-CL	89.4	89.8	91.5	91.9	94.7	94.6	96.1	96.3	84.1	85.1	87.9	88.2	86.0	86.6	89.4	89.9
CTR-GCN Chen et al. (2021a)	89.8	90.2	92.0	92.4	94.8	94.8	96.3	96.8	84.9	85.7	88.7	88.9	86.7	87.5	90.1	90.5
CTR-GCN w/STD-CL	90.4	90.8	92.2	92.7	95.2	95.0	96.4	96.9	85.6	86.7	89.3	89.5	86.8	88.2	90.4	90.9
Hyperformer Zhou et al. (2022)	90.3	91.1	92.0	92.7	94.5	94.4	95.5	96.2	86.1	87.4	88.9	89.9	87.8	89.0	90.6	91.2
Hyperformer w/STD-CL	90.8	91.3	92.3	92.9	95.0	94.7	95.9	96.4	86.8	88.1	89.2	90.1	88.3	89.2	90.7	91.4

3.2 Spatial-Temporal Decoupling Contrastive Learning

Human actions have the property of spatio-temporal coupling, the feature $\boldsymbol{X}$ after aggregation contains complex spatial and temporal relations, which may mix the spatial and temporal information. Thus, we decouple the feature into the spatial-aware and temporal-aware features to reduce the spatial-temporal information coupling. To further obtain more discriminative features, we apply the decoupled features to contrastive learning to explore the cross-sequence semantic relations. The core is to explore the latent data distributions by pulling together the features from the positive pairs and pushing away the negative pairs.

3.2.1 Spatial Temporal Feature Decoupling

The details of the proposed STFD are shown in (b) of Figure 3. The feature $\boldsymbol{X}$ is fed into two parallel branches for feature decoupling. In the spatial feature decoupling (SFD) module, a temporal pooling layer is adopted to obtain the spatial attentive feature $\boldsymbol{X}_{s}\in\mathbb{R}^{J\times C}$ . And the temporal attentive feature $\boldsymbol{X}_{t}\in\mathbb{R}^{T\times C}$ is obtained through a spatial pooling layer. Then, two linear transformation functions $\theta$ and $\phi$ with a channel reduction rate of $r$ are used to transform the attentive features into a neatly compact representation as follows:

\left\{\begin{matrix}\boldsymbol{X}_{s}^{*}=\theta(\boldsymbol{X}_{s})=\boldsymbol{X}_{s}\boldsymbol{W}_{\theta}\in\mathbb{R}^{J\times C/r}\\ \\ \boldsymbol{X}_{t}^{*}=\phi(\boldsymbol{X}_{t})=\boldsymbol{X}_{t}\boldsymbol{W}_{\phi}\in\mathbb{R}^{T\times C/r}\end{matrix}\right.

(4)

where, $\boldsymbol{W}_{\theta},\boldsymbol{W}_{\phi}\in R^{C\times C/r}$ are the weights of the spatial and temporal transformation function, respectively. After this, to retain the attentive information, we squeeze the spatial dimension into the channel dimension in the SFD module, and the TFD module is similar. In the end, two liner transformation functions $\zeta$ and $\sigma$ are used to transform the features to the latent space for further contrastive learning, which can be formulated as:

\left\{\begin{matrix}\boldsymbol{s}=\zeta(\boldsymbol{s}^{*})=\boldsymbol{s}^{*}\boldsymbol{W}_{\zeta}\in\mathbb{R}^{D}\\ \\ \boldsymbol{t}=\sigma(\boldsymbol{t}^{*})=\boldsymbol{t}^{*}\boldsymbol{W}_{\sigma}\in\mathbb{R}^{D}\end{matrix}\right.

(5)

where, $\boldsymbol{W}_{\zeta}\in\mathbb{R}^{JC/r\times D}$ , $\boldsymbol{W}_{\sigma}\in\mathbb{R}^{TC/r\times D}$ are the weights of final projection function. In this way, we decouple the spatial attentive feature $\boldsymbol{s}$ and the temporal attentive feature $\boldsymbol{t}$ from the spatial-temporal feature $\boldsymbol{X}$ .

3.2.2 Memory Bank

To obtain abundant negative pairs, we contrast two memory banks $\mathcal{M}_{spa}$ and $\mathcal{M}_{tem}$ to store the decoupled attentive features. Specifically, to avoid introducing hyperparameters and encourage the model to explore the cross-sequence context fully, we set the two memory banks to store the attentive features of all samples. Thus, the memory banks $\mathcal{M}_{spa},\mathcal{M}_{tem}\in\mathbb{R}^{L\times D}$ . $L$ is the number of the instance of the dataset. When an instance is fed into the model to extract features, the stored features in the memory banks are updated according to the corresponding index.

\left\{\begin{matrix}\mathcal{M}_{spa}[i]=\boldsymbol{s}_{i}\\ \\ \mathcal{M}_{tem}[i]=\boldsymbol{t}_{i}\end{matrix}\right.

(6)

where $i$ is the index of the instance in the dataset. Each element in $\mathcal{M}_{spa}$ and $\mathcal{M}_{tem}$ indicates a decoupled feature embedding from an instance. In the memory banks, the corresponding labels with the decoupled features are stored for the feature selection. Particularly, the decoupled features with the same category label as the current instance are selected as positive pairs, and negative pairs are sampled from the memory banks with other labels.

3.2.3 Training Objective

The positive pairs and negative pairs sampled from the two memory banks are used to achieve spatial-temporal decoupled contrastive learning. To pull together the positive pairs and push away the negative pairs, we define the distances between the two vectors with cosine similarity as follows:

\left\langle\boldsymbol{u},\boldsymbol{v}\right\rangle=\frac{\boldsymbol{u}\cdot\boldsymbol{v}^{T}}{||\boldsymbol{u}||\;||\boldsymbol{v}||}

(7)

where $\boldsymbol{u}$ , $\boldsymbol{v}$ are the features sampled from the memory banks. Then the InfoNCE loss adopted for spatial-specific feature contrasting and temporal-specific feature contrasting can be written as follows:

$\mathcal{L}_{NCE}^{spa}=-\sum\limits_{\boldsymbol{s}^{+}\in\mathcal{M}^{+}_{spa}}log\frac{\left\langle\boldsymbol{s},\boldsymbol{s}^{+}\right\rangle/\tau}{\left\langle\boldsymbol{s},\boldsymbol{s}^{+}\right\rangle/\tau+\sum\limits_{\boldsymbol{s}^{-}\in\mathcal{M}^{-}_{spa}}\left\langle\boldsymbol{s},\boldsymbol{s}^{-}\right\rangle/\tau}$

(8)

$\mathcal{L}_{NCE}^{tem}=-\sum\limits_{\boldsymbol{t}^{+}\in\mathcal{M}^{+}_{tem}}log\frac{\left\langle\boldsymbol{t},\boldsymbol{t}^{+}\right\rangle/\tau}{\left\langle\boldsymbol{t},\boldsymbol{t}^{+}\right\rangle/\tau+\sum\limits_{\boldsymbol{t}^{-}\in\mathcal{M}^{-}_{tem}}\left\langle\boldsymbol{t},\boldsymbol{t}^{-}\right\rangle/\tau}$

(9)

where $\tau$ is a temperature hyperparameter in contrastive learning. $\mathcal{M}^{+}_{spa}$ and $\mathcal{M}^{+}_{tem}$ are the sets of the decoupled features from the positive samples with the same label as the current feature in the memory banks. $\mathcal{M}^{-}_{spa}$ and $\mathcal{M}^{-}_{tem}$ are the sets of negative features with different lalbes.

Finally, the total training loss function of our proposed STD-CL is defined as follows:

\mathcal{L}=\mathcal{L}_{CE}+\mathcal{L}_{NCE}^{spa}+\mathcal{L}_{NCE}^{tem}

(10)

3.2.4 Sampling Strategy

To achieve efficient contrast across sequences, the strategy of sampling features from memory banks for contrastive learning has a large impact on the experimental performance. Therefore, in this paper, we follow the sampling strategy in Huang et al. (2023), which combines the hard mining strategy and random sampling strategy. The hard mining strategy is introduced to focus on the hard samples, which means the $N^{+}_{H}$ hardest positive samples with the lowest similarity and $N^{-}_{H}$ hardest negative samples with the highest similarity in equation 7. Furthermore, the random sampling strategy is adopted to maintain global random contrast. There are $N^{-}_{R}$ random negative samples selected to contrast with the input instance. In total, the sampling strategy adopted in our STD-CL takes contrast efficiency into account and contributes to the hard examples.

4 Experiment

Table 2: Top-1 action aclssification accuracy comparison (

\%

) with the state-of-the-art methods on the NW-UCLA dataset.

Dataset	NW-UCLA
Method/Streams	J	B	2S	4S
AGC-LSTM Si et al. (2019)	-	-	93.3*	-
DC-GCN+ADG Cheng et al. (2020a)	-	-	95.3	-
Shift-GCN Cheng et al. (2020b)	92.5	-	94.2	94.6
2S-AGCN Shi et al. (2019)	92.0	92.2	95.0	95.5
2S-AGCN w/STD-CL	93.8	93.1	95.9	97.0
CTR-GCN Chen et al. (2021a)	94.6	91.8	94.2	96.5
CTR-GCN w/STD-CL	94.8	94.6	95.7	97.2
InfoGCN Chi et al. (2022)	93.8	94.2	95.5	96.1
Info-GCN w/STD-CL	94.2	95.5	95.7	96.3
Hyperformer Zhou et al. (2022)	92.7	95.0	95.0	96.6
Hyperformer w/STD-CL	95.3	95.3	96.3	97.0

4.1 Dataset Settings

NTU RGB+D. NTU60 Shahroudy et al. (2016) is the most widely used dataset for skeleton-based action recognition. It contains 56,880 samples in 60 action classes. These samples are performed by 40 distinct subjects. It has been configured into two benchmarking structures: cross-subject (X-sub) and cross-view (X-view). In the X-sub, sequences from 20 individuals are allocated for training purposes, with the remaining 20 sets of sequences earmarked for validation. Meanwhile, under the X-view, skeleton sequences are divided based on camera views; two camera views are used for training while the remaining ones are utilized for evaluation.

NTU RGB+D 120. NTU120 dataset Liu et al. (2019) adds 57367 new skeleton sequences and 60 new action classes to the original NTU60 dataset. There are 32 various configurations in it, each of which depicts a different location and background. The authors offered the cross-subject (X-sub) and cross-setup (X-set) as two benchmark evaluations. In the X-sub, sequences from 53 subjects are for training, and the other 53 subjects are for testing. In the X-set, skeleton sequences are split by setup ID. Samples from even set-up IDs are used for training, and the odds are used for evaluation.

Northwestern-UCLA. NW-UCLA dataset Wang et al. (2014) contains 1494 video clips of 10 different actions captured from three Kinect cameras. We follow the same evaluation protocol in Wang et al. (2014): the first two cameras for training and the other for testing.

Table 3: Top-1 accuracy comparison (

\%

) with recent contrastive learning methods in supervised skeleton-based action recognition on the X-sub benchmark of NTU120 dataset.

Method	Acc (%)
Baseline Chen et al. (2021a)	84.9
w/GAP Xiang et al. (2023)	85.5
w/SkeletonGCL Huang et al. (2023)	85.6
w/FR Head Zhou et al. (2023)	85.5
w/STD-CL	85.6
STD-CL(Hyperformer Zhou et al. (2022))	86.8

4.2 Implementation Details

To fully validate the effectiveness and generalizability of STD-CL, we take four different model-based approaches as baseline models. (1) CNN-based method. We select HCN Li et al. (2018) as the CNN-based method to validate STD-CL. We reproduce the experimental results using the released code and conduct our STD-CL on it. And we follow the training recipes as the paper describes. (2) GCN-based method. 2S-AGCN Shi et al. (2019) and CTR-GCN Chen et al. (2021a), the widely used GCN-based model for skeleton-based action recognition. For CTR-GCN, we follow their training strategies. Particularly, for 2S-AGCN, we update the training strategies from CTR-GCN, which can improve the performance significantly. (3) Transformer-based method. For transformer-based methods, we choose Hyperformer as the baseline. More implementation details are presented in the supplementary material.

4.3 Comparsion With The State-of-The-Art

We compare our method with previous SOTA methods in Tables 1 and 2. To fully validate the proposed STD-CL, we report the experimental results in a single modality and the ensemble results of multi-modality. $J$ means the joint stream, $B$ denotes the bone stream, $2S$ denotes the ensemble results of the above two streams in default, and $4S$ incidents the ensemble results of additional joint motion and bone motion streams. Particularly, due to the design of their model architecture, the experimental results of VA-LSTM Zhang et al. (2017), ACG-LSTM Si et al. (2019) and HCN Li et al. (2018) are the ensemble results of two streams. Therefore, we classify the results into $2S$ .

As shown in Table 1, on NTU60 and NTU120, combined with STD-CL, all four various baselines achieve solid gains on these four benchmarks of the two datasets over different settings. Moreover, the improvement in a single stream can be generalized to the results of the multi-stream ensemble. Take CTR-AGCN as an example, it improves by 0.6% (89.8% to 90.4%) in the joint stream, 0.6% (90.2% to 90.8%) in the bone stream, and 0.2% (92.0% to 92.2%) in their fusion result.

As shown in Table 2, on NW-UCLA, STD-CL outperforms 2S-AGCN by 1.5%. It also outperforms CTR-CCN by 0.7%. And it outperforms the recent work InfoGCN and Hyperformer by 0.2% and 0.4%, respectively. Considering that the performance on this dataset is already very high, such improvement brought by STD-CL is significant.

4.4 Comparsion With Other Contrastive Learning Methods

We compare our method with the recent contrastive learning methods in supervised skeleton-based action recognition. The comparison is conducted on the X-sub benchmark of the NTU120 dataset. As shown in Table 3, our STD-CL achieves competitive improvements. Furthermore, most previous methods are only compatible with GCN-based methods. To take advantage of our method, we combine the STD-CL with Hyperformer Zhou et al. (2022), which is a transformer-based method. Our method outperforms previous contrastive learning methods by 1.2%, which is a promising result. For more comparisons, please refer to the supplementary materials.

4.5 Ablation Study

In this section, we evaluate the different experimental settings on the X-sub benchmark of the NTU120 dataset to verify the design of the proposed STD-CL. For more ablation studies, please refer to the supplementary materials.

Contrastive Strategies. In Table 3, we conduct experiments to validate the contrastive strategy. We employ the decoupled spatial features or decoupled temporal features to contrastive learning, and these lead to performance improvements, which are owed to the decrease in the degree of mixture in the global features. Furthermore, we find that global feature contrast can improve the performance on the baseline with 0.5% (84.9 % to 85.6 %), which owes to the explicit modeling of latent data distributions. Furthermore, we combine the global feature contrast and our STD-CL, and the performance drops slightly (85.6% to 85.5%), which may be due to that the global features contrast limits the feature decoupling to some extent.

Table 4: Ablation study about the contrastive strategies. SD-CL means applying spatial decoupled features to contrastive learning, and TD-CL means applying temporal decoupled features to contrastive learning. G-CL means applying the global features to contrastive learning.

Method	Acc(%)
Baseline (w/o contrast)	84.9
w/SD-CL	$\text{85.4}^{\uparrow{0.5}}$
w/TD-CL	$\text{85.3}^{\uparrow{0.4}}$
w/G-CL	$\text{85.4}^{\uparrow{0.5}}$
w/STD-CL + w/ G-CL	$\text{85.5}^{\uparrow{0.6}}$
w/STD-CL	$\text{85.6}^{\uparrow{0.7}}$

Performance on Different Level Actions. To evaluate the effectiveness of the proposed STD-CL, we conduct recognition accuracy for different action categories with different levels of difficulty. Specifically, we gather actions whose accuracy is over 90% as Easy Level, between 80% to 90% as Medium Level, and lower than 80% as Hard Level for their respective classification results. The experimental results are displayed in Table 5. From the results, we can observe that the proposed STD-CL achieves relatively significant improvements in the hard and medium-level actions, which include more fine-grained action categories. Moreover, the proposed STD-CL has made some gains in the easy-level actions, thanks to the explicit modeling for the latent data distributions of contrastive learning.

Table 5: Accuracy on different level actions.

Setting	Acc(%)
Method/Level	Hard	Medium	Easy
2S-AGCN	65.0	86.2	95.1
2S-AGCN w/STD-CL	$\text{65.9}^{\uparrow{0.9}}$	$\text{86.6}^{\uparrow{0.4}}$	$\text{95.2}^{\uparrow{0.1}}$
CTR-AGCN	65.9	85.8	95.4
CTR-AGCN w/STD-CL	$\text{66.1}^{\uparrow{0.2}}$	$\text{87.0}^{\uparrow{1.2}}$	$\text{95.9}^{\uparrow{0.5}}$
Hyperformer	68.3	85.9	95.2
Hyperformer w/STD-CL	$\text{69.6}^{\uparrow{1.3}}$	$\text{87.0}^{\uparrow{1.1}}$	$\text{95.4}^{\uparrow{0.2}}$

Furthermore, we compare our results with a SOTA model CTR-GCN Chen et al. (2021a), and the results of class with accuracy differences higher than 3% between CTR-GCN and our method are displayed in Figure 4. We can obverse that actions such as “hit with object”, “point to something”, “wipe face”, “reading” and “yawn” benefit from the proposed STD-CL due to the great capacity to obtain discriminative and semantically distinct features. Besides, the reason for the poor performance of actions “take object out of bag”, “put on a shoe”, and “tear up paper” is that such actions are strongly object-related, making it challenging to recognize and contrast.

Training Consumption. In Table 6, we report the training consumption on NTU120. With our method, the training memory usage only slightly increases with different baseline models, and the increase in training time remains within an acceptable range, which proves the efficiency of the design.

Table 6: The comparison of training consumption.

Method	Memory-Usage (G)	Time (min/epoch)
2S-AGCN	7.5	5.1
2S-AGCN w/STD-CL	7.8	8.8
CTR-GCN	11.5	19.9
CTR-AGCN w/STD-CL	11.7	23.5
Hyperformer	14.5	8.9
Hyperformer w/STD-CL	14.9	12.3

4.6 Qualitative Analysis

In this section, we validate our STD-CL through t-SNE Van der Maaten and Hinton (2008) distribution visualization of feature representations in the test set of the X-sub benchmark of the NTU120 dataset. As shown in Figure 5, our STD-CL can exploit the data distributions explicitly with more compact intra-class distances and clear classification boundaries. Thus, the features extracted from the proposed STD-CL are more discriminative, which shows that the decoupled feature contrast improves the feature extraction capacity of the skeleton encoders.

5 Conclusion

In this paper, we propose a novel spatial-temporal decoupling contrastive learning (STD-CL) framework for skeleton-based action recognition, which can be combined with most of the previous skeleton encoders. We employ contrastive learning for the spatial and temporal decoupled features to encourage the skeleton encoders to extract more discriminative and semantically distinct features, which are proven to be effective for recognizing fine-grained action classes and ambiguous samples. The extensive experimental results with four various skeleton encoders on three datasets demonstrate the effectiveness of the proposed framework.

References

Cao et al. [2017] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7291–7299, 2017.
Chen et al. [2020] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the International Conference on Machine Learning, pages 1597–1607. PMLR, 2020.
Chen et al. [2021a] Yuxin Chen, Ziqi Zhang, Chunfeng Yuan, Bing Li, Ying Deng, and Weiming Hu. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13359–13368, 2021.
Chen et al. [2021b] Zhan Chen, Sicheng Li, Bing Yang, Qinghan Li, and Hong Liu. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1113–1122, 2021.
Cheng et al. [2020a] Ke Cheng, Yifan Zhang, Congqi Cao, Lei Shi, Jian Cheng, and Hanqing Lu. Decoupling gcn with dropgraph module for skeleton-based action recognition. In Proceedings of the European Conference on Computer Vision, pages 536–553. Springer, 2020.
Cheng et al. [2020b] Ke Cheng, Yifan Zhang, Xiangyu He, Weihan Chen, Jian Cheng, and Hanqing Lu. Skeleton-based action recognition with shift graph convolutional network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 183–192, 2020.
Chi et al. [2022] Hyung-gun Chi, Myoung Hoon Ha, Seunggeun Chi, Sang Wan Lee, Qixing Huang, and Karthik Ramani. Infogcn: Representation learning for human skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20186–20196, 2022.
Dang et al. [2019] Qi Dang, Jianqin Yin, Bin Wang, and Wenqing Zheng. Deep learning based 2d human pose estimation: A survey. Tsinghua Science and Technology, 24(6):663–676, 2019.
Du et al. [2015] Yong Du, Wei Wang, and Liang Wang. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1110–1118, 2015.
He et al. [2020] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9729–9738, 2020.
Huang et al. [2023] Xiaohu Huang, Hao Zhou, Bin Feng, Xinggang Wang, Wenyu Liu, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, and Jingdong Wang. Graph contrastive learning for skeleton-based action recognition. 2023.
Li et al. [2018] Chao Li, Qiaoyong Zhong, Di Xie, and Shiliang Pu. Co-occurrence feature learning from skeleton data for action recognition and detection with hierarchical aggregation. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 786–792, 2018.
Liu et al. [2019] Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C Kot. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, 2019.
Liu et al. [2020] Ziyu Liu, Hongwen Zhang, Zhenghao Chen, Zhiyong Wang, and Wanli Ouyang. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 143–152, 2020.
Plizzari et al. [2021] Chiara Plizzari, Marco Cannici, and Matteo Matteucci. Spatial temporal transformer network for skeleton-based action recognition. In Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10–15, 2021, pages 694–701. Springer, 2021.
Ren et al. [2020] Bin Ren, Mengyuan Liu, Runwei Ding, and Hong Liu. A survey on 3d skeleton-based action recognition using learning method. arXiv preprint arXiv:2002.05907, 2020.
Shahroudy et al. [2016] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1010–1019, 2016.
Shi et al. [2019] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12026–12035, 2019.
Shi et al. [2020] Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition. In Proceedings of the Asian Conference on Computer Vision, 2020.
Si et al. [2019] Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, and Tieniu Tan. An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1227–1236, 2019.
Song et al. [2017] Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, and Jiaying Liu. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
Song et al. [2022] Yi-Fan Song, Zhang Zhang, Caifeng Shan, and Liang Wang. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1474–1488, 2022.
Van der Maaten and Hinton [2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
Wang et al. [2014] Jiang Wang, Xiaohan Nie, Yin Xia, Ying Wu, and Song-Chun Zhu. Cross-view action modeling, learning and recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2649–2656, 2014.
Wang et al. [2023] Wenguan Wang, Cheng Han, Tianfei Zhou, and Dongfang Liu. Visual recognition with deep nearest centroids. In Proceedings of International Conference on Learning Representations, 2023.
Wu et al. [2018] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3733–3742, 2018.
Xiang et al. [2023] Wangmeng Xiang, Chao Li, Yuxuan Zhou, Biao Wang, and Lei Zhang. Generative action description prompts for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10276–10285, 2023.
Xu et al. [2022] Kailin Xu, Fanfan Ye, Qiaoyong Zhong, and Di Xie. Topology-aware convolutional neural network for efficient skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2866–2874, 2022.
Xu et al. [2023] Binqian Xu, Xiangbo Shu, Jiachao Zhang, Guangzhao Dai, and Yan Song. Spatiotemporal decouple-and-squeeze contrastive learning for semisupervised skeleton-based action recognition. IEEE Transactions on Neural Networks and Learning Systems, 2023.
Yan et al. [2018] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Ye et al. [2020] Fanfan Ye, Shiliang Pu, Qiaoyong Zhong, Chao Li, Di Xie, and Huiming Tang. Dynamic gcn: Context-enriched topology learning for skeleton-based action recognition. In Proceedings of the ACM International Conference on Multimedia, pages 55–63, 2020.
Zhang et al. [2017] Pengfei Zhang, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jianru Xue, and Nanning Zheng. View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2117–2126, 2017.
Zhang et al. [2019] Hong-Bo Zhang, Yi-Xiang Zhang, Bineng Zhong, Qing Lei, Lijie Yang, Ji-Xiang Du, and Duan-Sheng Chen. A comprehensive survey of vision-based human action recognition methods. Sensors, 19(5):1005, 2019.
Zhang et al. [2020] Pengfei Zhang, Cuiling Lan, Wenjun Zeng, Junliang Xing, Jianru Xue, and Nanning Zheng. Semantics-guided neural networks for efficient skeleton-based human action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1112–1121, 2020.
Zhou et al. [2022] Yuxuan Zhou, Zhi-Qi Cheng, Chao Li, Yanwen Fang, Yifeng Geng, Xuansong Xie, and Margret Keuper. Hypergraph transformer for skeleton-based action recognition. arXiv preprint arXiv:2211.09590, 2022.
Zhou et al. [2023] Huanyu Zhou, Qingjie Liu, and Yunhong Wang. Learning discriminative representations for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10608–10617, 2023.