\UseRawInputEncoding

A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human Interaction Recognition

Ruoqi Yin, Jianqin Yin *Corresponding author.

Abstract

Human Interaction Recognition (HIR) is the process of identifying and understanding interactive actions and activities between multiple participants in a specific environment or situation. The aim of this task is to recognise the action interactions between multiple people or entities and their meaning and purpose. Many single Convolutional Neural Network (CNN) has issues, such as the inability to capture global instance interaction features or difficulty in training, leading to ambiguity in action semantics. In addition, the computational complexity of the Transformer cannot be ignored, and its ability to capture local information and motion features in the image is poor. In this work, we propose a Two-stream Hybrid CNN-Transformer Network (THCT-Net), which exploits the local specificity of CNN and models global dependencies through the Transformer. CNN and Transformer simultaneously model the entity, time and space relationships between interactive entities respectively. Specifically, Transformer-based stream integrates 3D convolutions with multi-head self-attention to learn inter-token correlations; We propose a new multi-branch CNN framework for CNN-based streams that automatically learns joint spatio-temporal features from skeleton sequences. The convolutional layer independently learns the local features of each joint neighborhood and aggregates the features of all joints. And the raw skeleton coordinates as well as their temporal difference are integrated with a dual-branch paradigm to fuse the motion features of the skeleton. Besides, a residual structure is added to speed up training convergence. Finally, the recognition results of the two branches are fused using parallel splicing. Multi-grained information modelling is employed to enhance the accuracy and robustness of the action recognition system. Experimental results on diverse and challenging datasets, such as NTU-RGBD, H2O, and Assembly101, demonstrate that the proposed method can better comprehend and infer the meaning and context of various actions, outperforming state-of-the-art methods.

Index Terms:

human interaction recognition, CNN, Transformer, multi-grained context.

I Introduction

Human Interaction Recognition (HIR) has become a significant challenge and research focus in the field of computer vision for identifying and comprehending video content of human actions [1, 2, 3, 4]. The rapid development of fields such as social media, intelligent surveillance, and virtual reality has increased the demand for real-time recognition and analysis of human behaviour in videos. The aim of the interactive action recognition task is to extract and recognise human actions from video sequences. These actions may include various activities in daily life, social interactions, or professional actions in specific fields, such as sports or industrial operations [5, 6, 7, 8].

(a) Individual Actions

(b) Group Activities

Figure 1: Examples of individual actions (a), group activities (b) and interactive actions (c). (a) Pose of a single person raising his hand could depict the action Stretching. (b) Group activity Answering by raising hands is annotated regardless of the people. (c) In a scene of waving hello, each entity is an integral part of the interactive action.

The aim of the human interaction recognition task is to identify and comprehend human body movements, gestures, or behaviours, thereby inferring the interaction process between people and objects. Unlike individual actions that relate to the actions of a single subject, and unlike group activities that abstract overall activity events from different individual actions, each goal in an interactive action is essential for explaining the complete semantics. For interactive actions, the term ’individual’ refers to the identification of a single action, a sequence of actions, or the behaviour of a specific person. In contrast, the term ’whole’ refers to a broader context or scene that contains relationships between multiple individuals or actions. The relationship between these two terms is important because they complement each other and help to more fully understand and explain the meaning and intention of the action. When identifying individual actions, it is important to consider the overall context in order to accurately infer and interpret their meaning. For instance, as shown in Fig. 1, while a single person raising their hand in a video may seem like a simple action, understanding that it occurs during a greeting scene provides a much richer understanding of its significance. The relationships and interactions among multiple individuals are integral to the overall situation. Identifying these interactive actions aids in comprehending the purpose and significance of individual actions within the broader scenario. For instance, in a social setting, a sequence of actions may comprise a conversational exchange, and recognising the overall interaction can reveal meaning and emotion beyond words.

In this task, videos are considered as spatio-temporal sequences that contain rich information. Each frame represents a moment, and the sequence represents the evolution of these moments on the timeline. Therefore, accurately understanding and identifying actions in videos requires not only modeling spatial information, such as posture and object location, but also capturing and understanding temporal information, which is the evolution of actions.

The challenges of this complex task are twofold. Firstly, video sequences are typically high-dimensional and contain a vast amount of information. Secondly, there are various types of interactive actions, including people-to-people, hands-to-hands, and hands-to-objects. Different interacting entities have distinct physical structures and interaction modes, resulting in complexity and variability in interaction modelling. To tackle these challenges, researchers have focused on developing different computational models and techniques to achieve human interaction recognition [9, 10, 11, 12]. Among these methods, Convolutional Neural Networks (CNNs) are commonly used to extract spatial information from video frames and capture static features frame by frame. However, modelling the temporal dependence of long sequences has always been a challenge. The emergence of deep learning technology has led to the development of Transformer, a sequence modeling tool that utilizes self-attention mechanism to achieve better modeling of long-term dependencies. As a result, Transformer has been introduced into human interaction recognition tasks, demonstrating great potential in capturing long sequence temporal relationships and modeling action sequences.

The Vision-Transformer (ViT) is a vision model that is entirely based on the Transformer structure. In comparison to traditional CNN vision models, ViT has shortcomings in both model structure and feature representation. Specifically, ViT divides the image into several fixed-size patches and subsequently performs feature extraction and classification on them. However, this will cause the original ViT model to be sensitive to the size of the input image, which can limit its ability to utilize global image information and ultimately affect its performance. It is important to maintain a balanced approach to classification performance. Secondly, the original ViT does not include multi-layer convolution and pooling operations in CNN. This may limit its ability to extract certain image features, resulting in a weaker ability to extract detailed information such as texture and shape. CNN performs well in image processing and can handle complex image features, particularly local features. However, its performance is weaker when processing global information. In contrast, Transformer excels in the NLP field, particularly in modeling and generating sequence data, and has an advantage in processing global information. By combining CNN with Transformer, this model can effectively capture and process both local and global information in images, resulting in improved performance.

Overall, there have been some excellent works on interactive action recognition methods based on CNN or Transformer. However, there is still room for improvement. To address these issues and combine the advantages of CNN and Transformer networks, we propose a Two-stream Hybrid CNN-Transformer Network (THCT-Net), which exploits the local specificity of CNN and models global dependencies through the Transformer. CNN and Transformer simultaneously model the entity, time and space relationships between interactive entities respectively. Specifically, Transformer-based stream integrates 3D convolutions with multi-head self-attention to learn inter-token correlations; We propose a new multi-branch CNN framework for CNN-based streams that automatically learns joint spatio-temporal features from skeleton sequences. The convolutional layer independently learns the local features of each joint neighborhood and aggregates the features of all joints. And the raw skeleton coordinates as well as their temporal difference are integrated with a dual-branch paradigm to fuse the motion features of the skeleton. Besides, a residual structure is added to speed up training convergence. Finally, the recognition results of the two branches are fused using parallel splicing. Experiments on three popular datasets verify that this model has the best fusion effect.

The main contributions of this paper are as follows：

1) We propose a new Two-stream Hybrid CNN-Transformer Network (THCT-Net) for human interaction recognition tasks, which uses Transformer self-attention module and traditional convolutional layers to learn multi-granularity context.

2) We propose a new multi-branch CNN framework that automatically learns joint spatio-temporal features from skeleton sequences. The convolutional layer independently learns the local features of each joint neighborhood and aggregates the features of all joints. And the raw skeleton coordinates as well as their temporal difference are integrated with a dual-branch paradigm to fuse the motion features of the skeleton. Besides, a residual structure is added to speed up training convergence.

3) Extensive experiments on NTU RGB+D 120, H2O and Assembly101 datasets consistently verify the effectiveness of our method, which outperforms most interactive action recognition methods.

II Related Work

Human Interaction Recognition. For tasks involving human interaction recognition, TA-GCN [9] uses topology-aware graph convolutional networks to learn the interdependencies and connections between different graph entities. It also computes the topology of multi-graph structures to learn the interdependence between the two hands and objects. LSTM-IRN [10] exploits minimal prior knowledge about human body structure, uses different body parts in posture information as independent objects, and performs pairwise modeling of their relationships. Raptis et al. [11] propose to cast the learning in a max-margin discriminative framework where treat keyframes as latent variables. This allows model to jointly learn a set of the most discriminative keyframes while also learning the local temporal context between them.

IGFormer [12] is the first network to adopt a Transformer-based architecture and utilise prior knowledge of human body structure to design interactions. It builds interaction graphs based on semantic and distance correlations between interacting body parts and enhances each person’s representation by aggregating information of interacting body parts based on the learning graph. ISTA-Net [13] does not require subject-type-specific graph prior knowledge to model diverse interacting entities. By extending an additional entity dimension in attention tokens, it can simultaneously and also effectively capture interactive and spatiotemporal correlations of interactive actions.

In summary, there have been some excellent works for human interaction recognition tasks, each demonstrating their respective advantages. For instance, CNN extracts features through shared convolution kernels, which reduces the number of network parameters, improves model efficiency, and provides translation invariance. However, it has a limited receptive field. Subsequently, the Long Short-term Memory Network (LSTM) [14] gained popularity as a model for individual dynamics in single-person action recognition due to its capacity to capture temporal motion information within a specific range. However, existing Recurrent Neural Networks (RNNs) only concentrate on capturing the dynamics of human interactions by merely combining all individual dynamics or modelling them as a whole, disregarding the interconnected dynamics of how human interactions evolve over time. Vision Transformer (ViT) [15] uses a pure Transformer structure to replace CNN, enabling it to capture global information of an image and surpassing the CNN structure in many visual tasks. This paper aims to explore a new human interaction recognition model that effectively combines the advantages of previous work and further improves recognition performance.

Hybrid CNN-Transformer. In recent years, research on hybrid CNN-Transformer models in computer vision has become a hot topic. This model combines the advantages of both CNN and Transformer to improve performance in various computer vision tasks. The success of CNN is due to its inherent inductive biases, namely translation invariance and local correlation. However, the limited receptive field of CNN makes it difficult to capture global information. In contrast, Transformer can capture long-distance dependencies. Therefore, after the emergence of ViT, many works have attempted to combine CNN and Transformer. This allows the network structure to inherit the advantages of both CNN and Transformer, retaining global and local features to the greatest extent possible.

Theoretically, Transformers can achieve better model performance than CNNs. However, calculating global attention results in significant computational losses, particularly in shallow networks. The computational complexity increases with the size of the feature map. Therefore, some methods propose inserting the Transformer into the CNN backbone network or using a Transformer module to replace a specific convolution module. BoTNet [16] utilises Multi-Head Self-Attention to replace the $3\times 3$ convolution in ResNet Bottleneck, resulting in a new network structure called Bottleneck Transformer. This approach combines the local features of CNN with the overall image focusing features of Transformer, while also significantly reducing computational requirements.

CNN exhibits locality and translation invariance. Locality pertains to adjacent points in the feature map, while translation invariance involves using the same matching rules for different regions. While the inductive bias of CNN enhances its performance on small data sets, it can limit its performance on larger ones. Consequently, some researchers have attempted to incorporate the inductive bias of CNN into Transformers to expedite network convergence. To decrease ViT’s reliance on vast amounts of data, Touvron et al. [17] proposed the Data-efficient Image Transformer (DeIT). This approach enhances the network’s performance on small data sets by utilizing data augmentation and regularization techniques. Additionally, a distillation strategy is introduced, which employs a teacher network to guide the student network. Dai et al. [18] proposed CoAtNet, a Convolution and Attention Network that incorporates depth convolution into the attention module. In depth convolution, each convolution kernel is responsible for one channel, resulting in lower parameters and operation costs compared to normal convolution. In depth convolution, each convolution kernel is responsible for one channel, resulting in lower parameters and operation costs compared to normal convolution. CoAtNet employs shallow networks with stacked convolutional layers. However, we discovered that hybrid models for human interaction recognition are extremely rare. Therefore, in this work, we introduce a two-stream hybrid CNN-Transformer network to enhance the generalisation ability and convergence speed of the model through parallel splicing.

III METHODOLOGY

As shown in Fig. 2, THCT-Net consists of two parallel streams processing information differently: 1) CNN stream, which learns joint spatiotemporal features from skeleton sequences. The convolutional layer independently learns the local features of each joint neighborhood and aggregates the features of all joints. And the raw skeleton coordinates as well as their temporal difference are integrated with a dual-branch paradigm to fuse the motion features of the skeleton. Besides, a residual structure is added to speed up training convergence. 2) Transformer branch, where it integrates 3D convolutions with multi-head self-attention to learn inter-token correlations. The benefit of the proposed branch-in-parallel approach: by leveraging the merits of CNNs and Transformers, we argue that THCT-Net can capture global information while preserving sensitivity on low-level context.

Refer to caption — Figure 2: The overall architecture of the proposed THCT-Net for skeleton-based human interaction recognition.

III-A Transformer Stream

The design of Transformer stream follows ISTA-Net [13]. The input skeleton sequence ${X}_{input}\in∈{\mathbb{R}}^{3\times T\times V\times M}$ is defined based on the estimated 3D skeleton of $M$ interactive entities interacting within time $T$ , with each entity containing $V$ joints.

The first step is to rearrange the input entities. When dealing with interactive entities, some are semantically unordered and interchangeable, such as people. Therefore, they can be arranged in any order while still representing the same interaction. The input skeleton sequence of size $C\times T\times V\times M$ is divided into $M$ parts along the interaction dimension, with each part representing the joint motion of a body. This is achieved through the following equation:

\displaystyle=Split({X}_{input}),

(1)

where [1, 2, …, $i$ , …, $M$ ] represents the index of position order along the interaction dimension.

We could rearrange the original ${X}_{input}$ as follows:

\displaystyle{\widetilde{X}}_{input}=Concat([{X}_{{a}_{1}},{X}_{{a}_{2}},...,{X}_{{a}_{i}},...,{X}_{{a}_{M}}]),

(2)

where $[{a}_{1},{a}_{2},...,{a}_{i},...,{a}_{M}]$ is an arbitrary arrangement of indexes [1, 2, …, $i$ , …, $M$ ].

The input permutation ${\widetilde{X}}_{input}$ is selected in each training epoch, while the original input ${X}_{input}$ is used in validation and testing.

Subsequently, the skeleton tensors are tokenized via a 3D sliding window to obtain interactive spatiotemporal tokens. Non-overlapping 3D windows are used to partition the input data. A window $W$ of size ${T}_{w}\times{J}_{w}\times{E}_{w}$ slides along the temporal, spatial, and interaction dimensions. The input of size $C\times T\times V\times M$ is divided into $U=[T/{T}_{w}]\times[V/{V}_{w}]\times[M/{M}_{w}]$ patches of size $C\times{T}_{w}\times{V}_{w}\times{M}_{w}$ in total. They represent interactive spatiotemporal local features for interactive skeleton sequences. The 3D convolution operation, followed by the batch normalization and an activation function, serves as the embedding layer for interactive spatiotemporal tokens.

Then these tokens are fed to $L$ Multi-head Self-attention Blocks to learn high-level cross frame, joint and subject representations. Similar to standard multi-head self-attention, the input ${X}_{L_{i-1}}$ undergoes transformation into multiple sets of queries $Q$ , keys $K$ , and values $V$ as follows:

\displaystyle Q=Conv3D_{(1\times 1\times 1)}({X}_{L_{i-1}}+PE({X}_{L_{i-1}})),

(3)

\displaystyle K=Conv3D_{(1\times 1\times 1)}({X}_{L_{i-1}}+PE({X}_{L_{i-1}})),

(4)

\displaystyle V={X}_{L_{i-1}},

(5)

where positional encoding implemented with circular functions is $PE(\cdot)$ . The number of sets, namely heads, is denoted as $H$ .

Self-attention scores ${X}_{{L}_{i}}^{h}$ of the $h$ -th head could be calculated as:

\displaystyle{X}_{{L}_{i}}^{h}=(\alpha tanh(\frac{QK^{T}}{\sqrt{C_{\beta}}})+A)V,

(6)

where ${QK^{T}}$ is divided by the square root of the feature length ${C_{\beta}=T_{w}\times V_{w}\times M_{w}\times C_{{L}_{i}-qkv}}$ . A trainable regularized matrix ${A}\in∈{\mathbb{R}}^{U\times U}$ is added to the normalized attention map with a trainable balanced factor $\alpha$ , which can benefit correlation learning [22, 23]. All scores ${X}_{{L}_{i}}^{h}$ of $H$ heads are concatenated to get ${X}_{{L}_{i}}^{H}$ .

A 3D $1\times 1\times 1$ convolution with residual connections implements the feed forward network (FFN). The last component is the temporal aggregation layer, it uses 3D convolution with kernel size 5 in the temporal dimension to aggregate sequence features. Prediction is finally made through Global Average Pooling (GAP) following with a fully connected (FC) layer.

III-B CNN Stream

Convolutional Neural Networks (CNNs) have been highly successful in the field of deep learning and have played a crucial role in tasks such as image recognition and computer vision. Unlike sequential structures such as RNNs, CNNs can encode spatial and temporal context information simultaneously.

This section provides a detailed description of the proposed CNN framework, which aims to learn both the spatial global features and temporal evolution of skeleton sequences. Fig. 2 displays the network architecture of the proposed framework. The skeleton sequence $X$ can be represented by a $C\times T\times V\times M$ tensor, where $C$ represents the coordinate dimension of the joints (e.g. 3 for a 3D skeleton: $x,y,z$ ), $T$ represents the number of frames in the sequence, $V$ represents the number of joints in the skeleton, and $M$ represents the number of people. For interactive actions, activities such as hugging and shaking hands require the participation of multiple people. To ensure scalability in multi-person scenarios, we utilize early fusion to aggregate the joint points of all individuals. This involves stacking all joints from multiple individuals as the input of the network, resulting in a tensor of size $(C,T,V\times M)$ to represent the input skeleton sequence.

Firstly, we encode the data using convolutional layers with kernel sizes of $1\times 1$ and $3\times 1$ . By keeping the kernel size along the joint dimension to 1, the model is forced to learn point-level representations independently from the 3D coordinates of each joint. It is observed that the output of the convolutional layer represents the global response of all input channels. If a 3D tensor $F$ is represented as ${d}_{1}\times{d}_{2}\times{d}_{3}$ , with dimension ${d}_{i}$ specified as a channel and the other two dimensions encoding local context, any information from dimension ${d}_{i}$ can be globally aggregated. This allows for the assignment of different contexts by transposing the tensor. Previous CNN-based methods have specified joint coordinates as channels to learn local features of each joint neighbourhood [19, 20, 21] , which may result in the inability to capture some long-range joint interaction information. Therefore, the feature map is transposed using the parameters (2,1,0) to move the joint dimensions to the channels of the tensor, i.e. $(V\times M,T,C)$ . If we treat each joint of the skeleton as a channel, the convolutional layer can learn the global features of all joints more easily.

In addition to learning the spatial global features of skeleton sequences, it is important to consider the inter-frame representation of the skeleton’s temporal evolution as a clue for identifying potential actions. Motion information is introduced by calculating the difference between frames, which enhances action information in time series data. This approach helps to better capture and represent the characteristics of actions. An additional branch is introduced to learn skeleton motion information from an $C\times T\times V\times M$ tensor. For the skeleton of a person in frame $t$ , we formulate it as ${S}^{t}=\{{J}_{1}^{t},{J}_{2}^{t},...,{J}_{V}^{t}\}$ where $V$ is the number of joint and $J=(x,y,z)$ id a 3D joint coordinate. The skeleton motion is defined as the temporal difference of each joint between two consecutive frames:

	$\displaystyle{M}^{t}$	$\displaystyle={S}^{t+1}-{S}^{t}$		(7)
		$\displaystyle=\{{J}_{1}^{t+1}-{J}_{1}^{t},{J}_{2}^{t+1}-{J}_{2}^{t},...,{J}_{V}^{t+1}-{J}_{V}^{t}\}.$		(7)

The network processes the raw skeleton coordinates $S$ and the skeleton motion $M$ independently using a dual-branch paradigm. Both branches share the same architecture. However, their parameters are learned separately. After passing through the $(64,3\times 3)$ convolutional layer, the feature maps of the two branches are fused by concatenating across channel dimensions. The fused features learn a richer feature representation through a residual module. By combining $1\times 7$ and $7\times 1$ size convolutions, convolution operations can be performed on the input tensor in different directions, which is equivalent to using a larger size convolution kernel. This approach captures spatial and cross-channel information more efficiently. Residual connections enable the direct transfer of the difference between input and output, expanding the network’s receptive field and allowing it to capture a wider range of spatial information and semantic features. Additionally, reducing the model’s parameters helps mitigate the risk of overfitting and improves its trainability. Finally, the feature maps are flattened into vectors and passed through two fully connected layers for the final classification.

Finally, we late-fuse the classification scores of the two streams in a weighted manner to obtain the final recognition result.

IV EXPERIMENTS

TABLE I: COMPARISONS OF ACTION RECOGNITION METHODS ON THREE DIFFERENT INTERACTIVE ACTION DATASETS

Type

Methods

Year

NTU RGB+D 120

- 26 Mutual Actions(%)

H2O(%)

Assembly101(%)

X-Sub

X-Set

LSTM

Co-LSTM [27]

AAAI 2016

ST-LSTM [28]

ECCV 2016

63.00

66.60

GCA [29]

CVPR 2017

70.60

73.70

VA-LSTM [30]

ICCV 2017

2s-GCA [31]

TIP 2018

73.00

73.30

H+O [32]

CVPR 2019

68.88

LSTM-IRN [10]

TMM 2022

77.70

79.60

GCN

ST-GCN [33]

AAAI 2018

78.90

76.10

73.76

AS-GCN [34]

CVPR 2019

82.90

83.70

2s-AGCN [35]

CVPR 2019

26.70

MS-G3D [36]

CVPR 2020

26.86

CTR-GCN [37]

ICCV 2021

89.32

90.19

26.25

TA-GCN [9]

ICCV 2021

79.25

LST [38]

arXiv 2022

89.27

90.60

TCA-GCN [39]

arXiv 2022

88.37

89.30

HD-GCN [40]

arXiv 2022

88.25

90.08

InfoGCN [41]

CVPR 2022

90.22

91.13

25.63

Transformer

DSTA-Net [22]

ACCV 2020

88.92

90.10

STSA-Net [23]

Neurocomputing 2023

90.20

90.97

IGFormer [12]

ECCV 2022

85.40

86.50

22.33

ISTA-Net [13]

IROS 2023

90.56

91.72

89.09

28.01

CNN-Transformer

THCT-Net (Ours)

2023

91.00

91.86

92.98

28.42

IV-A Datasets

The detailed descriptions of three public datasets are as follows:

$\bullet$

NTU RGB+D 120 [24], the extension version of NTU RGB+D [25], is a widely-used action recognition dataset. It provides 114,480 samples of 120 human actions. In our experiments we focus on a subset of NTU RGB+D 120 Dataset, which consists of 26 kinds of mutual actions (named NTU Mutual, for short).
$\bullet$

H2O [9] is the first dataset constructured for egocentric 3D interaction recognition. The images of the H2O dataset are acquired in indoor settings in which the subjects interact with eight different objects using both of their hands. The dataset includes 571,645 RGBD frames, and features four participants performing 36 distinct action classes in three different environments. With 3D pose of both hands and pose of manipulated objects, H2O dataset facilitates hand-to-hand and hand-to-object interactions understanding.
$\bullet$

Assembly101 [26] is a large procedural activity dataset. 3D hand poses are provided to advance 3D interaction recognition from egocentric views. It’s a tough task due to the dataset’s complexity, which includes over 1,300 fine-grained classes of hand-to-object interactions. Each class consists of a single verb and an object that is manipulated. Additionally, the absence of object poses adds another layer of difficulty to judging the interactive actions.

Statistics and difficulties of these datasets are summarized in Table I and Fig. 3. For evaluation on NTU Mutual, we employ the Cross-subject (X-Sub) and Cross-set (X-Set) criteria [24], using only the joint modality to ensure fair comparisons without fusion. For H2O and Assembly101, we follow the training, validation, and test splits described in [9] and [26], respectively.

IV-B Implementation Details

All of our experiments are conducted on a machine equipped with four NVIDIA GeForce RTX 3090 GPUs and CUDA version 12.2. For training on NTU Mutual dataset, SGD optimizer is used with Nesterov momentum of 0.9, a initial learning rate of 0.1 and a decay rate 0.1. Window size is set to $[20,1,2]$ . Cross entropy is used as loss function with label smoothing factor 0.1 and temperature factor 1.0. Batch size is 32. Each training process was terminated after 110 epochs.

IV-C Results and Analyses

IV-C1 Comparison with Baselines

We used the Transformer-based method ISTA-Net [13] as the baseline. Table I shows the recognition accuracy of the proposed THCT-Net is better than baseline on three datasets. CNN is effective at extracting local image features, with superior generalization ability and faster convergence speed. On the other hand, Transformer excels at capturing global semantic information and can produce excellent results on large datasets. Concatenating CNN and Transformer models in parallel can lead to better performance compared to using either model alone.

IV-C2 Comparison with Related Methods

Table I reports the experimental results on NTU Mutual, H2O and Assembly101 datasets. The proposed THCT-Net outperforms many LSTM-, GCN-, Transformer-based action recognition methods and other human interaction recognition methods. THCT-Net achieves 0.44%, 0.14%, 4.07% and 0.41% gains over the most related interactive action recognition method, ISTA-Net [13], on NTU Mutual X-Sub, X-Set, H2O and Assembly101. THCT-Net also outperforms InfoGCN [41] by 0.78% and 0.73% on NTU Mutual, TA-GCN [9] by 13.73% on H2O, and MS-G3D [36] by 1.56% on Assembly101. THCT-Net utilises the local specificity of CNNs and models global dependencies through the use of a transformer. The CNNs and transformer work together to model the physical, temporal, and spatial relationships between interacting entities. The recognition results from both branches are then combined through concurrent splicing to improve accuracy and robustness by modelling information at multiple granularities.

V Conclusion

For the human interaction recognition task, we propose a Two-stream Hybrid CNN-Transformer network (THCT-Net). The CNN models the temporal relationships between entities, while the Transformer models the spatial relationships between interacting entities. This approach mitigates the problem of ambiguity in the semantics of actions caused by a single model. Specifically, Transformer-based stream integrates 3D convolutions with multi-head self-attention to learn inter-token correlations; We propose a new multi-branch CNN framework for CNN-based stream that automatically learns joint spatio-temporal features from skeleton sequences. The convolutional layer independently learns the local features of each joint neighborhood and aggregates the features of all joints. And the raw skeleton coordinates as well as their temporal difference are integrated with a dual-branch paradigm to fuse the motion features of the skeleton. Besides, a residual structure is added to speed up training convergence. Finally, the recognition results of the two branches are fused using parallel splicing. Multi-grained information modelling is employed to enhance the accuracy and robustness of the action recognition system. Extensive experiments on NTU RGB+D 120, H2O and Assembly101 datasets consistently verify the effectiveness of our method, which outperforms most interactive action recognition methods.

Acknowledgments

This work was supported partly by the National Natural Science Foundation of China (Grant No. 62173045, 62273054), partly by the Fundamental Research Funds for the Central Universities (Grant No. 2020XD-A04-3), and the Natural Science Foundation of Hainan Province (Grant No. 622RC675).

References

[1] Wang P, Liu J, Hou F, et al. “Organization and understanding of a tactile information dataset TacAct for physical human-robot interaction,” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021: 7328-7333.
[2] Feng N, Hu F, Wang H, et al. “Hybrid Graph Convolutional Networks for Skeleton-Based and EEG-Based Jumping Action Recognition,” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021: 4156-4161.
[3] Zheng Z H, Zhang H T, Zhang F L, et al. “Image-based clothes changing system,” Computational Visual Media, 2017, 3: 337-347.
[4] Xing H, Burschka D, “Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network,” 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE,, 2022: 5195-5201.
[5] Roitberg A, Schneider D, Djamal A, et al. “Let’s play for action: Recognizing activities of daily living by learning from life simulation video games,” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021: 8563-8569.
[6] Zhao J, Ma Y, Dong J, et al. “Interactive mechanical arm control system based on Kinect,” 2016 35th Chinese Control Conference (CCC). IEEE,, 2016: 5976-5981.
[7] Zhang D, Vien N A, Van M, et al. “Non-local graph convolutional network for joint activity recognition and motion prediction,” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021: 2970-2977.
[8] Xing H, Xue Y, Zhou M, et al. “Robust event detection based on spatio-temporal latent action unit using skeletal information,” 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2021: 2941-2948.
[9] Kwon T, Tekin B, Stühmer J, et al. “H2o: Two hands manipulating objects for first person interaction recognition,” Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 10138-10148.
[10] Perez M, Liu J, Kot A C, “Interaction relational network for mutual action recognition,” IEEE Transactions on Multimedia, 2021, 24: 366-376.
[11] Raptis M, Sigal L, “Poselet key-framing: A model for human activity recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013: 2650-2657.
[12] Pang Y, Ke Q, Rahmani H, et al. “Igformer: Interaction graph transformer for skeleton-based human interaction recognition,” European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 605-622.
[13] Wen Y, Tang Z, Pang Y, et al. “Interactive spatiotemporal token attention network for skeleton-based general interactive action recognition,” 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2023: 7886-7892.
[14] Hochreiter S, Schmidhuber J, “Long short-term memory,” Neural computation, 1997, 9(8): 1735-1780.
[15] Dosovitskiy A, Beyer L, Kolesnikov A, et al. “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint, arXiv:2010.11929, 2020.
[16] Srinivas A, Lin T Y, Parmar N, et al. “Bottleneck transformers for visual recognition,” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021: 16519-16529.
[17] Yuyao G, Yiting C, Jia W, et al. “Vision Transformer Based on Knowledge Distillation in TCM Image Classification,” 2022 IEEE 5th International Conference on Computer and Communication Engineering Technology (CCET). IEEE, 2022: 120-125.
[18] Dai Z, Liu H, Le Q V, et al. “Coatnet: Marrying convolution and attention for all data sizes,” Advances in neural information processing systems, 2021, 34: 3965-3977.
[19] Du Y, Fu Y, Wang L, “Skeleton based action recognition with convolutional neural network,” 2015 3rd IAPR Asian conference on pattern recognition (ACPR). IEEE, 2015: 579-583.
[20] Ke Q, Bennamoun M, An S, et al. “A new representation of skeleton sequences for 3d action recognition,” Proceedings of the IEEE conference on computer vision and pattern recognition, 2017: 3288-3297.
[21] Li C, Zhong Q, Xie D, et al. “Skeleton-based action recognition with convolutional neural networks,” 2017 IEEE international conference on multimedia & expo workshops (ICMEW). IEEE, 2017: 597-600.
[22] Shi L, Zhang Y, Cheng J, et al. “Decoupled spatial-temporal attention network for skeleton-based action-gesture recognition,” Proceedings of the Asian Conference on Computer Vision, 2020.
[23] Qiu H, Hou B, Ren B, et al. “Spatio-temporal segments attention for skeleton-based action recognition,” Neurocomputing, 2023, 518: 30-38.
[24] Liu J, Shahroudy A, Perez M, et al. “Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding,” IEEE transactions on pattern analysis and machine intelligence, 2019, 42(10): 2684-2701.
[25] A. Shahroudy, J. Liu, T.-T. Ng, et al. “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp.1010–1019.
[26] Yun K, Honorio J, Chattopadhyay D, et al. “Two-person interaction detection using body-pose features and multiple instance learning,” 2012 IEEE computer society conference on computer vision and pattern recognition workshops. IEEE, 2012: 28-35.
[27] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie, “Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks,” in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, ser. AAAI’16, 2016, p. 3697–3703.
[28] J. Liu, A. Shahroudy, D. Xu, and G. Wang, “Spatio-temporal lstm with trust gates for 3d human action recognition,” in Computer Vision– ECCV 2016, Cham: Springer International Publishing, 2016, pp. 816–833.
[29] J. Liu, G. Wang, P. Hu, L.-Y. Duan, and A. C. Kot, “Global context-aware attention lstm networks for 3d action recognition,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 3671–3680.
[30] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive recurrent neural networks for high performance human action recognition from skeleton data,” in 2017 IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2136–2145.
[31] J. Liu, G. Wang, L.-Y. Duan, K. Abdiyeva, and A. C. Kot, “Skeleton-based human action recognition with global context-aware attention lstm networks,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1586–1599, 2018.
[32] B. Tekin, F. Bogo, and M. Pollefeys, “H+o: Unified egocentric recognition of 3d hand-object poses and interactions,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4506–4515.
[33] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, 2018.
[34] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, “Actional-structural graph convolutional networks for skeleton-based action recognitions,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3590–3598.
[35] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 12 018–12 027.
[36] Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and unifying graph convolutions for skeleton-based action recognition,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 140–149.
[37] Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” in 2021 IEEE International Conference on Computer Vision (ICCV), 2021, pp. 13 359–13 368.
[38] W. Xiang, C. Li, Y. Zhou, B. Wang, and L. Zhang, ““Language supervised training for skeleton-based action recognition,” arXiv preprint, arXiv:2208.05318, 2022.
[39] S. Wang, Y. Zhang, M. Zhao, H. Qi, K. Wang, F. Wei, and Y. Jiang, “Skeleton-based action recognition via temporal-channel aggregation,” arXiv preprint, arXiv:2205.15936, 2022.
[40] J. Lee, M. Lee, D. Lee, and S. Lee, “Hierarchically decomposed graph convolutional networks for skeleton-based action recognition,” Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 10444-10453.
[41] H.-G. Chi, M. H. Ha, S. Chi, S. W. Lee, Q. Huang, and K. Ramani, “Infogcn: Representation learning for human skeleton-based action recognition,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 20 154–20 164.