This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Two-stream Multi-level Dynamic Point Transformer for Two-person Interaction Recognition

Yao Liu [email protected] Gangfeng Cui [email protected] School of Computer Science and Engineering, University of New South WalesSydneyNSW2052Australia Jiahui Luo School of Computing and Information Systems, University of MelbourneMelbourneVictoria3010Australia [email protected] Xiaojun Chang Faculty of Engineering and Information Technology, University of Technology SydneySydneyNSW2007Australia [email protected]  and  Lina Yao Data 61, CSIRO and School of Computer Science and Engineering, University of New South WalesSydneyNSW2015Australia [email protected]
Abstract.

As a fundamental aspect of human life, two-person interactions contain meaningful information about people’s activities, relationships, and social settings. Human action recognition serves as the foundation for many smart applications, with a strong focus on personal privacy. However, recognizing two-person interactions poses more challenges due to increased body occlusion and overlap compared to single-person actions. In this paper, we propose a point cloud-based network named Two-stream Multi-level Dynamic Point Transformer for two-person interaction recognition. Our model addresses the challenge of recognizing two-person interactions by incorporating local-region spatial information, appearance information, and motion information. To achieve this, we introduce a designed frame selection method named Interval Frame Sampling (IFS), which efficiently samples frames from videos, capturing more discriminative information in a relatively short processing time. Subsequently, a frame features learning module and a two-stream multi-level feature aggregation module extract global and partial features from the sampled frames, effectively representing the local-region spatial information, appearance information, and motion information related to the interactions. Finally, we apply a transformer to perform self-attention on the learned features for the final classification. Extensive experiments are conducted on two large-scale datasets, the interaction subsets of NTU RGB+D 60 and NTU RGB+D 120. The results show that our network outperforms state-of-the-art approaches in most standard evaluation settings.

Two-person interaction recognition, Point cloud-based method, Frame sampling, Two-stream multi-level feature aggregation, Transformer
ccs: Computing methodologies Activity recognition and understanding

1. Introduction

Interactions between two people, such as handshaking and hugging (see Figure 1), represent a fundamental aspect of human life. Recognizing such interactions is crucial and stands as one of the most significant branches of human activity recognition. Its applications span across various domains, including security, video retrieval, surveillance, and human-computer interfaces, promising substantial societal benefits(Stergiou and Poppe, 2019). A typical computer vision-based two-person interaction recognition task involves automatically identifying human interactions from image sequences or videos. Despite notable advancements in human activity recognition over the past decade (Sargano et al., 2017; Wang et al., 2022), recognizing two-person interactions remains challenging. Unlike single-person action recognition, identifying two-person interactions is more complex, primarily due to the involvement of multiple human subjects and the interdependence between them.

Earlier human action recognition heavily relied on wearable devices, which captured motion information to achieve action recognition (Hammerla et al., 2016). However, this method required individuals to wear sensors continuously, making it costly for large-scale data collection and hindering its widespread adoption. In recent years, RGB-D-based two-person interaction recognition (Liu et al., 2019) has garnered significant attention, mainly due to the advancement of cost-effective RGB-D sensors. Several approaches in this domain focus on utilizing skeleton data alone (Xu et al., 2021; Tang et al., 2022; Dhiman et al., 2021) or hybrid features from different channels (Huang et al., 2018). While these techniques can provide valuable information for recognition, they face critical challenges. For skeleton-based approaches, they are common in single-person action recognition, but there are still challenges in human-to-human interaction recognition (Wang et al., 2018). There are more body overlaps and occlusions in human-human interactions. During the conversion from the RGB-D data to obtain the skeleton data, errors in the skeleton estimation can accumulate errors in action recognition and lead to performance degradation (Wang et al., 2020). In addition, the skeleton-based approach requires pre-designed skeleton points, which requires extra work for two-person interaction recognition. Regarding hybrid feature-based methods, the combination of features from multiple channels can result in heavy computations. Typically, these channels include RGB data(Shi et al., 2022; Xu et al., 2022), which contains texture and color information, raising concerns about privacy issues, particularly in scenarios such as monitoring the activities of children and older individuals in a room. Action analysis through common video or image data becomes challenging in such privacy-sensitive cases.

In this work, we propose a novel network named the Two-stream Multi-level Dynamic Point Transformer (TMDPT) for two-person interaction recognition, utilizing depth videos as the sole input to the network. Depth videos can be captured using depth cameras and are also accessible through open datasets collected by RGB-D cameras. Unlike RGB data, depth videos lack texture and color features and are insensitive to illumination, ensuring a higher level of personal privacy. Moreover, our model directly performs action recognition from the raw data, making it more robust against occlusion issues encountered by skeleton-based methods. Taking inspiration from the recent success of point cloud-based action recognition approaches (Wang et al., 2020; Li et al., 2021), we convert depth videos into point cloud videos in this paper. An additional advantage of this approach is the reduced computation costs when analyzing point cloud videos compared to processing depth videos directly.

Refer to caption
Figure 1. Examples of interactions from the NTU RGB+D 120 dataset: (A) hugging, (B) punching, (C) shaking hands, (D) high-five, (E) kicking, (F) whispering, (G) taking a photo and (H) cheers and drink.

In interaction learning, frame sampling plays a critical role due to the high frame rate of videos, typically containing a dozen frames per second. Some adjacent or nearby frames have a large amount of repetitive content, which belongs to the redundancy information. A well-designed frame sampling method is essential to balance efficiency and accuracy. However, in current interaction recognition methods, frame sampling methods are often not thoroughly investigated, and many simply employ uniform or random downsampling (Li et al., 2021). Such simplistic approaches may lead to performance degradation as selecting a small frame rate could cause the networks to miss crucial partial information, while opting for a large value could introduce unnecessary computational burden. In this paper, we introduce a novel frame sampling technique called Interval Frame Sampling (IFS), which intelligently selects informative frames while removing redundancy from the original videos, ultimately enhancing recognition performance.

With the sampled frames, we construct an effective deep-learning model for interaction recognition. Firstly, following the approach in (Li et al., 2021), we generate frame-level features. Subsequently, we propose a novel Two-stream Multi-level Feature Aggregation module to jointly capture informative global and partial features. This module comprises a global stream and a partial stream. Within the partial stream, we implement a temporal split procedure to divide a video into multiple temporal segments, allowing for extraction of partial information. In both global and partial streams, we carefully design and extract local-region spatial information, appearance information, and motion information, effectively representing the crucial aspects of two-person interactions and addressing the challenges associated with interaction recognition. To further enhance the feature representations, we employ a transformer (Vaswani et al., 2017), leveraging self-attention on the learned global and partial features. This process enables the global feature to acquire finer partial motion and appearance details while empowering the partial features with a global perspective. The output of the transformer, in combination with the original global feature, is then used for the final classification.

The main contributions of our work include the following:

  • We propose a novel network named the Two-stream Multi-level Dynamic Point Transformer (TMDPT) for two-person interaction recognition. Notably, we are the pioneers in addressing interaction recognition tasks through depth video using a point cloud-based approach. This novel perspective expands the possibilities of depth-based interaction analysis.

  • We introduce the Interval Frame Sampling (IFS) technique for frame sampling, and leverage it to extract global and partial features through our two-stream multi-level feature aggregation module. By doing so, we effectively capture and represent local-region spatial information, appearance information, and motion information, all of which contribute to a comprehensive representation of two-person interactions.

  • Our network outperforms state-of-the-art approaches in most of standard evaluation settings of the interaction subsets of the NTU RGB+D 60 and NTU RGB+D 120 datasets. This remarkable performance improvement validates the effectiveness of our proposed TMDPT model and its potential for practical applications. In addition to achieving state-of-the-art results, we conduct an extensive ablation study to demonstrate the effectiveness of each module in our network. This analysis further reinforces the contributions and benefits of our proposed method.

2. Related Works

2.1. Two-person interaction recognition

Most recent two-person interaction recognition approaches heavily rely on RGB-D data (Liu et al., 2019). These methods can be broadly classified into two major groups: skeleton-based interaction recognition and hybrid feature-based interaction recognition.

Skeleton-based methods are centered around utilizing skeleton data for interaction reasoning. This type of data contains 3D positions of body joints and is typically estimated from depth images. Various approaches within this category attempt to generate features based on each person’s joints and their interactive joints to represent the motion relation. For instance, in (Yun et al., 2012), an approach is proposed to use movements between all pairs of joints, joint distances, and velocity information as the motion information for interaction recognition. Another study, (Wu et al., 2018), extracts a human interaction feature descriptor encompassing the directional, static, and dynamic properties of the skeleton data. They subsequently employ a linear model with a sparse group lasso penalty enhancement to facilitate the interaction recognition task. Meanwhile, some researchers transform the interaction problem into a single-person action recognition problem. In (Bloom et al., 2016), a method is proposed to decompose a two-person interaction into two individuals’ actions, followed by separate classification of each person’s action. However, a significant challenge in using skeleton data is the robustness of the skeleton estimation, which is still not a trivial matter. Inaccurate and incomplete estimated data can adversely affect recognition performance, thereby posing a serious limitation to the effectiveness of skeleton-based methods.

Hybrid feature-based methods leverage combined features from different channels to recognize interactions. For instance, (Xia et al., 2015) integrates motion features, skeleton joints-based postures, and local appearance features from both depth and RGB data to reason about interactions. Similarly, (Trabelsi et al., 2017) combines the distance property of the 3D skeleton with dense optical features extracted from RGB and depth images jointly to predict the interaction class. On the other hand, (Gemeren et al., 2014) merges information from joints with poselets to select important frames and captures motion and appearance features for each interactive person from the bounding boxes where the human subjects are located. Although features from multimodal sources can provide valuable information for recognition, these methods often require substantial amounts of data, affecting the network’s efficiency. Additionally, many of these methods involve using RGB images as part of the input, which could expose excessive personal privacy information. Consequently, these methods may not be suitable for practical applications where some users prefer unobtrusive monitoring.

2.2. Point cloud-based deep learning methods

Point clouds contain abundant 3D geometrical information. PointNet (Qi et al., 2017a) is a groundbreaking work that employs deep learning techniques to address static point cloud problems, including object part segmentation, classification, and semantic scene segmentation. The primary concept behind this approach is to utilize a symmetric function constructed with shared-weight deep neural networks to extract information from the points. Building on this, PointNet++ (Qi et al., 2017b) is a subsequent work that captures spatial features from local partitions and then hierarchically merges them to form a frame-level global feature. The subsequent 3DGCN (Lin et al., 2020) and IGCN (Liu et al., 2022) improve the performance of the model by defining variable kernels and interpolating kernels for point cloud feature extraction, respectively.

Compared to static point cloud analysis, learning on dynamic point clouds presents more challenges. Recently, various point cloud-based approaches have been devised to address video-level human action recognition tasks. For instance, in the 3DV method (Wang et al., 2020), 3D points are initially converted into regular voxel sets to capture motion information. These voxel sets are then abstracted back to a point set, serving as the input to PointNet++ (Qi et al., 2017b) for action recognition. Similarly, SequentialPointNet (Li et al., 2021) flattens a point cloud video into a new data type named hyperpoint sequence, followed by feature mixing on the hyperpoint sequence to extract appearance and motion information. The advantage of using point clouds is that they can be directly derived from depth images, thereby mitigating the exposure of excessive texture and color information found in RGB images and effectively safeguarding privacy. Moreover, depth images aid in distinguishing between background and people, and employing advanced point cloud processing techniques can reduce computational costs and enhance efficiency. Taking inspiration from these methods, our paper proposes an effective point cloud-based network for two-person interaction recognition.

2.3. Impact of frame rate

In video data processing, it is common to convert the video into a sequence of frame-by-frame images for further analysis. Despite the extensive research conducted in the area of human activity recognition (Sargano et al., 2017), the impact of frame rates has received relatively little attention until recently. (Harjanto et al., 2016) is the first work to focus on this aspect. According to their findings, a lower frame rate (with fewer frames) results in a shorter running time, but it sacrifices the amount of information available for feature extraction, which can compromise the discriminability of the extracted features. On the other hand, a higher frame rate captures more temporal information, but it comes at the cost of longer processing times and potential redundancy. This excess redundancy can negatively affect the networks by introducing distracting effects. As a result, employing a simple frame sampling method to evenly or consecutively select frames with either a lower or higher frame rate may not be optimal. The frame rate and the sampling method of the video have a significant impact on subsequent tasks, and it is essential to carefully consider and choose the appropriate frame rate and sampling technique for optimal performance.

2.4. Transformer

The Transformer architecture (Vaswani et al., 2017) was initially developed to address language-related problems, such as text classification, machine translation, and question answering, by utilizing a self-attention mechanism to learn the relationships between elements in a sequence.

In recent years, the computer vision community has embraced transformer networks and adapted them for various vision-related tasks, achieving remarkable success. Various approaches have been created to tackle different computer vision problems, including image generation, object detection, segmentation, image recognition, video understanding, and visual question answering (Khan et al., 2022). For instance, (Dosovitskiy et al., 2021) introduces the Vision Transformer (ViT) for image classification. The method splits an image into multiple fixed-size patches and utilizes their embeddings along with positional information as input for a transformer. The transformer’s output is then used for classification. MViT (Fan et al., 2021a) introduces the concept of multiscale in ViT, constructing the model by gradually expanding the channel capacity and reducing spatio-temporal resolution. This approach addresses the pyramid structure through a multi-headed attention mechanism for recognition tasks in both images and videos. TimeSFormer (Bertasius et al., 2021) presents a convolution-free model proposing divided attention based on ViT. By extending ViT to video processing, TimeSFormer treats videos as sequences of patches extracted from frames. SequentialPointNet(Li et al., 2021) is a model for 3D human action recognition, processing point cloud sequences through two modules: the intra-frame appearance encoding module for spatial structure and the inter-frame motion encoding module for temporal changes. Transformer is used in the temporal position embedding layer to capture temporal features. P4Transformer (Fan et al., 2021b) employs a point 4D convolution to understand spatio-temporal structures and a transformer mechanism for grasping comprehensive appearance and motion details across videos, using self-attention on localized features instead of explicit tracking. Building on the success of these methods, we integrate a transformer into our network to enhance the information contained in global and partial features, thereby improving performance in two-person interaction recognition.

Refer to caption
Figure 2. Two-stream Multi-level Dynamic Point Transformer comprises four main components: a novel frame sampling scheme named Interval Frame Sampling, a frame features learning module, a two-stream multi-level feature aggregation module, and a transformer classification module.

3. Methodology

The overview of our Two-stream Multi-level Dynamic Point Transformer (TMDPT) is depicted in Figure 2. TMDPT comprises four main components: a novel frame sampling scheme, a frame features learning module, a feature aggregation module, and a transformer classification module. We solely utilize the depth videos channel captured by the RGB-D camera as the input to our model. While the figure occasionally includes RGB picture depictions for illustrative purposes, please note that our model does not involve any RGB channels. First, we form 3D voxels by mapping depth values to coordinates, and convert the 3D grid into a point cloud by determining whether the voxels are occupied (Wang et al., 2020; Li et al., 2021). Subsequently, Interval Frame Sampling (IFS) is applied to sample frames from the point cloud videos. The frame features learning module then extracts local-region features, frame spatial features, and frame temporal features from each frame. Next, a two-stream multi-level feature aggregation module combines these features into global and partial features. Lastly, the transformer classification module employs self-attention on the aggregated features and utilizes the output in conjunction with the original global feature for the final classification.

3.1. Interval Frame Sampling

We introduce a novel frame sampling technique called Interval Frame Sampling (IFS) (refer to Figure 3). IFS comprises two steps, with the first step executed before the training phase to sample pp frames per interaction video from the original videos. To elaborate, given a point cloud video, we evenly separate all frames into pp intervals and then randomly select one frame from each interval. This step balances the advantages of uniform and random sampling to reduce redundancy while retaining highly discriminative information. We denote pp as the Top frame rate in our method, usually selecting a large value to ensure that the interaction samples contain ample information. During the training phase, the second step is performed in each epoch. In this step, we apply the same sampling method again to evenly divide the updated video from the first step into qq intervals. We then randomly select one frame from each interval, resulting in a total of qq frames as a training instance. Essentially, this means the frames representing the same interaction sample are likely to vary from epoch to epoch. We refer to qq as the Bottom frame rate in this work.

Our frame selection technique offers several advantages. Firstly, by setting the interval size in the second step appropriately (not too large compared to the total training epochs), our network can effectively access a significant portion of the information contained within the pp frames. As mentioned earlier, pp can be set as a large number to provide our network with ample information. Moreover, the two rounds of interval sampling within IFS help eliminate redundant information that may be present in consecutive interaction frames. This ensures that the network receives only essential and non-redundant data. Additionally, IFS contributes to shorter processing times, as our network only utilizes qq frames per interaction sample during training. Here, qq can be set as a much smaller value compared to pp. Lastly, the characteristic of IFS introducing slight changes to the same interaction sample in each epoch’s training offers positive data augmentation-like effects. This variation enables the network to focus on learning discriminative partial information from different frames during the training phase. Overall, these advantages make IFS a valuable addition to our network, enhancing its efficiency and performance in two-person interaction recognition.

Refer to caption
Figure 3. A schematic overview of Interval Frame Sampling for two-person interaction recognition. Starting with an original interaction video, in step 1, pp frames are sampled from pp intervals to represent the corresponding interaction sample during the whole learning process. In step 2, qq frames are further sampled from the pp frames for each epoch’s interaction learning.

3.2. Frame features learning module

The frame features learning module outputs three types of features: local-region features, frame spatial features, and frame temporal features, which are obtained through a series of feature extraction steps (refer to Figure 4). This module takes a point cloud set x1,x2,,xn{x_{1},x_{2},\dots,x_{n}} as input, where each xi3x_{i}\in\mathbb{R}^{3} represents the 3-dimensional coordinates of a point within the point cloud frame.

Specifically, the spatial feature extraction component consists of two levels of set abstraction (SA) (Qi et al., 2017b). At each abstraction level, a sampling layer performs iterative Farthest Point Sampling (FPS) to select a fixed number of points as centroids from the input points. Subsequently, a grouping layer constructs local regions around these centroids using a ball query algorithm. Following that, a modified PointNet layer extracts local spatial features from each region. In the modified PointNet layer, the coordinates of all points in each local region are transformed into local coordinates relative to the centroid point. Additionally, the distance between each point and its corresponding centroid is included as an additional point feature to enhance the network’s performance in dealing with rotational motions. To further focus on learning crucial features, a Convolutional Block Attention Module (CBAM) (Woo et al., 2018) is applied to perform inter-feature attention on the point features. It is important to note that CBAM is not used in the first abstraction level within this work.

Refer to caption
Figure 4. A frame features learning module extracts local-region features, a frame spatial feature, and a frame temporal feature from each point cloud frame. nn denotes the number of points and mm denotes the dimension of the points.

At the end of this layer, the feature vectors are combined with the coordinates of the corresponding centroid point, forming a spatial feature that represents a local region, which is referred to as local-region features. Subsequently, a Multi-Layer Perceptron (MLP) and a Max Pooling operation are applied to capture a spatial feature representation for the entire frame, which we call the frame spatial feature. The frame spatial feature extraction operation can be formulated as below:

(1) FSt=MAXj=1,,n2{MLP(ejt)}=MAX{MLP(SA(SA(PCt)))}\displaystyle FS_{t}=\underset{j=1,\dots,n_{2}}{MAX}\{MLP(e^{t}_{j})\}={MAX}\{MLP(SA(SA^{\prime}(PC_{t})))\}

In Equation 1, FStFS_{t} is the frame spatial feature of the t-tht\text{-th} point cloud frame (PCtPC_{t}). ejte^{t}_{j} is the abstracted feature of the j-thj\text{-th} local region from the t-tht\text{-th} frame. SA is the set abstraction operation.

After extracting the frame spatial features, a frame temporal feature extraction structure is constructed to capture temporal clues for each frame. In this step, a temporal positional feature vector, which contains the temporal positional information for each frame, is generated using a sinusoidal positional encoding technique (Vaswani et al., 2017). Each dimension of the vector is calculated as follows:

(2) TP(t,2j)=sin(t100002j/m3)TP(t,2j)=sin(\frac{t}{10000^{2j/m3}})
(3) TP(t,2j+1)=cos(t100002j/m3)TP(t,2j+1)=cos(\frac{t}{10000^{2j/m3}})

In Equation 2 and  3, tt is the temporal position. 2j2j and 2j+12j+1 represent the dimension serial number. m3m3 is the size of a frame spatial feature. Each dimension of the sinusoidal positional vector corresponds to a sinusoid with a period of 100002j/m3×2π10000^{2j/m3}\times 2\pi. For a given input feature, this technique can be used to generate an absolute positional vector.

The generated temporal positional feature vectors have the same dimension as the frame spatial features; hence these two can be summed. After the summation, a MLP is applied to extract frame temporal features:

(4) FTt=MLP(FSt+TPt)FT_{t}=MLP(FS_{t}+TP_{t})

In Equation 4, FTtFT_{t}, FStFS_{t} and TPtTP_{t} are t-tht\text{-th} frame temporal feature, frame spatial feature, and temporal positional feature, respectively.

Refer to caption
Figure 5. First, a two-stream multi-level feature aggregation module merges frame-level features into global and partial features. Then, a transformer classification module performs self-attention on these aggregated features and uses the output combined with the original global feature for interaction recognition.

3.3. Two-stream multi-level feature aggregation

With frame-level features generated, we propose a two-stream module to aggregate them into global and partial representations (the whole architecture with the following transformer classification module can be seen in Figure 5).

3.3.1. Two-stream feature aggregation

Both global and partial information is important for interaction recognition. Purely focusing on global information may cause a network to overlook crucial partial details necessary for distinguishing similar interactions. To address this, we propose a two-stream feature aggregation module, comprising a global stream and a partial stream, to jointly capture global and partial information. Specifically, in the partial stream, a temporal split procedure is applied to obtain fine partial clues. Each video is split into tsts consecutive temporal segments, with an overlap ratio of 0.

3.3.2. Multi-level feature aggregation

The multi-level feature aggregation technique has demonstrated its ability to enhance performance in tasks such as image segmentation and classification. In this work, we apply this technique to simultaneously aggregate local-region features, frame spatial features, and frame temporal features, providing our network with more discriminative information.

Under the two-stream structure, our model processes both the temporal segments and the entire video. Initially, we utilize the multi-level feature aggregation technique to merge local-region features, frame spatial features, and frame temporal features, resulting in a local-region spatial feature, an appearance feature, and a motion feature. These features are then concatenated into a single feature, effectively describing the appearance and motion patterns jointly. In total, our model generates one global feature and tsts partial features (corresponding to tsts temporal segments). Finally, all of these global and partial features are combined to form an integrated feature, enabling our network to leverage both global and fine-grained temporal information for accurate two-person interaction recognition.

(5) Sg=Concat(Lg,Ag,Mg)=Concat(MAXt=1,,q{MAXj=1,,n2{ejt}},MAXt=1,,q{FSt},MAXt=1,,q{FTt})\displaystyle S_{g}=Concat(L_{g},A_{g},M_{g})=Concat(\underset{t=1,\dots,q}{MAX}\{\underset{j=1,\dots,n_{2}}{MAX}\{e^{t}_{j}\}\},\underset{t=1,\dots,q}{MAX}\{FS_{t}\},\underset{t=1,\dots,q}{MAX}\{FT_{t}\})
(6) Si=Concat(Li,Ai,Mi)=Concat(MAXt=i,,i+u1{MAXj=1,,n2{ejt}},MAXt=i,,i+u1{FSt},MAXt=i,,i+u1{FSt})\displaystyle S_{i}=Concat(L_{i},A_{i},M_{i})=Concat(\underset{t=i,\dots,i+u-1}{MAX}\{\underset{j=1,\dots,n_{2}}{MAX}\{e^{t}_{j}\}\},\underset{t=i,\dots,i+u-1}{MAX}\{FS_{t}\},\underset{t=i,\dots,i+u-1}{MAX}\{FS_{t}\})
(7) S=Concat(Sg,S1,,Sts)S=Concat(S_{g},S_{1},\dots,S_{ts})

In Equation 56 and  7, SgS_{g} is the feature representation for the whole video, SiS_{i} is the feature representation for the i-thi\text{-th} temporal segment, and SS is an integrated feature vector containing both global and partial information. SgS_{g} is composed of three sub-features: LgL_{g} as the global aggregated local-region spatial feature, AgA_{g} as the global appearance feature, and MgM_{g} as the global motion feature. Each partial feature also consists of three sub-partial aggregated features. LiL_{i}, AiA_{i}, and MiM_{i} above denote the partial aggregated local-region spatial feature, the partial appearance feature, and the partial motion feature of the i-thi\text{-th} temporal segment, respectively. ejte^{t}_{j} is the abstracted feature of the j-thj\text{-th} local region from the t-tht\text{-th} frame. It is the output of the second set abstraction level. uu is the size of each segment after temporal splitting.

3.4. Transformer classification

We employ a transformer to perform self-attention on the aggregated global and partial features. This step aims to enhance each feature’s information by allowing them to learn important relationships from other related features. Consequently, the partial features can gain a global view, and the global feature can capture more fine-grained partial motion and appearance details.

The transformer applies a multi-head structure to improve its performance. It consists of nn heads and receives the integrated feature SS as the input. Specifically, within an arbitrary head hih_{i}, the transformer uses three learnable matrices, MkiM_{ki}, MviM_{vi} and MqiM_{qi}, to transform each sub-feature within SS into three vectors, representing a key, a value, and a query. All the keys, values, and queries are then packed together into matrices KiK_{i}, ViV_{i}, and QiQ_{i}, respectively:

(8) Ki=SMki,Vi=SMvi,Qi=SMqiK_{i}=SM_{ki},\ V_{i}=SM_{vi},\ Q_{i}=SM_{qi}

After that, an attention weights matrix AiA_{i} containing the relationships between each pair of sub-features within SS is computed by using an attention function. Next, an updated feature vector UiU_{i} is generated by computing the cross product of the attention matrix AiA_{i} and the values ViV_{i}:

(9) Ai=Attention(Qi,Ki)=Softmax(QiKiTdk1/2)A_{i}=Attention(Q_{i},K_{i})=Softmax(\frac{Q_{i}K_{i}^{T}}{d_{k}^{1/2}})
(10) Ui=AiViU_{i}=A_{i}V_{i}

The same calculation is performed in every head parallelly. nn different updated feature vectors are generated. Eventually, a final updated vector UU is computed by concatenating all of these n vectors:

(11) U=Concat(U1,,Un)U=Concat(U_{1},~{}\cdots,U_{n})

The updated vector after the transformer has the same shape as the input feature SS. It consists of one updated global feature and tsts updated partial features, representing the enriched information from the self-attention process. In addition to the multi-head structure, the transformer incorporates several other components, including residual connections, LayerNorms, and linear layers. Moreover, to enhance performance, bb identical transformer blocks are stacked together, enabling the model to capture more complex and higher-order relationships within the data. Following the self-attention process, a subsequent MLP layer compresses these transformed features into a single feature. Subsequently, another MLP layer utilizes this single feature, combined with the original global feature, as the input to predict the interaction class of the video. This residual-like structure contributes to making the model more stable and improves its ability to effectively learn and represent complex interaction patterns.

Algorithm 1 Two-stream Multi-level Dynamic Point Transformer.
0:  Depth video
0:  Classification result Initialisation :
1:   Point cloud frames \leftarrow SegmentSegment(Depth Video)
2:  for each point cloud frame do
3:      DownsamplingDownsampling(Point cloud frame)
4:  end for
5:   IFS Step I LOOP Process:
6:   IFS Step II
7:  for each point cloud frame do
8:      Local-region feature \leftarrow SASA(SASA^{\prime}(Point cloud frame))
9:      Frame spatial feature \leftarrow MLPMLP(MAXMAX(Local-region feature))
10:      Frame temporal feature \leftarrow MLPMLP(Frame spatial feature, Temporal positional feature)
11:  end for
12:  if Global steam then
13:     Local-region spatial feature (LgL_{g}) \leftarrow MAXMAX(Local-region features)
14:     Appearance feature (AgA_{g}) \leftarrow MAXMAX(Frame spatial features)
15:     Motion feature (MgM_{g}) \leftarrow MAXMAX(Frame temporal features)
16:     Global feature (SgS_{g}) \leftarrow ConcatConcat(LgL_{g},AgA_{g},MgM_{g})
17:  end if
18:  if Partial stream then
19:     for Temporal split (ts) [1,6]\in[1,6] do
20:        Local-region spatial feature (LtsL_{ts}) \leftarrow MAXtsMAX_{ts}(Local-region features)
21:        Appearance feature (AtsA_{ts}) \leftarrow MAXtsMAX_{ts}(Frame spatial features)
22:        Motion feature (MtsM_{ts}) \leftarrow MAXtsMAX_{ts}(Frame temporal features)
23:        Partial feature (StsS_{ts}) \leftarrow ConcatConcat(LtsL_{ts},AtsA_{ts},MtsM_{ts})
24:     end for
25:  end if
26:   Classification result \leftarrow ClassifierClassifier(TransformerTransformer(SgS_{g}, S1S_{1}S6S_{6}) , SgS_{g})
27:  return  Classification result

3.5. Algorithm

The overall process of our Two-stream Multi-level Dynamic Point Transformer is depicted in Algorithm 1. Steps 1 to 5 pertain to the pre-processing data stage, primarily involving the conversion of the depth video into a collection of point cloud frames, downsampling of the point cloud frames, and the execution of IFS Step I. From Step 6 onwards, the algorithm is executed in loops during the training process and only once during the testing process. Step 6 corresponds to IFS Step II. Steps 7 to 11 represent the frame features learning module as described in Section 3.2. Steps 12 to 25 depict our two-stream multi-level feature aggregation module, responsible for obtaining both global and partial features, as discussed in Section 3.3. Finally, the last step is the classifier that utilizes Transformer in Section 3.4, employed to obtain the final recognition results.

4. Experiments

4.1. Datasets

4.1.1. NTU RGB+D 60

NTU RGB+D 60 (Shahroudy et al., 2016) is a large-scale dataset for 3D human activity understanding. It is collected from 40 distinct human subjects and contains 4 million frames and 56,000 samples. These samples are captured by Microsoft Kinect v2 from 80 different viewpoints. For a same action, 3 cameras are used simultaneously to capture 3 different views. NTU RGB+D 60 has a two-person interaction subset with 11 different classes. There are 10,428 interaction video samples in this subset. To set up a standard for evaluating the results on the dataset, the authors have defined two interaction recognition evaluation settings: cross-subject evaluation and cross-view evaluation. Under the cross-subject evaluation setting, 40 human actors are separated into two groups, each with 20 actors. These two groups are used for training and testing, respectively. In the cross-view evaluation, the samples from camera 2 and camera 3 are used for training. The remaining samples from camera 1 are used for testing.

4.1.2. NTU RGB+D 120

NTU RGB + D 120 (Liu et al., 2020a) is the largest dataset for 3D action recognition. It extends NTU RGB+D 60 with a large amount of extra data. Compared with NTU RGB+D 60, it contains 60 additional classes. In total, it consists of more than 114,000 video samples and 8 million frames, which are collected from 106 different human subjects. 32 collection setups are used to build the dataset. Over different setups, the location and background are changed. It contains a two-person interaction subset with 24,828 video samples and 26 different classes. The authors have also suggested two standard evaluation settings for NTU RGB + D 120: cross-subject evaluation and cross-setup evaluation. The cross-subject evaluation shares the same rule as the one in NTU RGB+D 60, with a subset of human actors used for training and the rest used for testing. For the cross-setup evaluation, it has been defined to use the sample videos from even setup IDs for training and the ones from odd setup IDs for testing.

4.2. Implementation details

The Top and Bottom frame rates are set to 50 and 24, respectively, in the Interval Frame Sampling (IFS). That is saying 24 frames per video are used during training. We initially randomly sample 2048 points from each point cloud frame. Then, 512 points are further sampled from these 2048 points using a Farthest Point Sampling (FPS) method. Within the frame features learning module, FPS is applied again to select 128 centroids in the first set abstraction level and 32 centroids in the second level. In these two levels, the ball query radiuses are set to 0.06 and 0.1. In the partial stream of the two-stream feature aggregation module, a point cloud video (containing 24 frames) is split into 6 consecutive temporal segments with an overlap ratio of 0. There are 4 frames in each temporal segment from which a partial feature is generated. We stack 5 identical transformer blocks together within the following transformer, each with 18 heads. We train our network for 90 epochs in total. Adam (Kingma and Ba, 2015) is selected as the optimizer with a batch size of 32. We set the learning rate to start with 0.001 and decay with a rate of 0.5 every 10 epochs. A same data augmentation method from (Wang et al., 2020) is applied in this work.

Table 1. Comparison of recognition accuracy (%) on the two-person interaction subset of NTU RGB+D 60.
Methods Year Cross-Subject Cross-View
ST-LSTM (Liu et al., 2016) 2016 83.0 87.3
GCA-LSTM (Liu et al., 2017) 2017 85.9 89.0
SPDNet (Huang and Gool, 2017) 2017 74.9 76.1
2S-GCA-LSTM (Liu et al., 2018) 2018 87.2 89.9
ST-GCN (Yan et al., 2018) 2018 83.3 87.1
AS-GCN (Li et al., 2019) 2019 89.3 93.0
2S-AGCN (Shi et al., 2019) 2019 92.4 95.8
FSNET (Liu et al., 2020b) 2020 74.0 80.5
MS-AGCN (Shi et al., 2020) 2020 94.1 96.7
3DV-PointNet++ (Wang et al., 2020) 2020 91.0 92.4
LSTM-IRN (Perez et al., 2022) 2021 90.5 93.5
SB-LSTM (Chiu et al., 2021) 2021 93.9 95.6
GeomNet (Nguyen, 2021) 2021 93.6 96.3
SequentialPointNet (Li et al., 2021) 2021 96.0 98.1
2s-AIGCN (Gao et al., 2022) 2022 95.3 98.0
3s-EGCN-IIG (Ito et al., 2022) 2022 96.6 98.7
JointContrast (Zhang et al., 2023) 2023 94.1 96.8
TMDPT (Ours) 2023 96.6 99.0

4.3. Comparison with the state-of-the-art methods

We compare our network’s performance to the state-of-the-art methods on the two large-scale 3D two-person interaction datasets, the interaction subsets of NTU RGB+D 60 and NTU RGB+D 120. The comparison results can be seen in Table 1 and Table 2. Impressively, our network achieves remarkable results, consistently outperforming most other methods in the four standard evaluation settings of these two challenging datasets.

  • ST-LSTM (Liu et al., 2016): The authors propose Spatio-temporal LSTM with trust gates for 3D human action recognition.

  • GCA-LSTM (Liu et al., 2017): The authors propose Global Context-aware Attention LSTM Networks for 3D action recognition.

  • SPDNet (Huang and Gool, 2017): The authors construct a Riemannian network structure with Symmetric Positive Definite (SPD) matrix nonlinear learning in a deep model for visual classification tasks.

  • 2S-GCA-LSTM (Liu et al., 2018): The authors propose a Two-stream Global Context-aware Attention LSTM network for human action recognition.

  • ST-GCN (Yan et al., 2018): The authors propose Spatio-temporal Graph Convolutional Networks for automatic learning of spatio-temporal patterns in data.

  • AS-GCN (Li et al., 2019): The authors introduce Actional Links in Graph Convolutional Networks to capture potential dependencies directly from actions, thus completing skeleton-based action recognition.

  • 2S-AGCN (Shi et al., 2019): The authors propose Two-stream Adaptive Graph Convolutional Networks for skeleton-based action recognition. The topology of the graph can be learned end-to-end by the BP algorithm.

  • FSNET (Liu et al., 2020b): The authors focus on online action prediction for streaming 3D skeleton sequences. The authors introduce expanded convolutional networks to model the dynamics by sliding windows on the time axis.

  • ST-GCN-PAM (Yang et al., 2020): The authors propose a method based on Spatial-temporal Graph Convolutional Networks, which uses the pairwise adjacency matrix to capture the relationship of person-person skeletons.

  • MS-AGCN (Shi et al., 2020): The authors propose Multi-stream Attention-enhanced Adaptive Graph Convolutional Neural Network for skeleton action recognition. Graph topologies can be learned uniformly or individually based on data in an end-to-end manner, and this data-driven approach increases the flexibility of the model.

  • 3DV-PointNet++ (Wang et al., 2020): The key to 3D Dynamic Voxel is to compactly encode the motion information in the depth video into a regular voxel set via a temporal rank pool, which is then transformed into point cloud data for processing.

  • LSTM-IRN (Perez et al., 2022): The authors propose an Interaction Relational Network that uses minimal prior knowledge about the structure of the human body to drive the network to identify the related body parts and use LSTM for inference.

  • SB-LSTM (Chiu et al., 2021): The Stacked Bidirectional LSTM uses two vectors to encode joint dynamics and spatial interaction information.

  • GeomNet (Nguyen, 2021): This is a method of two-person interaction recognition by using 3D skeleton sequences. Its key idea is to use Gaussian distributions to capture statistics on RnR^{n} and those on the space of Symmetric Positive Definite (SPD) matrices.

  • SequentialPointNet (Li et al., 2021): The authors propose a frame-level parallel point cloud sequence network; specifically, the authors flatten the point cloud sequence into hyperpoint sequences and then use the proposed Hyperpoint-Mixer module to process the spatial and temporal features of the frames.

  • 2s-AIGCN (Gao et al., 2022): The authors propose the Attention Interactive Graph Convolutional Network (AIGCN), which employs an Interactive Attention Encoding GCN (IAE-GCN) to extract interactive spatial structures and an Interactive-Attention Mask TCN (IAM-TCN) to discern temporal interactive features.

  • 3s-EGCN-IIG (Ito et al., 2022): The authors propose a method for recognizing interactions between two individuals by employing factorized convolution and analyzing the distances between joints.

  • JointContrast (Zhang et al., 2023):The authors propose the Interaction Information Embedding Skeleton Graph Representation (IE-Graph) model, which utilizes unsupervised pre-training.

4.3.1. NTU RGB+D 60

As the Table 1 shows, our network achieves impressive results of 96.6% and 99.0% on this large-scale dataset under the cross-subject and cross-view evaluation settings. Both results are the best among all these methods on this dataset. This shows that our network is robust against subject and view variation for interaction recognition tasks. One main reason for this success is that our network design enables our TMDPT to obtain crucial global and partial information to complete the interaction recognition tasks effectively.

Table 2. Comparison of recognition accuracy (%) on the two-person interaction subset of NTU RGB+D 120.
Methods Year Cross-Subject Cross-Setup
ST-LSTM (Liu et al., 2016) 2016 63.0 66.6
GCA-LSTM (Liu et al., 2017) 2017 70.6 73.7
SPDNet (Huang and Gool, 2017) 2017 60.7 62.1
2S-GCA-LSTM (Liu et al., 2018) 2018 73.0 73.3
ST-GCN (Yan et al., 2018) 2018 78.9 76.1
AS-GCN (Li et al., 2019) 2019 82.9 83.7
2S-AGCN (Shi et al., 2019) 2019 86.1 88.1
FSNET (Liu et al., 2020b) 2020 61.2 69.7
ST-GCN-PAM (Yang et al., 2020) 2020 83.3 88.4
MS-AGCN (Shi et al., 2020) 2020 87.7 90.5
3DV-PointNet++ (Wang et al., 2020) 2020 84.4 84.5
LSTM-IRN (Perez et al., 2022) 2021 77.7 79.6
SB-LSTM (Chiu et al., 2021) 2021 83.9 83.4
GeomNet (Nguyen, 2021) 2021 86.5 87.6
SequentialPointNet (Li et al., 2021) 2021 91.8 92.2
2s-AIGCN (Gao et al., 2022) 2022 90.7 90.7
3s-EGCN-IIG (Ito et al., 2022) 2022 92.4 95.5
JointContrast (Zhang et al., 2023) 2023 88.2 88.9
TMDPT (Ours) 2023 92.7 93.3

4.3.2. NTU RGB+D 120

Compared with NTU RGB+D 60, it is even harder to conduct interaction analysis on NTU RGB+D 120 (see point cloud interaction examples in Figure 6). It is not only because NTU RGB+D 120 contains a larger amount of data than NTU RGB+D 60, but also because it has much higher variability in many different aspects, including action categories, subjects, camera views, and environments. As can be seen in Table 2, despite facing more challenges, our network still achieves excellent results. With achieving an accuracy of 92.7% on the cross-subject setting and 93.3% on the cross-view setting, our network outperforms most of the state-of-the-art approaches again. This essentially demonstrates the effectiveness of our network for handling two-person interaction recognition tasks under complex subjects, environments, and viewpoints conditions.

Refer to caption


Figure 6. Point cloud two-person interaction examples converted from the depth dataset of NTU RGB+D 120.

4.4. Ablation study

A comprehensive ablation study is performed to extensively test the effectiveness of various aspects of our network design. All tests are conducted on NTU RGB+D 60 under the cross-view evaluation setting.

4.4.1. Effectiveness of Two-stream feature aggregation structure

In our network, a two-stream feature aggregation structure is proposed to aggregate global and partial information simultaneously. To evaluate its effectiveness, we create another network by removing the two-stream component with the following transformer and only using a global feature for the final classification. The performance comparison results can be seen in Table 3. As the results show, the proposed two-stream structure can effectively improve the performance of our network by 1.17%. Considering both results are close to 100%, such a performance enhancement is indeed a notable improvement. This result demonstrates the effectiveness of our two-stream structure and reveals that both the global and partial information is important for interaction recognition.

Table 3. Effectiveness of two-stream feature aggregation structure on the two-person interaction subset of NTU RGB+D 60 under cross-view test setting.
Network Structure Accuracy
Without Two-stream aggregation structure 97.85
With Two-stream aggregation structure (proposed) 99.02

4.4.2. Effectiveness of Transformer

We propose to use a transformer to perform self-attention on the global and partial features. To demonstrate the effectiveness of the transformer, a test case with deleting it from our proposed network is created. We compare the performance of this new test case with our proposed one. From the results listed in Table 4, we can observe that the recognition accuracy drops by 0.60% without using the transformer. This shows that the self-attention process can bring meaningful extra information for interaction reasoning.

Table 4. Effectiveness of transformer on the two-person interaction subset of NTU RGB+D 60 under cross-view test setting.
Network Structure Accuracy
Without Transformer 98.42
With Transformer (proposed) 99.02

4.4.3. Analysis of Transformer related design choices

To conduct transformer input analysis, we report the result of a new network, where the transformer only takes the global and partial motion features as input (without using the aggregated local-region spatial features and appearance features). The results of Table 5 demonstrate that performing self-attention on more comprehensive multi-level features (proposed) instead of only on the motion features can help our network achieve a better performance. From the model structure, local-region features, frame spatial features and frame temporal features are output sequentially; however, for the aggregated features, local-region spatial features and appearance features still contain key features that are not in motion features, so the accuracy can be improved.

Table 5. Comparison of recognition accuracy (%) on the two-person interaction subset of NTU RGB+D 60 under cross-view test setting with using different transformer inputs.
Transformer input Accuracy
Motion features 98.49
Multi-level features (proposed) 99.02

Additionally, we investigate the performance variation of our network when using different feature aggregation methods (Max pooling and MLP) for merging the output of the transformer. As the results in Table 6 indicate, MLP (proposed) provides a better outcome than Max pooling. Simple Max pooling method is not involved in the learning process and cannot fit the data samples reasonably. Thus, MLP is used as the transformer aggregation method in our network.

Table 6. Comparison of recognition accuracy (%) on the two-person interaction subset of NTU RGB+D 60 under cross-view test setting with using different transformer output aggregation methods.
Transformer Aggregation Methods Accuracy
Max Pooling 98.57
MLP (proposed) 99.02

4.4.4. Effectiveness of Interval Frame Sampling (IFS)

In some advanced methods (Perez et al., 2022; Chiu et al., 2021; Nguyen, 2021; Li et al., 2021), they use a fixed frame rate to sample videos. To verify the effectiveness of Interval Frame Sampling (IFS), we test five different cases. In the first four cases, we use a traditional frame-selecting method to evenly sample fixed 6, 12, 24, and 50 frames per interaction video, respectively. Here, the frames representing a particular interaction video do not change over time during the training period. In the last case, we use our proposed IFS setting to choose 50 as the Top frame rate and 24 as the Bottom frame rate.

Refer to caption

Figure 7. Comparison of frame sampling methods on the two-person interaction subset of NTU RGB+D 60 under cross-view test setting: (A) recognition accuracy (%), (B) running time per video (ms).

We first compare the recognition accuracy of these five cases. As shown in Figure 7 (A), more frames can lead to better performance when using the traditional frame selection methods. However, when the number of frames is larger than 24, the performance starts to be steady. This is because a higher frame rate can bring both extra helpful information and useless redundant information. While the former helps improve the performance, the latter can distract the network as noise. The result of our proposed approach indicates that this redundancy-related issue can be solved by our IFS. Compared with the two cases of using fixed 24 and 50 frames, we can gain an extra 0.37% and 0.35% accuracy increase with using IFS. These are considerable performance enhancements, considering the accuracy is already close to 100%.

We also compare the running time (per video) of using fixed 24 and 50 frames with our IFS, Figure 7 (B) shows that our IFS has a very similar processing time as the one using fixed 24 frames.

All these results demonstrate that IFS can improve the efficiency and accuracy of recognition effectively. Our IFS benefits from both sides, having an accuracy of more than 50 frames, while the running time is equivalent to only 24 frames. This double advantage reflects the advancement of IFS.

Refer to caption

Figure 8. Recognition accuracy (%) on the two-person interaction subset of NTU RGB+D 60 under cross-view test setting: (A) different numbers of transformer blocks, (B) different numbers of transformer heads per block.

4.4.5. Model hyperparameter analysis

A series of model hyperparameter analysis is conducted in this work. First, we investigate the impact of different numbers of blocks on the transformer. As shown in Figure 8 (A), the accuracy is first improved when there are more blocks, reaching a peak when the transformer contains 5 blocks. Then, the accuracy starts to drop with additional blocks added. This result indicates that TMDPT can achieve a better performance with using multiple transformer blocks. However, if there are too many blocks stacked together, issues like gradient exploding or vanishing can occur to bring adverse effects to the performance.

Then, we analyze the influence of the number of heads on the performance. Figure 8 (B) shows that the multi-head design can improve TMDPT’s learning ability, as long as not too many heads are used. One main reason for this result is that a transformer with a reasonable number of heads can jointly attend to information from different representation subspaces at different positions to improve performance. However, when there are too many heads, each head may not carry enough information to perform self-attention effectively.

Table 7. Comparison of recognition accuracy (%) on the two-person interaction subset of NTU RGB+D 60 under cross-view test setting with using different temporal split methods.
Temporal split methods Accuracy
2 frames per segment with an overlap ratio of 0 98.21
6 frames per segment with an overlap ratio of 0 98.69
4 frames per segment with an overlap ratio of 0.5 98.36
4 frames per segment with an overlap ratio of 0 (Proposed) 99.02

Finally, we test 4 cases with using different temporal split methods. As Table 7 shows, our proposed split method achieves the best result. Under this method, each video is split into 6 segments with an overlap ratio of 0. Too large or too small segmentation methods can affect the model’s accuracy due to insufficient fitting or overfitting; also, we find that the segmentation does not need to overlap after our IFS sampling. The result indicates that the proposed split method can help our network to extract more useful partial information for interaction learning.

4.4.6. Model complexity analysis

Table 8. Model complexity analysis.
Methods billion FLOPs million parameters rel. memory rel. time
SequentialPointNet (Li et al., 2021) 54.80 2.98 2.61 2.75
TMDPT w/o Transformer (Ours) 2.55 6.56 0.69 0.36
TMDPT (Ours) 7.09 47.06 1 1

We perform a model complexity analysis of our TMDPT and compare it with SequentialPointNet (Li et al., 2021), which also uses a point cloud-based approach. As shown in Table 8, our TMDPT model has lower FLOPs and higher number of parameters relative to SequentialPointNet. Our model contains two-stream multilevel feature aggregation and the Transformer module thus significantly surpasses the SequentialPointNet model in terms of the number of parameters, but due to the sound design of our model network as well as the efficient operation and implementation, it results in a lead over the SequentialPointNet model in FLOPs. In addition we also compare the resource consumption in real computation, SequentialPointNet is 2.61 times more expensive than our TMDPT in memory consumption and 2.75 times more expensive than our TMDPT in time consumption. We find that the multi-head mechanism in the Transformer is responsible for affecting the overall number of model parameters. Therefore we have similarly compared the complexity of TMDPT w/o Transformer. As a result, it can be found that a great saving of computational resources can be achieved without the use of a transformer. Comparing Tables 1 and  4, we find that on the two-person interaction subset of NTU RGB+D 60 under cross-view test setting, even without using a transformer, our TMDPT w/o Transformer (98.4%) exceeds the SequentialPointNet (98.1%). This shows that in some scenarios with limited computational resources, leading results can still be obtained using the TMDPT w/o Transformer.

4.4.7. Model robustness analysis

Table 9. Model robustness analysis on the two-person interaction subset of NTU RGB+D 60 under cross-view test setting.
Methods 512 points (accuracy %) 256 points(accuracy %)
SequentialPointNet (Li et al., 2021) 98.1 84.6
TMDPT (Ours) 99.0 96.8

Our model employs point clouds based on depth video transformations as input data, which avoids the errors accumulated in skeleton estimation compared to skeleton based methods. However the number of points sampled from the point cloud data directly affects the performance of the model, here we compare the results of sampling 512 points (baseline) and 256 points. Our experiments are performed on the two-person interaction subset of NTU RGB+D 60 under cross-view test setting and compare the results of the SequentialPointNet(Li et al., 2021) model, which is also based on a point cloud approach. As seen in Table 9, SequentialPointNet is only 0.9% lower than our TMDPT when sampling 512 points, but SequentialPointNet is 12.2% lower than our TMDPT when sampling 256 points. This shows that our model maintains a certain performance with fewer point cloud samples and is more robust compared to SequentialPointNet.

5. Conclusion

In this study, we focus on the task of two-person interaction recognition and place a strong emphasis on privacy issues. To address this concern, we exclusively use depth video data for our research. We propose a novel point cloud-based network named Two-stream Multi-level Dynamic Point Transformer (TMDPT) for two-person interaction recognition. TMDPT is composed of four main components: a unique frame sampling scheme, a frame features learning module, a two-stream multi-level feature aggregation module, and a transformer classification module. Our novel frame sampling technique effectively selects key frames from the input video. The frame feature learning module then extracts frame-level features, while the two-stream multi-level feature aggregation module combines global and partial features. Lastly, the classification module, which utilizes Transformer, performs two-person interaction classification. Our network is extensively evaluated on two large-scale 3D interaction datasets, specifically the interaction subsets of NTU RGB+D 60 and NTU RGB+D 120. The results of the experiments showcase the superiority of our proposed network for two-person interaction recognition.

References

  • (1)
  • Bertasius et al. (2021) Gedas Bertasius, Heng Wang, and Lorenzo Torresani. 2021. Is Space-Time Attention All You Need for Video Understanding?. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (Proceedings of Machine Learning Research, Vol. 139), Marina Meila and Tong Zhang (Eds.). PMLR, 813–824. http://proceedings.mlr.press/v139/bertasius21a.html
  • Bloom et al. (2016) Victoria Bloom, Vasileios Argyriou, and Dimitrios Makris. 2016. Hierarchical transfer learning for online recognition of compound actions. Comput. Vis. Image Underst. 144 (2016), 62–72. https://doi.org/10.1016/j.cviu.2015.12.001
  • Chiu et al. (2021) Shian-Yu Chiu, Kun-Ru Wu, and Yu-Chee Tseng. 2021. Two-Person Mutual Action Recognition Using Joint Dynamics and Coordinate Transformation. In CAIP 2021: Proceedings of the 1st International Conference on AI for People: Towards Sustainable AI, CAIP 2021, 20-24 November 2021, Bologna, Italy. European Alliance for Innovation, 56.
  • Dhiman et al. (2021) Chhavi Dhiman, Dinesh Kumar Vishwakarma, and Paras Agarwal. 2021. Part-wise Spatio-temporal Attention Driven CNN-based 3D Human Action Recognition. ACM Trans. Multim. Comput. Commun. Appl. 17, 3 (2021), 86:1–86:24. https://doi.org/10.1145/3441628
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=YicbFdNTTy
  • Fan et al. (2021a) Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. 2021a. Multiscale Vision Transformers. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 6804–6815. https://doi.org/10.1109/ICCV48922.2021.00675
  • Fan et al. (2021b) Hehe Fan, Yi Yang, and Mohan S. Kankanhalli. 2021b. Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021. Computer Vision Foundation / IEEE, 14204–14213. https://doi.org/10.1109/CVPR46437.2021.01398
  • Gao et al. (2022) Feng Gao, Hailun Xia, and Zhihao Tang. 2022. Attention Interactive Graph Convolutional Network for Skeleton-Based Human Interaction Recognition. In IEEE International Conference on Multimedia and Expo, ICME 2022, Taipei, Taiwan, July 18-22, 2022. IEEE, 1–6. https://doi.org/10.1109/ICME52920.2022.9859618
  • Gemeren et al. (2014) Coert Van Gemeren, Robby T. Tan, Ronald Poppe, and Remco C. Veltkamp. 2014. Dyadic Interaction Detection from Pose and Flow. In Human Behavior Understanding - 5th International Workshop, HBU 2014, Zurich, Switzerland, September 12, 2014. Proceedings (Lecture Notes in Computer Science, Vol. 8749), Hyun Soo Park, Albert Ali Salah, Yong Jae Lee, Louis-Philippe Morency, Yaser Sheikh, and Rita Cucchiara (Eds.). Springer, 101–115. https://doi.org/10.1007/978-3-319-11839-0_9
  • Hammerla et al. (2016) Nils Y. Hammerla, Shane Halloran, and Thomas Plötz. 2016. Deep, Convolutional, and Recurrent Models for Human Activity Recognition Using Wearables. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016. IJCAI/AAAI Press, 1533–1540. http://www.ijcai.org/Abstract/16/220
  • Harjanto et al. (2016) Fredro Harjanto, Zhiyong Wang, Shiyang Lu, Ah Chung Tsoi, and David Dagan Feng. 2016. Investigating the impact of frame rate towards robust human action recognition. Signal Process. 124 (2016), 220–232. https://doi.org/10.1016/j.sigpro.2015.08.006
  • Huang et al. (2018) Min Huang, Song-Zhi Su, Hongbo Zhang, Guo-Rong Cai, Dong-Ying Gong, Donglin Cao, and Shao-Zi Li. 2018. Multifeature Selection for 3D Human Action Recognition. ACM Trans. Multim. Comput. Commun. Appl. 14, 2 (2018), 45:1–45:18. https://doi.org/10.1145/3177757
  • Huang and Gool (2017) Zhiwu Huang and Luc Van Gool. 2017. A Riemannian Network for SPD Matrix Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, Satinder Singh and Shaul Markovitch (Eds.). AAAI Press, 2036–2042. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14633
  • Ito et al. (2022) Yoshiki Ito, Quan Kong, Kenichi Morita, and Tomoaki Yoshinaga. 2022. Efficient and Accurate Skeleton-Based Two-Person Interaction Recognition Using Inter-and Intra-Body Graphs. In 2022 IEEE International Conference on Image Processing, ICIP 2022, Bordeaux, France, 16-19 October 2022. IEEE, 231–235. https://doi.org/10.1109/ICIP46576.2022.9897250
  • Khan et al. (2022) Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2022. Transformers in Vision: A Survey. ACM Comput. Surv. 54, 10s, Article 200 (sep 2022), 41 pages. https://doi.org/10.1145/3505244
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). http://arxiv.org/abs/1412.6980
  • Li et al. (2019) Maosen Li, Siheng Chen, Xu Chen, Ya Zhang, Yanfeng Wang, and Qi Tian. 2019. Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 3595–3603. https://doi.org/10.1109/CVPR.2019.00371
  • Li et al. (2021) Xing Li, Qian Huang, Zhijian Wang, Zhenjie Hou, and Tianjin Yang. 2021. SequentialPointNet: A strong parallelized point cloud sequence network for 3D action recognition. CoRR abs/2111.08492 (2021). arXiv:2111.08492 https://arxiv.org/abs/2111.08492
  • Lin et al. (2020) Zhi-Hao Lin, Sheng-Yu Huang, and Yu-Chiang Frank Wang. 2020. Convolution in the Cloud: Learning Deformable Kernels in 3D Graph Convolution Networks for Point Cloud Analysis. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 1797–1806. https://doi.org/10.1109/CVPR42600.2020.00187
  • Liu et al. (2019) Bangli Liu, Haibin Cai, Zhaojie Ju, and Honghai Liu. 2019. RGB-D sensing based human action and interaction analysis: A survey. Pattern Recognit. 94 (2019), 1–12. https://doi.org/10.1016/j.patcog.2019.05.020
  • Liu et al. (2020a) Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and Alex C. Kot. 2020a. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42, 10 (2020), 2684–2701. https://doi.org/10.1109/TPAMI.2019.2916873
  • Liu et al. (2020b) Jun Liu, Amir Shahroudy, Gang Wang, Ling-Yu Duan, and Alex C. Kot. 2020b. Skeleton-Based Online Action Prediction Using Scale Selection Network. IEEE Trans. Pattern Anal. Mach. Intell. 42, 6 (2020), 1453–1467. https://doi.org/10.1109/TPAMI.2019.2898954
  • Liu et al. (2016) Jun Liu, Amir Shahroudy, Dong Xu, and Gang Wang. 2016. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III (Lecture Notes in Computer Science, Vol. 9907), Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling (Eds.). Springer, 816–833. https://doi.org/10.1007/978-3-319-46487-9_50
  • Liu et al. (2018) Jun Liu, Gang Wang, Ling-Yu Duan, Kamila Abdiyeva, and Alex C. Kot. 2018. Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks. IEEE Trans. Image Process. 27, 4 (2018), 1586–1599. https://doi.org/10.1109/TIP.2017.2785279
  • Liu et al. (2017) Jun Liu, Gang Wang, Ping Hu, Ling-Yu Duan, and Alex C. Kot. 2017. Global Context-Aware Attention LSTM Networks for 3D Action Recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 3671–3680. https://doi.org/10.1109/CVPR.2017.391
  • Liu et al. (2022) Yao Liu, Lina Yao, Binghao Li, Claude Sammut, and Xiaojun Chang. 2022. Interpolation graph convolutional network for 3D point cloud analysis. Int. J. Intell. Syst. 37, 12 (2022), 12283–12304. https://doi.org/10.1002/INT.23087
  • Nguyen (2021) Xuan Son Nguyen. 2021. GeomNet: A Neural Network Based on Riemannian Geometries of SPD Matrix Space and Cholesky Space for 3D Skeleton-Based Interaction Recognition. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. IEEE, 13359–13369. https://doi.org/10.1109/ICCV48922.2021.01313
  • Perez et al. (2022) Mauricio Perez, Jun Liu, and Alex C. Kot. 2022. Interaction Relational Network for Mutual Action Recognition. IEEE Trans. Multim. 24 (2022), 366–376. https://doi.org/10.1109/TMM.2021.3050642
  • Qi et al. (2017a) Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. 2017a. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017. IEEE Computer Society, 77–85. https://doi.org/10.1109/CVPR.2017.16
  • Qi et al. (2017b) Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. 2017b. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5099–5108. https://proceedings.neurips.cc/paper/2017/hash/d8bf84be3800d12f74d8b05e9b89836f-Abstract.html
  • Sargano et al. (2017) Allah Bux Sargano, Plamen Angelov, and Zulfiqar Habib. 2017. A comprehensive review on handcrafted and learning-based action representation approaches for human activity recognition. applied sciences 7, 1 (2017), 110.
  • Shahroudy et al. (2016) Amir Shahroudy, Jun Liu, Tian-Tsong Ng, and Gang Wang. 2016. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. IEEE Computer Society, 1010–1019. https://doi.org/10.1109/CVPR.2016.115
  • Shi et al. (2019) Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2019. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, 12026–12035. https://doi.org/10.1109/CVPR.2019.01230
  • Shi et al. (2020) Lei Shi, Yifan Zhang, Jian Cheng, and Hanqing Lu. 2020. Skeleton-Based Action Recognition With Multi-Stream Adaptive Graph Convolutional Networks. IEEE Trans. Image Process. 29 (2020), 9532–9545. https://doi.org/10.1109/TIP.2020.3028207
  • Shi et al. (2022) Qinghongya Shi, Hongbo Zhang, Zhe Li, Ji-Xiang Du, Qing Lei, and Jing-Hua Liu. 2022. Shuffle-invariant Network for Action Recognition in Videos. ACM Trans. Multim. Comput. Commun. Appl. 18, 3 (2022), 69:1–69:18. https://doi.org/10.1145/3485665
  • Stergiou and Poppe (2019) Alexandros Stergiou and Ronald Poppe. 2019. Analyzing human-human interactions: A survey. Comput. Vis. Image Underst. 188 (2019). https://doi.org/10.1016/j.cviu.2019.102799
  • Tang et al. (2022) Yansong Tang, Xingyu Liu, Xumin Yu, Danyang Zhang, Jiwen Lu, and Jie Zhou. 2022. Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition. ACM Trans. Multim. Comput. Commun. Appl. 18, 2 (2022), 46:1–46:24. https://doi.org/10.1145/3472722
  • Trabelsi et al. (2017) Rim Trabelsi, Jagannadan Varadarajan, Yong Pei, Le Zhang, Issam Jabri, Ammar Bouallegue, and Pierre Moulin. 2017. Robust Multi-Modal Cues for Dyadic Human Interaction Recognition. In Proceedings of the Workshop on Multimodal Understanding of Social, Affective and Subjective Attributes, MUSA2@MM 2017, Mountain View, CA, USA, October 27, 2017, Xavier Alameda-Pineda, Miriam Redi, Mohammad Soleymani, Nicu Sebe, Shih-Fu Chang, and Samuel D. Gosling (Eds.). ACM, 47–53. https://doi.org/10.1145/3132515.3132517
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008.
  • Wang et al. (2018) Pichao Wang, Wanqing Li, Philip Ogunbona, Jun Wan, and Sergio Escalera. 2018. RGB-D-based human motion recognition with deep learning: A survey. Comput. Vis. Image Underst. 171 (2018), 118–139. https://doi.org/10.1016/j.cviu.2018.04.007
  • Wang et al. (2022) Qiang Wang, Gan Sun, Jiahua Dong, Qianqian Wang, and Zhengming Ding. 2022. Continuous Multi-View Human Action Recognition. IEEE Trans. Circuits Syst. Video Technol. 32, 6 (2022), 3603–3614. https://doi.org/10.1109/TCSVT.2021.3112214
  • Wang et al. (2020) Yancheng Wang, Yang Xiao, Fu Xiong, Wenxiang Jiang, Zhiguo Cao, Joey Tianyi Zhou, and Junsong Yuan. 2020. 3DV: 3D Dynamic Voxel for Action Recognition in Depth Video. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, 508–517. https://doi.org/10.1109/CVPR42600.2020.00059
  • Woo et al. (2018) Sanghyun Woo, Jongchan Park, Joon-Young Lee, and In So Kweon. 2018. CBAM: Convolutional Block Attention Module. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part VII (Lecture Notes in Computer Science, Vol. 11211), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer, 3–19. https://doi.org/10.1007/978-3-030-01234-2_1
  • Wu et al. (2018) Huimin Wu, Jie Shao, Xing Xu, Yanli Ji, Fumin Shen, and Heng Tao Shen. 2018. Recognition and Detection of Two-Person Interactive Actions Using Automatically Selected Skeleton Features. IEEE Trans. Hum. Mach. Syst. 48, 3 (2018), 304–310. https://doi.org/10.1109/THMS.2017.2776211
  • Xia et al. (2015) Lu Xia, Ilaria Gori, Jake K. Aggarwal, and Michael S. Ryoo. 2015. Robot-centric Activity Recognition from First-Person RGB-D Videos. In 2015 IEEE Winter Conference on Applications of Computer Vision, WACV 2015, Waikoloa, HI, USA, January 5-9, 2015. IEEE Computer Society, 357–364. https://doi.org/10.1109/WACV.2015.54
  • Xu et al. (2021) Chunyan Xu, Rong Liu, Tong Zhang, Zhen Cui, Jian Yang, and Chunlong Hu. 2021. Dual-Stream Structured Graph Convolution Network for Skeleton-Based Action Recognition. ACM Trans. Multim. Comput. Commun. Appl. 17, 4 (2021), 120:1–120:22. https://doi.org/10.1145/3450410
  • Xu et al. (2022) Haotian Xu, Xiaobo Jin, Qiufeng Wang, Amir Hussain, and Kaizhu Huang. 2022. Exploiting Attention-Consistency Loss For Spatial-Temporal Stream Action Recognition. ACM Trans. Multim. Comput. Commun. Appl. 18, 2s (2022), 119:1–119:15. https://doi.org/10.1145/3538749
  • Yan et al. (2018) Sijie Yan, Yuanjun Xiong, and Dahua Lin. 2018. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, Sheila A. McIlraith and Kilian Q. Weinberger (Eds.). AAAI Press, 7444–7452. https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17135
  • Yang et al. (2020) Chao-Lung Yang, Aji Setyoko, Hendrik Tampubolon, and Kai-Lung Hua. 2020. Pairwise Adjacency Matrix on Spatial Temporal Graph Convolution Network for Skeleton-Based Two-Person Interaction Recognition. In IEEE International Conference on Image Processing, ICIP 2020, Abu Dhabi, United Arab Emirates, October 25-28, 2020. IEEE, 2166–2170. https://doi.org/10.1109/ICIP40778.2020.9190680
  • Yun et al. (2012) Kiwon Yun, Jean Honorio, Debaleena Chattopadhyay, Tamara L. Berg, and Dimitris Samaras. 2012. Two-person interaction detection using body-pose features and multiple instance learning. In 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, June 16-21, 2012. IEEE Computer Society, 28–35. https://doi.org/10.1109/CVPRW.2012.6239234
  • Zhang et al. (2023) Ji Zhang, Xiangze Jia, Zhen Wang, Yonglong Luo, Fulong Chen, Gaoming Yang, and Lihui Zhao. 2023. JointContrast: Skeleton-Based Interaction Recognition with New Representation and Contrastive Learning. Algorithms 16, 4 (2023), 190. https://doi.org/10.3390/A16040190