Free-Form Composition Networks for Egocentric Action Recognition

Haoran Wang, Qinghua Cheng, Baosheng Yu, Yibing Zhan, Dapeng Tao, Liang Ding, and Haibin Ling H. Wang and Q. Cheng are with College of Information Science and Engineering, Northeastern University, Shenyang 110819, China. E-mail: [email protected] [email protected]. Yu is with School of Computer Science, Faculty of Engineering, the University of Sydney, Darlington, NSW 2008, Australia. E-mail: [email protected]. Zhan and L. Ding are with JD Explore Academy, Beijing 100176, China. E-mail: [email protected] [email protected]. Tao is with School of Information Science and Engineering, Yunnan University, Kunming 650091, Yunnan, China. E-mail: [email protected]. Ling is with Department of Computer Science, Stony Brook University, Stony Brook, USA. E-mail: [email protected].

Abstract

Egocentric action recognition is gaining significant attention in the field of human action recognition. In this paper, we address data scarcity issue in egocentric action recognition from a compositional generalization perspective. To tackle this problem, we propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations, and then use them to compose new samples in the feature space for rare classes of action videos. First, we use a graph to capture the spatial-temporal relations among different hand/object instances in each action video. We thus decompose each action into a set of verb and preposition spatial-temporal representations using the edge features in the graph. The temporal decomposition extracts verb and preposition representations from different video frames, while the spatial decomposition adaptively learns verb and preposition representations from action-related instances in each frame. With these spatial-temporal representations of verbs and prepositions, we can compose new samples for those rare classes in a free-form manner, which is not restricted to a rigid form of a verb and a noun. The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance. We evaluated our method on three popular egocentric action recognition datasets, Something-Something V2, H2O, and EPIC-KITCHENS-100, and the experimental results demonstrate the effectiveness of the proposed method for handling data scarcity problems, including long-tailed and few-shot egocentric action recognition.

Index Terms:

compositional learning, data scarcity, egocentric action recognition.

I Introduction

Refer to caption — Figure 1: An illustration of the free-form compositional learning for egocentric action recognition. Existing compositional models usually decompose and compose action samples in a rigid combination, e.g., a verb and a noun, which is not suitable for the complex action analysis. In contrast, we extract the features of multiple verbs, prepositions, and nouns appearing in the action class, and then generate new training data for the rare actions in a relatively free-form manner.

Egocentric action recognition has been attracting increasing amount of attention due to its wide range of applications containing robotics, augmented reality, and surveillance. Different from the third-person view, egocentric video recognition mainly aims to identify the fine-grained hand-object interactions at a close distance, where the actions are usually described by a simple combination of a verb and a noun [1, 2]. Following such a formulation, two-stream architectures have been widely used to separately classify the action and the active object with deep neural networks [3] and are further improved by using either hand/object features [4, 5, 6] or attention mechanisms [7, 8]. However, existing methods mainly rely on a large amount of labelled data for training, leaving the data scarcity problem, such as long-tailed, few-shot, and zero-shot classifications, in egocentric action recognition under explored.

Data augmentation methods have been intensively explored for improving the variousness of training samples and preventing overfitting to a certain extent, especially in modern deep learning models. Vanilla augmentation strategies usually perform a set of operations on the original data, such as random rotation and random crop, while more advanced methods have been also proposed for the rare classes to augment the samples in multiple tasks such as semantic segmentation [9, 10], image classification [11, 12], and face recognition [13, 14]. Currently, compositional learning frameworks have proven effective in addressing the challenge of data scarcity in the detection of human-object interactions [15, 16, 17]. In these frameworks, all human-object interactions are broken down and reassembled to create new training samples, typically represented as a verb-noun pair. However, compositional action recognition remains relatively unexplored due to the inherent complexities involved in decomposing and composing action videos that contain both spatial and temporal semantic features [18]. To elaborate further, on one hand, it is challenging to pinpoint the specific spatial and temporal regions within a video that correspond to the verb associated with a particular action class. On the other hand, action videos tend to convey a much richer set of semantic information compared to still images, making it difficult to encapsulate human actions within a rigid template, such as a verb and a noun. These inherent challenges have hindered the exploration of compositional action recognition techniques. To address these issues, as shown in Figure 1, we learn the visual features of all the verbs, prepositions, and nouns from two actions “lifting the doll up then dropping it down” and “throwing the scissors into the box”, and then compose new samples of “throwing the scissors up then letting it drop down” with above verbs, prepositions, and nouns. Notably, compared with existing compositional models [19, 4, 15], rather than decomposing and composing action samples in a rigid combination of a verb and a noun, we simultaneously extract the features of multiple verbs, prepositions, and nouns from egocentric action videos, and then generate new training data in a relatively free-form manner for the rare classes.

To address the data scarcity problem in egocentric action recognition via compositional learning, we devise the free-form compositional networks (FFCN). Compared with traditional compositional learning models, FFCN decomposes/composes human actions with much richer semantic information in a flexible manner. Specifically, it represents each hand-object interaction video by a graph and extracts the semantic features (i.e., verbs, prepositions, and nouns) from action videos. Given the locations of hands and objects, we utilize the temporal dynamics of hand-object and object-object relationships with the graph edge features to represent verbs and prepositions. Since it is common that each action description may simultaneously contain multiple verbs and prepositions, we propose to decompose the graph model to separately extract these spatial-temporal verb/preposition representations. The temporal decomposition aims to distinguish the verbs/prepositions appearing in different video segments, while the spatial decomposition adaptively learns discriminative sub-graphs to characterize verbs and prepositions. After that, we can utilize popular CNNs to extract object features (nouns), and then shuffle these verb, preposition, and noun representations to compose new samples for the rare action classes in the feature space.

The main contributions of this paper can be summarized as follows:

•

We propose a new framework, referred to as the free-form compositional networks (FFCN), to simultaneously extract the feature representations of multiple verbs and prepositions appearing in the action description based on the hand/object locations;
•

We explore the proposed method for generating new training samples with verb, preposition, and noun representations in a flexible free-form manner to alleviate the data scarcity problem (e.g., long-tailed and few-shot) in egocentric action recognition;
•

We intensively test our FFCN for egocentric action recognition in multiple data scarcity tasks, and the experimental results show the effectiveness of the proposed method for improving the model generalizability.

II Related Work

II-A Egocentric Action Recognition

Compared with third-person views, egocentric action recognition relies on modeling the fine-grained hand-object interactions. Early methods mainly explore motion cues to distinguish egocentric actions. In particular, the optical flow-based global [20] and local [21] descriptors are designed to classify a variety of interactive actions. The motion descriptors are further combined with CNN-based appearance [22, 23] and depth [24] information to improve the action representations. Noticing that the attention mechanism is efficient to locate the region of interest from cluttered background, the gaze estimation [25, 26] is utilized to direct deep learning models to concentrate on the informative region in hand-object interaction videos. LSTA [27] extends traditional LSTM with the spatial attention based neural unit in order to concentrate on the action-related portion of the video sequence. In addition, the visual and audio features are fused by multi-scale temporal-binding [28] to obtain complementary information from egocentric actions.

According to the structure of egocentric actions, recent methods utilize two independent branches to separately recognize the associated verb and noun. The twin stream network [3] utilizes the hand segmentation to locate the action-related object, and then classifies the verb and noun by CNNs. To exploit the correlation between the verb and noun branches, the symbiotic attention [7, 8] uses the spatial locations and discriminative object features to focus on the occurring interactions. Based on the multi-task framework, H+O [5] estimates the hand and object poses in the 3D space from 2D RGB images, and then models the hand-object relationships with LSTM to recognize the action class and the action-related object. Several datasets [2, 6] also provide annotated 3D hand poses and 6D object poses to facilitate the study of egocentric actions. To avoid labor-intensive annotations, IPL [4] leverages the motion cues to learn the verb class, and then guide the noun classification by extracting the action-related object. Compared with the rigid combination of a verb and a noun, our method is able to simultaneously extract the features of multiple verbs, prepositions, and nouns from an egocentric action video, and then reassemble these features to generate new training data for the rare and unseen actions.

II-B Compositional Learning

Compositional learning has attracted increasing attention by synthesizing new samples based on the representations of partial images [29, 30, 31]. Compositional models are also utilized to ameliorate the detection of partly occluded objects with non-occluded portions [32, 33]. For visual question answering [34, 35], the questions are first decomposed into syntactic sub-modules, which are then regrouped by question-related compositional networks.

Currently, a variety of compositional models are proposed to focus on the data scarcity issue in human-object interaction (HOI) detection. External lexical knowledge [36] is embedded into graph convolutional networks to generate new action-object pairs for zero-shot HOI. The target objects in images are also replaced with similar intra-class instances in other images to improve the diversity of training data [37]. IDN [38] and VCL [39] extract shared action and object features from HOI images via the transformation function in the multi-branch network, and then regroup these action and object features to compose novel HOI samples in the feature space. In order to detect rare and unseen categories in the open world, ATL [40] and FCL [41] combine verb representations and novel object representations to generate large-scale HOI samples. ICompass [42] and GCA [43] further explore multiple verb-noun pairs from a single image via cross-attention mechanism and generative adversarial networks. Compared with existing HOI models decomposing/composing still images with the rigid template of a verb and a noun, we devise a flexible form to decompose/compose egocentric action videos with multiple verbs, prepositions, and nouns.

II-C Graphical Models

Graphical models are widely used to reason the long range relationships in multiple computer vision tasks, including semantic segmentation [44, 45], scene graph generation [43, 46], and group activity recognition [47]. By performing convolutional operations on the graph-structured data, graph convolutional networks (GCNs) [48] are adopted to learn the temporal evolution of human body skeletons. ST-GCN [49] adopts the spatial and temporal convolutions to model the evolution of human poses. 2s-AGCN [50] increases the flexibility of graph models and learn both the static and dynamic information from skeleton sequence. Besides skeleton data, GCNs are also employed to model the relationships between object proposals extracted from the action video. STRG [51] constructs the space-time region graph with object bounding boxes to reason the human-object and object-object relationships. LSTR [52] takes 3D tubelets as graph nodes, and models the long-term temporal dynamics via GCNs to localize human actions.

Compared with GCNs operating on a graph with its edge fixed as a similarity scalar, graph neural networks (GNNs) updates the node and edge features alternatively by message passing procedure. dNRI [53] predicts the states of the human body joints and their relations at every point in time based on GNNs and all the previous states. Similarly, [54] makes use of historical pose tracklets to predict the human poses in the following frame, which serves as a robust estimation in challenging scenarios. In addition to the human pose estimation, GNN-based models are combined with the pose-graph optimization to deal with the problem of camera pose estimation [55]. Different from above graphical models operating on the complete graph, MUSLE [56] explores compact sub-graphs to capture discriminative patterns of human actions for classification. In this paper, we adopt the edge features in GNNs to characterize the verbs and prepositions, and then distinguish different verbs/prepositions by learning from corresponding sub-graphs.

III Our Method

In this section, we will begin by presenting an overview of the primary framework for egocentric action recognition, known as the Free-Form Composition Networks. Subsequently, we will delve into a comprehensive explanation of our proposed approach, which encompasses the spatial-temporal action decomposition modules. Lastly, we will outline the process of creating new action representations in a free-form manner.

III-A Overview

We present an overview of the primary framework of the Free-Form Composition Networks (FFCN) in Figure 2. This framework is designed to combine semantic elements from various videos to generate new instances corresponding to specific action classes. To accomplish this, we start by considering each video frame, which contains $N$ instances, including hands and objects, and establish a graph with $N$ nodes to represent these instances (the red and yellow nodes separately represent hands and objects). Subsequently, we use Graph Neural Networks (GNNs) to concurrently enhance the features of both nodes and edges in the graph. The relationships between verbs, prepositions, hands, and objects are closely intertwined within the edge features of the graph, which capture substantial relational information among different nodes. We then partition the graph into two sub-graphs, namely, a verb graph and a preposition graph, to extract the representations of verbs and prepositions from the edge features. Recognizing that action descriptions may not always comprise just one verb and one preposition, we introduce temporal and spatial decomposition modules to distinguish multiple verb and preposition features from a sequence of verb and preposition graphs.

The detailed procedure to extract different verb representations is shown in Figure 3. Since the number of verbs varies between different action classes, we consistently extract ${n_{v}}$ verb representations from each video, where ${n_{v}}$ indicates the largest number of verbs appearing in an action description. Similarly, the preposition representations are extracted from the sequence of preposition graphs. To extract the object representations corresponding to the nouns in the action description, we employ vanilla CNNs. With the above-mentioned shared verb/preposition/noun representations, we then integrate those verb, preposition, and noun representations across different classes to generate new training samples in the composition branch, with a focus on the rare classes. That is, the verb, preposition, and noun representations are concatenated in terms of their order appearing in the action description.

It is important to note that the quantity of semantic components (verbs, prepositions, and nouns) varies significantly across different action classes. For the rare classes, which may contain ${n_{v}}$ verbs, ${n_{p}}$ prepositions, and ${n_{n}}$ nouns, we take the $d$ -dimensional features corresponding to verbs, prepositions, and nouns and individually reduce their dimensions to $d/{n_{v}}$ , $d/{n_{p}}$ , and $d/{n_{n}}$ using a multilayer perceptron (MLP). After this dimension reduction, these features are then concatenated based on their order of appearance in the action description, resulting in the creation of new samples with a dimension of $3d$ . This ensures that the features of all generated training samples are normalized to the same dimension.

III-B Graph Generation

In this subsection, we define the process of generating a graph as follows: Consider a video comprising $T$ frames, each containing $N$ instances. We represent all these instances’ feature information as $X=(x_{1}^{1},\cdots,x_{N}^{1},x_{1}^{2},\cdots,x_{N}^{2},\cdots,x_{1}^{T},\cdots,x_{N}^{T})$ , where $x_{i}^{t}$ represents the feature representation of the $i$ -th instance in frame $t$ . These instance representations can take various forms, such as 1) bounding box coordinates that include the center position, height, and width of each bounding box, and 2) appearance data obtained from the backbone CNNs. In this study, we choose to use the bounding box coordinates as the instance features for extracting verb and preposition characteristics. This choice is motivated by two reasons: firstly, these coordinates complement the widely adopted RGB-based methods, and secondly, verbs and prepositions have a closer association with the spatial positions of hands and objects than with RGB information. For example, two action videos of “pour juice” and “pour water” contain the same verb “pour”, but they may have different appearance features. The zero vectors are adopted as the features when there are fewer instances detected in each video frame. Therefore, we have the ability to build a graph denoted as $G=(V,E)$ where the graph nodes and edges symbolize the identified instances and their interconnections. To enhance this graph, we employ a graph neural network (GNN). In the process of message passing at step $s$ , we represent the node features as $h_{i,t}^{s}$ and the edge features as $f_{ij,t}^{s}$ . The architecture of the GNN, which concurrently improves both the graph nodes and edges, is structured as follows:

$\displaystyle h_{i,t}^{0}$	$\displaystyle=$	$\displaystyle{\phi_{\rm ini}}(x_{i}^{t}),$	(1)
$\displaystyle f_{ij,t}^{s}$	$\displaystyle=$	$\displaystyle{\phi_{\rm edge}}([h_{i,t}^{s-1},h_{j,t}^{s-1}]),~{}~{}$	(2)
$\displaystyle h_{i,t}^{s}$	$\displaystyle=$	$\displaystyle{\phi_{\rm node}}(\sum\limits_{j\neq i}{f_{ij,t}^{s}}).$	(3)

Here, the notation $[\cdot,\cdot]$ signifies the concatenation operation, while $h$ and $f$ correspond to the intermediate hidden states of the graph node and edge, respectively. Additionally, $\phi$ stands for the multilayer perceptrons (MLPs).

We utilize the output of the GNNs, which are the graph edge features, to gain insights into the interactive relationships among various instances. Recognizing that the verb and preposition characteristics can be respectively attributed to the hand-object and object-object relationships within egocentric videos, we proceed to break down the refined graph model into two distinct components: the verb graph and the preposition graph. This allows us to independently capture the features related to the verb and preposition. Within the verb graph, we establish connections between the hand node and all the object nodes to extract information regarding hand-object relationships. In contrast, in the preposition graph, we interconnect the object nodes themselves to capture the interactions among objects. This approach differs from most existing graph-based models, which predominantly focus on exploring the features of graph nodes. In our case, we leverage the robust edge features to extract semantic components, namely the verb and preposition features.

III-C Temporal Decomposition

In this subsection, we formulate the temporal decomposition as follows. Given the fact that the action description usually consists of multiple verbs and prepositions, which correspond to different segments in the video. Meanwhile, the dilated convolution is usually adopted to aggregate multi-scale contextual information, and increase the receptive field by stacking multiple layers. Therefore, we propose to explore dilated convolutions on the verb/preposition graph edges in the time dimension to extract the features of different verbs/prepositions from the corresponding video segments.

Given ${N_{e}}$ edges in the verb graph, the features of the $k$ -th edge across $T$ frames are denoted as ${F_{k}}=({f_{k1}},{f_{k2}},\cdots,{f_{kT}})$ , which are fed into several layers of dilated 1D convolutions. In this paper, we use four layers of dilated convolutions with dilation factors $d=1,2,3,4$ at each layer. All these layers have the same number of convolutional filters with the kernel size 3. Each layer utilizes a dilated convolution with the ReLU activation function, and the output is refined by the next layer. Inspired by MS-TCN [57], we also employ a residual connection [58] and the operations at each layer are formulated as follows:

	$\displaystyle{\hat{H}_{l}}=\text{ReLU}({W_{1}}*{H_{l-1}}+{b_{1}}),$		(4)
	$\displaystyle{H_{l}}={H_{l-1}}+{W_{2}}*{{\hat{H}}_{l}}+{b_{2}},~{}~{}$		(5)

where ${H_{l}}$ indicates the output of the $l$ -th layer, $*$ is the convolution operator, ${W_{1}}\in{R^{3\times D\times D}}$ are the weights of the dilated convolution with $D$ filters and kernel size 3, ${W_{2}}\in{R^{1\times D\times D}}$ are the weights of the $1\times 1$ convolution, and ${b_{1}},{b_{2}}\in{R^{D}}$ indicate the bias. The optimized graph edge features can be obtained from the last dilated convolution layer and the features of the $k$ -th edge across $T$ frames are then aggregated as follows:

\displaystyle{q_{k}}={\phi_{verb}}({f_{k1}},\cdots,{f_{kT}}),

(6)

where the function ${\phi_{verb}}$ indicates the MLP networks. All the edges in the verb graph share the same dilated temporal convolution layers, in order to extract the verb feature from the graph edges belonging to the same video segment. This way, all the verb graphs across $T$ frames are aggregated along the time dimension into a single graph model, where the edge features are denoted as ${Q_{v}}=[{q_{1}},{q_{2}},\cdots,{q_{{N_{e}}}}]$ . Similarly, the preposition graphs across $T$ frames are temporally aggregated into a single graph with the edge features ${Q_{p}}$ .

III-D Spatial Decomposition

In this subsection, we formulate the spatial decomposition as follows. Though the verb and preposition graphs are utilized to explore the verb and preposition features based on the temporal evolution of hand-object and object-object relationships, the verb and preposition graphs usually contain irrelevant objects due to the complex background, which may degrades the reasoning from spatial-temporal relationships. Instead of using a simple averaging pooling operation to aggregate the graph edge features, we propose to adaptively learn from the relationships between action-related instances to extract the verb and preposition features.

For the verb features, we apply the graph convolutional networks (GCNs) to encode the interactive relations between different instances as follows:

\displaystyle Z=A{Q_{v}^{T}}W,

(7)

where $A$ represents the adjacency matrix with ${N_{e}}\times{N_{e}}$ dimensions, ${Q_{v}}$ indicates the edge features in the verb graph with $D\times{N_{e}}$ dimensions, and $W$ is the weight matrix with the dimension $D\times D$ . Following the same strategy in [50], the elements $\theta$ in the adjacency matrix $A$ are trainable parameters in the networks and updated in the training process. Therefore, the output of each graph convolution layer $Z$ is in ${N_{e}}\times D$ dimensions, and multiple convolution layers can be stacked together. After the message passing within the verb graph, the final outputs of the GCNs then indicate the optimized relationships between different instances, which are aggregated by average pooling to obtain the verb feature. Notably, we always adopt a residual connection to facilitate the gradient flow. We can also extract the features of all the verbs and prepositions appearing in the action description in the same way. In addition, we adopt vanilla CNNs to extract the object features in each video frame, and then fuse the features of the same object across all the frames by MLPs to represent the nouns.

III-E Action Composition

In this subsection, we outline the process of creating new action samples as follows: We start with verb, preposition, and noun features, and from these, we generate features for new action samples, which are aimed at enriching the training data for the less common classes in the composition branch. To construct these samples for rare actions, we follow the sequence of verbs, prepositions, and nouns as they appear in the infrequent action descriptions. We concatenate the corresponding features and employ MLPs to reduce the dimension of these features, effectively representing the rare action samples within the feature space. These generated samples are then used for training in conjunction with the original training samples. It’s worth noting that both the composition and decomposition branches share the same weights in the fully connected and classifier layers. Finally, we simultaneously minimize two loss functions: ${L_{d}}$ for the decomposition branch and ${L_{c}}$ for the composition branch, expressed as $L={L_{d}}+\lambda\cdot{L_{c}}$ , where $\lambda\geq 0$ serves as a hyper-parameter to balance the two distinct branches.

IV Experiments

In this section, we perform experiments on three popular datasets for egocentric action recognition, including Something-Something V2 (SS-V2) [59], H2O [2], and EPIC-KITCHENS-100 (KITCHENS) [1]. Specifically, we first introduce the statistics of different datasets and implementation details on each dataset such as hyper-parameters. We then compare the proposed method with recent state-of-the-art methods. Lastly, we perform comprehensive ablation studies to better understand how the proposed method addresses data scarcity issues for robust egocentric action recognition.

IV-A Datasets

•

Something-Something V2 (SS-V2) [59] is a large-scale hand-object interaction dataset, which contains 174 action categories created in a crowdsourcing manner. The videos in the same class are captured by performing the same action with various objects, in order to make the methods identify human actions regardless of the interactive objects. It provides the action descriptions mainly consisting of multiple verbs and prepositions as well as video annotations, including bounding boxes and their identities. Overall, it includes 168,913 training videos and 24,777 validation videos. The number of videos in different action classes follows a long-tailed distribution ranging from 115 to 4,081.
•

H2O [2] is a recently introduced egocentric interaction dataset containing 36 action classes of two hands manipulating objects, which are captured by four participants in three different environments. The action labels are verb-noun pairs composed by 11 verb classes and 8 noun classes. This dataset includes 571,645 RGB-D frames with 344,645 training frames, 73,380 validation frames, and 153,620 testing frames. It also provides detailed annotations, containing hand and object poses separately in 3D and 6D space, object meshes, camera parameters, and point cloud data. We focus on the action recognition task, and provide the top-1 accuracy on the validation videos.
•

EPIC-KITCHENS-100 (KITCHENS) [1] is a large-scale egocentric video dataset, which includes 100 hours of daily activity videos captured in kitchens. The performed activities are unscripted and close to real-world data, which makes the number of videos unbalanced between different classes. All the videos are split into train/validation/test with a ratio of 75/10/15. In total, there are 89,977 video clips of fine-grained interactions annotated with 97 verb classes and 300 noun classes. It also provides annotations of hands and objects appearing in the video. Following the standard evaluation strategy, we separately predict the verb and noun classes of each video, and then obtain the action class by fusing the verb and noun classification results. We provide the top-1 accuracy on the validation videos.

IV-B Implementation Details

We implement the proposed free-form composition networks (FFCN) based on PyTorch [60], and sample 16 video frames from each clip. We adopt SGD [61] with the momentum value of 0.9 and the weight decay value of 1e-4.

On the Something-Something V2 dataset, we respectively report the settings in the standard, object-independent, and few-shot classification tasks. In the standard and object-independent protocols, each mini-batch contains 64 samples. The basic learning rate is set to 0.05, and divided by a factor of 10 after 15 epochs. A total of 25 epochs are utilized to train the networks. We generate new training data for 30 tail classes, and compose 10 samples in each mini-batch. In the few-shot protocol, each mini-batch contains 16 samples and the learning rate is set to 0.05 with a total of 8 epochs. We compose new training data for all the classes, and the number of composed samples is equal to original training samples in each class. We set the weight $\lambda{\rm{=}}0.1$ to balance two branches in the final loss function.

On the H2O dataset, each mini-batch contains 12 action samples. The basic learning rate is set to 0.005, and decreased by 0.1 at the 20-th epoch with a total of 80 epochs. We generate new training data for each action class, and compose 10 action samples in each mini-batch. We utilize the 3D hand poses and 6D object poses provided by this dataset as the features of each instance to learn the verb features, and employ the VGG-16 [62] to learn the noun features from the region of the objects. We set the weight $\lambda{\rm{=}}0.2$ to balance the decomposition and composition branches.

On the EPIC-KITCHENS-100 dataset, each mini-batch contains 256 videos. The basic learning rate is set to 0.001, and divided by a factor of 10 at the 20-th and 40-th epochs. A total of 50 epochs are utilized to train the networks. We compose new samples for the tail classes with fewer than 100 training samples, and generate 20 samples in each mini-batch. We use the bounding box positions of hands and objects to learn the verb features, and adopt the VGG-16 [62] to extract the noun features. We set the weight $\lambda{\rm{=}}0.1$ in the final loss function.

	Method	Year	Accuracy
Loc.	STIN [18]	2020	$48.4\%$
Loc.	FFCN	-	52.9%
RGB	I3D [63]	2017	$56.0\%$
	TRN Dual Atten. [64]	2019	$51.6\%$
	TSM [65]	2019	$61.7\%$
	STIN + I3D [18]	2020	$60.2\%$
	TDN [66]	2021	$67.0\%$
	DirecFormer [67]	2022	$64.9\%$
	TCM [68]	2022	$67.8\%$
	SIFA [69]	2022	$69.8\%$
	ViTTA [70]	2023	$66.4\%$
	Video-FocalNet [71]	2023	$71.1\%$
	FFCN + I3D	-	$67.3\%$
	FFCN + TDN	-	74.8%

TABLE I: Results on the original SS-V2 dataset

	Method	Year	Obj-ind	5-shot	10-shot
Loc.	STIN [18]	2020	$51.3\%$	$25.8\%$	$32.9\%$
	STIN + NL [18]	2020	$51.4\%$	$27.7\%$	$33.5\%$
	FFCN	-	54.3%	30.9%	35.8%
RGB	I3D [63]	2017	$51.9\%$	$23.5\%$	$27.1\%$
	STRG [51]	2018	$52.3\%$	$24.8\%$	$29.9\%$
	STIN + STRG [18]	2020	$56.2\%$	$29.1\%$	$34.6\%$
	STIN + I3D [18]	2020	$58.1\%$	$34.0\%$	$40.6\%$
	TDN [66]	2021	$62.5\%$	$30.8\%$	$38.5\%$
	SIFA [69]	2022	$44.6\%$	$19.4\%$	$23.7\%$
	TCM [68]	2022	$59.8\%$	$33.6\%$	$39.8\%$
	Video-FocalNet [71]	2023	$68.8\%$	$19.3\%$	$21.6\%$
	FFCN + I3D	-	$66.9\%$	$38.7\%$	$44.4\%$
	FFCN + TDN	-	70.5%	43.1%	53.5%

TABLE II: Results of object-independent and few-shot action recognition on the SS-V2 dataset

IV-C Results on Something-Something V2

Initially, we evaluate our proposed method using the standard approach, which involves a dataset consisting of 168,913 training videos and 24,777 validation videos. The performance comparison against state-of-the-art methods is presented in Table I. To ensure a fair assessment, we separately compare our Free-Form Composition Network (FFCN) with two different categories of existing methods that employ bounding box locations and video sequences as input to the networks. Both FFCN and STIN [18] utilize bounding box locations as input and can be integrated with RGB-based methods to enhance accuracy. In this scenario, FFCN outperforms STIN in both settings. Additionally, I3D [63] is a widely used RGB-based method for video action recognition. When combined with the same method, FFCN+I3D ( $67.3\%$ ) surpasses STIN+I3D ( $60.2\%$ ) by $7.1\%$ . As TDN achieves better results than I3D, we incorporate FFCN with TDN, resulting in an accuracy of $74.8\%$ , which is $7.8\%$ higher than TDN and notably superior to recently proposed alternatives.

Following the evaluation protocol outlined in [18], we also assess the performance of our FFCN in an object-independent scenario. In this setup, the training set consists of combinations of nouns and verbs that do not appear in the testing set. The training and validation sets contain 54,919 and 57,876 videos, respectively, encompassing a total of 174 action classes. As demonstrated in Table II, FFCN ( $54.3\%$ ) outperforms STIN ( $51.3\%$ ) by exclusively utilizing bounding box locations of hands and objects to capture the features of verbs and prepositions present in the action descriptions. Most existing methods rely on whole video frames as input to distinguish between different action classes, partially relying on scene appearance information rather than the motion characteristics between hands and objects. The proposed FFCN can be easily combined with RGB-based methods through simple late fusion, resulting in an improved performance. By fusing FFCN ( $54.3\%$ ) with I3D ( $51.9\%$ ), the final accuracy reaches $66.9\%$ . When combined with TDN ( $62.5\%$ ), we achieve the highest accuracy of all, at $70.5\%$ .

To further validate the effectiveness of composing new training data, we test the performance of FFCN in the task of few-shot action recognition. We also utilize the protocol in [18] to separate the training data into the base and novel sets. Our networks are first trained on the 88 base classes, and then finetuned on the other 86 novel classes with 5 or 10 training videos in each class. As illustrated in Table II, we separately compare the accuracies of 5-shot and 10-shot action recognition on the 86 novel classes. With only 5 training samples in each action class, our FFCN ( $30.9\%$ ) exceeds STIN ( $25.8\%$ ) by $5.1\%$ based on the locations of hands and objects. Combined with RGB-based models, FFCN+I3D ( $38.7\%$ ) outperforms I3D ( $23.5\%$ ) by $15.2\%$ , and FFCN+TDN ( $43.1\%$ ) outperforms TDN ( $30.8\%$ ) by $12.3\%$ . In the 10-shot action recognition, our FFCN also improves the performances of RGB-based models by more than $15\%$ . The above results in the few-shot learning validate that our FFCN can augment the training data to obviously improve the accuracies in egocentric action recognition.

IV-D Results on H2O

Method	Year	Accuracy
I3D [63]	2017	$87.2\%$
ST-GCN [49]	2018	$83.5\%$
H+O [5]	2019	$80.5\%$
SlowFast [72]	2019	$86.0\%$
X-ViT [73]	2021	$82.8\%$
TA-GCN [2]	2021	$86.8\%$
TAR [74]	2022	$90.9\%$
HTT [75]	2023	$86.4\%$
FFCN	-	93.3%
FFCN + X-ViT	-	95.0%
FFCN + I3D	-	95.9%

TABLE III: Results on the H2O dataset

We also test our FFCN on the H2O dataset, and the comparison results are illustrated in Table III. RGB-based methods are widely used to recognize hand-object interactions, such as I3D, SlowFast, and X-ViT. Moreover, TAR achieves excellent result ( $90.9\%$ ) in ECCV 2022 challenge on action recognition from egocentric camera. HTT adopts the locations of 3D hand joints and the RGB features of objects together to obtain a promising result ( $86.4\%$ ). The accuracy of our FFCN is $93.3\%$ , which outperforms other methods by a clear margin with only location information. Because the location- and RGB-based methods are complementary in representing hand-object interactions, we combine our FFCN ( $93.3\%$ ) with recently proposed X-ViT ( $82.8\%$ ) to further increase the accuracy to $95.0\%$ . We also fuse the results of FFCN and I3D, and the accuracy is improved to $95.9\%$ , which is the best performance on the H2O dataset.

IV-E Results on EPIC-KITCHENS-100

Different from the above two datasets designed to directly predict the action labels, the EPIC-KITCHENS-100 dataset expects the methods to separately predict the verb, noun, and action labels of each video. Our FFCN mainly utilizes the coordinates of hands and objects to extract the motion features in the video without the appearance information, so we separately report the prediction accuracies of the verb and action in FFCN and the accuracies of the verb, noun, and action by integrating FFCN with a RGB-based method. The comparison results are illustrated in Table IV. Because of the promising performance, SlowFast and TSM are both widely used to recognize egocentric actions. In first person videos, the active objects are usually surrounded by distracting objects, so IPL and SOS present two kinds of methods to capture the action-related objects in the video. Based on the transformer architecture, X-ViT exploits the spatial-temporal attention in action videos, and OMNIVORE jointly trains a single model from multiple modalities, such as images, videos, and 3D data, to achieve outstanding results in different classification tasks. All aforementioned methods are RGB-based models, and extract the video features from the appearance information. Our FFCN utilizes the locations of hands and objects to predict the verb and action labels, and the mean accuracies are separately $56.1\%$ and $52.2\%$ . Integrated with X-ViT, the verb and action accuracies are increased to $71.5\%$ and $60.5\%$ . Our FFCN is also fused with OMNIVORE, and the final verb and action prediction accuracies are $73.6\%$ and $69.1\%$ , which are separately $4.1\%$ and $19.2\%$ higher than OMNIVORE.

Method	Year	Verb	Noun	Action
SlowFast [72]	2019	$65.6\%$	$50.0\%$	$38.5\%$
TSM [65]	2019	$67.9\%$	$49.1\%$	$38.4\%$
IPL [4]	2021	$68.6\%$	$51.2\%$	$41.0\%$
X-ViT [73]	2021	$68.7\%$	$56.4\%$	$44.3\%$
SOS [76]	2022	$69.3\%$	$57.9\%$	$45.7\%$
OMNIVORE [77]	2022	$69.5\%$	$61.7\%$	$49.9\%$
Video-FocalNet [71]	2023	$66.2\%$	$46.7\%$	$40.9\%$
FFCN	-	$56.1\%$	-	$52.2\%$
FFCN+X-ViT	-	$71.5\%$	$56.4\%$	$60.5\%$
FFCN+OMNIVORE	-	73.6%	$61.7\%$	69.1%

TABLE IV: Results on the EPIC-KITCHENS-100 dataset

IV-F Ablation Studies

Dataset	Decomp. branch	Two branches
SS-V2	$51.3\%$	$52.9\%$
SS-V2-Subset	$55.1\%$	$64.7\%$
H2O	$91.7\%$	$93.3\%$
KITCHENS	$50.5\%$	$52.2\%$

TABLE V: The effectiveness of decomposition and composition branches in the free-form composition networks

The proposed free-form composition networks mainly consist of two branches, i.e., a decomposition branch and a composition branch. To demonstrate the effectiveness of two branches for egocentric action recognition, we separately report the results of the decomposition branch and two branches together. The accuracies corresponding to two kinds of settings are shown in Table V. On the something-something V2 dataset, the mean accuracy is $51.3\%$ based on the decomposition branch, and the result is increased to $52.9\%$ through generating new training data for the tail actions with the composition branch. We can see that some tail actions in this dataset can not be composed by the videos of other action classes. For instance, the verb “cover” only appears in the action “covering something with something”, so we can not extract its visual feature from other actions. We thus gather 50 action classes (i.e., SS-V2-Subset) with shared verbs and prepositions/adverbs from something-something V2 to evaluate the effectiveness of two branches. The number of training samples in each action class is illustrated in Figure 4, where the class label is consistent with original dataset. Based on the decomposition branch, the mean accuracy is $55.1\%$ using only the locations of hands and objects. Then, the accuracy is improved to $64.7\%$ by generating the features of new training samples for 20 action classes with the fewest training data. H2O is a relatively small dataset, and the number of videos for each class in the training set ranges from 12 to 25. We thus generate new training samples for all the action classes. After applying data augmentation, the mean accuracy is increased from $91.7\%$ with only the decomposition branch to $93.3\%$ with two branches together. The EPIC-KITCHENS-100 dataset expects the methods to separately predict the verb, noun, and action labels. It is worth noting that the two-branch architecture is designed to directly predict the action label. To test the effectiveness of the compositional branch, we only compare the action prediction accuracies whether or not generating new training data for the tail actions. The average accuracy is increased from $50.5\%$ to $52.2\%$ through data augmentation.

We also compare the performances of each action class whether or not composing new training samples. The comparison results on the SS-V2-Subset are shown in Figure 5. The class label and the number of training samples in each class are both consistent with Figure 4. With only the decomposition branch, none of the validation samples is correctly recognized in some tail classes, such as 136, 60, and 137. After composing new samples for 20 tail classes, the accuracies of 45 out of 50 actions are improved, and the result of action class 15 is increased from $50.0\%$ to $83.5\%$ . We further compare the results of the decomposition and composition branches on the H2O dataset. As shown in Figure 6, the decomposition branch of our FFCN obtains promising results, and the accuracies of most classes are $100\%$ . After composing new samples with two branches, the results of five classes are increased by a clear margin, and the accuracy of the action “close chips” is increased from 0 to $50.0\%$ . But some actions in this dataset are still easily confused with each other. For example, the action “take out espresso” is usually misclassified into the actions of “take out cappuccino” and “put in espresso”.

V Conclusion

In this paper, we propose the free-form composition networks to relieve the data scarcity problem in the long-tailed and few-shot egocentric action recognition. The proposed free-form composition networks (FFCN) works in a flexible way for action recognition, regardless of the different numbers of verbs/prepositions/nouns used in the action description: it first extracts the spatial-temporal features of multiple different verbs, prepositions, and nouns appearing in the action description of videos, and then generates new training samples for the rare action classes by integrating disentangled verb, preposition, and noun feature representations together. Extensive experiments on popular egocentric action recognition datasets show the effectiveness of the proposed method to address the challenging data scarcity issues, containing the long-tailed and few-shot egocentric action recognition. In the future, we will investigate to adaptively learn the semantic components from the raw text, which is not restricted to the combination of verbs, prepositions, and nouns.

References

[1] D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,” International Journal of Computer Vision, vol. 130, no. 1, pp. 33–55, 2022.
[2] T. Kwon, B. Tekin, J. Stühmer, F. Bogo, and M. Pollefeys, “H2o: Two hands manipulating objects for first person interaction recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 10 138–10 148.
[3] M. Ma, H. Fan, and K. M. Kitani, “Going deeper into first-person activity recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1894–1903.
[4] X. Wang, L. Zhu, H. Wang, and Y. Yang, “Interactive prototype learning for egocentric action recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 8168–8177.
[5] B. Tekin, F. Bogo, and M. Pollefeys, “H+o: Unified egocentric recognition of 3d hand-object poses and interactions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 4511–4520.
[6] G. Garcia-Hernando, S. Yuan, S. Baek, and T.-K. Kim, “First-person hand action benchmark with rgb-d videos and 3d hand pose annotations,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 409–419.
[7] X. Wang, L. Zhu, Y. Wu, and Y. Yang, “Symbiotic attention for egocentric action recognition with object-centric alignment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[8] X. Wang, Y. Wu, L. Zhu, and Y. Yang, “Symbiotic attention with privileged information for egocentric action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 12 249–12 256.
[9] J. Yuan, Y. Liu, C. Shen, Z. Wang, and H. Li, “A simple baseline for semi-supervised semantic segmentation with strong data augmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 8229–8238.
[10] Y. Zang, C. Huang, and C. C. Loy, “Fasa: Feature augmentation and sampling adaptation for long-tailed instance segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 3457–3466.
[11] P. Chu, X. Bian, S. Liu, and H. Ling, “Feature space augmentation for long-tailed data,” in Proceedings of the European conference on computer vision, 2020, pp. 694–710.
[12] Z. Chen, Y. Fu, K. Chen, and Y.-G. Jiang, “Image block augmentation for one-shot learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 3379–3386.
[13] X. Yin, X. Yu, K. Sohn, X. Liu, and M. Chandraker, “Feature transfer learning for face recognition with under-represented data,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 5704–5713.
[14] J. Liu, Y. Sun, C. Han, Z. Dou, and W. Li, “Deep representation learning on long-tailed data: A learnable embedding augmentation perspective,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 2970–2979.
[15] Z. Hou, B. Yu, Y. Qiao, X. Peng, and D. Tao, “Affordance transfer learning for human-object interaction detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 495–504.
[16] D. Huynh and E. Elhamifar, “Interaction compass: Multi-label zero-shot learning of human-object interactions via spatial relations,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 8472–8483.
[17] Z. Hou, B. Yu, Y. Qiao, X. Peng, and D. Tao, “Detecting human-object interaction via fabricated compositional learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 646–14 655.
[18] J. Materzynska, T. Xiao, R. Herzig, H. Xu, X. Wang, and T. Darrell, “Something-else: Compositional action recognition with spatial-temporal interaction networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 1049–1059.
[19] B. Kim, J. Mun, K.-W. On, M. Shin, J. Lee, and E.-S. Kim, “Mstr: Multi-scale transformer for end-to-end human-object interaction detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 19 578–19 587.
[20] K. M. Kitani, T. Okabe, Y. Sato, and A. Sugimoto, “Fast unsupervised ego-action learning for first-person sports videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3241–3248.
[21] M. S. Ryoo and L. Matthies, “First-person activity recognition: What are they doing to me?” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2730–2737.
[22] M. S. Ryoo, B. Rothrock, and L. Matthies, “Pooled motion features for first-person videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 896–904.
[23] Y. Poleg, A. Ephrat, S. Peleg, and C. Arora, “Compact cnn for indexing egocentric videos,” in IEEE winter conference on applications of computer vision, 2016, pp. 1–9.
[24] Y. Tang, Y. Tian, J. Lu, J. Feng, and J. Zhou, “Action recognition in rgb-d egocentric videos,” in IEEE International Conference on Image Processing, 2017, pp. 3410–3414.
[25] Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of gaze and actions in first person video,” in Proceedings of the European conference on computer vision, 2018, pp. 619–635.
[26] Y. Li, Z. Ye, and J. M. Rehg, “Delving into egocentric actions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 287–295.
[27] S. Sudhakaran, S. Escalera, and O. Lanz, “Lsta: Long short-term attention for egocentric action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 9954–9963.
[28] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen, “Epic-fusion: Audio-visual temporal binding for egocentric action recognition,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 5492–5501.
[29] A. Alfassy, L. Karlinsky, A. Aides, J. Shtok, S. Harary, R. Feris, R. Giryes, and A. M. Bronstein, “Laso: Label-set operations networks for multi-label few-shot learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6548–6557.
[30] I. Misra, A. Gupta, and M. Hebert, “From red wine to red tomato: Composition with context,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1792–1801.
[31] S. Tulsiani, H. Su, L. J. Guibas, A. A. Efros, and J. Malik, “Learning shape abstractions by assembling volumetric primitives,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2635–2643.
[32] A. Wang, Y. Sun, A. Kortylewski, and A. L. Yuille, “Robust object detection under occlusion with context-aware compositionalnets,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 645–12 654.
[33] A. Kortylewski, J. He, Q. Liu, and A. L. Yuille, “Compositional convolutional neural networks: A deep architecture with innate robustness to partial occlusion,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 8940–8949.
[34] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Neural module networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 39–48.
[35] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learning to reason: End-to-end module networks for visual question answering,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 804–813.
[36] K. Kato, Y. Li, and A. Gupta, “Compositional learning for human object interaction,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 234–251.
[37] H. Fang, Y. Xie, D. Shao, Y. Li, and C. Lu, “Decaug: Augmenting hoi detection via decomposition,” in Thirty-five AAAI conference on artificial intelligence, 2021.
[38] Y. Li, X. Liu, X. Wu, Y. Li, and L. Cewu, “Hoi analysis: Integrating and decomposing human-object interaction,” in Advances in neural information processing systems, 2020.
[39] Z. Hou, X. Peng, Y. Qiao, and D. Tao, “Visual compositional learning for human-object interaction detection,” in Proceedings of the European Conference on Computer Vision, 2020, pp. 584–600.
[40] Z. Hou, B. Yu, Y. Qiao, X. Peng, and D. Tao, “Affordance transfer learning for human-object interaction detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 495–504.
[41] ——, “Detecting human-object interaction via fabricated compositional learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 646–14 655.
[42] D. Huynh and E. Elhamifar, “Interaction compass: Multi-label zero-shot learning of human-object interactions via spatial relations,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 8472–8483.
[43] B. Knyazev, H. de Vries, C. Cangea, G. W. Taylor, A. Courville, and E. Belilovsky, “Generative compositional augmentations for scene graph prediction,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 15 827–15 837.
[44] G.-S. Xie, J. Liu, H. Xiong, and L. Shao, “Scale-aware graph neural network for few-shot semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 5475–5484.
[45] T. Wang, N. Xu, K. Chen, and W. Lin, “End-to-end video instance segmentation via spatial-temporal graph neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 10 797–10 806.
[46] J. Zhang, K. J. Shih, A. Elgammal, A. Tao, and B. Catanzaro, “Graphical contrastive losses for scene graph parsing,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 535–11 543.
[47] Z. Deng, A. Vahdat, H. Hu, and G. Mori, “Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4772–4781.
[48] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in Proceedings of the International Conference on Learning Representations, 2017, pp. 1–14.
[49] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in Thirty-second AAAI conference on artificial intelligence, 2018.
[50] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2019, pp. 12 026–12 035.
[51] X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proceedings of the European conference on computer vision, 2018, pp. 399–417.
[52] D. Li, T. Yao, Z. Qiu, H. Li, and T. Mei, “Long short-term relation networks for video action detection,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 629–637.
[53] C. Graber and A. Schwing, “Dynamic neural relational inference,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 8510–8519.
[54] Y. Yang, Z. Ren, H. Li, C. Zhou, X. Wang, and G. Hua, “Learning dynamics via graph neural networks for human pose estimation and tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 8074–8084.
[55] X. Li and H. Ling, “Pogo-net: Pose graph optimization with graph neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 5895–5905.
[56] D. Li, Z. Qiu, Y. Pan, T. Yao, H. Li, and T. Mei, “Representing videos as discriminative sub-graphs for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 3310–3319.
[57] Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3575–3584.
[58] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[59] R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag et al., “The “something something” video database for learning and evaluating visual common sense,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5842–5850.
[60] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
[61] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010: 19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers, 2010, pp. 177–186.
[62] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations, 2015.
[63] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
[64] T. Xiao, Q. Fan, D. Gutfreund, M. Monfort, A. Oliva, and B. Zhou, “Reasoning about human-object interactions through dual attention networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 3919–3928.
[65] J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 7083–7093.
[66] L. Wang, Z. Tong, B. Ji, and G. Wu, “Tdn: Temporal difference networks for efficient action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 1895–1904.
[67] T.-D. Truong, Q.-H. Bui, C. N. Duong, H.-S. Seo, S. L. Phung, X. Li, and K. Luu, “Direcformer: A directed attention in transformer approach to robust action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 030–20 040.
[68] Y. Liu, J. Yuan, and Z. Tu, “Motion-driven visual tempo learning for video-based action recognition,” IEEE Transactions on Image Processing, vol. 31, pp. 4104–4116, 2022.
[69] F. Long, Z. Qiu, Y. Pan, T. Yao, J. Luo, and T. Mei, “Stand-alone inter-frame attention in video models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3192–3201.
[70] W. Lin, M. J. Mirza, M. Kozinski, H. Possegger, H. Kuehne, and H. Bischof, “Video test-time adaptation for action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 952–22 961.
[71] S. T. Wasim, M. U. Khattak, M. Naseer, S. Khan, M. Shah, and F. S. Khan, “Video-focalnets: Spatio-temporal focal modulation for video action recognition,” in arXiv:2307.06947, 2023.
[72] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE international conference on computer vision, 2019, pp. 6202–6211.
[73] A. Bulat, J. M. Perez Rua, S. Sudhakaran, B. Martinez, and G. Tzimiropoulos, “Space-time mixing attention for video transformer,” in Advances in Neural Information Processing Systems, 2021, pp. 19 594–19 607.
[74] H. Cho and S. Baek, “Transformer-based action recognition in hand-object interacting scenarios,” arXiv preprint arXiv:2210.11387, 2022.
[75] Y. Wen, H. Pan, L. Yang, J. Pan, T. Komura, and W. Wang, “Hierarchical temporal transformer for 3d hand pose estimation and action recognition from egocentric rgb videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 243–21 253.
[76] V. Escorcia, R. Guerrero, X. Zhu, and B. Martinez, “Sos! self-supervised learning over sets of handled objects in egocentric action recognition,” in Proceedings of the European conference on computer vision, 2022, pp. 606–620.
[77] R. Girdhar, M. Singh, N. Ravi, L. van der Maaten, A. Joulin, and I. Misra, “Omnivore: A single model for many visual modalities,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 102–16 112.