This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2022: Team HNU-FPV Technical Report

Nie Lin    Minjie Cai
College of Computer Science and Electronic Engineering
   Hunan University
Hunan
   China
{nielin,caiminjie}@hnu.edu.cn
Abstract

In this report, we present the technical details of our submission to the 2022 EPIC-Kitchens Unsupervised Domain Adaptation (UDA) Challenge. Existing UDA methods align the global features extracted from the whole video clips across the source and target domains but suffer from the spatial redundancy of feature matching in video recognition. Motivated by the observation that in most cases a small image region in each video frame can be informative enough for the action recognition task, we propose to exploit informative image regions to perform efficient domain alignment. Specifically, we first use lightweight CNNs to extract the global information of the input two-stream video frames and select the informative image patches by a differentiable interpolation-based selection strategy. Then the global information from videos frames and local information from image patches are processed by an existing video adaptation method, i.e., TA3N, in order to perform feature alignment for the source domain and the target domain. Our method (without model ensemble) ranks 4th among this year’s teams on the test set of EPIC-KITCHENS-100.

1 Introduction

With the rapid development of deep learning techniques, how to develop deep neural networks to understand human’s daily interactions with surrounding environments from the first-person perspective has gained increasing interests from researchers. The EPIC-KITCHENS-100 dataset is a large video dataset of first-person perspective, and the videos record most of the common actions that would happen in a kitchen scene  [2]. The dataset provides fine-grained action labels, and each action is composed by a pair of verb and noun labels. In order to meet the task of EPIC-KITCHENS-100 Unsupervised Domain Adaptation (UDA) Challenge for Action Recognition, the model needs to be trained on the labeled source domain (EPIC-KITCHENS-2018) and adapted to the unlabeled target domain (EPIC-KITCHENS-100). The UDA for action recognition is more challenging than the action recognition task since the adapted model needs to overcome the domain discrepancy represented in complex video features between the source domain and the target domain. Therefore, how to effectively model the shared feature representation of the source and target domains is one of the keys to solve this challenge.

Refer to caption
Figure 1: Illustration of fine-grained action recognition on EPIC-KITCHENS-55 (source domain) and EPIC-KITCHENS-100 (target domain). (a) Due to the differences in shooting time and indoor environment, there are many different background objects in the video clip of the same action (\eg, “cutting onion”) between the source/target domains, which are irrelevant to the action recognition task. (b) By selecting the most informative image regions for processing, the domain discrepancy between the source domain and the target domain can be effectively reduced.
Refer to caption
Figure 2: Overview of the proposed method. The method includes two main parts: spatio-temporal feature extraction and video domain adaptation. In spatio-temporal feature extraction, it is composed by global feature extraction branches and local feature extraction branches for both RGB and optical flow inputs. fGs{f}^{s}_{\text{G}}, fFs{f}^{s}_{\text{F}} and πs\pi{}^{s} denote the glancer, focuser and policy networks for the spatial local module, respectively. Similar notations are used for the temporal local module. In video domain adaptation, G^sd\hat{G}_{sd}, G^td\hat{G}_{td} and G^rdn\hat{G}^{n}_{rd} denote the spatial, temporal and relation domain classifiers, respectively. Lsd{L}_{sd}, Ltd{L}_{td} and Lrdn{L}^{n}_{rd} denote the spatial, temporal and relation domain classification loss. fCv{f}^{v}_{\text{C}} and fCn{f}^{n}_{\text{C}} denote the verb classifier and noun classifier. Lyv{L}^{v}_{y} and Lyn{L}^{n}_{y} denote the verb and noun classification loss, respectively. Laev{L}^{v}_{ae} and Laen{L}^{n}_{ae} denote the attentive entropy loss for verb and noun, respectively.

As recorded by a wearable camera from the first-person perspective, egocentric video is characterized by rapidly changing background between consecutive actions and cluttered background containing multiple objects irrelevant to the ongoing action. Furthermore, for videos in different domains, the same actions may present huge differences of image appearance, especially in the background. As a result, directly modeling shared feature representation between different domains is challenging due to spatial redundancy in the original video features. Figure 1 shows examples of video frames of the same action from two different domains. It can be seen that the action of “cutting onion” in the source domain shows quite different visual appearance compared with the target domain. One exception is the region around hands which show certain consistency between two domains. Actually, information of the verb “cutting” and the noun “onion” is fully encoded in such informative regions of video frames. So the challenge of action recognition in UDA lies in the frequent scene switching between each action and the difference in the background of the same action in different domains. Therefore, instead of straightforward domain alignment of original video features, exploiting the most informative regions in video frames for feature extraction shows a promising way of efficient domain adaptation for egocentric action recognition.

In this work, we incorporate a learning-based patch selection strategy into an existing video domain adaption framework. The patch selection strategy is implemented as a lightweight CNN and a policy network which helps locate the task-related regions and extract local features for each video frame. We consider both RGB and optical flow images as input to capture the spatial and temporal characteristic of an action. After spatial-temporal feature fusion with both global and local features, we adopt an existing video domain adaptation method TA3N [1] to do feature alignment for the source and target domains. The experimental results on EPIC-KITCHEN-100 demonstrate the effectiveness of the proposed method in UDA for action recognition.

2 Method

As an overview of our approach is described in Figure 2, the overall model is divided into two parts. The first part of the model extracts the spatio-temporal features of the video from the input RGB frames and optical flow frames and contains both global and local branches in the process. For the local branch, inspired by the latest work in video-based action recognition [6, 7], we build a spatio-temporal local feature extraction. After extracting the global and local features of the original video, the model will fuse the spatio-temporal features extracted from different domains through spatio-temporal feature fusion. In the second part, the model is used to align the spatio-temporal features extracted from the source domain and the target domain and finally complete the action prediction of the target domain. We will introduce the above component in detail in the following sections.

Table 1: The recognition performance of different models on target validation set. FeatDim: the dimension of shared features of TA3N; NumSeg: the number of input frames between the global and local branches is from left to right. The left and right side of “+” indicates the input into the glancer network and the focuser network. “-” indicates that local branches are not used for feature extraction.
Method Backbone FeatDim NumSeg Top-1 Accuracy (%) Top-5 Accuracy (%)
Global Local Verb Noun Action Verb Noun Action
TA3N TBN - 512 6/- 48.10 26.74 18.72 77.98 47.50 41.87
TA3N TBN - 1024 6/- 48.28 27.30 19.25 76.71 47.39 41.65
TA3N TBN MN2/RN 1024 6/4+6 48.70 27.87 19.61 76.18 48.52 42.01
TA3N TBN MN2/RN 2048 12/8+12 49.42 28.33 20.11 77.06 47.52 41.82
Table 2: The recognition performance of different models on target test set. All results on the test set were evaluated on the test server. Table column definitions are the same as in Table 1.
Method Backbone FeatDim NumSeg Top-1 Accuracy (%) Top-5 Accuracy (%)
Global Local Verb Noun Action Verb Noun Action
TA3N TBN MN2/RN 1024 6/4+6 47.71 27.74 19.41 73.38 48.91 31.26
TA3N TBN MN2/RN 2048 12/8+12 48.87 28.72 19.88 74.61 49.70 32.32

2.1 The Spatio-temporal Feature Extraction

Given a RGB stream of video frames {𝒗1s,𝒗2s,}\left\{\bm{v}^{s}_{1},\bm{v}^{s}_{2},...\right\} and a optical flow stream of video frames {𝒗1t,𝒗2t,}\left\{\bm{v}^{t}_{1},\bm{v}^{t}_{2},...\right\}, the model will extract the spatio-temporal features of the two different video streams. For the local feature extraction, the model takes a glance at each frame in the video with the corresponding glancer network fG{f}_{\text{G}}. Then the cheap and coarse feature will be fed into the corresponding policy network π\pi to select the area that contributes the most to the task:

𝒗~ns=πs(fGs(𝒗ns)),n=1,2,,𝒗~nt=πt(fGt(𝒗nt)),n=1,2,,\begin{split}\bm{\tilde{v}}^{s}_{n}=\pi^{s}({f}^{s}_{\text{G}}(\bm{{v}}^{s}_{n})),\quad n=1,2,...,\\ \bm{\tilde{v}}^{t}_{n}=\pi^{t}({f}^{t}_{\text{G}}(\bm{{v}}^{t}_{n})),\quad n=1,2,...,\\ \end{split} (1)

where 𝒗~ns\tilde{\bm{v}}^{s}_{n}, 𝒗~nt\tilde{\bm{v}}^{t}_{n} are the selected patches of RGB video frames and optical flow video frames of the nthn^{th} frame. And the selected patches 𝒗~ns\tilde{\bm{v}}^{s}_{n}, 𝒗~nt\tilde{\bm{v}}^{t}_{n} will be fed into the corresponding focuser network fF{f}_{\text{F}} to extract the local feature maps 𝒆Ls\bm{e}^{s}_{\text{L}}, 𝒆Lt\bm{e}^{t}_{\text{L}}:

𝒆Ls=fFs(𝒗~ns),n=1,2,,𝒆Lt=fFt(𝒗~nt),n=1,2,,\begin{split}\bm{{e}}^{s}_{\text{L}}={f}^{s}_{\text{F}}(\bm{\tilde{v}}^{s}_{n}),\quad n=1,2,...,\\ \bm{{e}}^{t}_{\text{L}}={f}^{t}_{\text{F}}(\bm{\tilde{v}}^{t}_{n}),\quad n=1,2,...,\\ \end{split} (2)

Finally, the global spatio-temporal features 𝒆Gs\bm{e}^{s}_{\text{G}}, 𝒆Gt\bm{e}^{t}_{\text{G}} and audio feature 𝒆Ga\bm{e}^{a}_{\text{G}} extracted from the global branch. Note that our model considers the global features corresponding to the audio modalities, which are not represented in the figure for the sake of simplicity. Then the global features are concatenated with the local spatio-temporal features 𝒆Ls\bm{e}^{s}_{\text{L}}, 𝒆Lt\bm{e}^{t}_{\text{L}} extracted from the local branch are concatenated to serve as the input final feature 𝒆\bm{e} to the video domain adaptation:

𝒆=Concat(𝒆Gs,𝒆Ls,𝒆Gt,𝒆Lt,𝒆Ga).\bm{e}=Concat(\bm{e}^{s}_{\text{G}},\bm{e}^{s}_{\text{L}},\bm{e}^{t}_{\text{G}},\bm{e}^{t}_{\text{L}},\bm{e}^{a}_{\text{G}}). (3)

2.2 The Video Domain Adaptation

After the global and local spatio-temporal features are obtained, it will be more beneficial for the model to perform efficient domain alignment. In the video domain-adapted training of global-local features from the source and target domains, we adapt an existing video domain adaptation method for action recognition tasks, \ie, TA3N [1]. As shown in Figure 2, model first aligns frame-level features from the source and target domain inputs through the adversarial discriminators G^sd\hat{G}_{sd} and generates the corresponding domain loss Lsd{L}_{sd}. At the same time, the frame-level features of the input are modeled in the temporal relation module of TA3N, and these relation features are aggregated to obtain the video-level features. In aggregating these relational features, the domain attention mechanism is added to pay more attention to the alignment of local temporal features that have larger domain discrepancy. In the domain attention mechanism, the adversarial discriminators G^rdn\hat{G}^{n}_{rd} are used to align the relational features from the source and target domains, and the corresponding domain loss Lrdn{L}^{n}_{rd} is generated. Then, the adversarial discriminators G^td\hat{G}_{td} are also used to align the video-level features from the source and target domains, and the corresponding domain loss Ltd{L}_{td} is generated. Finally, the model classifies the video-level features through two corresponding classifiers fCv{f}^{v}_{\text{C}} and fCn{f}^{n}_{\text{C}}, and generates the predicted verb classification and noun classification.

3 Experiments

3.1 Implementation Details

Refer to caption
Figure 3: Visualization results of the image patches selected from the spatial local module.

Spatio-temporal feature extraction. Since the network of spatial feature extraction and temporal feature extraction are the same in parameter settings, the following description will not distinguish between spatial and temporal feature extraction. For the global feature of spatio-temporal feature extraction, we use RGB, flow and audio features provided by the organizers that were extracted with Temporal Binding Network (TBN)  [4] pretrained in the source domain. And we follow the model setting in  [7] to extract the local feature of the spatio-temporal feature. We also adopt MobileNet-V2 (MN2) [5] and ResNet-50 (RN) [3] as the glancer network fG{f}_{\text{G}} and focuser network fF{f}_{\text{F}}, respectively. And the same policy network is used to select the image patch that contributes most to the task from the input video frames by the differentiable bilinear interpolation. The network parameters are learned with SGD optimizer with momentum 0.9 and weight decay 5×1045\times{10}^{-4}. For each network of the local branches, the initial learning rates of fG{f}_{\text{G}}, fF{f}_{\text{F}} and π\pi are set to 0.0050.005, 0.010.01, and 1e41e-4, respectively. For the video frames that are input into the model, we adopt the same processing method as  [7] and set the size of the selected image patch to 176×176176\times 176.

Video domain adaptation. After obtaining the spatio-temporal features of the source and target domains, TA3N [1] is used to align the input features and generate the prediction results of the model. The network parameters are also learned with SGD optimizer with momentum 0.9 and weight decay 5×1045\times{10}^{-4}. During training, the parameters in the spatio-temporal feature extraction are freezed. The initial learning rate is set at 3e-3 and decayed by a factor of 0.1 at epochs 10 and 20.

3.2 Result

Table 1 shows the action recognition effect of the model on the target validation set under different input and hyper-parameter settings. The table shows that the accuracy can be improved under the same hyperparameter setting by using the local spatio-temporal branch to extract the local feature. We tried two groups of models trained under different hyper-parameters, and their performance on the target test set is shown in Table 2. Our proposed method performs favorably against TA3N by 0.93%0.93\% in the top-1 action accuracy. In our final submission, we use RGB, Flow and Audio modalities, and the shared feature dimension of the model is set as 2048. The number of input frames of the glancer network and focuser network is set as 8 and 12, respectively. And the visualization results of the image patches selected from the test set by the proposed method are shown in Figure 3. Each line shows a number of image patches selected from consecutive video frames by the spatial local module of the method. It should be noticed that the spatial local module is fixed after training with source domain data. It can be seen that the model can also be well applied to the videos of the target domain.

4 Conclusion

This paper presents the technical details of our solution for the EPIC-KITCHENS-100 UDA for Action Recognition Challenge. By incorporating a learning-based patch selection strategy into an existing video domain adaption framework, the proposed method can effectively improve the domain adaptation performance of action recognition. Our work empirically verifies the importance of exploiting informative regions for egocentric videos and provides some new inspirations for domain adaptive action recognition.

References

  • [1] Min-Hung Chen, Zsolt Kira, Ghassan AlRegib, Jaekwon Yoo, Ruxin Chen, and Jian Zheng. Temporal attentive alignment for large-scale video domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6321–6330, 2019.
  • [2] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision. arXiv preprint arXiv:2006.13256, 2020.
  • [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [4] Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5492–5501, 2019.
  • [5] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4510–4520, 2018.
  • [6] Yulin Wang, Zhaoxi Chen, Haojun Jiang, Shiji Song, Yizeng Han, and Gao Huang. Adaptive focus for efficient video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16249–16258, 2021.
  • [7] Yulin Wang, Yang Yue, Yuanze Lin, Haojun Jiang, Zihang Lai, Victor Kulikov, Nikita Orlov, Humphrey Shi, and Gao Huang. Adafocus v2: End-to-end training of spatial dynamic networks for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20062–20072, 2022.