This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

NVIDIA-UNIBZ Submission for EPIC-KITCHENS-100 Action Anticipation Challenge 2022

Tsung-Ming Tai1,2, Oswald Lanz2, Giuseppe Fiameni1, Yi-Kwan Wong1    Sze-Sen Poon1, Cheng-Kuang Lee1, Ka-Chun Cheung1, Simon See1
1NVIDIA, 2Free University of Bozen-Bolzano
{tstai,oswald.lanz}@unibz.it
{gfiameni,gwong,spoon,ckl,chcheung,ssee}@nvidia.com
Abstract

In this report, we describe the technical details of our submission for the EPIC-Kitchen-100 action anticipation challenge. Our modelings, the higher-order recurrent space-time transformer and the message-passing neural network with edge learning, are both recurrent-based architectures which observe only 2.5 seconds inference context to form the action anticipation prediction. By averaging the prediction scores from a set of models compiled with our proposed training pipeline, we achieved strong performance on the test set, which is 19.61% overall mean top-5 recall, recorded as second place on the public leaderboard.

1 Introduction

Forecasting future events based on evidence of current conditions is an innate skill of human beings, and key for predicting the outcome of any decision making. Anticipating ”what will happen next?” is a natural skill for human beings, but not for machines. In computer vision, the same question arises in video action anticipation. It is a long standing and widely studied problem to recognize the human actions given a video clip. However, to further predict the future action based on the given observations has just attracted increasing interests in recent years. Unlike action recognition, in action anticipation, the target action only stays in causal relation to the signal in the sub-clip, but is not directly observable. It must be forecast as one possible consequence of the already observed video context. EPIC-Kitchen-100 [2] is the largest dataset containing the definition of the video action anticipation task. It considers 97 verbs and 300 nouns. Unique verb-noun pairs define 3807 action categories. The dataset is provided with the pre-extracted RGB, optical flow, object bounding box, and object mask modalities in this competition.

We participated in the video action anticipation challenge by considering two different proposed models:

  • Higher-Order Recurrent Space-Time Transformer [8]: A recurrent network with space-time decomposition attention and higher order recurrent designs.

  • Message-Passing Neural Network with Edge Learning [9]: A recurrent network based on the message-passing framework. It models the sequential structure as a graph with a set of vertices and edges and learns the edge connectivity by different strategies.

Both Higher-Order Recurrent Space-Time Transformer (HORST) and Message-Passing Neural Network with Edge Learning (MPNNEL) are recurrent architectures, and learn the spatial-temporal dependencies in different ways. HORST builds the n-gram temporal modeling by considering the higher-order recurrence with temporal attention and dynamically attends the relevant spatial information by spatial attention. On the other hand, MPNNEL projects the spatial contexts of frame input from each timestep onto the internal graph representation, and leverages the message-passing framework to capture the temporal propagation. MPNNEL also learns to augment the edge connectivity by using different end-to-end learning strategies. Both modelings are based on the extracted feature from 2D-CNN frame-based backbone. The final score for this competition was deployed by late-fusion of all the training variants from HORST and MPNNEL, and averaging the prediction scores of individual models across different modalities.

The remaining parts of this report are organized as follows. The description of applied models is presented in Section 2, proposed training techniques are in Section 3. The experimental results are included in Section 4. Finally, Section 5 contains concluding remarks of this technical report.

Refer to caption
Figure 1: The overview HORST architecture. The HORST cell consist of a light-weighted spatial-temporal attention, and an internal first-in first-out queue to maintain the previous states for higher-order recurrence design.
Refer to caption
Figure 2: Left: The space-time decomposition attention used in HORST; Right: The self-attention proposed in [10].

2 Model Architecture

We briefly introduce HORST and MPNNEL architectures, the two modelings we used in this competition.

2.1 HORST Model

To exploit the effective information in space-time structure, we proposed space-time decomposition attention – a light-weighted and computation-efficient attention, which integrates spatial and temporal operators from separated branches as shown in Figure 2. To define spatial and temporal branch operators, spatial filter was introduced to recognize the relevant spatial information by the max and mean pooled features of inputs:

f𝒳(X)=sigmoid(θ𝒳[Xmax,Xavg]+b𝒳)f_{\mathcal{X}}(X)=\text{sigmoid}(\theta_{\mathcal{X}}*[X_{max},X_{avg}]+b_{\mathcal{X}}) (1)

where * is convolution, Xavg,XmaxX_{avg},X_{max} are channel mean and max pooled, θ𝒳\theta_{\mathcal{X}} and b𝒳b_{\mathcal{X}} are convolution kernels and biases.

The general higher-order recurrent network [6, 7, 11] is with the following form:

ht=f(xt,ϕ(ht1:tS)),h_{t}=f(x_{t},\phi(h_{t-1:t-S})), (2)

where the hidden state at time tt, hth_{t}, is computed by the cell function ff on input xtx_{t} and SS orders states ht1:tSh_{t-1:t-S} aggregated by the function ϕ\phi.

The HORST cell can be viewed as instantiating ϕ\phi with space-time decomposition attention and maintain the previous states ht1:tSh_{t-1:t-S} in an internal queue by the first-in first out update policy. The overall design is shown in Figure 1. At each step tt, we process the video frame by a 2D-CNN backbone to obtain the feature map and encode it to the intermediate representation. Such representation is served as query and cross-reference from the historical states via the space-time decomposition attention. The attention output is then pushed to the queue while releasing the oldest state. Cell output finally propagate to the classifier. More details are found in [8] and we build our HORST models for this competition based on the codebase published at https://github.com/CorcovadoMing/HORST.

Refer to caption
Figure 3: Left: The implicit edge estimation by multi-head self-attention; Middle: The augmented edge learning by outer product the class tokens supervised by the verb and noun annotations; Right: The augmented edge learning by introducing a joint learnable template bank.
Refer to caption
Figure 4: The overview architecture of MPNNEL. The message function is extendable with the explicit edge estimation by different edge learning strategies.

2.2 MPNNEL Model

MPNNEL translates the anticipation problem into a message passing scheme, producing a graph-structured space-time representation. The connectivity of the graph structure is inferred from the input at each time step. The readout function is called when the prediction is required at any timestep. The proposed model utilizes only multi-head self-attention for information routing between vertices. The overall architecture definition is illustrated in Figure 4. Note that the resulting spatial graph is either bi-directed, when an adjacency matrix AA is provided, or else it is un-directed.

Without any prior knowledge, we assume each vertex in the graph can be accessible by any other vertices. In this case the scaled dot-product in the self-attention computes the pairwise similarity of all vertices from the inputs can be viewed as an implicit edge estimation. This can be extended by optionally providing the edge estimation explicitly by one of following strategies, also shown in Figure 3:

  • Template Bank (TB), which forms the estimation of edge connections by soft-fusing a set of learnable templates using weights computed from the frame input.

  • Class Token Projection (CTP), which performs the outer-product of class tokens to construct the edge estimation. The class tokens are supervised from provided verb and noun labels.

More details are found in [9] and we build our MPNNEL models for this competition based on the codebase published at https://github.com/CorcovadoMing/MPNNEL.

3 Model Training

In this section, we describe the 4 phases training pipeline used to efficiently train all our models in this competition, and also the class weightings applied in the loss function to cope with imbalanced class distribution.

3.1 Training Phases

Refer to caption
Figure 5: Demonstration of different training phase. The Backbone model is only trainable in the warmup phase and remains freeze in rest of the phases.

We trained all of our models by having them experience four learning phases, where they are (i) warmup phase; (ii) ordinary training; (iii) finetune; and (iv) finetune with joint validation set. The demonstration is shown in Figure 5.

The details of different training phases are:

  • Warmup Phase: The model is end-to-end trainable on the target dataset. The model can access the actual action frames beyond the anticipation limitation only in this training phase.

  • Ordinary Phase: The model is trained with backbone freeze, and the action frames are not accessible in this and following training stages.

  • Finetune Phase: The model is trained with backbone freeze, under the lower learning rate, and with the class weightings adjusted in the loss functions.

  • Finetune with joint validation set: The model is trained with backbone freeze, under the lower learning rate, and with the class weightings adjusted. Additionally, the validation samples are joint together in the supervised learning.

The warmup phase is targeted to build a strong feature extractor for the competition, the backbone model is able to receive gradients and the action frames are allowed to be observable exclusively in this phase. The ordinary phase focus on the anticipation task and trains the HORST and MPNN architectures with the feature extractor kept frozen. The finetune phases learn to distinguish the hard samples and tailed cases by class weighting adjustments, also with the validation set jointly in the last training. Every phases are resumed from its previous step and each model training experiences the complete learning rate scheduling.

3.2 Class Weightings

We adjusted the class weightings of the cross-entropy loss for individual verbs, nouns, and actions during the fine-tuning stages. The adjustment is based on the label frequency summarized from the training set. Note the action distribution defined in EPIC-Kitchen-100 is composed of joint probability of verbs and nouns, however, the label frequency of the action class could be different than the individual frequency belonging to verbs and nouns. Therefore, we empirically found this adjustment brings additional regularization to the model learning and results in noticeable gains on the validation up to 4% improvements.

4 Experiments

We provide implementation details and discuss the choices that led to our public record in the competition leaderboard.

4.1 Implementation Details

We prepared each input modality as follow: RGB frames are resized to 224x224 and the pixel values are scaled from [0, 256] to [-1, 1]. The Flow modality came with the two maps described for horizontal and vertical optical changes, we stacked the two maps in channel dimension and resized them to 224x224 with pixel intensity scaled to [-1, 1]. As inspired from [3], the Obj modality is formed by summarizing the object detection confidences from the officially provided object features, and discards the location information of the bounding boxes. Masked-RGB is the modality which multiplies the masking, extracted from a pretrained MaskedRCNN, with the RGB input.

The RandAugment [1] is applied for RGB, Masked-RGB, and Flow inputs. The video clip for training and inference are all sampled at 4 FPS (i.e., step 0.25s), as inherited from RU-LSTM baseline [3]. Each sample contains 14 sequential frames during training from 3.5s to 0.25s before action starts. However the last 3 frames are strictly not allowed to access in this competition since the anticipation time set to 1s. The total length of the inference context in our models are 2.5s (observed from 3.5s to 1s).

We trained our model using batch size 32 on 4 ×\times NVIDIA A100 GPUs. AdaBelief [13] in combination with the look-ahead optimizer [12] is adopted. Weight decay is set to 0.001. The learning rate is set to 1e-4 and decreased to 1e-6 for warmup and ordinary training, and 1e-5 decreased to 1e-7 for finetune phases. The learning rate scheduling uses FlatCosine, which keeps the initial learning rate for the first 75% of total epochs and switches to cosine schedule for the last 25% epochs (see also Figure 5). The total epochs for warmup and ordinary training are set to 50, and 20 for finetune phases.

4.2 Individual Models

Table 1: Individual model performance on validation set, measured in mean top-5 action recall (MT5R) at 1s, of various modalities using different modelings and backbones.
Model Modality Backbone MT5R (%)
HORST RGB Swin-B 18.42
HORST RGB ConvNeXt 17.09
MPNNEL RGB Swin-B 17.05
MPNNEL (CTP) RGB Swin-B 18.18
MPNNEL (TB) RGB Swin-B 17.05
MPNNEL RGB ConvNeXt 17.18
MPNNEL (CTP) RGB ConvNeXt 18.54
MPNNEL (TB) RGB ConvNeXt 18.09
HORST Flow Swin-B 7.95
HORST Flow ConvNeXt 7.36
HORST Flow (Snippets) Swin-B 6.61
HORST Flow (Snippets) ConvNeXt 8.06
MPNNEL Flow Swin-B -
MPNNEL (CTP) Flow Swin-B 6.66
MPNNEL (TB) Flow Swin-B -
MPNNEL Flow ConvNeXt 7.59
MPNNEL (CTP) Flow ConvNeXt 8.74
MPNNEL (TB) Flow ConvNeXt 8.18
HORST Obj None 8.72
MPNNEL Obj None 9.69
MPNNEL (CTP) Obj None 8.80
MPNNEL (TB) Obj None 8.99
HORST Masked-RGB Swin-B 12.03
HORST Masked-RGB ConvNeXt 11.30
MPNNEL Masked-RGB Swin-B 9.22
MPNNEL (CTP) Masked-RGB Swin-B 7.87
MPNNEL (TB) Masked-RGB Swin-B 9.57
MPNNEL Masked-RGB ConvNeXt 9.65
MPNNEL (CTP) Masked-RGB ConvNeXt 8.53
MPNNEL (TB) Masked-RGB ConvNeXt 10.30

Unlike other modalities which are in an spatial-temporal structure, the Obj modality is presented as a temporal sequence of frame vectors. Each such vector represents the frame-level object scores computed from an object detection pretrained model. We modified HORST and MPNNEL models for supporting the 1D object vector representation, by replacing the 2D Convolution in HORST with the fully-connected layer; and by replacing the object entities with learnable vectors multiplied by corresponding object scores to defined the vertices in MPNNEL. Some models apply on Flow modality by snippets, as suggested in [3], where the previous 5 sequential Flow features are stacked.

For all of our models we considered the Swin Transformer (i.e., base configuration, Swin-B) [4], and ConvNeXt [5] as backbones. We showed validation results of each representative category in Table 1. Note the validation results reported in Table 1 are before training with the joint validation set, in order to keep the numbers meaningful.

4.3 Model Ensemble

Table 2: Test accuracy of model ensemble.
Model MT5R (%)
(a) HORST Family with all modalities 17.47
(b) MPNNEL Family with all modalities 18.19
(a) + (b) 19.52
(a) + (b) and weightings 1.2x on all RGB models 19.61

We manually selected the strong models from each individual variant, and tried to balance between HORST and MPNNEL instances to maintain the diversity among the ensembled models. Our best submission, 19.61% overall accuracy, was achieved by an ensemble of in total 54 models. Those models consisted of 30 RGB models, 10 Flow models, 8 Obj models, and 6 Masked-RGB models.

Table 2 reports on the trajectory we stepped to our highest score submission. Averaging the prediction scores in the HORST family resulted in 17.47% overall test accuracy, and 18.19% in the MPNNEL family. Combining both HORST and MPNNEL further improved the score significantly, to 19.52%, indicating some degree of complementarity of the two recurrent models. We also empirically found that emphasizing the prediction scores of all RGB models can have additional performance gains. In our best submission we weighted all RGB models by a factor 1.2x higher than other modalities.

5 Conclusion

In this report, we presented the technical details of our submission, achieving an overall 19.61% mean top-5 recall on the EPIC-Kitchen-100 anticipation challenge 2022. Our method considered the Higher-Order Recurrent Space-Time Transformer (HORST) and Message-Passing Neural Network with Edge Learning (MPNNEL) architectures, which are both recurrent-based networks and only observed 2.5s inference context for the action anticipation. Combined with the proposed training pipeline and by averaging the prediction scores from the models trained from various modalities, our submission recorded the second place on the public leaderboard.

References

  • [1] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 702–703, 2020.
  • [2] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Antonino Furnari, Evangelos Kazakos, Jian Ma, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision, 130(1):33–55, 2022.
  • [3] Antonino Furnari and Giovanni Maria Farinella. Rolling-unrolling lstms for action anticipation from first-person video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4021–4036, 2020.
  • [4] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021.
  • [5] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. arXiv:2201.03545, 2022.
  • [6] Rohollah Soltani and Hui Jiang. Higher order recurrent neural networks. arXiv:1605.00064, 2016.
  • [7] Jiahao Su, Wonmin Byeon, Jean Kossaifi, Furong Huang, Jan Kautz, and Anima Anandkumar. Convolutional tensor-train lstm for spatio-temporal learning. In Advances in Neural Information Processing Systems, volume 33, pages 13714–13726, 2020.
  • [8] Tsung-Ming Tai, Giuseppe Fiameni, Cheng-Kuang Lee, and Oswald Lanz. Higher order recurrent space-time transformer for video action prediction. arXiv:2104.08665, 2021.
  • [9] Tsung-Ming Tai, Giuseppe Fiameni, Cheng-Kuang Lee, Simon See, and Oswald Lanz. Unified recurrence modeling for video action anticipation. arXiv:2206.01009, 2022.
  • [10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, 2017.
  • [11] Rose Yu, Stephan Zheng, Anima Anandkumar, and Yisong Yue. Long-term forecasting using tensor-train rnns. arXiv:1711.00073, 2017.
  • [12] Michael Zhang, James Lucas, Jimmy Ba, and Geoffrey E Hinton. Lookahead optimizer: k steps forward, 1 step back. In NeurIPS, volume 32, 2019.
  • [13] Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C. Tatikonda, Nicha C. Dvornek, Xenophon Papademetris, and James S. Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. In NeurIPS, 2020.