Exploiting Semantic Role Contextualized Video Features
for Multi-Instance Text-Video Retrieval
EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022

Burak Satar^1,2
Zhu Hongyuan¹
Hanwang Zhang²
Joo Hwee Lim^1,2 ¹Institute for Infocomm Research, A*STAR, Singapore ²School of Computer Science and Engineering, NTU, Singapore
{burak_satar, zhuh, joohwee}@i2r.a-star.edu.sg, [email protected]

Abstract

In this report, we present our approach for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022. We first parse sentences into semantic roles corresponding to verbs and nouns; then utilize self-attentions to exploit semantic role contextualized video features along with textual features via triplet losses in multiple embedding spaces. Our method overpasses the strong baseline in normalized Discounted Cumulative Gain (nDCG), which is more valuable for semantic similarity. Our submission is ranked 3rd for nDCG and ranked 4th for mAP.

1 Introduction

With the rise of videos uploaded by users via social media channels, cross-modal retrieval of video data and natural language descriptions has gained popularity. The goal of video-to-text retrieval, given a query action segment, is to rank captions in a gallery set so that those with a higher rank are more semantically related to the video action. Text-to-video retrieval, on the other hand, ranks videos based on a query caption.

While most methods [8, 10, 3] use one joint embedding space to align video and text features, recent methods [1, 15] use multiple embedding spaces to match video features into the noun and verb embedding spaces along with the textual features but did not consider their interactions. Moreover, it is relatively easy to parse the textual features into verb and noun levels since an off-the-shelf toolkit could be used. However, mapping a video feature into the object and action levels is still challenging, which corresponds to noun and verb levels in text.

Inspired by [11, 12, 5], we implement self-attentions to exploit visual features on top of the baseline, JPoSE [15] by leveraging the contexts from nouns and verbs of the text query with details in the following section. While we outperform the strong baseline in normalized Discounted Cumulative Gain (nDCG), which is more beneficial for semantic similarity, we fall short in mean Average Precision (mAP), a traditional technique for binary relevance. Our approach is ranked third for nDCG, while it is ranked fourth for mAP. We also analyze various failure examples to save the time of the following researchers on this task.

2 Method

We follow the baseline work [15]: we create a pair of functions that map videos and texts into a joint embedding space, in which embeddings for matched texts and videos should be close together, and embeddings for mismatched texts and videos should be far apart, given a video and a query text. A suitable embedding space should also ensure that related videos/texts stay close together.

With this motivation, we first parse caption into the noun $t_{i}^{1}$ and verb $t_{i}^{2}$ levels, followed by linear layers. We utilize linear layers to embed corresponding video features $v_{i}^{1}$ , $v_{i}^{2}$ and use a self-attention layer to exploit contextualized features. Then, we concatenate textual and visual features to compute the distance between these representations, $\hat{v_{i}}$ and $\hat{t_{i}}$ . L1 and L2 refer to triplet losses. The more details of the loss functions are in Eq. 1 and baseline paper [15] and the architecture details are in Fig. 1

In Eq. 1, the first two rows refer to cross-modal losses, and the last two rows indicate within-modal losses. $\theta$ function denotes two fully connected layers. $\delta$ function signifies two linear layers and one self-attention layer. $m$ refers to the constant margin, while $d$ is the distance function. While $i$ refers to the selected video, $j$ and $k$ denote positive and negative samples, respectively.

Refer to caption — Figure 1: We first parse caption into the noun $t_{i}^{1}$ and verb $t_{i}^{2}$ levels, followed by linear layers. We utilize linear layers to embed corresponding video features $v_{i}^{1}$ , $v_{i}^{2}$ and use a self-attention layer to exploit contextualized features. Then, we concatenate textual and visual features to compute the distance between these representations, $\hat{v_{i}}$ and $\hat{t_{i}}$ . L1 and L2 refer to triplet losses.

\begin{split}L&=\lambda_{v,t}\sum_{i,j,k}\max(0,d(\delta_{v_{i}},\theta_{t_{j}})-d(\delta_{v_{i}},\theta_{t_{k}})+m)\\ &+\lambda_{t,v}\sum_{i,j,k}\max(0,d(\theta_{t_{i}},\delta_{v_{j}})-d(\theta_{t_{i}},\delta_{v_{k}})+m)\\ &+\lambda_{v,v}\sum_{i,j,k}\max(0,d(\delta_{v_{i}},\delta_{v_{j}})-d(\delta_{v_{i}},\delta_{v_{k}})+m)\\ &+\lambda_{t,t}\sum_{i,j,k}\max(0,d(\theta_{t_{i}},\theta_{t_{j}})-d(\theta_{t_{i}},\theta_{t_{k}})+m)\end{split}

(1)

Eq. LABEL:encoder shows that the visual features $V_{i}$ are fed into the self-attention layer to encode into $z_{s}$ . Then, a feed-forward layer FF outputs the final contextualized appearance feature. Normalization of the layer is done under the Norm function.

		$\displaystyle z_{s}=\textrm{Norm}\big{(}\textrm{MultiHead}(V_{i},V_{i},V_{i})+V_{i}\big{)}$		(2)
		$\displaystyle E_{v}=\textrm{Norm}\big{(}\textrm{FF}(z_{s})+z_{s}\big{)}$		(2)

For multi-headed attention layers, we follow [13], as formulated in Eq. 3. All of the W matrices are learned during the training procedure. Since it is a self-attention layer, query Q is the same as key K and value V. After each attention layer, layer normalization and the residual connection are implemented.

		$\displaystyle\textrm{MultiHead}(Q,K,V)=\textrm{Concat}(\textrm{Head}_{1},...,\textrm{Head}_{h})W^{O}$		(3)
		$\displaystyle\textrm{Head}_{i}=\textrm{Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})$
		$\displaystyle\textrm{Attention}(Q,K,V)=\sigma\Big{(}\frac{QK^{T}}{\sqrt{d}}V\Big{)}$

Table 1: Multi-instance retrieval results on the EPIC-KITCHENS-100 test split. T2V and V2T stand for Text-to-Video and Video-to-Text retrieval, respectively. While the above part of the table compares with baselines, the lower part shares the result of this year’s competition.

Comparison to Baselines
	mean Average Precision (mAP)			normalised Discounted Cumulative Gain (nDCG)
Method	Average	T2V	V2T	Average	T2V	V2T
MI-MM	27.58	23.08	32.09	42.10	40.48	43.72
JPoSE*	43.95	38.18	49.71	53.40	51.60	55.21
JPoSE [15]	44.01	38.11	49.91	53.53	51.55	55.51
DCRL [6]	44.23	38.49	49.96	53.56	51.83	55.28
Our Method	42.81	38.10	47.52	55.33	54.12	56.55
Comparison to Other Users
	mean Average Precision (mAP)			normalised Discounted Cumulative Gain (nDCG)
User	Average	T2V	V2T	Average	T2V	V2T
haoxiaoshuai	44.02 (3)	38.34 (3)	49.69 (3)	53.06 (4)	51.31 (4)	54.82 (4)
Our Method	42.81 (4)	38.10 (4)	47.52 (4)	55.33 (3)	54.12 (3)	56.55 (3)
afalcon	49.77 (1)	44.39 (1)	55.15 (1)	61.02 (2)	58.88 (2)	63.16 (2)
kevin.lin	47.39 (2)	40.95 (2)	53.84 (2)	61.44 (1)	59.60 (1)	63.29 (1)

3 Experiments

3.1 Implementation Details

While $\lambda_{t,v}$ equals 2.0, the other constant margins equal 1.0. The batch size is 64, and the learning rate is 0.01.

Dataset. We undertake experiments on the EPIC-KITCHENS-100 dataset [2], which is a collection of unscripted egocentric action data across the world, to demonstrate the efficiency of our strategy.

Features. We use the video features extracted by TBN [7]. Each one is an nx25x1024 matrix holding a python dictionary containing the RGB, flow, and audio features, where n is the number of video clips. The number of training and test set pairs is 67217 and 9668, respectively. Each feature is followed by temporal mean pooling, making the shape nx1x1024. We utilize the textual features given by [15] using a Word2Vec model trained on the Wikipedia corpus. spaCy parser [4] is used to disentangle the text caption into different PoS tags. The model is trained with the default values of the baseline.

Evaluation Metrics. We utilize two assessment metrics, mAP and nDCG, on the test set to evaluate submissions for action retrieval. Mean Average Precision (mAP) was employed for retrieval baselines because it allows the whole ranking to be analyzed on binary relevance. nDCG has already been used to retrieve information [14]. It necessitates the use of similarity scores throughout the entire test set.

3.2 Results

Table 1 shows the comparison between our method and the baselines. It also compares with the methods attended to this year’s challenge. While our method overpass all the baselines on nDCG, it falls short on mAP. The MI-MM approach projects both modalities onto a shared action space using linear layers via max-margin loss, which is a simplified version of [9]. The JPoSE approach [15] uses a triplet loss to separate captions into the verb and noun spaces. The JPoSE* refers to our implementation. DCRL [6] considers both inter-modal and intra-modal constraints at the same time to retain both cross-modal semantic similarity and modality-specific consistency in the embedding space.

Failure cases. We also share failure cases which could be helpful for other researchers. For every experiment, we give the results approximately compared to baseline JPoSE [15]. 1) If we implement self-attention to the textual features as it is done to the video features, the results decrease around 2-3%. 2) When we increase the batch size or embedding size, the results decrease 1-2%. 3) We get 1-2% lower results when applying temporal max-pooling rather than mean pooling.

4 Conclusion

In this report, we propose an approach to exploit contextualized video features via self-attentions and disentangling them into multiple embedding spaces. It also parses text into corresponding embedding spaces, and then the similarity between representations is calculated via triplet loss. While our strategy outperforms the strong baseline in normalized Discounted Cumulative Gain (nDCG), a semantic similarity measurement, it falls short in mean Average Precision (mAP), a standard measure of binary relevance. For nDCG, our proposal is ranked third, and for mAP, it is ranked fourth. We plan to exploit each video feature separately via novel fusion methods as well as utilize domain-specific features such as hand-object relations for future work.

References

[1] S. Chen, Y. Zhao, Q. Jin, and Q. Wu. Fine-grained video-text retrieval with hierarchical graph reasoning. In CVPR, 2020.
[2] Dima Damen, Hazel Doughty, Giovanni Maria Farinella, , Antonino Furnari, Jian Ma, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 2021.
[3] J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang. Dual encoding for zero-example video retrieval. In CVPR, pages 9338–9347, 2019.
[4] English spaCy parser. https://spacy.io/.
[5] Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. Multi-modal Transformer for Video Retrieval. In European Conference on Computer Vision (ECCV), 2020.
[6] Xiaoshuai Hao, Yucan Zhou, Dayan Wu, Wanqian Zhang, Bo Li, and Weiping Wang. Multi-feature graph attention network for cross-modal video-text retrieval. In Proceedings of the 2021 International Conference on Multimedia Retrieval, ICMR ’21, page 135–143, New York, NY, USA, 2021. Association for Computing Machinery.
[7] Evangelos Kazakos, Arsha Nagrani, Andrew Zisserman, and Dima Damen. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
[8] Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman. Use what you have: Video retrieval using representations from collaborative experts. In arXiv, 2019.
[9] Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-End Learning of Visual Representations from Uncurated Instructional Videos. In CVPR, 2020.
[10] N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury. Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In Proceedings of the 2018 ACM on ICMR, page 19–27, 2018.
[11] Burak Satar, Zhu Hongyuan, Xavier Bresson, and Joo Hwee Lim. Semantic role aware correlation transformer for text to video retrieval. In 2021 IEEE International Conference on Image Processing (ICIP), pages 1334–1338, 2021.
[12] Burak Satar, Hongyuan Zhu, Hanwang Zhang, and Joo Hwee Lim. Rome: Role-aware mixture-of-expert transformer for text-to-video retrieval. arXiv:2206.12845, 2022.
[13] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. In NIPS, 2017.
[14] Michael Wray, Hazel Doughty, and Dima Damen. On semantic similarity in video retrieval. In CVPR, 2021.
[15] M. Wray, D. Larlus, G. Csurka, and D. Damen. Fine-grained action retrieval through multiple parts-of-speech embeddings. In ICCV, 2019.

Exploiting Semantic Role Contextualized Video Features for Multi-Instance Text-Video Retrieval EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022