Semantic Role Aware Correlation Transformer
for Text to Video Retrieval

Abstract

With the emergence of social media, voluminous video clips are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at different levels. The preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics. It also overpasses two SOTA methods in terms of two metrics.

Index Terms— Video understanding, text-to-video retrieval, transformer, multi-modal, hierarchical, cross-modal

1 Introduction

With the popularity of TikTok, Youtube and Instagram, millions of videos are uploaded every minute, making video retrieval an important function for users to find relevant content. Conventional models are based on keywords query [1, 2]. However, unstructured keywords are limited and insufficient to retrieve fine-grained and compositional events. Therefore, communities are shifting attention to cross-modal video text retrieval [3, 4, 5, 6, 7] which can retrieve videos using natural language descriptions that includes more structured details.

Most existing cross-modal text video retrieval maps each modality into a joint embedding space to measure their similarities. Some works [8, 4, 9] directly embeds whole videos and texts into flat vectors for matching, however losing fine-grained details in texts and videos. To avoid losing those details, some other works [10, 11] align a sequence of frames and words in texts to compute overall similarities. Although these approaches have achieved certain progress, aligning video and text is still an open problem, given a huge semantic gap between video and text.

Recently, there are different attempts to resolve the video-text semantic gap. [3] propose decomposing text into three semantic roles (events, actions and entities) and then embedding 2D video features into these three spaces accordingly for matching. Another line of research uses a BERT-like transformer [12, 13, 14] to learn the text-video correspondence, based on recent mixture-of-expert embedding [5] which requires large-scale dataset for pre-training.

Refer to caption — Fig. 1: Overview of our model on text-to-video retrieval, where the textual input is a caption of a video clip, and the visual input are the visual expert features from RoI, 2D Frames and 3D clips. It computes the similarity between the caption and a candidate video clip in the three embedding spaces correspond to object contexts, spatial contexts and temporal contexts by embracing transformer with self-attention and cross-modal attention that can capture the specific and complement information within three semantic role modalities. In the cross-modal matching unit, a matching score is calculated for three-level of embeddings. Then, we average the similarities and utilize contrastive ranking loss as a training objective.

We propose a novel transformer architecture for video-text matching inspired by [3] and [12, 13, 14]. Different from [3], which only considers multi-head embedding of the spatial frame and ignores the interaction between different visual contexts, our method explicitly considers more fine-grained visual encoding of object contexts, spatial contexts and temporal contexts by embedding RoI regions, 2D frames and video sequences into the corresponding space with their interactions. Different from [12, 13, 14], which only uses self-attention to discover modality-specific information, our method uses a self-attention scheme to discover modality-specific discriminative features. Moreover, our model utilizes cross-modal attention to consider the interactions between object, spatial and temporal contexts to discover modality-complement features for better align video and text.

We experimented with the YouCook2 dataset for the text-to-video retrieval task. The results show that our approach surpasses a recent SOTA with a high margin. While our approach gives better results over other approaches in terms of two parameters(R@1, R@5), it falls a close behind them in terms of the other two parameters(R@10, MedR).

2 Related Work

The text-video retrieval task is challenging because there are significant semantic and structure gap between videos and text. Mithun et al. [4] employ multimodal image, motion, audio modalities in video. Liu et al. [8] further utilize all modalities that can be extracted from videos such as speech contents and scene texts for video encoding. Dong et al. [9] use biGRU and CNN to encode sequential videos and texts. Yu et al. [11] propose to fuse the sequential interaction of videos and texts for video text retrieval. Song et al. [10] employ to align encoded video and text elements for matching. Tan et al. [15] use LSTM and Graph Neural Networks to capture cross-modal relation. Chen et al. [3] disentangle phrases into different part-of-speech such as verbs and nouns for fine-grained retrieval, and also it uses GNNs to encode the interactions between the roles in text modalities. Our work complements Chen et al. [3] by focusing on video encoding parts using a transformer to learn interactions between object contexts, spatial contexts and video contexts.

Some methods [16, 17] extend BERT-like models with self-supervised learning to have better video-text representation. For example, while Sun et al. [16] build upon the BERT model to learn bidirectional joint distributions over sequences of visual and linguistic tokens. Zhu et al. [17] aim to encode global actions, local regions, objects, and linguistic descriptions with self-supervised learning without considering the linguistic and visual structures. Our work aims to exploit transformer like architecture to learn interactions for better object contexts, spatial context and video context encoding.

Most recent papers [5, 13] use complex datasets such as [6] for pre-training to have a more general textual and visual embeddings. Alayrac et al. [18] even use an audio-based dataset [19] to have audio embedding. In this current work, we have yet to exploit pre-training on a huge dataset motivated by studying the computational time and the effects of pre-training absence, although our work can incorporate pre-trained backbones.

3 Method

Figure 1 is an illustration of our semantic role aware correlation transformer model, which includes three main blocks: textual encoding, visual encoding, and cross-modal matching.

3.1 Textual Encoding

Text description internally has a hierarchical structure. For example, the whole sentence can describe global contexts; nouns and verbs can define object and actions. We use an off-the-shelf toolkit [20] for semantic role labelling to parse these components and the semantic relationship between them in a graph structure. Verb nodes are directly connected with the global node by showing temporal relations of various actions. Noun nodes are connected with verb nodes by defining the objects. $r_{ij}$ edges show the semantic relationship between verb nodes $i$ and object nodes $j$ . For graph representation, we follow the method proposed by Chen et al. [3]. This approach, which is shown in Eq. 1, uses factorized weights in GCN, where $W_{t}$ is a transformation matrix and shared among all relationship types. $W_{t}$ denotes a unique matrix for different semantic roles. $g_{a}$ and $g_{b}$ denote node embeddings which can be for sentences, verbs or objects. $\beta$ is the outcome after attention is applied at the nodes.

g^{l}_{i}=g^{l-1}_{i}+\sum_{j\varepsilon N_{i}}(\beta_{ij}(W^{l}_{t}\odot W_{r}r_{ij})g_{j})

(1)

3.2 Visual Encoding

Parsing videos into hierarchical semantic features is more challenging than texts since it includes detection, action segmentation and so on. We extract spatial context features $F_{S}$ with 2D CNNs by ResNet-152 [21], pre-trained on Imagenet [22]. The temporal context features $F_{T}$ are extracted with 3D CNNs by ResNeXt-101 [23] pre-trained on Kinetics [24], then we perform temporal max-pooling on feature maps. We extract object context features $F_{O}$ by using Faster R-CNN [25], pre-trained on MS COCO [26], and the backbone is ResNet-101 [21]. We follow the settings of Zhu et al. [17] by extracting the features at 1 FPS after RoI-pooling happens. Confidence threshold is set as 0.4, while each frame could contain up to ten boxes. Dimension for all three expert embedding is 2048.

To capture the modality-complement and specific information, we adapt a transformer with attention modules [12, 13, 14, 27]. For example, to generate spatial context-specific feature $s_{e}$ , the temporal context features $F_{T}$ , and object context features $F_{O}$ are concatenated in feature dimension and fed into an encoder with a multi-head self-attention layer and a feed-forward linear layer FF, as shown in Eq. LABEL:encoder.

		$\displaystyle f_{e}=\textrm{Concat}(F_{T},F_{O})$		(2)
		$\displaystyle z_{e}=\textrm{Norm}\big{(}\textrm{MultiHead}(f_{e},f_{e},f_{e})+f_{e}\big{)}$
		$\displaystyle s_{e}=\textrm{Norm}\big{(}\textrm{FF}(z_{e})+z_{e}\big{)}$

The attention layers are multi-headed as Vaswani et al. [12]’s dot-product attention, and each layer also follows layer normalization and residual connection, as shown in Eq. 3. All the W matrices are trainable parameters.

		$\displaystyle\textrm{MultiHead}(Q,K,V)=\textrm{Concat}(\textrm{Head}_{1},...,\textrm{Head}_{h})W^{O}$		(3)
		$\displaystyle\textrm{Head}_{i}=\textrm{Attention}(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V})$
		$\displaystyle\textrm{Attention}(Q,K,V)=\sigma\Big{(}\frac{QK^{T}}{\sqrt{d}}V\Big{)}$

Eq. LABEL:decoder shows the formula of generating spatial context-complement feature. The spatial context feature $F_{S}$ is first fed into the self-attention layer to encode into $z_{s}$ , then with cross-modal attention is conditioned on $s_{e}$ and $z_{s}$ to generate complement feature $c_{e}$ . Next, it is encoded through feed-forward linear layer FF and together with the modality-specific feature $s_{e}$ to deliver final embeddings for the spatial context $E_{S}$ . ’Norm’ refers to layer normalization.

		$\displaystyle z_{s}=\textrm{Norm}\big{(}\textrm{MultiHead}(F_{S},F_{S},F_{S})+F_{S}\big{)}$		(4)
		$\displaystyle c_{e}=\textrm{Norm}\big{(}\textrm{MultiHead}(s_{e},z_{s},z_{s})+z_{s}\big{)}$
		$\displaystyle E_{S}=\textrm{Norm}\big{(}\textrm{FF}(c_{e})+c_{e}\big{)}$

The same process is applied to generate the final embeddings $E_{T}$ and $E_{O}$ for temporal context feature $F_{T}$ and object context feature $F_{O}$ , respectively, for matching.

3.3 Cross-modal Matching

We utilize cosine similarity to calculate the cross-modal matching score for each level by corresponding visual and textual embeddings.

s(V,C)=\frac{<v,c>}{||v||_{2}||c||_{2}}

(5)

Then, we average the similarities and utilize contrastive ranking loss as a training objective. Our aim is to have such positive pair $(v_{p},c_{p})$ pushed away from the negative pairs $(v_{p},c_{n})$ and $(v_{n},c_{p})$ than a set margin. $v$ and $c$ refer to the video and textual representation, respectively. This matching applies to three levels of embeddings. $\Delta$ represents the pre-defined margin, which is for training with contrastive loss.

	$\displaystyle L(v_{p},c_{p})=[\Delta+s(v_{p},c_{n})-s(v_{p},c_{p})]+$		(6)
	$\displaystyle[\Delta+s(v_{n},c_{p})-s(v_{p},c_{p})]$		(6)

Table 1: Text-to-video retrieval comparison with SOTA approaches on YouCook2 validation set. ’Visual Backbone’ only refers to 3D CNNs Features. Our method surpasses the SOTA methods in the first two parameters when without pre-training.

Method	Pre-training	Visual Backbone	Batch Size	R@1↑	R@5↑	R@10↑	MedR↓
Random	No	-	-	0.03	0.15	0.3	1675
Miech et al [6]	No	ResNeXt-101	-	4.2	13.7	21.5	65
HGLMM [28]	No	-	-	4.6	14.3	21.6	75
HGR [3]	No	ResNeXt-101	32	4.7	14.1	20.0	87
Ours	No	ResNeXt-101	32	5.3	14.5	20.8	77
Miech et al+FT [6]	HowTo100M	ResNeXt-101	-	8.2	24.5	35.3	24
ActBert [17]	HowTo100M	ResNet-3D	-	9.6	26.7	38.0	19
MMV FAC [18]	HowTo100M+AudioSet	TSM-50	4096	11.5	30.2	41.5	16
MIL-NCE [7]	HowTo100M	S3D	8192	15.1	38.0	51.2	10

4 Experiments

We compare our model with state-of-the-art methods on text-to-video retrieval task, shown in Table 1. We also share an ablation study using various expert embeddings in different hierarchical levels, shown in Table 2.

4.1 Dataset and Metrics

Dataset. We evaluate our model on YouCook2 [29]. It is a video dataset on cooking gathered from YouTube. The videos have a diverse number of cooking styles and methods. It includes 89 recipe types and 14k video clips correlated by imperative English captions defining the action. Note that since annotations for the test set is not published yet, we evaluate the task on validation clips which is around 3.5k totally.

Evaluation metrics. The task is retrieving video clips based on text queries. We evaluate our model using common metrics on recall at various sets and median rank. R@1, R@5, R@10 gives the number of accurately retrieved clips in the ranking list’s top associated rates. MedR gives the median rank of correct clips in the ranking list. While higher is better for recall metrics, lower is better for the median rank.

Table 2: Ablation studies on YouCook2 dataset to investigate the contributions of various feature experts at different levels. The same ablation is also done on HGR method [3] since it is a strong baseline. On 2D + 3D visual features setting, when the feature dimension is 4096, concatenation is done on dimension one; otherwise is done on dimension zero. Our model surpasses HGR with the same hierarchical features with a high margin by using cross-modal attention.

Method	Visual Features			Feature Dimension	R@1↑	R@5↑	R@10↑	MedR↓
Method	Appearance	Action	Object	Feature Dimension	R@1↑	R@5↑	R@10↑	MedR↓
HGR [3] : Ours	2D	2D	2D	2048	4.7 : 4.2	13.8 : 13.7	19.7 : 19.4	86 : 86
HGR [3] : Ours	2D + 3D	2D + 3D	2D + 3D	2048	4.8 : 4.5	14.0 : 13.2	20.3 : 20.0	85 : 85
HGR [3] : Ours	2D + 3D	2D + 3D	2D + 3D	4096	4.8 : 4.5	14.0 : 13.2	20.3 : 20.0	85 : 85
HGR [3] : Ours	2D	3D	RoI	2048	4.7 : 5.3	14.1 : 14.5	20.0 : 20.8	87 : 77

4.2 Implementation Details

Training. We use Glove embeddings [30] by setting the word embedding size as 300 for the text encoding. We adopt the text encoding approach of [3] to disentangle the text embeddings. Graph convolutions have two layers, and it outputs features with 1024 dimensions. A linear layer is applied to each visual expert embeddings to transform their dimensions into 1024. We use a cross-modal attention mechanism for each level in visual encoding. For training, the margin is set to $\Delta=0.2$ ; the epoch is 100 for each experiment. The mini-batch size is 32.

4.3 Comparison

Table 1 shows the comparison with SOTA methods. While ’Visual Backbone’ in the table only refers to 3D CNNs Features, ResNet-based models are also used in all the models as 2D CNNs backbone. We focus on training only with YouCook2 dataset; however, we also add the methods when using pre-training for reference. ’FT’ denotes fine-tuning on YouCook2 dataset. We see that the accelerator type, which directly defines the maximum batch size, and pre-training usage affects the result sharply. There are even differences with the same model when training on different accelerators, epoch, and batch sizes; which can be found on corresponding papers. For example, the MIL-NCE method reaches 50% of its accuracy when trained with less batch size and epochs. Our method surpasses the HGR method [3] with a high margin for all metrics. We also outperform the other two SOTA methods: Miech et al. [6], and HGLMM [28], in terms of the first two parameters. We think that modality-specific and modality-complement features improve accuracy at R@1 and R@5, which are more demanding and useful for real-world applications.

4.4 Ablation Studies

Since we adopt our model from HGR [3], we implement extensive ablation to show the improvement in our approach on the visual encoding part. We aim to find how the feature expert combination affects the result. Our model falls short when we feed three levels with only 2D features. The same decrease continues with the concatenated 2D and 3D features at dimension zero as well as the concatenated 2D and 3D features at dimension one. However, when we feed them with 2D, 3D, and RoI features respectively, while the HGR model shows a slight decrease, our model reaches a better result with a high margin. This confirms our insight that inter-modal correlation can be exploited with our proposed cross-modal attention mechanism to achieve better results.

5 Conclusion

Retrieving related video on a textual query gets harder since the number of videos on the internet increases. Most works use one joint embedding space for text-to-video retrieval task without fully exploiting cross-modal features. We propose a hierarchical model representing complex textual and visual features with three joint embedding spaces by utilizing self-attention and cross-modal attention to exploit the modality-specific and modality-complement visual embeddings. Our model surpasses a strong baseline with a high margin, and it also overpasses other SOTA methods in R@1, R@5 metrics.

References

[1] X. Chang, Y. Yang, A. Hauptmann, E. Xing, and Y. Yu, “Semantic concept discovery for large-scale zero-shot event detection,” in IJCAI, 2015.
[2] A. Habibian, Thomas Mensink, and Cees G. M. Snoek, “Composite concept discovery for zero-shot video event detection,” ICMR, 2014.
[3] S. Chen, Y. Zhao, Q. Jin, and Q. Wu, “Fine-grained video-text retrieval with hierarchical graph reasoning,” in CVPR, 2020.
[4] N. C. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury, “Learning joint embedding with multimodal cues for cross-modal video-text retrieval,” in Proceedings of the 2018 ACM on ICMR, 2018, p. 19–27.
[5] A. Miech, I. Laptev, and J. Sivic, “Learning a text-video embedding from incomplete and heterogeneous data,” arXiv:1804.02516, 2018.
[6] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,” in ICCV, 2019.
[7] A. Miech, J. B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman, “End-to-end learning of visual representations from uncurated instructional videos,” in CVPR, 2020.
[8] Y. Liu, S. Albanie, A. Nagrani, and A. Zisserman, “Use what you have: Video retrieval using representations from collaborative experts,” in arXiv, 2019.
[9] J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang, “Dual encoding for zero-example video retrieval,” in CVPR, 2019, pp. 9338–9347.
[10] Y. Song and M. Soleymani, “Polysemous visual-semantic embedding for cross-modal retrieval,” 2019.
[11] Y. Yu, J. Kim, and G. Kim, “A joint sequence fusion model for video question answering and retrieval,” in ECCV, 2018.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
[13] V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal Transformer for Video Retrieval,” in ECCV, 2020.
[14] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in HLT-NAACL, 2019.
[15] R. Tan, H. Xu, K. Saenko, and B. A. Plummer, “wman: Weakly-supervised moment alignment network for text-based video segment retrieval,” 2020.
[16] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “Videobert: A joint model for video and language representation learning,” in ICCV, 2019.
[17] L. Zhu and Y. Yang, “Actbert: Learning global-local video-text representations,” in CVPR, 2020.
[18] J.-B. Alayrac, A. Recasens, R. Schneider, R. Arandjelović, J. Ramapuram, J. De Fauw, L. Smaira, S. Dieleman, and A. Zisserman, “Self-supervised multimodal versatile networks,” 2020.
[19] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017, pp. 776–780.
[20] P. Shi and J. Lin, “Simple bert models for relation extraction and semantic role labeling,” 2019.
[21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
[22] J. Deng, W. Dong, R. Socher, L. Li, Kai L., and Li F.-F., “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
[23] K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?,” in CVPR, 2018, pp. 6546–6555.
[24] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in CVPR, 2017, pp. 4724–4733.
[25] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” IEEE TPAM, vol. 39, pp. 1137–1149, 2017.
[26] T. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Ms coco: Common objects in context,” 2015.
[27] V. Iashin and E. Rahtu, “Multi-modal dense video captioning,” in CVPR Workshops, 2020.
[28] B. Klein, G. Lev, G. Sadeh, and L. Wolf, “Associating neural word embeddings with deep image representations using fisher vectors,” in CVPR, 2015.
[29] L. Zhou, C. Xu, and J. Corso, “Towards automatic learning of procedures from web instructional videos,” in AAAI, 2018, pp. 7590–7598.
[30] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in EMNLP. 2014, pp. 1532–1543, ACL.

Semantic Role Aware Correlation Transformer for Text to Video Retrieval