NITS-VC System for VATEX Video Captioning Challenge 2020

Alok Singh, Thoudam Doren Singh and Sivaji Bandyopadhyay
Center for Natural Language Processing & Department of Computer Science and Engineering
National Institute of Technology Silchar
Assam, India
{alok.rawat478, thoudam.doren, sivaji.cse.ju}.gmail.com

Abstract

Video captioning is process of summarising the content, event and action of the video into a short textual form which can be helpful in many research areas such as video guided machine translation, video sentiment analysis and providing aid to needy individual. In this paper, a system description of the framework used for VATEX-2020 video captioning challenge is presented. We employ an encoder-decoder based approach in which the visual features of the video are encoded using 3D convolutional neural network (C3D) and in the decoding phase two Long Short Term Memory (LSTM) recurrent networks are used in which visual features and input captions are fused separately and final output is generated by performing element-wise product between the output of both LSTMs. Our model is able to achieve BLEU scores of 0.20 and 0.22 on public and private test data sets respectively.

Keywords: C3D, Encoder-decoder, LSTM, VATEX, Video captioning.

1 Introduction

Nowadays, the amount of multimedia data (especially video) over the Internet is increasing day by day. This leads to a problem of automatic classification, indexing and retrieval of the video [14]. Video captioning is a task of automatic captioning a video by understanding the action and event in the video which can help in the retrieval of the video efficiently through text. On addressing the task of video captioning effectively, the gap between computer vision and natural language can also be minimized. Based on the approaches proposed for video captioning till now, they can be classified into two categories namely, template-based language model [2] and sequence learning model [12]. The template based approaches use predefined templates for generating the captions by fitting the attributes identified in the video. These kinds of approaches need the proper alignment between the words generated for the video and the predefined templates. In contrast to template based approach, the sequence learning based approach learn the sequence of word conditioned on previously generated word and visual feature vector of the video. This approach is commonly used in Machine Translation (MT) where the target language (T) is conditioned on the source language (S).

Video is a rich source of information which consists of large number of continuous frames, sound and motion. The presence of large number of similar frames, complex actions and events of video makes the task of video captioning challenging. Till now a number of data sets have been introduced by covering variety of domains such as cooking [5], social media [8] and human action [4]. Based on the data sets available for video captioning, the approaches proposed for video captioning can be categorized into open domain video captioning [19] and domain specific video captioning [5]. The process of generating caption for an open domain video is challenging due to the presence of various intra-related action, event and scenes as compared to domain specific video captioning. In this paper, an encoder-decoder based approach is used for addressing the problem of English video captioning in a dataset provided by the VATEX community for multilingual video captioning challenge 2020 [19].

2 Background Knowledge

The background of video captioning approaches can be divided into three phases [1]. The classical video captioning approach phase involves the detection of entities of the video (such as object, actions and scenes) and then map them to a predefined templates. The statistical methods phase, in which the video captioning problem is addressed by employing statistical methods. The last one is deep learning phase. In this phase, many state-of-the-art video captioning framework have been proposed and it is believed that this phase has a capability of solving the problem of automatic open domain video captioning.

Some classical video captioning approaches are proposed in [3, 10] where motion of vehicles in a video and series of actions in a movie are characterized using natural language respectively. The successful implementation of Deep Learning (DL) techniques in the field of computer vision and natural language processing attract researchers for incorporating DL based techniques for text generation. Most of these DL based approaches for video captioning are inspired by the deep image captioning approaches [20]. These DL based approaches for video captioning mainly consist of two stages namely, visual feature encoding stage and sequence decoding stage. [17] proposed a model based on stacked LSTM in which first LSTM layer takes a sequence of frames as an input and second LSTM layer generates the description using first LSTM hidden state representation. The shortcoming of the approach proposed in [17] is that all frames need to be processed for generating description and since the video consist of many similar frames, there is a chance that final representation of the visual features consist of less relevant information for video captioning [20]. Some other DL based technique are successfully implemented in [7, 18]. In this paper, the proposed architecture and performance of the system is presented which is evaluated on VATEX video captioning challenge dataset 2020¹¹1https://competitions.codalab.org/competitions/24360. Furthermore, in Section 3, description of system used for challenge is given, followed by Section 4 and Section 5 on discussion of performance of the system and conclusion respectively.

3 VATEX Task

In this section, detail of visual feature extraction of the video and the implementation of the system are discussed.

3.1 System Description

Visual Feature Encoding: For this task, a traditional encoder-decoder based approach is used. For encoding, the visual feature vector of the input video, firstly, the video is evenly segmented into $n$ segments in the interval of $16$ then, using a pre-trained 3D Convolutional Neural Network (C3D), a visual feature vector $f=\{s_{1},s_{2}\dots s_{n}\}$ for video is extracted. The dimension of feature vector for each video is $f\in\mathbb{R}^{n\times m_{x}}$ where $m_{x}$ is dimension feature vector for each segment. Since, the high dimensional feature vector is prone to carry redundant information and affect the quality of features [20], for reducing the dimension of features an average pooling is performed with filter size $5$ . All the averaged pooled features( $f_{a}$ ) of each segment are concatenated in a sequence to preserve the temporal relation between them and passed to a decoder for caption generation.

Caption generator: For the decoding, two separate Long Short Term Memory (LSTM) recurrent networks are used as a decoder. In this stage, firstly, all the input captions are passed to a embedding layer to get a dense representation for each word in the input caption. The embedding representation is then passed to both LSTM separately. The first LSTM takes the encoded visual feature vector as an initial stage and for the second LSTM, the visual feature vector concatenated with the output of embedding layer and finally the element wise product is preformed between the output from both LSTM. The unrolling procedure of system is given below: ^†^†[ ; ] represent the concatenation

\tilde{y}=W_{e}X+b_{e}

(1)

\tilde{z_{1}}=LSTM_{1}(\tilde{y},h_{i})

(2)

\tilde{z_{2}}=LSTM_{2}([\tilde{y};f_{a}])

(3)

y_{t}=softmax(\tilde{z_{1}}\odot\tilde{z_{2}})

(4)

The Equation 1 represents the embedding of input captions ( $X$ ) where $W_{e}$ and $b_{e}$ are learnable parameter weight and bias respectively. The $\tilde{z_{1}}$ in Equation 2 and $\tilde{z_{2}}$ in Equation 3 are the output of LSTM layers where, $h_{i}$ ( $h_{i}=f_{a}$ ) is the average pooled feature vector which is passed as a hidden state for $LSTM_{1}$ and for $LSTM_{2}$ it concatenated with embedding vector. $\odot$ represent the product of $\tilde{z_{1}}$ and $\tilde{z_{2}}$ which is finally passed to softmax layer.

Table 1: Statistics of dataset used

Dataset	#Videos	#English	#Chinese
Split	#Videos	Captions	Captions
Training	25,991	259,910	259,910
Validation	3,000	30,000	259,910
Public test set	6,000	30,000	30,000
Private test set	6,287	62,780	62,780

[Uncaptioned image] — Table 2: Sample output caption generated by system.



VideoId-2-40BG6NPaY	VideoId-fjErVZXd9e0	VideoId-IwjWR7VJiYY
Generated Caption	Generated Caption	Generated Caption
A group of people are skiing down a hill and one of them falls down	A man is lifting a heavy weight over his head and then drops it	A woman is demonstrating how to apply mascara to her eyelashes
(a)	(b)	(c)

4 Experimental Results and Discussion

In this section, a brief discussion on dataset used, experimental setup and the output generated by the system are carried out.

4.1 Dataset

For the evaluation of the system performance, VATEX video captioning dataset is used. This dataset is proposed for motivating multilingual video captioning, in which each video is associated with corresponding $10$ captions in both English and Chinese. This dataset consist of two test sets namely, public test set for which reference captions are publicly available and private test set for which reference captions are not publicly available which are on hold for the evaluation in VATEX video captioning challenge 2020. The detailed statistics of the dataset is given in Table 1. The system employed in the paper is implemented using QuADro P2000 GPU and tested on both the test sets.

4.2 Evaluation Metrics

For the evaluation of generated captions, various evaluation metrics have been proposed. In this paper, for the quantitative evaluation of the generated output, the following evaluation metrics are used: BLEU[13]²²2https://github.com/tylin/coco-caption, METEOR [6], CIDEr [16] and ROUGE-L. BLEU evaluates the part of n-grams (up to four) that are similar in both reference (or a set of references) and hypothesis. CIDEr also measure the n-grams that are common in both hypothesis and the references, but in CIDEr, term frequency-inverse document is used for weighting the n-grams. In METEOR, once the common unigrams are found, then it calculates a score for this matching using a unigram-precision, unigram-recall and measure of fragmentation which is used for evaluating how well-ordered the matched word in generated caption against the reference caption.

4.3 Experimental Setup

Following the system architecture discussed in Section 3.1, after extracting the spatio-temporal features using C3D model pre-trained on Sports-1M dataset [9, 15], they are passed to caption generator module.

In the training process, each caption is concatenated by two special marker $<BOS>$ and $<EOS>$ for informing the model about the beginning and ending of caption generation process. We restricted the maximum number of words in a caption to $30$ , and if the length of the caption is less than desired length the caption is padded with $0$ . The captions are tokenized using Stanford tokenizer [11] and only to $15K$ words with most occurrence are retained. For out-of-vocabulary words, a special tag $UKN$ is used. In the testing phase, the generation of caption starts after watching start marker and the input visual feature vector, in each iteration, the most probable word is sampled out and passed to next iteration with previous generated words and visual feature vector until a special end marker is generated. For loss minimization, cross-entropy loss function is used with $ADAM$ optimizer and learning rate is set to $2\times 10^{-4}$ . For reducing the overfitting situation, a dropout of $0.5$ is used and the hidden units of both LSTMs are set to $512$ . The system is trained with different batch size $64$ and $256$ for 50 epochs each. On analysing the scores of evaluation metrics and the quality of generated captions in terms of coherence and relevancy, smaller batch size performs better.

4.4 Results

The performance of the system on private and public test set is given in Table 3.

Table 3: Performance of the system on public dataset

Evaluation	Proposed System	Proposed System
Metrics	on public test set	on private test set
CIDEr	0.24	0.27
BLEU-1	0.63	0.65
BLEU-2	0.43	0.45
BLEU-3	0.30	0.32
BLEU-4	0.20	0.22
METEOR	0.18	0.18
ROUGE-L	0.42	0.43

In the Table 2, sample output captions generated by the system is given along with videoId.

5 Conclusion

In this paper, a description of the system which is used for VATEX2020 video captioning challenge is presented. We have used encode-decoder based video captioning framework for the generation of English captions. Our system scored $0.20$ and $0.22$ BLEU-4 score on public and private video captioning test set respectively.

Acknowledgments

This work is supported by Scheme for Promotion of Academic and Research Collaboration (SPARC) Project Code: P995 of No: SPARC/2018-2019/119/SL(IN) under MHRD, Govt of India.

References

[1] Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, and Mubarak Shah. Video description: A survey of methods, datasets, and evaluation metrics. ACM Computing Surveys (CSUR), 52(6):1–37, 2019.
[2] Andrei Barbu, Alexander Bridge, Zachary Burchill, Dan Coroian, Sven Dickinson, Sanja Fidler, Aaron Michaux, Sam Mussman, Siddharth Narayanaswamy, Dhaval Salvi, et al. Video in sentences out. arXiv preprint arXiv:1204.2742, 2012.
[3] Matthew Brand. The” inverse hollywood problem”: From video to scripts and storyboards via causal analysis. In AAAI/IAAI, pages 132–137. Citeseer, 1997.
[4] David L Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 190–200. Association for Computational Linguistics, 2011.
[5] Pradipto Das, Chenliang Xu, Richard F. Doell, and Jason J. Corso. A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
[6] Michael Denkowski and Alon Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
[7] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2625–2634, 2015.
[8] Spandana Gella, Mike Lewis, and Marcus Rohrbach. A dataset for telling the stories of social media videos. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 968–974, 2018.
[9] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725–1732, 2014.
[10] Dieter Koller, N Heinze, and Hans-Hellmut Nagel. Algorithmic characterization of vehicle trajectories from image sequences by motion verbs. In Proceedings. 1991 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 90–95. IEEE, 1991.
[11] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60, 2014.
[12] Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, and Yueting Zhuang. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1029–1038, 2016.
[13] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics, 2002.
[14] Alok Singh, Dalton Meitei Thounaojam, and Saptarshi Chakraborty. A novel automatic shot boundary detection algorithm: robust to illumination and motion effect. Signal, Image and Video Processing, 2019.
[15] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
[16] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
[17] Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence-video to text. In Proceedings of the IEEE international conference on computer vision, pages 4534–4542, 2015.
[18] Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729, 2014.
[19] Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE International Conference on Computer Vision, pages 4581–4591, 2019.
[20] Yuecong Xu, Jianfei Yang, and Kezhi Mao. Semantic-filtered soft-split-aware video captioning with audio-augmented feature. Neurocomputing, 357:24–35, 2019.