A Multimodal Sentiment Dataset for Video Recommendation

Hongxuan Tang, Hao Liu, Xinyan Xiao, Hua Wu
Baidu Inc., Beijing, China
{tanghongxuan, liuhao24, xiaoxinyan, wu_hua}@Baidu.com

Abstract

Recently, multimodal sentiment analysis has seen remarkable advance and a lot of datasets are proposed for its development. In general, current multimodal sentiment analysis datasets usually follow the traditional system of sentiment/emotion, such as positive, negative and so on. However, when applied in the scenario of video recommendation, the traditional sentiment/emotion system is hard to be leveraged to represent different contents of videos in the perspective of visual senses and language understanding. Based on this, we propose a multimodal sentiment analysis dataset, named baiDu Video Sentiment dataset (DuVideoSenti), and introduce a new sentiment system which is designed to describe the sentimental style of a video on recommendation scenery. Specifically, DuVideoSenti consists of 5,630 videos which displayed on Baidu ¹¹1One of the most popular applications in China, which features both information retrieval and news recommendation, each video is manually annotated with a sentimental style label which describes the user’s real feeling of a video. Furthermore, we propose UNIMO Li et al. (2020) as our baseline for DuVideoSenti. Experimental results show that DuVideoSenti brings new challenges to multimodal sentiment analysis, and could be used as a new benchmark for evaluating apporaches designed for video understanding and multimodal fusion. We also expect our proposed DuVideoSenti could further improve the development of multimodal sentiment analysis and its application to video recommendations.

1 Introduction

Sentiment analysis is an important research area in Natural Language Processing (NLP), which has wide applications, such as opinion mining, dialogue generation and recommendation. Previous work Liu and Zhang (2012); Tian et al. (2020) mainly focused on text sentiment analysis and achieved promising results. Recently, with the development of short video applications, multimodal sentiment analysis has obtained more attention Tsai et al. (2019); Li et al. (2020); Yu et al. (2021) and a lots of datasets Li et al. (2017); Poria et al. (2018); Yu et al. (2020) are proposed to advance its developments. However, current multimodal sentiment analysis datasets usually follow the traditional sentiment system (positive, neutral and negative ) or emotion system (happy, sad, surprise and so on), which is far from satisfactory especially for video recommendation scenario of application.

In the scenery of industrial video recommendation, content-based recommendation method Pazzani and Billsus (2007) is widely used because of its advantages in improving the explainability of recommendation and conducting effective online interventions. Specifically, the videos are first represented by certain tags, which is automatically predicted by some neural network models Wu et al. (2015, 2016); Rehman and Belhaouari (2021). Then, a user’s profile is constructed by gathering the tags from his/her watched videos. Final, it recommends candidate videos based on the relevance between current candidate video’s tags and the user’s profile. In general, the types of tags are further divided into topic-level and entity-level. As show in Figure 1, for the videos talking about one’s traveling in The Palace Museum, they may relate to the topic-level tags such as ”Tourism” and entity-level tags such as ”The Palace Museum”. While in real application, the topic-level and entity-level are not afford to summary the content of a certain video comprehensively. The reason relies in that the videos, which share the same topics or contains similarity entities, usually gives different visual and sentimental perceptions to a certain user. For example, the first video brings us a feeling of great momentum, while the second video brings us a feeling of generic, although they share the same topics.

Based on this, we propose a multimodal sentiment analysis dataset, named baiDu Video Sentiment dataset (DuVideoSenti), and construct a new sentimental -style system, which consists of eight sentimental-style tags. These sentimental-style tags are designed to describe the users’ real feeling after he/she has watched the video.

In detail, we collect 5,630 videos which displayed on Baidu, and then each video is manually annotated with a sentimental-style label selected from the above sentimental-style system. In addition, we further propose a baseline to our DuVideoSenti Corpus which is based on UNIMO. And we expect our proposed DuVideoSenti corpus could further improve the development of multimodal sentiment analysis and its application to video recommendations.

Refer to caption — Figure 1: Two videos displayed in Baidu, which record the authors’ travelling to The Palace Museum. Although, the two videos share the same topic-level and entity-level tags such as ”Tourism” and ”The Palace Museum”, they give different visual and sentimental perceptions to the audience. Specifically, the first video gives us a feeling of great momentum, while the another one brings us a feeling of generic.

2 Related Work

In this section, we briefly introduce previous works in multimodal datasets, multimodal sentiment analysis.

2.1 Multimodal Datasets

To promote the development of multimodel sentiment analysis and emotion detection, a variety of multimodal datasets are proposed, including IEMOCAP Busso et al. (2008), YouTube Morency et al. (2011), MOUD Pérez-Rosas et al. (2013), ICT-MMMO Wöllmer et al. (2013), MOSI Zadeh et al. (2016), CMU-MOSEI Zadeh et al. (2018b), CHEAVD Li et al. (2017), Meld Poria et al. (2018), CH-SIMS Yu et al. (2020) and so on. As mentioned above, current multimodal datasets follow the traditional sentiment/emotion system, which is not fit to the scenario of video recommendations. In this paper, our DuVideoSenti dataset defines a new sentimental-style system which is customized for video recommendation.

2.2 Multimodal Sentiment Analysis

In multimodal sentiment analysis, intra-modal representation and inter-modal fusion are two essential and challenging subtasks. For intra-modal representation, previous work pay attention to the temporal and spatial characteristics among different modalities. The Convolutional Neural Network (CNN), Long Short-term Memory (LSTM) and Deep Neural Network (DNN) are representative approaches to extract multimodal features Cambria et al. (2017); Zadeh et al. (2017, 2018a). For inter-modal fusion, numerous methods have been proposed recently. For example, concatenation Cambria et al. (2017), Tensor Fusion Network (TFN) Zadeh et al. (2017), Low-rank Multimodal Fusion (LMF) Liu et al. (2018), Memory Fusion Network (MFN) Zadeh et al. (2018a), Dynamic Fusion Graph (DFG) Zadeh et al. (2018b). Recently, we follow the trend of pretaining, and select UNIMO Li et al. (2020) to build our baselines.

3 Data Construction

In this section, we introduce the constrcution details of our DuVideoSenti dataset. Specifically, DuVideoSenti contains 5,630 video examples, all examples are selected from Baidu. Besides, in order to ensure the data quality, all sentiment labels are annotated by experts. Finally, we release DuVideoSenti according to following data structure as shown in Table 1. Table 1 shows a real example in our DuVideoSenti datasets, which contains multi-regions such as url, title, video feaures and its corresponding sentiment label. For the video features extraction, we first sample four frame images from all the frame images of the given video at the same time intervals. Second, we use Faster R-CNN Ren et al. (2015) to detect the salient image regions and extract the visual features (pooled ROI features) for each region, which is the same as Chen et al. (2020); Li et al. (2020). Specifically, the image-region features are set as 100.

[Uncaptioned image] — Table 1: Data structure of released dataset. (It should be noted that we have not provide the origin frame images in our dataset for the protection of copyright. Instead, we provided the visual features extracted from each frame image and its corresponding url for watching the video online.)

Region	Example
url	http://quanmin.baidu.com/sv?source=share-h5&pd=qm_share_search&
	vid=5093910907173814607
title	那小孩不让我吃口香糖…
label	呆萌可爱
frame1
…	…
frame4

In addition, we split DuVideoSenti into training set and test set randomly, the size of which are 4,500 and 1,130 respectively. The sentimental-style labels are listed as follows: “文艺清新 (Hipsterism)”, “时尚炫酷 (Fashion)”, “舒适温馨 (Warm and Sweet)”, “客观理性 (Objective and Rationality)”, “家长里短 (Daily)”, “土里土气 (Old-fashion)”, “呆萌可爱 (Cute)”, “奇葩低俗 (Vulgar)”, “正能量 (Positive Energy)”, “负能量 (Negative Energy)”, “其他 (Other)” , the class distribution and examples of each sentiment label in our DuVideoSenti are shown in Table 2.

Label	Example	#
文艺清新	令箭荷花的开放过程	520
Hipsterism	Translated: The opening process of Nopalxochia	520
时尚炫酷	古驰中国新年特别系列	193
Fashion	Translated: Gucci Chinese New Year limited edition	193
舒适温馨	美好的一天，从有爱的简笔画开始	485
Warm and Sweet	Translated: A beautiful day, start with simple strokes of love	485
客观理性	吹气球太累?是你没找对方法！	1,204
Objective and Rationality	Translated: Tired of blowing balloons? Try this method.	1,204
家长里短	今天太阳炎热夏天以到来	1,334
Daily	Translated: It’s very hot today, summer is coming	1,334
土里土气	广场舞，梨花飞情人泪32步	71
Old-fashion	Translated: The 32 steps of public square dancing	71
呆萌可爱	#创意简笔画#可爱小猫咪怎么画？	522
Cute	Translated: #Creative stick figure#How to draw a cute kitten?	522
奇葩低俗	撒网是我的本事，入网是你的荣幸	282
Vulgar	Translated: It is your honour to love me	282
正能量	山东齐鲁医院130名医护人员集体出征	81
Positive Energy	Translated: Shandong Qilu Hospital 130 medical staff set out	81
负能量	黑社会被打屁股	5
Negative Energy	Translated: The underworld spanked	5
其他	速记英语，真的很快记住	933
Other	Translated: English shorthand, really quick to remember	933
Total		5,630

Table 2: Distribution and examples for different sentiment classes in DuVideoSenti.

4 Experiments

4.1 Experiment Setting

In our experiment, we select UNIMO-large , which consists of 24 layers of Transformer block,as as our baseline. The maximum sequence length of text tokens and image-region features are set as 512 and 100, respectively. The learning rate is set to 1e-5 and the batch size is $8$ . We set the number of epochs to $20$ . All experiments are conducted on 1 Tesla V100 GPUs. We select Accuracy to evaluate the performance of the baseline systems.

4.2 Experimental Results

	frame 1	frame 4	frame 20
文艺清新	50.91	52.88	45.19
时尚炫酷	2.56	28.20	15.38
舒适温馨	27.85	27.83	37.11
客观理性	62.65	67.91	65.14
家长里短	64.41	67.69	74.90
土里土气	6.66	0.00	6.66
呆萌可爱	60.95	39.04	68.57
奇葩低俗	29.82	33.33	10.52
正能量	5.88	41.17	47.05
负能量	0.00	0.00	0.00
其他	57.75	59.89	56.14
All	52.65	54.56	56.46

Table 3: UNIMO baseline performance based on different number of key frames.

	Visual	Textual	Multi
文艺清新	52.88	32.69	52.88
时尚炫酷	17.94	17.94	28.20
舒适温馨	18.55	27.83	27.83
客观理性	66.39	56.84	67.91
家长里短	61.79	65.54	67.69
土里土气	6.66	0.00	0.00
呆萌可爱	49.52	49.52	39.04
奇葩低俗	35.08	28.07	33.33
正能量	11.76	47.05	41.17
负能量	0.00	0.00	0.00
其他	56.68	41.71	59.89
All	51.85	47.25	54.56

Table 4: Baseline performance based on image and text.

Table 3 shows the baseline performance in our dataset, specifically, UNIMO-large model achieves an Accuracy of 54.56% on the test set. It suggests that the by simply using a strong and widely used models failed to obtain promising results, which proposes the needs for further development of multimodal sentiemnt analysis, especially for videos.

We are also interested in the affects brought by different number of image-region frames. We further test the performance of our baseline system when $\{1,4,20\}$ frames are selected. The experiment results are listed in Table 3, which shows that the performance of decreases with less image-region frames. It suggests that by improving visual representation, can further promote the classification performance.

Finally, we propose to evaluate the improvements brought by multi-modal fusion. We compare the accuracy performance among the systems which use visual, textual, and multi-modal information respectively. The results are show in Table 4, which shows that by fusing the visual and textual information of the videos, it obtained the best performance. Furthermore, visual-only models performs better than the textual-only one, we suspect that is our defined sentimental-style system is much more related to user’ visual feelings after he/she watched the video.

5 Conclusion

In this paper, we propose a new multimodel sentiment analysis dataset named DuVideoSenti, which is designed for the scenario of video recommendation. Furthermore, we propose UNIMO as our baseline, and test the accuray performance of the baseline on a variety of settings. We expect our DuVideoSenti dataset could further improve the development for the area of multimodal sentiment analysis, and promote the applications to video recommendation.

References

Busso et al. (2008) Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335–359.
Cambria et al. (2017) Erik Cambria, Devamanyu Hazarika, Soujanya Poria, Amir Hussain, and RBV Subramanyam. 2017. Benchmarking multimodal sentiment analysis. In International Conference on Computational Linguistics and Intelligent Text Processing, pages 166–179. Springer.
Chen et al. (2020) Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2020. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer.
Li et al. (2020) Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu, and Haifeng Wang. 2020. Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning. arXiv preprint arXiv:2012.15409.
Li et al. (2017) Ya Li, Jianhua Tao, Linlin Chao, Wei Bao, and Yazhu Liu. 2017. Cheavd: a chinese natural emotional audio–visual database. Journal of Ambient Intelligence and Humanized Computing, 8(6):913–924.
Liu and Zhang (2012) Bing Liu and Lei Zhang. 2012. A survey of opinion mining and sentiment analysis. In Mining text data, pages 415–463. Springer.
Liu et al. (2018) Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2018. Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064.
Morency et al. (2011) Louis-Philippe Morency, Rada Mihalcea, and Payal Doshi. 2011. Towards multimodal sentiment analysis: Harvesting opinions from the web. In Proceedings of the 13th international conference on multimodal interfaces, pages 169–176.
Pazzani and Billsus (2007) Michael J Pazzani and Daniel Billsus. 2007. Content-based recommendation systems. In The adaptive web, pages 325–341. Springer.
Pérez-Rosas et al. (2013) Verónica Pérez-Rosas, Rada Mihalcea, and Louis-Philippe Morency. 2013. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 973–982.
Poria et al. (2018) Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. 2018. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508.
Rehman and Belhaouari (2021) Atiq Rehman and Samir Brahim Belhaouari. 2021. Deep learning for video classification: A review.
Ren et al. (2015) Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28:91–99.
Tian et al. (2020) Hao Tian, Can Gao, Xinyan Xiao, Hao Liu, Bolei He, Hua Wu, Haifeng Wang, and Feng Wu. 2020. Skep: Sentiment knowledge enhanced pre-training for sentiment analysis. arXiv preprint arXiv:2005.05635.
Tsai et al. (2019) Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2019, page 6558. NIH Public Access.
Wöllmer et al. (2013) Martin Wöllmer, Felix Weninger, Tobias Knaup, Björn Schuller, Congkai Sun, Kenji Sagae, and Louis-Philippe Morency. 2013. Youtube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intelligent Systems, 28(3):46–53.
Wu et al. (2016) Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2016. Multi-stream multi-class fusion of deep networks for video classification. In Proceedings of the 24th ACM international conference on Multimedia, pages 791–800.
Wu et al. (2015) Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, Xiangyang Xue, and Jun Wang. 2015. Fusing multi-stream deep networks for video classification. arXiv preprint arXiv:1509.06086.
Yu et al. (2020) Wenmeng Yu, Hua Xu, Fanyang Meng, Yilin Zhu, Yixiao Ma, Jiele Wu, Jiyun Zou, and Kaicheng Yang. 2020. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3718–3727.
Yu et al. (2021) Wenmeng Yu, Hua Xu, Ziqi Yuan, and Jiele Wu. 2021. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. arXiv preprint arXiv:2102.04830.
Zadeh et al. (2017) Amir Zadeh, Minghai Chen, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2017. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.
Zadeh et al. (2018a) Amir Zadeh, Paul Pu Liang, Navonil Mazumder, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018a. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
Zadeh et al. (2016) Amir Zadeh, Rowan Zellers, Eli Pincus, and Louis-Philippe Morency. 2016. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259.
Zadeh et al. (2018b) AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018b. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2236–2246.