Contrastive Graph Multimodal Model for Text Classification in Videos
Abstract
The extraction of text information in videos serves as a critical step towards semantic understanding of videos. It usually involved in two steps: (1) text recognition and (2) text classification. To localize texts in videos, we can resort to large numbers of text recognition methods based on OCR technology. However, to our knowledge, there is no existing work focused on the second step of video text classification, which will limit the guidance to downstream tasks such as video indexing and browsing. In this paper, we are the first to address this new task of video text classification by fusing multimodal information to deal with the challenging scenario where different types of video texts may be confused with various colors, unknown fonts and complex layouts. In addition, we tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information. Furthermore, contrastive learning is utilized to explore inherent connections between samples using plentiful unlabeled videos. Finally, we construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications. Extensive experiments on TI-News demonstrate the effectiveness of our method.
Index Terms— Multimodal Classification, Video Understanding, Graph Neural Network, Contrastive Learning
1 Introduction
With massive video data generated everyday, the extraction of textual information from videos is an essential work in many video applications. Texts embedded in videos usually carry rich semantic description about the video content and this information gives high-level index for a content-based video indexing and browsing. The extraction of video text information proceeds in two main steps: text recognition followed by text classification. To localize texts in videos, one can resort to large numbers of text detection methods based on OCR technology. A number of deep neural network (DNN) based methods [1, 2, 3] have been proposed to automatically learn effective features of text and localize text in an image using variant DNN models such as convolutional neural network (CNN) and recurrent neural network (RNN). To make use of complementary text cues in multiple related video frames, some multi-frame video text detection methods [4, 5], such as spatial-temporal analysis, have been proposed to further improve the overall detection performance on the basis of single-image text detection methods.
In this paper, we concentrate on the second step of video text classification, which plays a more important role in downstream tasks such as video indexing and browsing. For example, caption summarizes the whole video and represents the most important or relevant information within the original video content. Subtitle expresses what the speaker thought, which contains abundant details describing video events. However, to our knowledge, there is no existing work attempting to address this task, which guides us to turn to some similar tasks, such as text classification and scene/caption classification. Text classification is a fundamental problem behind many research topics in Natural Language Processing (NLP), such as topic categorization, sentiment analysis, relation extraction, etc. Existing text classification methods can be classified into bag-of-words/n-grams models [6], CNN [7, 8], RNN [9, 10], and Transformer [11, 12]. Scene/caption classification aims to classify texts in one image background or foreground. Roy et al. [13] proposed temporal integration for word-wise caption and scene text identification. Ghosh et al. [14] proposed identifying the presence of graphical text in scene images using CNN.
Unfortunately, these approaches have important shortcomings which fail to get satisfying performance. Text classification based methods only take text as their input, ignoring other important modalities with critical discrimination information for some classes, e.g. caption and subtitle. Fig.1(a) and Fig.1(b) show an example of caption and subtitle on one video frame, described as “it is not necessary to wear mask in an outdoor area with dispersing crowd during Chinese May Day” and “now tourists travel rationally”, respectively. In this case, it is difficult to distinguish from one another based only on text information, thus text classification methods usually collapse with weak discrimination. Further, Fig.1 shows that these two texts appear in various colors and position, inspiring us to exploit multi-modality of visual and coordinate information. Considering scene/caption classification, it is not a real classification task. It only focuses on identifying a text background or foreground, thus this coarse-grained task does not classify texts’ categories. Consequently, a number of useless texts, such as rolling text in news videos, have no effect or even side effect on downstream video tasks.
In order to effectively assign the label of each text in videos, in this paper, we propose a novel Contrastive Graph Multimodal Model, called CGMM, to leverage multimodal information in a harmonious manner. A specific module called CorrelationNet is established to explicitly extract layout information to enhance feature representation, meanwhile, contrastive learning is employed to learn the general features from plentiful unlabeled samples. Due to the lacking of public datasets related to our task, we construct a new large-scale dataset, called TI-News, from the news domain.
(a) an example of caption
(b) an example of subtitle
2 Methodology
The schematic diagram of our CGMM algorithm is shown in Fig.2. The architecture of CGMM consists of four parts. Firstly, multimodal features of vision, position and text are extracted, where the corresponding backbone is pre-trained using contrastive learning on unlabeled videos. Then, features of different modalities are fused and the multimodal representations of text boxes are obtained. Furthermore, we develop a new module called CorrelationNet to aggregate the spatial information among neighboring texts. Finally, the classification head is used to tag the text so as to determine its category.

2.1 Multimodal Feature Extraction
As mentioned in section 1, multimodal signals can provide abundant discrimination information so that we need to extract the multimodal representation of the texts. In this paper, we mainly extract visual, textual and coordinate features.
For visual modality, in contrast to conventional approaches which usually used classical VGG [15] or ResNet-based [16] network, we construct a shallow neural network as the backbone. In fact, from Figure 1, we observe that texts differ in low-level features of colors, fonts and sizes, thus above deeper networks are abandoned as high-level semantic information is extracted which is not our need. Correspondingly, BERT [17] is used as the backbone of textual extraction. Upper-left and bottom-right coordinates represent the position of text in the frame.
2.2 Multimodal Feature Fusion
The fusion of multimodal features is a critical step to get the multimodal representation of video text. The process of visual and positional modalities is shown as bellow:
(1) |
where and as raw visual and positional features respectively. Firstly, ROIAlign [18] is utilized to extract patch feature using and as its inputs. Then, we use a transformer [19] to learn implicit relation between a patch box and its corresponding full frame. Finally, the visual feature and textual feature are simply concatenated.
2.3 Contrastive Learning
Due to limited labeled data, we elaborately utilize a contrastive learning [20] strategy to fully explore inherent connections between samples using plentiful unlabeled videos. Fig.3 illustrates the overview of the contrastive learning framework. It learns a representation which embed positive input pairs nearby, while pushing negative pairs far apart.

Given a batch of samples , the positive sample of an anchor sample is established by disturbing its coordinate, changing image color, or replacing text with synonyms in lexical combinations. Thus a positive input pair is combined as . The negative pair is formed as , where . Positive and negative sample pairs are fed to contrastive learning frameworks.
Based on the trained backbone of unimodal by contrastive learning, we attempt to train video text classification model using the labeled videos by two strategies: fine-tuning and joint learning. Fine-tuning only uses traditional supervised learning to deal with our task. Joint learning combines the self-supervised contrastive learning and original supervised learning. The joint loss is denoted as follows:
(2) |
where is the contrastive loss defined in [20] , and is the classical cross entropy loss. is the weight to balance these two losses.
2.4 CorrelationNet
Edited texts from news videos, such as texts of caption and subtitle, usually have regular layout. Figure 1 shows examples of caption and subtitle on some video frames. It is observed that the text of caption appears above that of person information, displayed in Fig.1(a), while the text of person information appears above that of subtitle, shown in Fig.1(b). This phenomenon inspires us to utilize the structural information to reinforce the feature representation. Motivated by Graph Neural Network [21], CorrelationNet is proposed to model the relations among adjacent texts.
For a certain text box , consider its adjacent neighbors . The total text boxes are used to generate the aggregated feature of . Features of neighboring text boxes are weighted by a attention mechanism and then summed to get the fused feature. The graph constructed by the text boxes is illustrated in Fig.4.

is denoted as the weight between two text boxes and . Concretely, features are transformed by a fully connected network and then do subtraction to get a representation measuring the difference. Finally, a multilayer perceptron (MLP) is applied to get the weight.
After calculating all the weights between and its neighboring texts, the aggregated feature is obtained:
(3) |
where , and is original feature of text box .
By using weighted aggregation strategy, CorrelationNet can learn the contribution of each modality by highlighting its importance. The final multimodal feature , which is fed to classification head, is:
(4) |
3 Experiments
3.1 Dataset
There are no publicly available datasets with unified annotations for video text recognition and classification. To promote this new area, we construct a new well-defined dataset of news videos, called TI-News, which is dedicated especially to studying video text recognition and classification tasks jointly. Over 450000 text-box samples are annotated in TI-News, which are extracted from over 100 videos in 23 news programs. In training phase, 350000 samples come from 90 videos related to 8 programs. In testing phase, 100000 samples are selected for evaluation, from which 50000 samples are collected from the same programs with the training dataset to construct TI-News standard dataset, while 50000 samples are collected from extra 15 distinct videos to construct TI-News generalization dataset for generalization testing.
We use a general OCR engine to locate and recognize all the texts in videos. Then, an annotation system for video text classification task is designed. Three representative categories are defined. They are Caption, Subtitle and Person information. In addition, class Others is used to denote the text boxes which not belong to above-mentioned categories. Besides, abundant kinds of pre-trained models involved in visual, audio and text features extraction are presented, which are pre-trained by millions of short videos using contrastive learning. Therefore, the pre-trained models have strong ability of feature representation.
3.2 Training Details
For visual modality, video frame is first resized to the shape of . A simple model building with 3-layers CNN is used as feature extractor. Correspondingly, a 4-TransformerLayers BERT with embedding dimension of 768 is used to extract textual features. A stacked transformer block with 2 TransformerEncoderLayers (d_model=256, nhead=8, dim_feedforward=256) is used to implicitly learn the relation between patch box and full image. Adam is chosen as the optimizer with a learning rate of 0.0002.
3.3 Evaluation Criteria
We use standard precision, recall and F1 score to evaluate the performance of CGMM. To exploit the effectiveness of several designed modules, ablation study is applied. Experiments on TI-News standard set and generalization set are conducted.
3.4 Results on TI-News dataset
Results on TI-News standard dataset and TI-News generalization dataset are shown in Table 1 and Table 2 respectively. In Table 1, the precision and recall of CGMM exceed 89.05% and 92.34% in standard set, with 90.67% as its F1 score, which lead to a splendid performance of video-text classification task. Moreover, it is easy to find that we can’t get good results with single modality. In Table 2, the precision and recall of CGMM exceed 84.33% and 87.44% in generalization set, with 85.86% as its F1 score, which shows that our method has good generalization. If only consider the visual modality, the results are shown to be an random distribution. By incorporating text modality, precision has increased a lot in the category of caption and subtitle. Furthermore, CorrelationNet can aggregate features of neighboring text boxes, which gives great help to the classification model. Also, contrastive learning is effective to enhance the modality representations.
Methods | Classification | ||
---|---|---|---|
Precision | Recall | F1 | |
CGMM w/o CV | 76.14 | 86.65 | 81.05 |
CGMM w/o NLP | 40.63 | 32.11 | 35.87 |
CGMM w/o POS | 86.41 | 92.15 | 89.19 |
CGMM w/o CorrelationNet | 87.50 | 91.44 | 89.43 |
CGMM w/o Contrasive learning | 86.96 | 92.02 | 89.42 |
CGMM proposed | 89.05 | 92.34 | 90.67 |
Methods | Classification | ||
---|---|---|---|
Precision | Recall | F1 | |
CGMM w/o CV | 72.10 | 82.05 | 76.76 |
CGMM w/o NLP | 38.47 | 30.41 | 33.97 |
CGMM w/o POS | 81.83 | 89.55 | 85.52 |
CGMM w/o CorrelationNet | 82.85 | 86.60 | 84.69 |
CGMM w/o Contrasive learning | 82.34 | 87.92 | 85.04 |
CGMM proposed | 84.33 | 87.44 | 85.86 |
3.5 Ablation Study
Backbone We study how the change of backbone affects the performance of the proposed CGMM, as shown in Table 3. In the proposed CGMM, a simple 3-layers CNN and 4-layers BERT are employed as the backbone of visual and textual modality, respectively. First, 3-layers CNN is replaced by a mobilenetV2 network, which leads to a drop of 7.5% on F1 score against the proposed CGMM. Afterward, for the backbone of textual stream, we deepen BERT network from 4 layers to 8 layers. It is observed that 4 layers BERT is better.
Methods | Classification | |||
---|---|---|---|---|
Backbone | P | R | F1 | |
CGMM | CV(MobilenetV2) | 80.13 | 86.34 | 83.12 |
CGMM | NLP(8layers) | 88.65 | 92.01 | 90.3 |
CGMM proposed | – | 89.05 | 92.34 | 90.67 |
Contrastive Learning We empirically analyze the effect of contrastive learning. Table 4 illustrates the results from three aspects: CV, NLP and position. We construct variants of positive samples by disturbing its coordinates, changing image color, or replacing text with synonyms in lexical combinations. The results indicate that contrastive learning involved in all three aspects can achieve the best performance.
Methods | Classification | |||
---|---|---|---|---|
Contrastive | P | R | F1 | |
CGMM | CV | 88.23 | 90.98 | 89.58 |
CGMM | CV+POS | 88.47 | 91.77 | 90.09 |
CGMM proposed | – | 89.05 | 92.34 | 90.67 |
4 Conclusion
The paper solves a new video text classification problem which aims to assign the label of each text. To improve the discrimination ability, we propose a multimodal network, called CGMM, to fuse multimodal information of vision, text and position, into an unified framework. To explore layout information, a new module CorrelationNet is developed to use latent correlations among neighboring texts. Furthermore, contrastive learning is used to strength the representation using unlabeled videos. For news program application, a new dataset, TI-News, is established. The experiments on TI-News verify the effectiveness of our method.
References
- [1] Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai, “Rotation-sensitive regression for oriented scene text detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 5909–5918.
- [2] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao, “Detecting text in natural image with connectionist text proposal network,” in European conference on computer vision. Springer, 2016, pp. 56–72.
- [3] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang, “East: an efficient and accurate scene text detector,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 5551–5560.
- [4] Lan Wang, Yang Wang, Susu Shan, and Feng Su, “Scene text detection and tracking in video with background cues,” in Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval, 2018, pp. 160–168.
- [5] Chun Yang, Xu-Cheng Yin, Wei-Yi Pei, Shu Tian, Ze-Yu Zuo, Chao Zhu, and Junchi Yan, “Tracking based multi-orientation scene text detection: A unified framework with dynamic programming,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3235–3248, 2017.
- [6] Sida I Wang and Christopher D Manning, “Baselines and bigrams: Simple, good sentiment and topic classification,” in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2012, pp. 90–94.
- [7] Hoa T Le, Christophe Cerisara, and Alexandre Denis, “Do convolutional networks need to be deep for text classification?,” in Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- [8] Qi Li, Pengfei Li, Kezhi Mao, and Edmond Yat-Man Lo, “Improving convolutional neural network for text classification by recursive data pruning,” Neurocomputing, vol. 414, pp. 143–152, 2020.
- [9] Dani Yogatama, Chris Dyer, Wang Ling, and Phil Blunsom, “Generative and discriminative text classification with recurrent neural networks,” arXiv preprint arXiv:1703.01898, 2017.
- [10] Honglun Zhang, Liqiang Xiao, Yongkun Wang, and Yaohui Jin, “A generalized recurrent neural architecture for text classification with multi-task learning,” arXiv preprint arXiv:1707.02892, 2017.
- [11] Maosheng Guo, Yu Zhang, and Ting Liu, “Gaussian transformer: a lightweight approach for natural language inference,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, vol. 33, pp. 6489–6496.
- [12] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon, “Unified language model pre-training for natural language understanding and generation,” arXiv preprint arXiv:1905.03197, 2019.
- [13] Sangheeta Roy, Palaiahnakote Shivakumara, Umapada Pal, Tong Lu, and Ainuddin Wahid Bin Abdul Wahab, “Temporal integration for word-wise caption and scene text identification,” in 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2017, vol. 1, pp. 349–354.
- [14] Mridul Ghosh, Himadri Mukherjee, Sk Md Obaidullah, KC Santosh, Nibaran Das, and Kaushik Roy, “Identifying the presence of graphical texts in scene images using cnn,” in 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). IEEE, 2019, vol. 1, pp. 86–91.
- [15] Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- [18] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
- [19] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” arXiv preprint arXiv:1706.03762, 2017.
- [20] Phuc H Le-Khac, Graham Healy, and Alan F Smeaton, “Contrastive representation learning: A framework and review,” IEEE Access, 2020.
- [21] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini, “The graph neural network model,” IEEE transactions on neural networks, vol. 20, no. 1, pp. 61–80, 2008.