Incorporating Domain Knowledge Graph into Multimodal Movie Genre Classification with Self-Supervised Attention and Contrastive Learning
Abstract.
Multimodal movie genre classification has always been regarded as a demanding multi-label classification task due to the diversity of multimodal data such as posters, plot summaries, trailers and metadata. Although existing works have made great progress in modeling and combining each modality, they still face three issues: 1) unutilized group relations in metadata, 2) unreliable attention allocation, and 3) indiscriminative fused features. Given that the knowledge graph has been proven to contain rich information, we present a novel framework that exploits the knowledge graph from various perspectives to address the above problems. As a preparation, the metadata is processed into a domain knowledge graph. A translate model for knowledge graph embedding is adopted to capture the relations between entities. Firstly we retrieve the relevant embedding from the knowledge graph by utilizing group relations in metadata and then integrate it with other modalities. Next, we introduce an Attention Teacher module for reliable attention allocation based on self-supervised learning. It learns the distribution of the knowledge graph and produces rational attention weights. Finally, a Genre-Centroid Anchored Contrastive Learning module is proposed to strengthen the discriminative ability of fused features. The embedding space of anchors is initialized from the genre entities in the knowledge graph. To verify the effectiveness of our framework, we collect a larger and more challenging dataset named MM-IMDb 2.0 compared with the MM-IMDb dataset. The experimental results on two datasets demonstrate that our model is superior to the state-of-the-art methods. Our code and dataset is available at IDKG.git.
1. Introduction
Movie genre classification is a fundamental task for certain downstream tasks such as movie recommendation (Choi et al., 2012), understanding (Islam and Bertasius, 2022), editing (Bruckert et al., 2022), description (Rohrbach et al., 2017), etc. Previous studies (Ertugrul and Karagoz, 2018; Yadav and Vishwakarma, 2020) have achieved unparalleled results in movie genre classification with a single modality such as posters, plot summaries, movie trailers, audio, or metadata. Nowadays more researchers (Cascante et al., 2019; Huang et al., 2020; Bain et al., 2020; Zhang et al., 2022a; Vielzeuf et al., 2018; Kiela et al., 2019; Braz et al., 2021; Sankaran et al., 2021) focus on multimodal sources which could be the arbitrary combination of multiple modalities. By taking advantage of multimodal information, existing methods have made great progress in movie genre classification. However, they still leave three issues unsolved:

1) Unutilized group relations in metadata. As illustrated in Figure 1, group relations indicate that entities belonging to the same group usually appear simultaneously. To give two real-scenario examples, if Nolan is the director of a movie, it is likely to be a science fiction. If Emma Watson starred in a movie, it is probably not a comedy. However, recent works typically ignore group relations when modeling metadata. Behrouzi et al. (Behrouzi et al., 2022) extracts features from metadata and fuses them with other modality features through random forest classifier. Seo et al. (Seo et al., 2022) trains a graph attention network based on an undirected graph composed of movie nodes. In our opinion, these methods can be further improved if taking group relations into consideration.
2) Unreliable attention allocation. Intuitively, different samples should own varying weights on each modality to boost movie genre prediction with multimodal data. Previous works(Arevalo et al., 2017; Behrouzi et al., 2022) adopt an attention module to assign different weights to various modalities (e.g., plot summary, poster, trailer, audio). Nevertheless, reliability is not guaranteed due to no supervision in attention module training. Consequently, the produced attention can be irrational.
3) Indiscriminative fused features. Existing methods (Arevalo et al., 2017; Yu et al., 2022; Vielzeuf et al., 2018; Braz et al., 2021; Sankaran et al., 2021; Xu et al., 2023) typically utilize a pre-trained model for each modality to obtain discriminative single-modality features. Then features from each modality are fused for genre classification. However, fused feature space tends to show some distance from the original feature space of each modality, which harms the discriminative ability. As a result, fused features tend to be inefficient in predicting genres.
Inspired by previous methods (Yao et al., 2017; Bevilacqua and Navigli, 2020; Wang et al., 2022b) that leverage knowledge graph, we propose a novel framework named IDKG (Incorporating Domain Knowledge Graph) for movie genre classification. Our motivation is to exploit the knowledge graph from different perspectives to solve the aforementioned issues. To begin with, we propose to construct a domain knowledge graph using metadata which includes directors, casts, titles and genres. Moreover, we adopt a translate model for knowledge graph embedding (such as TransH (Wang et al., 2014), TransR (Lin et al., 2015), etc.) to capture the relation among entities in knowledge graph. For the first issue, we leverage group relations present in the metadata to retrieve the pertinent embedding from the knowledge graph. Then the retrieved embedding is integrated with other modalities to improve classification accuracy. To alleviate the unreliable attention allocation problem, we propose an Attention Teacher (AT) module that guides the attention module to produce rational attention scores based on self-supervised learning. Our AT module captures the distribution feature of the knowledge graph to generate pseudo labels for attention scores and utilizes a suitably designed loss function to train the attention module. As to the indiscriminative fused features, we propose a Genre-Centroid Anchored Contrastive Learning (G-CACL) module to strengthen the discriminative ability of features. It can be hard to select positive and negative pairs for samples with multiple genres in contrastive learning. To solve this problem, our G-CACL module defines the centroid of multiple genres embedding as the positive anchor. The enlarged genre space provides feasible optimization directions for fused features to enhance their discriminative ability.
In order to verify the effectiveness of our proposed IDKG, we further create a new dataset, MM-IMDb 2.0, which is more challenging compared with MM-IMDb dataset. It comprises 33,742 movies collected from IMDb website, with genres set the same as MM-IMDb. Notably, the proportion of the number of head and tail genres is enlarged, thereby enhancing the task difficulty.
Our main contributions of can be summarized as follows:
• We propose a novel framework called IDKG that subtly exploits the knowledge graph from different perspectives. To the best of our knowledge, we are the first to incorporate a knowledge graph into the multimodal movie genre classification task.
• We utilize group relations in metadata to obtain relevant embedding from the knowledge graph. With selected embedding as an additional modality source, performance is significantly improved.
• We propose an AT module to alleviate the unreliable attention allocation problem. It obtains pseudo labels from the distribution of the knowledge graph and trains the attention module in a self-supervised manner. Owing to more reliable attention scores, each modality is assigned a more reasonable weight.
• We propose a G-CACL module to alleviate the indiscriminative fused features problem. Centroids of genre embedding from the knowledge graph are regarded as positive anchors. The contrastive learning strategy is applied to enhance fused feature representation.
• We create a new and more challenging dataset MM-IMDb 2.0 to verify the effectiveness of our proposed method. Extensive experiments are conducted to compare our IDKG with current state-of-the-art methods. Experimental results demonstrate that our method outperforms them by a huge margin.
2. Related Work
2.1. Movie Genre Classification
Movie genre classification could be divided into two categories: single-modality-based and multimodal-based. The former predicts genres by one kind of modality data, such as posters, plot summaries, movie trailers, audios, or metadata. (Simões et al., 2016; Yadav and Vishwakarma, 2020) extract features from movie trailers, while (Wi et al., 2020) and (Ertugrul and Karagoz, 2018) focus on using only poster images or plot summaries. As single-modality data is relatively simple to process, these methods have achieved excellent performance. However, for multimodal-based methods (Behrouzi et al., 2022; Ben-Ahmed and Huet, 2018; Fish et al., 2020) which combine two or more modalities , the results have not been as good due to the complexity of data features.
2.2. Self-supervised Learning
Self-supervised learning is a special learning method without direct supervision signal. The supervision signal is generated by using the features of the dataset itself. Nowadays, self-supervised learning is attracting more attention from researchers, such as Next Word Prediction (Iter et al., 2020; Devlin et al., 2018), Automated Text Augmentation (Meng et al., 2021; Giorgi et al., 2021) in natural language processing and Colorization (Vondrick et al., 2018; Zhang et al., 2016), Context Prediction (Noroozi and Favaro, 2016; Misra et al., 2016) in computer vision. Moreover, self-supervised learning (Misra and Maaten, 2020; Hendrycks et al., 2019; Baevski et al., 2022; Chen et al., 2022) has great potential to replace fully supervised learning in representation learning domain.
In our paper, we provide a novel insight into the self-supervised learning. We develop a paradigm to the attention module by summarizing the distribution from constructed domain knowledge graph.

2.3. Supervised Contrastive Learning in Multi-label Classification
Contrastive learning is a technique that trains a model to differentiate between similar and dissimilar examples. Such method can be used to learn representations of data. Initially, this approach (Park et al., 2022; Wang et al., 2022a; Han et al., 2021) was explored in the self-supervised setting. The feature embedding is learned without explicit labels by solving a pretext task. Supervised contrastive learning (Khosla et al., 2020; Gunel et al., 2021) is another form of contrastive learning that employs annotated data to generate positive pairs by selecting samples from various instances of a specific category.
In the contrastive learning paradigm, positive and negative pairs are defined by semantic similarity. Neverthless, it is hard to be applied to multi-label classification because of multiple semantics. Recently, several work attends to bridging combination of multi-label and contrastive learning. (Dao et al., 2021; Wang et al., 2022a) compute the contrastive loss by determining a single label that best matches each sample. However, such methods also serve to increase the distance between the sample and other labels. (Zhang et al., 2022b) proposes a multi-label contrastive learning framework that utilizes a hierarchical structure to leverage all available labels, but in movie genre classification there is no hierarchical structure of labels. (Hassanin et al., 2022; Bai et al., 2022) aim to utilize the label embedding space. (Hassanin et al., 2022) adopts a center loss but fails to construct negative pairs. (Bai et al., 2022) focuses on figuring out the similarity between label embedding, which may harm the discriminative ability of fused features. To alleviate the problem of defining positive pairs, our proposed G-CACL module enlarges the genre embedding space and a centroid of genres is generated as the positive anchor for each sample. Negative samples are well-designed for the effectiveness of our module.
3. Approach
Problem Definition. In multimodal movie genre classification task, the -th sample is composed of a text , an image and metadata . includes the directors, casts, title and the genres of the movie. Each sample is annotated with a multi-label vector , where is the number of genres of the dataset. The dataset is splited into train set , test set and validation set . The goal of our task is to train a strong classifier using to predict the multiple genres of each sample in .
Model Overview. The overall architecture of our framework is illustrated in Figure 2. Firstly, a Clip model (Radford et al., 2021) extracts features and from input source image and text . Then, in Section 3.1, we process the metadata from into a knowledge graph. To capture the relations between entities, an embedding matrix of knowledge graph entities is trained by adopting a translate model for knowledge graph embedding. We retrieve all the directors and actors in metadata, and get their embedding from embedding matrix as a new modality . Next, in Section 3.2, we describe an AT module which could ensure the rational allocation of attention module. Multi-modality features are fused according to their attention weights. Finally, in Section 3.3, we introduce a G-CACL module that enhances the discriminability of fused features.
3.1. Knowledge Graph Feature Formation
Knowledge Graph Construction. A domain knowledge graph is constructed full-automatically by using metadata. Our knowledge graph schema is derived from the fields of metadata and we define four types of entities as the name of fields:
(1) |
which correspond to directors, titles, casts and genres respectively. Moreover, we define six kinds of relations:
(2) |
which represent the relations of directors and titles, casts and titles, genre and titles, directors and cats, directors and genres, casts and genres. Notably we only use metadata from to avoid data leakage. We visit the fields of metadata and extract each value as an entity which has a unique matching id. Furthermore, we traverse all entity pairs in metadata, and each entity pair forms a triplet with the corresponding .
Knowledge Graph Embedding. In order to capture the relations between entities, we apply a translate model for knowledge graph embedding, for instance, TransH (Wang et al., 2014). Finally we get the embedding matrix of all entities , where denotes the number of entities in knowledge graph and is the dimension of entity embedding.
Utilization of Group Relations in Metadata. The group relations in metadata can aid the prediction of movie genres as illustrated in Section 1. Since the translate model has enabled knowledge graph capture the relations between entities, for -th sample we traverse all the director and cast entities in metadata. Finally we obtain the embedding of entities from :
(3) |
where denotes the embedding of an director entity. and are the number of directors and casts of the -th sample respectively. Finally, the feature of knowledge graph is defined as the sum of all embedding in . Notably, in the test phase all and of many samples do not exist in . It is because that is trained by the entities of . becomes the 0 vector at this point and it can be formulated as:
(4) |
3.2. Attention Teacher Module
We adopt an attention module to balance the weight of each modality for each sample. It is composed of a linear function and a sigmoid function. Notably the parameters of the attention module for three modality features are shared.
As illustrated in Figure 2, for multi-modality features , and , where and are the dimension of text feature and image feature extracted from Clip model, we apply batchnorm1d and linear projection function to convert them to the same shape . After alignment, these three feature maps are entered into the attention module to obtain their attention scores , and . Next, these multi-modality features are multiplied by their corresponding attention scores and add up as the ultimate fused feature :
(5) |
where and denotes the batchnorm1d and linear projection function respectively.
To ensure the reliable attention allocation of the attention module, we introduce the Attention Teacher (AT) module based on self-supervised learning. Unlike the previous methods which mainly create pretext tasks in vision or language domain, we mine the distribution feature from the knowledge graph. It is observed that different samples contain different numbers of entities when forming . Thus it is natural to consider that varying scores of attention should be assigned to of different samples. Specifically, composed of a small number of entities deserves a lower attention score. Especially in testing phase, if is 0 vector as is presented in Formula (4), the attention scores should be close to 0 at this point. Moreover, we consider the degree of entities in knowledge graph, because an entity is less important if it has very few neighbours in the knowledge graph structure.
Taking above discussions into consideration, we finally utilize the following equations to define the pseudo label of attention scores for :
(6) |
where denotes the number of samples in and is the degree of an entity in the knowledge graph. In Formula (6), we introduce the average number of sum of directors and actors and the average degree of all samples in . Our intention is that if the sum of directors and actors and the degree of current sample are both beyond the average number in , the corresponding pseudo label should become higher, otherwise it would be lower. Formula (6) requires the range of in to fit the range of produced attention weights which are limited by a sigmoid function.
Then we design an appropriate loss function to make the self-supervised attention scores converge normally. Since is a vector while is in scalar form according to Formula (6), is averaged as scalar form either to compute loss with . Since the goal of AT module is to guide close to , a regression loss is adopted in this module. Moreover, it is worth noting that and are both designed to be between 0 and 1. Thus directly applying l1 or l2 regression loss would limit the loss value in range (0,1) and have the risk of underfitting. In order to ensure a large enough gradient, we apply a logarithmic function to guarantee larges gradients instead of directly adopting l1 or l2 regression loss. Considering a batch of input with batchsize , the self-supervised attention loss is defined as follows:
(7) |
In this way the definition domain of loss function is limited to 0 to 1 with large enough gradient and monotonic increasing trend. From Formula (7), it can be observed that only the knowledge graph attention score is supervised by its pseudo label to train the attention module, while and do not participate in the training procedure. Nevertheless, in our experiment (Section 4.5) we find that the attention module trained with Formula (7) can produce reasonable scores and for texts and images.
3.3. Genre-Centroid Anchored Contrastive Learning Module
We propose a Genre-Centroid Anchored Contrastive Learning (G-CACL) module which facilities the genre embedding from knowledge graph to strengthen the discriminative ability of . Considering that in contrastive learning, each sample is typically assigned a single semantic label and positive pairs are defined by whether they belong to the same semantic label. However, in our task each sample is annotated multi-genre. If we regard each genre embedding as a positive anchor, it is hardly achievable to push the feature of sample close to all the positive anchors synchronously. To overcome this limitation, we attempt to represent the semantics of multiple genres in single anchor.
Specifically, for a batch of fused features , the genre embedding set of -th feature is:
(8) |
where is the number of annotated genres of . we enlarge the genre embedding space by computing the centroid of :
(9) |
which is regarded as the positive anchor for . We could obtain the union of genre embedding of all samples in the current batch:
(10) |
where is the number of annotated genres of all samples in the current batch. We define the complement set of in as the negative samples of :
(11) |
Before computing loss, the centroid and genre embedding go through the linear function to be transformed into the same shape as :
(12) |
The loss function of G-CACL module is defined as follows:
(13) |
where is the temperature coefficient following (Khosla et al., 2020). In this loss function, we push the close to its corresponding , and away from the embedding of negative samples . denotes the sum of similarity between and each embedding in .
Furthermore, for the multi-label classification, we adopt the binary cross-entropy loss. For a batch of output after the classifier, the multi-label classification loss is:
(14) |
where denotes the predicted probability that the -th sample belongs to the -th genre.
Finally, we compose , and as the ultimate training loss for our IDKG:
(15) |
Genre | MM-IMDb | MM-IMDb 2.0 | Genre | MM-IMDb | MM-IMDb 2.0 |
---|---|---|---|---|---|
Drama | 4188 | 4773 | Fantasy | 498 | 789 |
Comedy | 2609 | 2861 | Music | 415 | 752 |
Action | 1617 | 1950 | History | 418 | 754 |
Adventure | 1588 | 1673 | Western | 344 | 704 |
Romance | 1148 | 1364 | Sci-Fi | 280 | 687 |
Crime | 1081 | 1298 | Musical | 292 | 603 |
Horror | 835 | 1179 | Sport | 245 | 472 |
Thriller | 823 | 1096 | Short | 211 | 428 |
Biography | 584 | 846 | War | 164 | 417 |
Animation | 664 | 874 | Documentary | 139 | 272 |
Family | 646 | 817 | File-Noir | 92 | 73 |
Mystery | 591 | 800 |
4. Experiment
Type | Model | MM-IMDb | MM-IMDb 2.0 | ||||||
---|---|---|---|---|---|---|---|---|---|
Micro | Macro | Weighted | Samples | Micro | Macro | Weighted | Samples | ||
Multimodal | GMU (Arevalo et al., 2017) | 0.630 | 0.541 | 0.617 | 0.630 | 0.617 | 0.575 | 0.607 | 0.588 |
CentralNet (Vielzeuf et al., 2018) | 0.639 | 0.561 | 0.631 | 0.639 | 0.622 | 0.594 | 0.619 | 0.606 | |
MMBT (Kiela et al., 2019) | 0.669 | 0.618 | - | - | 0.635 | 0.607 | 0.652 | 0.650 | |
MFM (Braz et al., 2021) | 0.675 | 0.616 | 0.675 | 0.673 | 0.656 | 0.608 | 0.664 | 0.671 | |
ReFNet (Sankaran et al., 2021) | 0.680 | 0.587 | - | - | - | - | - | - | |
COCA (Yu et al., 2022) | 0.677 | 0.626 | 0.668 | 0.681 | 0.659 | 0.623 | 0.649 | 0.670 | |
BLIP (Li et al., 2022) | 0.674 | 0.628 | 0.663 | 0.675 | 0.661 | 0.618 | 0.635 | 0.663 | |
BridgeTow (Xu et al., 2023) | 0.682 | 0.633 | 0.676 | 0.680 | 0.668 | 0.627 | 0.684 | 0.679 | |
Graphical | MM-GATBT(Seo et al., 2022) | 0.685 | 0.645 | 0.683 | 0.686 | 0.674 | 0.632 | 0.697 | 0.685 |
Graphical+Multimodal | IDKG | 0.849 | 0.832 | 0.848 | 0.839 | 0.828 | 0.811 | 0.827 | 0.807 |
4.1. Experiment Setup
Dataset. We evaluate IDKG on two datasets, MM-IMDb and MM-IMDb 2.0. MM-IMDb dataset is a multi-label movie genre classification dataset released by (Arevalo et al., 2017) which contains 25959 films. Each sample is composed of a poster, a plot summary and metadata which covers the directors, actors, publication year, etc. Following (Arevalo et al., 2017), we randomly split the dataset into train set, test set and validation set at the ratio of 0.6, 0.3 and 0.1.
To further verify our framework, we create a novel and more challenging dataset, MM-IMDb 2.0. We collect 33742 movies with their posters, plots and metadata from the IMDB website. As is illustrated in Table 1, the ratio of the quantity of Drama to the quantity of Film-Noir is nearly 65:1. Besides, we also constrain several other tail genres. Moreover, we partition our dataset the same proportion as MM-IMDb dataset.
Domain Knowledge Graph. As presented in Section 3.1, we process the metadata of into a domain knowledge graph. For MM-IMDb dataset, there are 2231455 triplets and 264271 entities in its domain knowledge graph. Since MM-IMDb 2.0 dataset has much more movies, the number of triplets and entities are 3512803 and 318299 respectively.

Implementation Details. IDKG is trained on Pytorch with a 2080ti gpu. For the translate model for knowledge graph embedding, we use the toolkit released by (Han et al., 2018). We adopt the SGD optimizer with 0.5 learning rate and the embedding dimension is 200. Moreover, we train the translate model for 500 epochs with 100 batchsize. For the stage two we use the AdamW (Loshchilov and Hutter, 2019) optimizer and the learning rate is 1e-3. We train 15 epochs with 64 batchsize. To extract the image and text features, we apply the Clip (Radford et al., 2021) model and the parameters are frozen in our experiments. The unified vector space dimension is 512 which is set the same as (Arevalo et al., 2017). The parameter for the G-CACL module is set [0.05, 0.1, 0.3, 0.5, 0.7, 0.9] following (Khosla et al., 2020).
4.2. Experimental Results and Analyses
We compare IDKG with state-of-the-art methods on two datasets and our approach achieves far superior performance. We categorize the existing methods into three types: 1) Multimodal type is a straightforward strategy to solve the problem. This type of works (Arevalo et al., 2017; Vielzeuf et al., 2018; Kiela et al., 2019; Braz et al., 2021; Sankaran et al., 2021; Xu et al., 2023; Li et al., 2022; Yu et al., 2022) mainly focus on the strategy of extracting the well-represented features from each modality respectively by adopting corresponding pre-trained models. It is noted that we also select several recent outstanding Multimodal methods (Xu et al., 2023; Li et al., 2022; Yu et al., 2022) as our competitors. 2) Graphical method is conducted to explore the potential of structural sources. MM-GATBT (Seo et al., 2022) leverages graph neural networks to learn the relational semantics of entities by using encoded images as node features. 3) Our method is a comprehensive framework which fully taking advantage of the two types above. Notably the translate model used for Table 2, 3 and 5 is RotateE (Sun et al., 2019) and the ablation study on translate models is shown in Section 4.3. Moreover, we compare our IDKG with GMU (Arevalo et al., 2017) for head and tail genres on two datasets with Macro-F1 scores as the evaluation metric.
Model | MM-IMDb | MM-IMDb 2.0 | ||||||
---|---|---|---|---|---|---|---|---|
Micro | Macro | Weighted | Samples | Micro | Macro | Weighted | Samples | |
IDKG | 0.849 | 0.832 | 0.848 | 0.839 | 0.828 | 0.811 | 0.827 | 0.807 |
IDKG - AT | 0.842 | 0.829 | 0.841 | 0.831 | 0.813 | 0.794 | 0.812 | 0.792 |
IDKG - AT - G-CACL | 0.828 | 0.816 | 0.825 | 0.817 | 0.796 | 0.779 | 0.789 | 0.783 |
IDKG - AT - G-CACL - KG | 0.677 | 0.625 | 0.661 | 0.667 | 0.668 | 0.597 | 0.652 | 0.631 |
IDKG (GMU) | 0.832 | 0.816 | 0.832 | 0.824 | 0.797 | 0.779 | 0.795 | 0.783 |
IDKG (GMU) - G-CACL | 0.819 | 0.804 | 0.817 | 0.810 | 0.782 | 0.773 | 0.790 | 0.772 |
IDKG (GMU) - G-CACL - KG | 0.630 | 0.541 | 0.617 | 0.630 | 0.617 | 0.575 | 0.607 | 0.588 |
Results on MM-IMDb Dataset. The comparison results on MM-IMDb dataset are shown in Table 2. We observe that the Graphical method MM-GATBT (Seo et al., 2022) outperforms all the Multimodal methods in each metric, which demonstrates that the semantic relations in metadata could serve a great benefits to the capacity of predicting genres. Table 2 shows that IDKG surpasses MM-GATBT by at least 15% on all evaluation metrics. One possible reason may be that in MM-GATBT the graph nodes are composed of the image embedding, where there is no additional knowledge in graph nodes at bottom. IDKG incorporates the knowledge graph embedding with other modalities by using the group relations in metadata which enriches the features for genre prediction. Moreover, two effective modules are designed to address the unreliable attention allocation and indiscriminative fused feature issues, thus boosting the performance of our model.
Results on MM-IMDb 2.0 Dataset. The overall results show that the performance of all methods on MM-IMDb 2.0 dataset is inferior to that on MM-IMDb dataset. The reason can be that MM-IMDb2 2.0 dataset is a more challenging dataset. As shown in Table 2, our proposed IDKG also successes in beating all the competitors by at least 12% which is a large margin on all metrics. The experimental result demonstrates that taking advantage of both Multimodal and Graphical methods can remarkably boost the performance.
Dataset | Trans Model | Micro | Macro | Weighted | Samples | Hit@10 |
---|---|---|---|---|---|---|
MM-IMDb | TransH (Wang et al., 2014) | 0.833 | 0.820 | 0.831 | 0.827 | 0.507 |
TransR (Lin et al., 2015) | 0.839 | 0.823 | 0.838 | 0.835 | 0.519 | |
TransD (Ji et al., 2015) | 0.835 | 0.827 | 0.828 | 0.833 | 0.508 | |
ComplEx (Trouillon et al., 2016) | 0.827 | 0.813 | 0.825 | 0.821 | 0.485 | |
ConvE (Dettmers et al., 2018) | 0.829 | 0.822 | 0.836 | 0.825 | 0.506 | |
RotateE (Sun et al., 2019) | 0.849 | 0.832 | 0.848 | 0.839 | 0.549 | |
MM-IMDb 2.0 | TransH (Wang et al., 2014) | 0.813 | 0.806 | 0.815 | 0.792 | 0.507 |
TransR (Lin et al., 2015) | 0.814 | 0.809 | 0.820 | 0.791 | 0.519 | |
TransD (Ji et al., 2015) | 0.817 | 0.803 | 0.823 | 0.796 | 0.508 | |
ComplEx (Trouillon et al., 2016) | 0.806 | 0.797 | 0.812 | 0.784 | 0.485 | |
ConvE (Dettmers et al., 2018) | 0.814 | 0.805 | 0.822 | 0.792 | 0.506 | |
RotateE (Sun et al., 2019) | 0.828 | 0.811 | 0.827 | 0.807 | 0.549 |
Macro-F1 score analysis. As shown in Table 1, the distribution of genres is imbalanced in MM-IMDb dataset. Towards severer imbalance problem, we enlarge the proportion of the number of head genres and tail genres when collecting MM-IMDb 2.0 dataset as mentioned in Section 4.1. As Macro-F1 neglects the proportion for each label, it is more sensitive to the imbalance of genres distribution than other evaluation metrics. Thus we compare Macro-F1 between GMU and IDKG for sampling six head genres and six tail genres on two datasets as can be seen in Figure 3 and the genres are arranged in descending order of quantity from left to right. We could observe that for GMU the Macro-F1 of head classes is generally far larger than that of tail classes on two datasets due to the imbalance distribution. However, for IDKG the Macro-F1 of tail genres is close to that of head genres and almost Macro-F1 of all genres is around 80%, which demonstrates distinguished classification ability of IDKG.
4.3. Ablation Study
To evaluate the effectiveness of each component of IDKG, we conduct extensive ablation experiments on two datasets. In particular, we incorporate the domain knowledge graph and the G-CACL module into GMU (Arevalo et al., 2017) one by one for more solid proof. After the ensemble of all modules, we name the new model IDKG (GMU). Notably since GMU provides a strategy of feature fusion, AT module is not conducted in its ablation experiment. Moreover, we compare the different translate models for knowledge graph embedding and show the influence on IDKG performance. Finally, we analyze the parameter of G-CACL module.

Results on IDKG. Table 3 illustrates that the discarding of domain knowledge graph loses the most performance by at least 10% of all modules which proves that group relations in metadata could contribute to the model performance greatly. With the removal of AT module, the performance of IDKG on all metrics declines to various degrees on two datasets. This is may be that AT module ensure the reliable attention allocation, thus improving the accuracy. The performance goes down if IDKG is not equipped with G-CACL module. The reason is perhaps that G-CACL module could improve the discriminative ability of fused feature which could boost the effectiveness of our model.
Results on IDKG (GMU). As can be seen from Table 3, the trend of performance change on IDKG (GMU) is the same as IDKG. This illustrates the effectiveness of our proposed modules. It is noted that the performance drops significantly when remove G-CACL module for IDKG (GMU). It demonstrates that GMU is weak in the discriminative ability of fused feature and our G-CACL module compensates the shortcoming.
Effect of Translate Models. We compare different translate models including TransH (Wang et al., 2014), TransR (Lin et al., 2015), TransD (Ji et al., 2015), ComplEx (Trouillon et al., 2016), ConvE (Dettmers et al., 2018) and RotateE (Sun et al., 2019) to observe the effects on IDKG performance. Moreover, we list the Hit@10 results of predicting missing links on WN18 dataset, which are reported in the toolkit (Han et al., 2018). As show in Table 4, we notice that the RotateE achieves the best performance and ComplEx performs worst. Combining the Hit@10 results, we could draw a conclusion that the performance of IDKG is correspond to the capacity of translate model due to strong embedding representation enhancing the ability of capturing relations between entities.
Analysis on parameter . As illustrated in (Wang and Liu, 2021), smaller is sensitive to difficult negative samples that too small would destroy the embedding space due to the unproper negative samples setting. According to Figure 4, when is 0.1 the performance of IDKG is the best on MM-IMDb dataset. The reason may be that in our method the negative pairs are rationally constructed, thus small would benefit the contrastive loss.

4.4. Comparison with Multi-label Contrastive Learning Methods
To demonstrate the effectiveness of G-CACL module, we not only conduct extensive ablation study as illustrated in Section 4.3, but also replace it with existing multi-label classification methods which adopt contrastive learning. The competitors are MulCon (Dao et al., 2021), MCL (Hassanin et al., 2022), MLTC (Wang et al., 2022a) and C-GMVAE (Bai et al., 2022) and all of them are introduced in Section 2.3. The results demonstrated in Table 5 verify that our G-CACL module is superior to other contrastive learning methods. We observe that the performance of embedding space initialized randomly is worse than that initialized from the genre embedding of our knowledge graph. We assume that it is because the knowledge graph embedding captures the discrimination between genre semantics, thus improving the effectiveness of our module.
4.5. Analysis on Attention Teacher Module
To test the validity of AT module, we present a case study as shown in Figure 5. We record , and in testing phase and sample 3 cases, where the modalities that contribute more to the prediction are marked in red. It is observed that the attention module outputs the reliable scores for each modality. Notably despite that and are not directly supervised, they still output the rational scores. For example, the plot summary of the second and third sample should be allocated more attention weights and of them are relative high. We infer that and are soft-supervised because of the shared parameters of the attention module.
Dataset | Trans Model | Micro | Macro | Weighted | Samples |
---|---|---|---|---|---|
MM-IMDb | MulCon (Dao et al., 2021) | 0.830 | 0.818 | 0.827 | 0.816 |
MCL (Hassanin et al., 2022) | 0.832 | 0.813 | 0.829 | 0.822 | |
MLTC (Wang et al., 2022a) | 0.835 | 0.821 | 0.825 | 0.826 | |
C-GMVAE (Bai et al., 2022) | 0.846 | 0.828 | 0.846 | 0.840 | |
Ours (random) | 0.841 | 0.828 | 0.842 | 0.835 | |
Ours | 0.849 | 0.832 | 0.848 | 0.839 | |
MM-IMDb 2.0 | MulCon (Dao et al., 2021) | 0.806 | 0.785 | 0.792 | 0.784 |
MCL (Hassanin et al., 2022) | 0.811 | 0.794 | 0.802 | 0.787 | |
MLTC (Wang et al., 2022a) | 0.815 | 0.799 | 0.812 | 0.792 | |
C-GMVAE (Bai et al., 2022) | 0.825 | 0.807 | 0.824 | 0.806 | |
Ours (random) | 0.821 | 0.808 | 0.821 | 0.802 | |
Ours | 0.828 | 0.811 | 0.827 | 0.807 |
5. Conclusion
In this paper, we proposed an effective and novel framework named IDKG. To the best of our knowledge, we are the first to apply knowledge graph technology to multimodal movie genre classification. Firstly, IDKG utilized the group relations in the knowledge graph to obtain embedding and incorporated it with other modalities. Furthermore, an Attention Teacher module was proposed to learn the distribution of the knowledge graph and guide the attention module to allocate more reliable weights. Finally, a Genre-Centroid Anchored Contrastive Learning module enhanced the discriminative ability of the fused feature. We also collected a new large-scale dataset named MM-IMDb 2.0 for movie genre classification which faces a severer class-imbalanced problem compared with the MM-IMDb dataset. Finally, we conducted extensive experiments on MM-IMDb and MM-IMDb 2.0 datasets and the experimental results demonstrated that the performance of our model was superior to existing methods. For future work, we plan to construct a multimodal domain knowledge graph of the movie field, as well as applying to more downstream tasks.
Acknowledgements.
We thank the Big Data Computing Center of Southeast University for providing the facility support on the numerical calculations in this paper.References
- (1)
- Arevalo et al. (2017) John Arevalo, Thamar Solorio, Manuel Montes-y Gómez, and Fabio A González. 2017. Gated multimodal units for information fusion. In ICLR.
- Baevski et al. (2022) Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In ICML.
- Bai et al. (2022) Junwen Bai, Shufeng Kong, and Carla P Gomes. 2022. Gaussian Mixture Variational Autoencoder with Contrastive Learning for Multi-Label Classification. In PMLR.
- Bain et al. (2020) Max Bain, Arsha Nagrani, Andrew Brown, and Andrew Zisserman. 2020. Condensed movies: Story based retrieval with contextual embeddings. In ACCV.
- Behrouzi et al. (2022) Tina Behrouzi, Ramin Toosi, and Mohammad Ali Akhaee. 2022. Multimodal movie genre classification using recurrent neural network. Springer MULTIMED TOOLS APPL (2022).
- Ben-Ahmed and Huet (2018) Olfa Ben-Ahmed and Benoit Huet. 2018. Deep multimodal features for movie genre and interestingness prediction. In CBMI.
- Bevilacqua and Navigli (2020) Michele Bevilacqua and Roberto Navigli. 2020. Breaking through the 80% glass ceiling: Raising the state of the art in word sense disambiguation by incorporating knowledge graph information. In ACL.
- Braz et al. (2021) Leodécio Braz, Vinícius Teixeira, Helio Pedrini, and Zanoni Dias. 2021. Image-Text Integration Using a Multimodal Fusion Network Module for Movie Genre Classification. In ICPRS.
- Bruckert et al. (2022) Alexandre Bruckert, Marc Christie, and Olivier Le Meur. 2022. Where to look at the movies: Analyzing visual attention to understand movie editing. BEHAV RES METHODS (2022).
- Cascante et al. (2019) Paola Cascante, Kalpathy Sitaraman, Mengjia Luo, and Vicente Ordonez. 2019. Moviescope: Large-scale analysis of movies using multiple modalities. arXiv preprint arXiv:1908.03180 (2019).
- Chen et al. (2022) Richard J Chen, Chengkuan Chen, Yicong Li, Tiffany Y Chen, Andrew D Trister, Rahul G Krishnan, and Faisal Mahmood. 2022. Scaling vision transformers to gigapixel images via hierarchical self-supervised learning. In CVPR.
- Choi et al. (2012) Sang-Min Choi, Sang-Ki Ko, and Yo-Sub Han. 2012. A movie recommendation algorithm based on genre correlations. Elsevier Expert Syst. Appl. (2012).
- Dao et al. (2021) Son D Dao, Zhao Ethan, Phung Dinh, and Cai Jianfei. 2021. Contrast learning visual attention for multi label classification. arXiv preprint arXiv:2107.11626 (2021).
- Dettmers et al. (2018) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. 2018. Convolutional 2d knowledge graph embeddings. In AAAI.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
- Ertugrul and Karagoz (2018) Ali Mert Ertugrul and Pinar Karagoz. 2018. Movie genre classification from plot summaries using bidirectional LSTM. In ICSC.
- Fish et al. (2020) Edward Fish, Jon Weinbren, and Andrew Gilbert. 2020. Rethinking movie genre classification with fine-grained semantic clustering. arXiv preprint arXiv:2012.02639 (2020).
- Giorgi et al. (2021) John Giorgi, Osvald Nitski, Bo Wang, and Gary Bader. 2021. DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. In ACL.
- Gunel et al. (2021) Beliz Gunel, Jingfei Du, Alexis Conneau, and Veselin Stoyanov. 2021. Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning. In ICLR.
- Han et al. (2021) Junlin Han, Mehrdad Shoeiby, Lars Petersson, and Mohammad Ali Armin. 2021. Dual contrastive learning for unsupervised image-to-image translation. In CVPR.
- Han et al. (2018) Xu Han, Shulin Cao, Xin Lv, Yankai Lin, Zhiyuan Liu, Maosong Sun, and Juanzi Li. 2018. OpenKE: An Open Toolkit for Knowledge Embedding. In EMNLP.
- Hassanin et al. (2022) Mohammed Hassanin, Ibrahim Radwan, Salman Khan, and Murat Tahtali. 2022. Learning discriminative representations for multi-label image recognition. JVCIR (2022).
- Hendrycks et al. (2019) Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. 2019. Using self-supervised learning can improve model robustness and uncertainty. NIPS (2019).
- Huang et al. (2020) Qingqiu Huang, Yu Xiong, Anyi Rao, Jiaze Wang, and Dahua Lin. 2020. Movienet: A holistic dataset for movie understanding. In ECCV.
- Islam and Bertasius (2022) Md Mohaiminul Islam and Gedas Bertasius. 2022. Long movie clip classification with state-space video models. arXiv preprint arXiv:2204.01692 (2022).
- Iter et al. (2020) Dan Iter, Kelvin Guu, Larry Lansing, and Dan Jurafsky. 2020. Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models. In ACL.
- Ji et al. (2015) Guoliang Ji, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Knowledge graph embedding via dynamic mapping matrix. In IJCNLP.
- Khosla et al. (2020) Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. NIPS (2020).
- Kiela et al. (2019) Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Ethan Perez, and Davide Testuggine. 2019. Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950 (2019).
- Li et al. (2022) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.
- Lin et al. (2015) Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning entity and relation embeddings for knowledge graph completion. In AAAI.
- Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In ICLR.
- Meng et al. (2021) Yu Meng, Chenyan Xiong, Payal Bajaj, Paul Bennett, Jiawei Han, Xia Song, et al. 2021. Coco-lm: Correcting and contrasting text sequences for language model pretraining. NIPS (2021).
- Misra and Maaten (2020) Ishan Misra and Laurens van der Maaten. 2020. Self-supervised learning of pretext-invariant representations. In CVPR.
- Misra et al. (2016) Ishan Misra, C Lawrence Zitnick, and Martial Hebert. 2016. Shuffle and learn: unsupervised learning using temporal order verification. In ECCV.
- Noroozi and Favaro (2016) Mehdi Noroozi and Paolo Favaro. 2016. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV.
- Park et al. (2022) Sungho Park, Jewook Lee, Pilhyeon Lee, Sunhee Hwang, Dohyung Kim, and Hyeran Byun. 2022. Fair contrastive learning for facial attribute classification. In CVPR.
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML.
- Rohrbach et al. (2017) Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, and Bernt Schiele. 2017. Movie description. IJCV (2017).
- Sankaran et al. (2021) Sethuraman Sankaran, David Yang, and Ser-Nam Lim. 2021. Refining Multimodal Representations using a modality-centric self-supervised module. (2021).
- Seo et al. (2022) Seung Byum Seo, Hyoungwook Nam, and Payam Delgosha. 2022. MM-GATBT: Enriching Multimodal Representation Using Graph Attention Network. In ACL(Workshop).
- Simões et al. (2016) Gabriel S Simões, Jônatas Wehrmann, Rodrigo C Barros, and Duncan D Ruiz. 2016. Movie genre classification with convolutional neural networks. In IJCNN.
- Sun et al. (2019) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. 2019. Rotate: Knowledge graph embedding by relational rotation in complex space. ICLR (2019).
- Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. In PMLR.
- Vielzeuf et al. (2018) Valentin Vielzeuf, Alexis Lechervy, Stéphane Pateux, and Frédéric Jurie. 2018. Centralnet: a multilayer approach for multimodal fusion. In ECCV(Workshop).
- Vondrick et al. (2018) Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. 2018. Tracking emerges by colorizing videos. In ECCV.
- Wang and Liu (2021) Feng Wang and Huaping Liu. 2021. Understanding the behaviour of contrastive loss. In CVPR.
- Wang et al. (2022a) Ran Wang, Xinyu Dai, et al. 2022a. Contrastive learning-enhanced nearest neighbor mechanism for multi-label text classification. In ACL.
- Wang et al. (2022b) Xiting Wang, Kunpeng Liu, Dongjie Wang, Le Wu, Yanjie Fu, and Xing Xie. 2022b. Multi-level recommendation reasoning over knowledge graphs with reinforcement learning. In WWW.
- Wang et al. (2014) Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph embedding by translating on hyperplanes. In AAAI.
- Wi et al. (2020) Jeong A Wi, Soojin Jang, and Youngbin Kim. 2020. Poster-based multiple movie genre classification using inter-channel features. Access (2020).
- Xu et al. (2023) Xiao Xu, Chenfei Wu, Shachar Rosenman, Vasudev Lal, and Nan Duan. 2023. Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning. In AAAI.
- Yadav and Vishwakarma (2020) Ashima Yadav and Dinesh Kumar Vishwakarma. 2020. A unified framework of deep networks for genre classification using movie trailer. APPL SOFT COMPUT (2020).
- Yao et al. (2017) Liang Yao, Yin Zhang, Baogang Wei, Zhe Jin, Rui Zhang, Yangyang Zhang, and Qinfei Chen. 2017. Incorporating knowledge graph embeddings into topic modeling. In AAAI.
- Yu et al. (2022) Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. 2022. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022).
- Zhang et al. (2016) Richard Zhang, Phillip Isola, and Alexei A Efros. 2016. Colorful image colorization. In ECCV. Springer.
- Zhang et al. (2022b) Shu Zhang, Ran Xu, Caiming Xiong, and Chetan Ramaiah. 2022b. Use all the labels: A hierarchical multi-label contrastive learning framework. In CVPR.
- Zhang et al. (2022a) Zhongping Zhang, Yiwen Gu, Bryan A Plummer, Xin Miao, Jiayi Liu, and Huayan Wang. 2022a. Effectively leveraging Multi-modal Features for Movie Genre Classification. arXiv preprint arXiv:2203.13281 (2022).