This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Academy for Engineering and Technology, Fudan University, Shanghai, China22institutetext: Jilin Provincial Key Laboratory of Intelligence Science and Engineering, Changchun, China33institutetext: Engineering Research Center of AI and Robotics, Ministry of Education, Shanghai, China
33email: {yxlei22, mingchengli21, jwchen22}@m.fudan.edu.cn
33email: {dkyang20, slwang19, lihuazhang}@fudan.edu.cn

Text-oriented Modality Reinforcement Network for Multimodal Sentiment Analysis from Unaligned Multimodal Sequences

Yuxuan Lei 11    Dingkang Yang 11    Mingcheng Li 11    Shunli Wang 11    Jiawei Chen 11    Lihua Zhang 11223*3*
Abstract

Multimodal Sentiment Analysis (MSA) aims to mine sentiment information from text, visual, and acoustic modalities. Previous works have focused on representation learning and feature fusion strategies. However, most of these efforts ignored the disparity in the semantic richness of different modalities and treated each modality in the same manner. That may lead to strong modalities being neglected and weak modalities being overvalued. Motivated by these observations, we propose a Text-oriented Modality Reinforcement Network (TMRN), which focuses on the dominance of the text modality in MSA. More specifically, we design a Text-Centered Cross-modal Attention (TCCA) module to make full interaction for text/acoustic and text/visual pairs, and a Text-Gated Self-Attention (TGSA) module to guide the self-reinforcement of the other two modalities. Furthermore, we present an adaptive fusion mechanism to decide the proportion of different modalities involved in the fusion process. Finally, we combine the feature matrices into vectors to get the final representation for the downstream tasks. Experimental results show that our TMRN outperforms the state-of-the-art methods on two MSA benchmarks.

Keywords:
Multimodal sentiment analysis Attention mechanism Representation learning Multimodal fusion Modality reinforcement
* Corresponding author

1 Introduction

Recognizing the research value of sentiments, numerous studies [20, 23, 22, 25, 3] in recent years have focused on identifying and analyzing human sentiments. Compared with traditional unimodal sentiment analysis, Multimodal Sentiment Analysis (MSA) attempts to mine sentiment information from multiple data sources to more comprehensively and accurately understand and predict a wide range of complex human emotions.

While data from multiple modalities can be complementary, the asynchrony between different modality sequences caused the distress of fusion. To address this problem, most prior works have manually aligned visual and acoustic sequences at the resolution of text words [16, 18], but this has also resulted in high labor costs and ignored long-term dependencies between different modal elements. Recent efforts like [15, 9] have tended to deal with unaligned multimodal sequences by cross-modal attention. They often digest inter-modality correlations through sufficient interactions between each pair of modalities. However, this results in a surge in the number of parameters and redundant information in the modalities. They treat all modalities with the same weight without regard to the fact that the semantic richness of distinct modalities is different, which may lead to strong modalities being neglected and weak modalities being overvalued. Observing previous works [19, 1], we found that text modality dominates the MSA task. On the one hand, the text modality is naturally highly structured and semantically condensed; on the other hand, due to the maturity of natural language processing techniques, modeling techniques for text data are relatively mature. In this situation, it is crucial to balance the contributions of different modalities. Moreover, the vanilla Transformer[17] also has some drawbacks. The self-attention mechanism incorporates redundancy and noise while focusing on the information within the modality, especially for the visual and acoustic modalities. Unlike spoken words that can be encoded directly, acoustic and visual modalities are pre-processed before being fed into the network, and noise is inevitably introduced during the pre-processing process [1]. Secondly, the redundancy in time series between visual and acoustic sequences is very high.

Inspired by the above observations, we propose a Text-oriented Modality Reinforcement Network (TMRN) to refine multimodal representations effectively. The core strategy of the TMRN is to interact between modalities with the text modality at the center and to guide the reinforcement process of the other two modalities by text modality. For the inter-modal intersection, we propose a text-centered cross-modal attention module to make full interaction for text/acoustic and text/visual pairs. We also present an adaptive fusion mechanism to measure the weights of the different modalities during fusion. For the intra-modal reinforcement, we design a text-gated self-attention module to introduce prior knowledge of textual semantics in the process of feature reinforcement of visual/acoustic modalities. This aims to mine the semantic information on time series better and to ignore the noise of visual/acoustic modalities. Overall, we make the following three contributions:

  • We propose TMRN, a method that focuses on the dominance of the text modality in MSA tasks. The TMRN interacts and reinforces the other two modalities with the text modality as the main thread to obtain a low redundancy and denoised feature representation.

  • We present a Text-Centred Cross-modal Attention (TCCA) module and a Text-Gated Self-Attention (TGSA) module to mine inter-modal and intra-modal contextual relationships.

  • We perform a comprehensive set of experiments on two human multimodal language benchmarks MOSI [29] and MOSEI [30]. Our experiments show that our method achieves state-of-the-art methods on these two datasets.

2 Related Work

Human multimodal sentiment analysis is to infer human emotional attitudes from the various modality information in video clips. Compared to multimodal fusion from static modalities like images [10], the key technique for this task is how to fuse time-series sequences from different modalities such as natural language, video frames, and acoustic signals [16], especially when these sequences are temporally unaligned. Some recent works [16, 18] have focused on manually aligning the visual and acoustic sequences in the resolution of textual words before training. However, manual word alignment is costly, and there is inevitably some loss of information in the multimodal fusion after alignment.

Furthermore, some researchers have worked on unaligned multimodal data. These works can be classified into two categories: discarding the time series dimension and retaining the time series dimension in the subsequent modal interactions. For the former, they usually take one row of the two-dimensional features as a feature vector for subsequent interaction and fusion [4, 27, 24]. [4, 24] learned modality-invariant and modality-specific representations to give a comprehensive and disentangled view of the multimodal data. [27] jointed training the multimodal and unimodal tasks to learn the consistency and difference, respectively. For the latter, they tend to use the attention mechanism to implement interactions between non-aligned sequences [15, 9].

A great deal of attention to attention mechanism has been triggered by the Transformer [17]. Transformer networks have been successfully applied to many tasks like semantic role labeling [13] and word sense disambiguation[14]. And now, Transformer is also widely used in the multimodal field. [15] presented Multimodal Transformer (MulT), which uses cross-modal attention to capture the bimodal interactions without manually aligning the three modalities. [9] proposed PMR, which is a further improvement of the interaction between the three modalities based on MulT. Following the latter approaches, our work is also based on the attention mechanism.

3 Problem Statement and Model Structure

3.1 Problem Statement

Refer to caption
Figure 1: The overall architecture of the proposed model TMRN.

In this work, the multimodal sentiment analysis task focuses on using the same video clip from the text (t), visual (v), and acoustic (a) modalities as inputs to the model, which is represented as XmRTm×dmX_{m}\in R^{T_{m}\times d_{m}} for each modality m{t,v,a}m\in\left\{{t,v,a}\right\}. For the rest of the paper, TmT_{m} and dmd_{m} are used to represent sequence length and feature dimension of modality mm, respectively. The goal of our model is to fully explore and fuse sentiment-related information from these input unaligned multimodal sequences to obtain a text-driven multimodal representation and thus predict the final sentiment analysis results.

3.2 Overall Architecture

The overall architecture of our TMRN is shown in Fig. 1, which consists of three main components: 1) Unimodal feature extraction module: we utilize pre-trained BERT [2] to generate the extravagant representation of input words and process visual and acoustic features with Bi-LSTM [5]; 2) Modality reinforcement: this part is composed of cross-stacking TCCA and TGSA modules to interact and reinforce the features. We divide the features into visual-text and acoustic-text pairs for cross-attention with the text modality as the query, while self-attention is performed on the text modality. Then, we fuse the pairs with an adaptive fusion mechanism. After that, we use the text modality as a gate to adding prior knowledge to the process of self-reinforcement of visual/acoustic modalities; 3) Fusion and output module: we aggregate the final two-dimension features into one-dimension vectors and concatenate them for the downstream tasks. Our aim is to further guide and interact with acoustic and visual modalities through the text modality to obtain a text-dominated implicitly aligned fusion feature.

Refer to caption
Figure 2: The architecture of the Text-Centred Cross-modal Attention (TCCA) module.

3.2.1 Unimodal Feature Extraction.

To obtain a stronger feature representation of the text, we use a pre-trained BERT [2] model to extract the feature of the sentences:

Ft=BERT(Xt;θtBERT)RTt×dt.F_{t~{}}=BERT\left(X_{t};\theta_{t}^{BERT}\right)\in R^{T_{t}\times d_{t}}. (1)

In acoustic and visual modalities, following [28, 26], we use pre-trained ToolKits to extract the initial features XmX_{m} from raw data. Then, we use Bi-directional Long Short-Term Memory (BiLSTM)[5] to capture the timing characteristics:

Fa=BiLSTM(Xa;θaLSTM)RTa×da,F_{a}=BiLSTM\left(X_{a};\theta_{a}^{LSTM}\right)\in R^{T_{a}\times d_{a}}, (2)
Fv=BiLSTM(Xv;θvLSTM)RTv×dv.F_{v~{}}=BiLSTM\left(X_{v};\theta_{v}^{LSTM}\right)\in R^{T_{v}\times d_{v}}. (3)

For subsequent calculations, we use one fully connected layer to project the features into a fixed dimension as FmRTm×dF_{m}\in R^{T_{m}\times d}, where m{t,a,v}m\in\left\{t,a,v\right\}.

3.2.2 Modality Reinforcement.

This part includes two key modules: a Text-Centred Cross-modal Attention (TCCA) module and a Text-Gated Self-Attention (TGSA) module. The architecture of TCCA is shown in Fig. 2. Unlike [9], the visual and acoustic modalities share the same text self-attention block to reduce the amount of computation in our TCCA module. This unit is composed of two cross-attention blocks and one self-attention block. The cross-attention block takes Ft[i]F_{t}^{[i]} and Fmt[i]F_{m\rightarrow t}^{[i]} as its inputs, where m{a,v}m\in\left\{{a,v}\right\}, and the superscript [i][i] indicates the ii-th modality reinforcement processes. First, we perform a layer normalization (LNLN) on the features like Fmt[i]=LN(Fmt[i])F_{m\rightarrow t}^{[i]}=LN\left(F_{m\rightarrow t}^{[i]}\right) and Ft[i]=LN(Ft[i])F_{t}^{[i]}=LN\left(F_{t}^{[i]}\right), and then we put them into a Cross-Attention (CACA) block:

Fmt[i]\displaystyle F_{m\rightarrow t}^{[i]} =CAmt[i](Fmt[i],Ft[i]),\displaystyle=~{}{CA}_{m\rightarrow t}^{[i]}\left({F_{m\rightarrow t}^{[i]},F_{t}^{[i]}}\right), (4)
=softmax(Ft[i]WQtWKmTFmt[i]Td)Fmt[i]WVm,\displaystyle=softmax\left(\frac{F_{t}^{[i]}W_{Q_{t}}W_{K_{m}}^{T}{F_{m\rightarrow t}^{[i]}}^{T}}{\sqrt{d}}\right)F_{m\rightarrow t}^{[i]}W_{V_{m}},

where Fmt[0]=FmRTm×dF_{m\rightarrow t}^{[0]}=F_{m}\in R^{T_{m}\times d} and Ft[0]=FtRTt×dF_{t}^{[0]}=F_{t}\in R^{T_{t}\times d}. Note that the sequence length of Fmt[i]F_{m\rightarrow t}^{[i]} is updated to TtT_{t} after the first CACA block. And the Self-Attention (SASA) block takes Ft[i]F_{t}^{[i]} as input to obtain Ft[i+1]RTt×dF_{t}^{[{i+1}]}\in R^{T_{t}\times d}:

Ft[i+1]\displaystyle F_{t}^{[{i+1}]} =SAt[i](Ft[i]),\displaystyle=~{}{SA}_{t}^{[i]}\left(F_{t}^{[i]}\right), (5)
=softmax(Ft[i]WQtWKtTFt[i]Td)Ft[i]WVt.\displaystyle=softmax\left(\frac{F_{t}^{[i]}W_{Q_{t}}W_{K_{t}}^{T}{F_{t}^{[i]}}^{T}}{\sqrt{d}}\right)F_{t}^{[i]}W_{V_{t}}.
Refer to caption
Figure 3: The architecture of the Text-Gated Self-Attention (TGSA) module.

Then the reinforced features Ft[i+1]F_{t}^{[{i+1}]} and Fmt[i]F_{m\rightarrow t}^{[i]} are processed via the following adaptive fusion mechanism:

G[i]=σ(Ft[i+1]Wt[i]+Fmt[i]Wmt[i]+b[i]),G^{[i]}=\sigma\left(F_{t}^{[{i+1}]}*W_{t}^{[i]}+F_{m\rightarrow t}^{[i]}*W_{m\rightarrow t}^{[i]}+b^{[i]}\right), (6)
Fmt[i]=G[i]Ft[i+1]+(1G[i])Fmt[i],F_{m\rightarrow t}^{[i]}=G^{[i]}\odot F_{t}^{[{i+1}]}+\left({1-G^{[i]}}\right)\odot F_{m\rightarrow t}^{[i]}, (7)

where σ\sigma denotes the sigmoid non-linearity function, \odot denotes element-wise multiplication. We can determine the passed proportions of Ft[i+1]F_{t}^{[{i+1}]} and Fmt[i]F_{m\rightarrow t}^{[{i}]} via the learnable parameters Wt[i]W_{t}^{[i]}, Wmt[i]W_{m\rightarrow t}^{[i]}, and b[i]b^{[i]}. This operation can filter the incorrect information produced by the cross-modal interactions, and measure the fusion ratio of two modalities. After that, we process Ft[i+1]F_{t}^{[{i+1}]} and Fmt[i]F_{m\rightarrow t}^{[i]} by a Position-wise Feed-Forward layer (PFFPFF) with skip connection, as in the Transformer[17]:

Fmt[i]=PFF(LN(Fmt[i]))+Fmt[i],F_{m\rightarrow t}^{[i]}=PFF\left({LN\left(F_{m\rightarrow t}^{[i]}\right)}\right)+F_{m\rightarrow t}^{[i]}, (8)
Ft[i+1]=PFF(LN(Ft[i+1]))+Ft[i+1].F_{t}^{[{i+1}]}=PFF\left({LN\left(F_{t}^{[{i+1}]}\right)}\right)+F_{t}^{[{i+1}]}. (9)

After the TCCA module, we obtain unified dimensional features of three modalities. We think that the relationships within each modality are complementary to the cross-modal relations, so we do self-attention for Fvt[i]F_{v\rightarrow t}^{[i]} and Fat[i]F_{a\rightarrow t}^{[i]}, while using the Ft[i+1]F_{t}^{[{i+1}]} as a gate to activate or deactivate the corresponding key and value channels:

g[i]=σ(Linear(Ft[i+1];θg)),g^{[i]}=\sigma\left(Linear\left(F_{t}^{[{i+1}]};\theta_{g}\right)\right), (10)
gFmt[i]=(1+g[i])Fmt[i].{gF}_{m\rightarrow t}^{[i]}=\left(1+g^{[i]}\right){\odot F}_{m\rightarrow t}^{[i]}. (11)

The query and key from visual/acoustic modalities are then modulated by the gate from the text modality:

Fmt[i+1]\displaystyle F_{m\rightarrow t}^{[{i+1}]} =TGSAm[i](Fmt[i],gFmt[i]),\displaystyle={TGSA}_{m}^{[i]}\left({F_{m\rightarrow t}^{[i]},{~{}gF}_{m\rightarrow t}^{[i]}}\right), (12)
=softmax(gFmt[i]WQmtWKmtTgFmt[i]Td)Fmt[i]WVmt+Fmt[i].\displaystyle=softmax\left(\frac{{gF}_{m\rightarrow t}^{[i]}W_{Q_{m\rightarrow t}}W_{K_{m\rightarrow t}}^{T}{{gF}_{m\rightarrow t}^{[i]}}^{T}}{\sqrt{d}}\right)F_{m\rightarrow t}^{[i]}W_{V_{m\rightarrow t}}+F_{m\rightarrow t}^{[{i}]}.

The architecture of the TGSA is shown in Fig. 3.

3.2.3 Fusion and Output Module.

Here, we utilize a simple attention approach to aggregate the reinforced features of the three modalities. Specifically, given the feature Fm[n]RTm×dF_{m}^{[n]}\in R^{T_{m}\times d} for modality mm output by the last TGSA module, we get the attention weight matrix:

am=softmax(Fm[n]Wmd)TR1×Tm,a_{m}=softmax\left(\frac{F_{m}^{[n]}W_{m}}{\sqrt{d}}\right)^{T}\in R^{1\times T_{m}}, (13)

where WmRdW_{m}\in R^{d} denotes the linear projection parameter, and ama_{m} denotes the attention weight matrix for the feature Fm[n]F_{m}^{[n]}. Then we aggregate the features with the attention weights:

fm=amFm[n]R1×d.f_{m}=a_{m}F_{m}^{[n]}\in R^{1\times d}. (14)

Eventually, we concatenate all of the three modalities’ features as f=[ft;fa;fv]R1×3df=\left[f_{t};f_{a};f_{v}\right]\in R^{1\times 3d} as the fused feature passing through a Multi-Layer Perceptron (MLPMLP) to make the final prediction ypredy_{pred}:

ypred=Φ(f;θΦ),y_{pred}=\Phi\left(f;\theta_{\Phi}\right), (15)

where the Φ()\Phi(\cdot) is a MLPMLP parameterized by θΦ\theta_{\Phi}.

4 Experiments

In this section, we empirically evaluate our model on two datasets that are frequently used to benchmark the MSA task in prior works, and we introduce the datasets, implementation details, and the results of our experiments.

4.1 Datasets and Evaluation Metrics

MOSI[29] dataset is a widely used benchmark dataset for the MSA task. It comprises 2,199 short monologue video clips taken from 93 Youtube movie review videos. Its predetermined data partition has 1,284 samples in the training set, 229 in the validation set, and 686 in the testing set. MOSEI [30] dataset is an improvement over MOSI. It contains 22,856 annotated video segments (utterances) from 5,000 videos, 1,000 distinct speakers, and 250 different topics. Its predetermined data partition has 16,326 samples in the training set, 1,871 in the validation set, and 4,659 in the testing set. Each sample in both MOSI and MOSEI is manually annotated with a sentiment score between [3,3][-3,3], which indicates the polarity and relative strength of expressed sentiment. The polarity is indicated by positive/negative, and strength is indicated by absolute value. As in the previous works[7, 9], we evaluate the model performances by the 7-class accuracy (Acc7Acc_{7}), the binary accuracy (Acc2Acc_{2}), mean absolute error (MAEMAE), the correlation of the model’s prediction with human (CorrCorr), and the F1F1 score.

Table 1: Comparison results on the MOSI. For Acc2Acc_{2} and F1F1, we have two sets of non-negative/negative (left) and positive/negative (right) evaluation results.
Method MAEMAE\downarrow CorrCorr\uparrow Acc7Acc_{7}\uparrow Acc2Acc_{2}\uparrow F1F1\uparrow
TFN 0.901 0.698 34.9 -/80.8 -/80.7
LMF 0.917 0.695 33.2 -/82.5 -/82.4
MulT 0.861 0.711 - 81.5/84.1 80.6/83.9
MISA 0.783 0.761 42.3 81.8/83.4 81.7/83.6
MAG-BERT 0.731 0.789 - 82.5/84.3 82.6/84.3
Self-MM 0.718 0.796 46.04 82.62/84.45 82.55/84.44
TMRN(ours) 0.704 0.784 48.68 83.67/85.67 83.45/85.52
Table 2: Comparison results on the MOSEI.
Method MAEMAE\downarrow CorrCorr\uparrow Acc7Acc_{7}\uparrow Acc2Acc_{2}\uparrow F1F1\uparrow
TFN 0.593 0.700 50.2 -/82.5 -/82.1
LMF 0.623 0.677 48.0 -/82.0 -/82.1
MulT 0.580 0.703 - 82.5 -/82.9
MISA 0.568 0.724 - 82.59/84.23 82.67/83.97
MAG-BERT 0.539 0.753 - 83.8/85.2 83.7/85.1
Self-MM 0.536 0.763 54.5 82.59/84.95 82.9/84.85
TMRN(ours) 0.535 0.762 53.65 83.39/86.19 83.67/86.08
Table 3: Ablation results of our TMRN on the MOSI.
Model MAEMAE\downarrow CorrCorr\uparrow Acc7Acc_{7}\uparrow Acc2Acc_{2}\uparrow F1F1\uparrow
Full method 0.7041 0.7844 48.68 83.67/85.67 83.45/85.52
w/o A 0.8114 0.7426 45.48 81.05/81.86 81.09/81.96
w/o V 0.8452 0.7382 41.69 80.61/81.71 80.67/81.82
Acoustic-oriented 0.7508 0.7658 43.00 81.92/83.38 81.85/83.36
Visual-oriented 0.7956 0.7309 41.69 82.07/83.23 82.06/83.27
w/o TCCA 0.7498 0.7817 44.75 83.09/84.76 83.03/84.75
w/o TGSA 0.7467 0.7824 45.33 80.45/81.71 80.50/81.79

4.2 Implementation Details

All models are built on the Pytorch toolbox [11] with two Quadro RTX 8000 GPUs. The Adam optimizer[6] is adopted for network optimization. For the MOSI and MOSEI datasets, the training setting follows: the batch sizes are {128,64}\{128,64\}, the epochs are {100,40}\{100,40\}, the learning rates are {1e3,2e3}\{{1e}^{-3},{2e}^{-3}\}, and the hidden dimension d is 128. The number of TCCA and TGSA is N=3N=3.

4.3 Comparison with State-of-the-Art Methods

The proposed approach is compared to the existing state-of-the-art (SOTA) baselines, including TFN [28], LMF [8], Mult [15], MISA [4], MAG-BERT [12], and Self-MM [27]. Table 1 and 2 show the comparison results on the MOSI and MOSEI, respectively. The result of Self-MM [27] is reproduced from open-source code with hyper-parameters provided in the original paper.

The proposed TMRN significantly outperforms most previous methods [28, 8, 15, 4, 12] by considerable margins on all metrics in both datasets, demonstrating the superiority of our method. In addition, our model is superior to the current SOTA Self-MM [27] in most metrics (i.e.,i.e., MAE,Acc7,Acc2,F1MAE,Acc_{7},Acc_{2},F1 scores on the MOSI, and MAE,Acc2,F1MAE,Acc_{2},F1 scores on the MOSEI.) with better or competitive performance, suggesting the effectiveness of our text-oriented design philosophy.

4.4 Ablation Study

The overall performance has proven the superiority of TMRN. To understand the necessity of the different components and the dominance of the text modality, we conduct systematic ablation experiments on the MOSI, as shown in Table 3.

4.4.1 Importance of Modality.

We remove a modality separately to explore the performance of our model. Both declining results indicate the importance of visual and acoustic modalities when removing the visual or acoustic sequences. Furthermore, the performance degradation is more severe when the visual modality is removed. This is in line with the previous work [21]. This result suggests that the information in nonverbal modalities complements the text modality.

4.4.2 Importance of Center Modality.

To demonstrate the dominance of the text modality, we replace the other two modalities as the dominant modality for the experiments. The acoustic- and visual-oriented models invariably suffer from significant performance degradation. These observations demonstrate that the text modality is richer in semantics and less noisy, which leads to better feature reinforcement of the other two modalities.

4.4.3 Importance of Module.

Finally, we explore the importance of the proposed components by removing the TCCA and TGSA modules separately. For the TCCA module, we remove the cross-attention block and only do self-attention for text modality. We can see that the gain degrades when removing one of the modules. These observations suggest that adequate guidance of the text modality is necessary and indispensable.

Refer to caption
Figure 4: Performance of TMRN with different parameter NN on MOSI and MOSEI.

4.5 Sensitivity of Parameter

In order to explore the effect of parameter NN on the model performance, we conducted experiments on MOSI and MOSEI datasets with different parameters NN. The results are summarized in Fig. 4. With the increase of NN, we find that the F1F1 scores show a trend of increasing and then decreasing, and the network performs best when N=3N=3. In our conjecture, the larger NN can result in better modality reinforcement. However, experiments show us that too many layers may bottleneck the ability of the text modality to guide the other two modalities. We should choose the appropriate network for different datasets, which is exactly what our proposed TMRN can flexibly do. If migrating our model to a more complex dataset, we can properly increase the number of TCCA and TGSA modules to achieve the best performance.

5 Conclusion

This paper presents a text-oriented multimodal sequence reinforcement network to achieve interaction and fusion over unaligned sequences of three modalities in the context of multimodal human sentiment analysis. The work is based on inter- and intra-modal attention mechanisms, and the attention of the other two modalities is guided throughout by sequences from the text modality, enabling the alternate transfer of information within and across modalities. We experimentally observe that our approach can achieve better performance than the baselines in MSA benchmarks.

5.0.1 Acknowledgments

This work was supported in part by the National Key R&D Program of China (2021ZD0113503), and in part by the Shanghai Municipal Science and Technology Major Project (2021SHZDZX0103).

References

  • [1] Chen, M., Wang, S., Liang, P.P., Baltrušaitis, T., Zadeh, A., Morency, L.P.: Multimodal sentiment analysis with word-level fusion and reinforcement learning. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction. pp. 163–171 (2017)
  • [2] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  • [3] Du, Y., Yang, D., Zhai, P., Li, M., Zhang, L.: Learning associative representation for facial expression recognition. In: IEEE International Conference on Image Processing. pp. 889–893 (2021)
  • [4] Hazarika, D., Zimmermann, R., Poria, S.: Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In: Proceedings of the 28th ACM International Conference on Multimedia. pp. 1122–1131 (2020)
  • [5] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation 9(8), 1735–1780 (1997)
  • [6] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [7] Liang, T., Lin, G., Feng, L., Zhang, Y., Lv, F.: Attention is not enough: Mitigating the distribution discrepancy in asynchronous multimodal sequence fusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8148–8156 (2021)
  • [8] Liu, Z., Shen, Y., Lakshminarasimhan, V.B., Liang, P.P., Zadeh, A., Morency, L.P.: Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064 (2018)
  • [9] Lv, F., Chen, X., Huang, Y., Duan, L., Lin, G.: Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2554–2562 (2021)
  • [10] Mittal, T., Guhan, P., Bhattacharya, U., Chandra, R., Bera, A., Manocha, D.: Emoticon: Context-aware multimodal emotion recognition using frege’s principle. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14234–14243 (2020)
  • [11] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019)
  • [12] Rahman, W., Hasan, M.K., Lee, S., Zadeh, A., Mao, C., Morency, L.P., Hoque, E.: Integrating multimodal information in large pretrained transformers. In: Proceedings of the conference. Association for Computational Linguistics. Meeting. vol. 2020, p. 2359. NIH Public Access (2020)
  • [13] Strubell, E., Verga, P., Andor, D., Weiss, D., McCallum, A.: Linguistically-informed self-attention for semantic role labeling. arXiv preprint arXiv:1804.08199 (2018)
  • [14] Tang, G., Müller, M., Rios, A., Sennrich, R.: Why self-attention? a targeted evaluation of neural machine translation architectures. arXiv preprint arXiv:1808.08946 (2018)
  • [15] Tsai, Y.H.H., Bai, S., Liang, P.P., Kolter, J.Z., Morency, L.P., Salakhutdinov, R.: Multimodal transformer for unaligned multimodal language sequences. In: Proceedings of the conference. Association for Computational Linguistics. Meeting. vol. 2019, p. 6558. NIH Public Access (2019)
  • [16] Tsai, Y.H.H., Liang, P.P., Zadeh, A., Morency, L.P., Salakhutdinov, R.: Learning factorized multimodal representations. arXiv preprint arXiv:1806.06176 (2018)
  • [17] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)
  • [18] Wang, Y., Shen, Y., Liu, Z., Liang, P.P., Zadeh, A., Morency, L.P.: Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 7216–7223 (2019)
  • [19] Wu, Y., Lin, Z., Zhao, Y., Qin, B., Zhu, L.N.: A text-centered shared-private framework via cross-modal prediction for multimodal sentiment analysis. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. pp. 4730–4738 (2021)
  • [20] Yang, D., Chen, Z., Wang, Y., Wang, S., Li, M., Liu, S., Zhao, X., Huang, S., Dong, Z., Zhai, P., Zhang, L.: Context de-confounded emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19005–19015 (June 2023)
  • [21] Yang, D., Huang, S., Kuang, H., Du, Y., Zhang, L.: Disentangled representation learning for multimodal emotion recognition. In: Proceedings of the 30th ACM International Conference on Multimedia. pp. 1642–1651 (2022)
  • [22] Yang, D., Huang, S., Liu, Y., Zhang, L.: Contextual and cross-modal interaction for multi-modal speech emotion recognition. IEEE Signal Processing Letters 29, 2093–2097 (2022)
  • [23] Yang, D., Huang, S., Wang, S., Liu, Y., Zhai, P., Su, L., Li, M., Zhang, L.: Emotion recognition for multiple context awareness. In: Proceedings of the European Conference on Computer Vision. vol. 13697, pp. 144–162 (2022)
  • [24] Yang, D., Kuang, H., Huang, S., Zhang, L.: Learning modality-specific and -agnostic representations for asynchronous multimodal language sequences. In: Proceedings of the 30th ACM International Conference on Multimedia. p. 1708–1717 (2022)
  • [25] Yang, D., Liu, Y., Huang, C., Li, M., Zhao, X., Wang, Y., Yang, K., Wang, Y., Zhai, P., Zhang, L.: Target and source modality co-reinforcement for emotion understanding from asynchronous multimodal sequences. Knowledge-Based Systems 265, 110370 (2023)
  • [26] Yu, W., Xu, H., Meng, F., Zhu, Y., Ma, Y., Wu, J., Zou, J., Yang, K.: Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 3718–3727 (2020)
  • [27] Yu, W., Xu, H., Yuan, Z., Wu, J.: Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In: Proceedings of the AAAI conference on artificial intelligence. pp. 10790–10797 (2021)
  • [28] Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250 (2017)
  • [29] Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos. arXiv preprint arXiv:1606.06259 (2016)
  • [30] Zadeh, A.B., Liang, P.P., Poria, S., Cambria, E., Morency, L.P.: Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. pp. 2236–2246 (2018)