This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MUFM: A Mamba-Enhanced Feedback Model for Micro Video Popularity Prediction

Jiacheng Lu1, Mingyuan Xiao1, Weijian Wang1, Yuxin Du1,
Yi Cui2, Jingnan Zhao1, Cheng Hua1
Corresponding author: [email protected]
Abstract

The surge in micro videos is transforming the concept of popularity. As researchers delve into vast multi-modal datasets, there is a growing interest in understanding the origins of this popularity and the forces driving its rapid expansion. Recent studies suggest that the virality of short videos is not only tied to their inherent multi-modal content but is also heavily influenced by the strength of platform recommendations driven by audience feedback. In this paper, we introduce a framework for capturing long-term dependencies in user feedback and dynamic event interactions, based on the Mamba Hawkes process. Our experiments on the large-scale open-source multi-modal dataset, Microlens, show that our model significantly outperforms state-of-the-art approaches across various metrics by 23.2%. We believe our model’s capability to map the relationships within user feedback behavior sequences will not only contribute to the evolution of next-generation recommendation algorithms and platform applications but also enhance our understanding of micro video dissemination and its broader societal impact.

Introduction

The widespread adoption of portable devices has significantly contributed to the success of micro video platforms like TikTok. These devices make it easy for users to share their experiences, opinions, and thoughts in various formats, such as text, images, audio, and video. The resulting increase in user participation has led to the emergence of an important research area: micro video popularity prediction (MVPP).

Refer to caption
Figure 1: TikTok Micro Videos

We recognize that current approaches to predicting short video popularity often fall short by underutilizing the wealth of available data, particularly in overlooking the role of event-driven propagation within social networks. To address these shortcomings, we introduce the Mamba-Enhanced User Feedback Capture Model for Micro Video Popularity Prediction (MUFM), which advances popularity prediction by integrating refined recommendation systems with models that account for social network dynamics.

MUFM is designed to make better use of the diverse data linked to short videos. It starts by using a retrieval system to find relevant micro videos from a multi-modal database, filtering content based on all available information including visual, audio, and text features. To better understand user responses, we use the Mamba Hawkes process, training on a dataset of 19,000 scraped comments to show how recommendation systems and user behavior influence video popularity. Specifically, the model processes video, audio, and related text data using the Mamba Hawkes process to recreate the video’s spread and produce a recommendation index. Additionally, we use a cross-attention mechanism to detect connections between the target video and similar content, further boosting the model’s prediction performance.

Contributions: Overall, the main contributions of this work are summarized as follows:

  • We present a Mamba-based framework that captures long-term dependencies and dynamic event interactions driven by user feedback, aimed at predicting micro video popularity. Designed for robustness against noise and uncertainties, it ensures more accurate and reliable predictions.

  • We use a Hawkes Process Model, built on the Mamba architecture, to quantify the impact of user reactions. Additionally, we integrate this model with a user-focused SASRec model to analyze and infer the influence of TikTok’s recommendation system on the popularity of micro videos.

  • Our evaluation of the Microlens-100k dataset, the largest available for this task, shows that our MUFM model outperforms the current state-of-the-art by 23.2%.

Related Works

Micro Video Popularity Prediction

Micro Video Popularity Prediction (MVPP) has been studied using a variety of approaches. One prevalent method is feature engineering, which involves designing specific features to predict popularity. While this approach is widely adopted, it relies heavily on expert knowledge and high-quality feature selection, which can limit the scalability and flexibility of the models (Li et al. 2013) (Roy et al. 2013) (Wu et al. 2016).

Alternatively, deep learning methods have emerged as powerful tools for modeling multi-modal data. Techniques such as HMMVED (Xie, Zhu, and Chen 2023) and MTFM (Zhao et al. 2024) harness the capabilities of neural networks like ResNet (He et al. 2015) and ViT (Dosovitskiy et al. 2021) for visual data, and BERT (Devlin et al. 2019) and AnglE (Li and Li 2024) for textual data. These models excel in capturing cross-modal correlations and predicting popularity by leveraging the strengths of different data modalities. However, these methods mainly target limited video information, fail to utilize the user behavior information that is more important for social network dissemination, and can’t make sufficient use of multi-modal data. Therefore, the accuracy of the existing methods in the MVPP task is relatively low.

Hawkes Processes

The Hawkes Process is a self-exciting point process that defines the event intensity as a function of past events, capturing dependencies in time-series data (Hawkes 1971). The Hawkes Process is distinguished by its capability to model self-excitation, where events increase the likelihood of similar future occurrences, as well as mutual excitation, where distinct event types exert influence over one another (Laub, Taimre, and Pollett 2015). This dual functionality renders it particularly effective for applications in finance, social media analysis (Zhou, Zha, and Song 2013) (Rizoiu et al. 2017), and seismic activity modeling.

Recent advancements have enhanced the original model’s capabilities. Mamba Hawkes Process (MHP) combines the Mamba state space architecture with the Hawkes process to effectively capture long-range dependencies and dynamic interactions in temporal data (Gao, Dai, and Hu 2024). While traditional Hawkes processes model self-exciting event sequences, they often struggle with complex patterns, particularly in irregular or asynchronous data. MHP improves this by effectively modeling both history-dependent and independent components, naturally incorporating temporal differences to better predict event timing.

Preliminaries

Definitions & Problem Statement

Let V={v1,v2,,vN}V=\{v_{1},v_{2},\dots,v_{N}\} represent the set of NN micro videos available on an online video platform. Each video viv_{i} consists of LL modalities of content, denoted as Mi={m1i,m2i,,mli}M_{i}=\{m_{1}^{i},m_{2}^{i},\dots,m_{l}^{i}\}, where l2l\geq 2. The goal of micro video popularity prediction is to forecast meta parameters related to video popularity, such as cumulative views, comments, and likes. These parameters are represented as Yi=[y1i,y2i,,yhi]\textbf{Y}_{i}=[y_{1}^{i},y_{2}^{i},\dots,y_{h}^{i}], where hh denotes the number of different popularity metrics for each video viv_{i}. The task is to predict the future values of these parameters, Yi\textbf{Y}_{i}, using all relevant modalities that influence the video’s popularity trend after its release.

In this study, we focus on predicting comments as the primary popularity metric, as comments are less prone to manipulation and provide more detailed temporal information compared to other metrics like total views. The problem we address is to accurately predict yciy_{c}^{i}, the comment parameter for a given video viv_{i}, using its multi-modal content information MiM_{i}. Specifically, the goal is to forecast Yi in the future based on the available modalities. We employ our model F(viv_{i}, MiM_{i}) to make this prediction. We use normalized Mean Squared Error (nMSE) to evaluate the loss of our prediction model.

Methodology

Our methodological approach encompasses two modules: the video information extraction module founded on multi-modal information processing and database detection, and the social network dissemination simulation module for the video. In the multi-modal processing module of the video, we introduce LLM to retrieve related videos and process various modal information of the video via architectures such as ViT. Subsequently, we introduce a cross-attention mechanism to explore the interaction relationships between the retrieved videos and the target video. In our social network dissemination simulation module, we establish an event-focused Hawkes process model based on the Mamba architecture to reproduce the dissemination of the video within the group. A prediction network integrates all outcomes of the two modules and offers a prediction of the dissemination intensity of the video.

Refer to caption
Figure 2: Overall Structure of MUFM

Micro Video Retrieval

In micro video dissemination, a video’s popularity is often related to the performance of similar videos and is significantly influenced by recommendation algorithms and user behavior. We posit that incorporating similar video information can enhance micro video popularity prediction accuracy. To achieve this, we first employ a module to summarize and extract multi-modal info from micro videos, converting data into a textual description for similar content identification. To retrieve valuable instances for target video popularity prediction, we craft a video-to-text transformation process. This process uses vision and audio-to-text models to generate captions for video content, combined with original textual descriptions. The resulting composite text is used as an input prompt for LLMs, encoded into a retrieval vector representing the corresponding video. This method aligns micro video’s audio-visual and textual modalities, addressing potential inconsistencies.

The micro video memory repository is a collection of reference pairs, denoted as frames,text\left\langle frames,text\right\rangle, with each element encoded for efficient retrieval. To generate a retrieval vector for a given micro video ViV_{i}, we use BLIP-2 (Li et al. 2023) and CLAP (Wu et al. 2024) to generate descriptive textual annotations, denoted as CiC_{i} and AiA_{i}. The resulting synthetic captions CiC_{i}, AiA_{i}, and text description TiT_{i} are then combined, which is achieved using the concatenation operator \oplus, leading to synthesized text prompt PiP_{i}:

Pi=CiAiTi.\displaystyle P_{i}=C_{i}\oplus A_{i}\oplus T_{i}. (1)

Then PiP_{i} is processed through a pre-trained semantic extraction model to generate a retrieval vector RiR_{i} that encapsulates the key attributes of video ViV_{i}. Then aggregating features from the top-S most similar videos, we can attain all retrieved features. The specific process and formula description of this part of the model can be found in Appendix 1.

Multi-Modal Process

For MVPP, it is fatal to extract the required feature information from the multi-modal information of the video. We perform frame-by-frame processing on all target videos, select certain interval frames for extracting image and audio information, and transform them into corresponding frame feature vectors through ViT and AST architectures. Further, we input the image-audio features of the video and the text description into a model with both forward and reverse attention mechanisms to capture its content features. Further, we compare it with the videos matched by the retriever in the database to further enhance this feature.

Multi-Modal Feature Extraction

To capture the key features of a micro video viv_{i}, we extract visual, audio and textual information. We start by selecting key frames F1i,F2i,,FKiF_{1}^{i},F_{2}^{i},\dots,F_{K}^{i}, where KK is the total number of frames. Each frame FjiH×W×CF_{j}^{i}\in\mathbb{R}^{H\times W\times C} is divided into fixed-size patches and reshaped FjiN×P2×C{F_{j}^{i}}^{\ast}\in\mathbb{R}^{N\times P^{2}\times C} .

After implementing the Audio Spectrogram Transformer (AST) structure (Gong, Chung, and Glass 2021) to gain spectrum matrix MSnmel×naM_{S}^{n_{mel}\times n_{a}}, we split it into patches and linearly embedded each visual and audio patch, attaining KK patch sequences. Through a transformer structure, we have EivE_{i}^{v} and EiaE_{i}^{a}, respectively. These feature matrices are concatenated to emphasize the relationship between visual and audio information:

Eiv,a=EivEiaK×(dv+da).\displaystyle E_{i}^{v,a}=E_{i}^{v}\oplus E_{i}^{a}\in\mathbb{R}^{K\times(d_{v}+d_{a})}. (2)

The concatenated features Eiv,aE_{i}^{v,a} are then passed through a linear layer Wc(dv+da)×dW_{c}\in\mathbb{R}^{(d_{v}+d_{a})\times d} with a ReLU activation function, generating an audio-visual input token XcX^{c}. Similarly we can get textual input token XtX^{t} by process textual embedding EitE^{t}_{i} through layer Wtdt×dW_{t}\in\mathbb{R}^{d_{t}\times d}:

XcK×d:Xic=ReLU(Eiv,aWc).\displaystyle X^{c}\in\mathbb{R}^{K\times d}:X_{i}^{c}=ReLU(E_{i}^{v,a}W_{c}). (3)
Xtnw×d:Xit=ReLU(EitWt).\displaystyle X^{t}\in\mathbb{R}^{n_{w}\times d}:X_{i}^{t}=ReLU(E_{i}^{t}W_{t}). (4)

The specific process and formula description of this part of the model can be found in Appendix 2.

Cross-modal Bipolar Interaction

Aligning audio-visual and textual modalities in micro videos presents challenges due to potential inconsistencies between textual descriptions and video content. To address this, we implement a cross-attention network comprising both positive and negative attention modules, designed to capture the similarities and differences between multi-modal information. The positive attention module focuses on identifying the most consistent features across different modalities, while the negative attention module highlights any inconsistent or contradictory information.

Within the positive attention module, the most aligned features between modalities are calculated using cross-modal attention vectors. For a given video viv_{i}, the audio-visually guided positive textual features Ti𝒫T_{i}^{\mathcal{P}} and the textually guided positive audio-visual features Ci𝒫C_{i}^{\mathcal{P}} are derived as follows:

Ti𝒫\displaystyle T_{i}^{\mathcal{P}} =ATT𝒫(XicW𝒫𝒬,XitW𝒫𝒦,XitW𝒫𝒞)\displaystyle=ATT^{\mathcal{P}}\left(X_{i}^{c}W^{\mathcal{Q}}_{\mathcal{P}},X_{i}^{t}W^{\mathcal{K}}_{\mathcal{P}},X_{i}^{t}W^{\mathcal{C}}_{\mathcal{P}}\right) (5)
=Softmax(α𝒬𝒦Td)𝒞,\displaystyle=Softmax(\alpha\frac{\mathcal{QK}^{T}}{\sqrt{d}})\mathcal{C}, (6)

where W𝒫𝒬,W𝒫𝒦,W𝒫𝒞W^{\mathcal{Q}}_{\mathcal{P}},W^{\mathcal{K}}_{\mathcal{P}},W^{\mathcal{C}}_{\mathcal{P}} denote the query, key, and value projection matrices, respectively. Textually guided positive audio-visual features Ci𝒫C_{i}^{\mathcal{P}} can be gained with a similar method. The parameter α\alpha is used to balance the influence of positive and negative attention. Similarly, the negative audio-visually guided textual features Ti𝒩T_{i}^{\mathcal{N}} and textual guided audio-visual features Ci𝒩C_{i}^{\mathcal{N}} can be obtained using the same method.

After this, we referred to the integration processing of hidden states in the MMRA (Geva et al. 2021) (Zhong et al. 2024) model to generate a comprehensive textual modal representation in FFN layers, incorporate audio-visual hidden states into textual hidden states to generate a comprehensive textual and audio-visual modal representation Ti~\widetilde{T_{i}} and Ci~\widetilde{C_{i}}. Thus, we exploit expressive representations Ti,CiT_{i},C_{i} by the attentive pooling strategy (Sun and Lu 2020).

This dual-path approach ensures a synthesis of audio-visual and textual modalities and the detailed process and formula description of this part of the model can be found in Appendix 3.

Retrieval Interaction Enhancement

We focus on extracting valuable insights from relevant instances retrieved to improve micro video popularity prediction (MVPP). To achieve this, we use an aggregation function that combines the comprehensive representations of these instances from the memory bank. This function assigns attention scores based on the normalized similarity scores obtained during retrieval as weights to construct the retrieved embeddings Xir,cX_{i}^{r,c} and Xir,tX_{i}^{r,t}. Instances with higher similarity scores are prioritized, highlighting their relevance to the target micro video.

Similarly, we calculate Tir~\widetilde{T_{i}^{r}} and Cir~\widetilde{C_{i}^{r}}, and then derive the final representations TirT_{i}^{r} and CirC_{i}^{r} using an attentive pooling strategy. To capture the popularity trends of the retrieved instances, we encode the label information through linear layers and aggregation, resulting in the retrieved label embedding LirL_{i}^{r}. Finally, AUG-MMRA integrates all features from TirT_{i}^{r} and CirC_{i}^{r} to model cross-sample interactions. These feature interactions are constructed as

=[Φ(Ci,Cir),Φ(Ci,Tir),,Φ(Ti,Lir)],\displaystyle\mathcal{I}=[\Phi(C_{i},C_{i}^{r}),\Phi(C_{i},T_{i}^{r}),...,\Phi(T_{i},L_{i}^{r})], (7)

where Φ\Phi denotes the process of inner products. A detailed description of this part of the model can be found in Appendix 4.

Multi-model Information Based SASRec

In our micro-video popularity prediction model, we use SASRec, a Transformer-based architecture, to capture users’ sequential behavior patterns. It models short and long-term user preferences and incorporates multi-modal item features.

The model processes video frames, cover images, and titles extracted with pre-trained encoders. User interaction histories go into the SASRec module, where self-attention layers encode sequential behaviors. Multi-modal features are integrated to form the item scoring representation. The user embedding is combined with this.

For each user, a score for each item is calculated by a dot product. The final score for an item across all users is the average of individual scores. These are converted to probabilities with a sigmoid function. The model integrates multi-modal features and optimizes through backpropagation based on user behavior sequences. More details can be found in Appendix 5.

Mamba Hawkes Process (MHP) Architecture

We define the user’s reaction sequence as 𝒮={(t1,r1),(t2,r2),,(tn,rn)}\mathcal{S}=\{(t_{1},r_{1}),(t_{2},r_{2}),...,(t_{n},r_{n})\}, where Δi=titi1\Delta_{i}=t_{i}-t_{i-1} represents the time difference, and it is expressed as Δ=(Δ1,Δ2,,Δn)\Delta=(\Delta_{1},\Delta_{2},...,\Delta_{n}). The event is represented as a one-hot vector 𝐫i\mathbf{r}_{i}.We construct the MHP model using the following equation:

h(t+Δt)=𝐀¯h(t)+𝐁¯u(t)h(t+\Delta t)=\overline{\mathbf{A}}h(t)+\overline{\mathbf{B}}u(t) (8)

MHP Model Architecture

Let WeW^{e} be the event embedding matrix of dimension D×RD\times R, and the event embedding is defined as xti=𝐫i(We)Tx_{t_{i}}=\mathbf{r}_{i}(W^{e})^{T}. The sequence embedding is (xt1,xt2,,xtn)(x_{t_{1}},x_{t_{2}},...,x_{t_{n}}).We define the time-dependent matrices 𝐁(ti)\mathbf{B}(t_{i}) and 𝐂(ti)\mathbf{C}(t_{i}) as linear transformations. The state transition formulas for MHP are:

zti\displaystyle z_{t_{i}} =𝐀¯(ti)zti1+𝐁¯(ti)xti,\displaystyle=\overline{\mathbf{A}}(t_{i})z_{t_{i-1}}+\overline{\mathbf{B}}(t_{i})x_{t_{i}}, (9)
yti\displaystyle y_{t_{i}} =𝐂(ti)zti,\displaystyle=\mathbf{C}(t_{i})z_{t_{i}},

where the time-dependent coefficients are:

𝐀¯(ti)\displaystyle\overline{\mathbf{A}}(t_{i}) =exp(Δi𝐀),\displaystyle=\exp(\Delta_{i}\mathbf{A}), (10)
𝐁¯(ti)\displaystyle\overline{\mathbf{B}}(t_{i}) =(Δi𝐀)1(exp(Δi𝐀)𝐈)Δi𝐁(ti).\displaystyle=(\Delta_{i}\mathbf{A})^{-1}(\exp(\Delta_{i}\mathbf{A})-\mathbf{I})\cdot\Delta_{i}\mathbf{B}(t_{i}).

The final output 𝐎=(𝐨1,𝐨2,,𝐨n)\mathbf{O}=(\mathbf{o}_{1},\mathbf{o}_{2},...,\mathbf{o}_{n}) is passed through a neural network to generate the hidden representation 𝐡(t)\mathbf{h}(t):

𝐇=ReLU(𝐎W1+b1)W2+b2,𝐡(tj)=𝐇(j,:)\displaystyle\mathbf{H}=\mathrm{ReLU}(\mathbf{O}W_{1}+b_{1})W_{2}+b_{2},\quad\mathbf{h}(t_{j})=\mathbf{H}(j,:) (11)

The resulting matrix 𝐇\mathbf{H} contains hidden representations of all the events in the input sequence, where each row corresponds to a particular event.

Refer to caption
Figure 3: Mamba Hawkes Process (MHP) Architecture

Intensity Function and Log-Likelihood

The intensity function of the MHP is given by

λ(t)=r=1Rλr(t),\lambda(t)=\sum_{r=1}^{R}\lambda_{r}(t), (12)

where

λr=fr(αr(ttj)+𝐰rT𝐡(tj)+br),\lambda_{r}=f_{r}(\alpha_{r}(t-t_{j})+\mathbf{w}_{r}^{T}\mathbf{h}(t_{j})+b_{r}), (13)

and fr(x)=βrlog(1+exp(x/βr))f_{r}(x)=\beta_{r}\log\left(1+\exp\left(x/\beta_{r}\right)\right) is the Softplus function.The log-likelihood function based on the sequence 𝒮\mathcal{S} is given by

(𝒮)=j=1nlogλ(tj|j)event log-likelihoodt1tnλ(t|t)𝑑tnon-event log-likelihood.\ell(\mathcal{S})=\underbrace{\sum_{j=1}^{n}\log\lambda(t_{j}|\mathcal{H}_{j})}_{\text{event log-likelihood}}-\underbrace{\vphantom{\sum_{j=1}^{N}\log\lambda(t_{j}|\mathcal{H}_{j})}\int_{t_{1}}^{t_{n}}\lambda(t|\mathcal{H}_{t})dt}_{\text{non-event log-likelihood}}. (14)

Note: Here, log-likelihood function will play two important roles:

  • loss-function for our pre-trained model

  • To evaluate the reaction ratio from the users

We learn the model parameters by maximizing the log-likelihood across all sequences:

maxi=1N(𝒮i),\max~{}\sum_{i=1}^{N}\ell(\mathcal{S}_{i}), (15)

using the ADAM optimization algorithm for an efficient solution.

Then we use the MHP model as an evaluation module, combining it with multi-modal information to assign a comprehensive score that reflects the impact of platform recommendations on the popularity of short videos.

LikelihoodMHP=(vi),viV.Likelihood_{MHP}=\ell(v_{i}),\ v_{i}\in\textbf{V}. (16)

More tricky details can be found in Appendix 6, please refer to it for more explanation.

Prediction Network

For each micro-video viv_{i}, the output layer in AUG-MMRA is fed a concatenated vector of the feature components, which is then passed through a fully connected layer with weights Woutput10d×dW_{output}\in\mathbb{R}^{10d\times d}

Output=concat([Ci,Ti,Cir,Tir,])Woutput.\displaystyle Output=concat([C_{i},T_{i},C_{i}^{r},T_{i}^{r},\mathcal{I}])W_{output}. (17)

Additionally, for each micro-video, we obtain the recommendation score from SAS4Rec, ScoreSASScore_{SAS}, and the recommendation likelihood LikelihoodMHPLikelihood_{MHP} from MHP.

Prediction=[Output,ScoreSAS,LikelihoodMHP]Wpred.\displaystyle Prediction=[Output,Score_{SAS},Likelihood_{MHP}]W_{pred}. (18)

where WpredW_{pred} is a trainable parameter matrix. In the model training phase, we employ mean squared error (MSE) as our loss function.

Experiments

Research Question

In this section, we present experiments conducted to evaluate the effectiveness of MUFM on a real-world micro-video dataset, with the aim of addressing the following research questions:

  • RQ1: How does MUFM perform compared to existing models and state-of-the-art methods?

  • RQ2: What is the contribution of each component of MUFM to its overall performance in MVPP?

  • RQ3: How do key hyperparameters affect the model’s performance?

  • RQ4: What insights can be gained from the results of MUFM?

Table 1: Statistics of dataset.
Dataset Video Train Val Test
MicroLens-100k 1973819738 1579015790 19741974 19741974
pMicroLens 1738217382^{\ast} 1390513905 17391739 17381738

\ast: 2342 videos are unused in pMicroLens due to lack of va-

lid comments, 14 videos are disposed due to file corruption.

Datasets

Based on an open-source dataset MicroLens-100k(Ni et al. 2023), We have constructed pMicroLens(permuted MicroLens), excluding videos with fewer than five comments from the dataset and only the 5 earliest published comments for the remaining videos. In that case, there’s no information directing to the amount of comments, and only the sequential knowledge of the earliest 5 comments can be learned in the Mamba Hawkes Process. To evaluate the effectiveness of MUFM, we conduct experiments on pMicroLens to predict the popularity score, which can be represented by the number of comments.

Evaluation Metrics

Since our short video popularity prediction is essentially a regression-like task, the degree of task completion is determined by the fit between the predicted popularity parameters and their actual values. We adopt the normalized Mean Square Error (nMSE) as the main parameter to measure the performance, which is defined as follows:

nMSE=1Nσyi2i=1N(yiy^i)2\displaystyle nMSE=\frac{1}{N\sigma_{y_{i}}^{2}}\sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2} (19)

where NN represent the total number of micro-video samples, yiy_{i} and y^i\hat{y}_{i} are the target and predicted popularity score for the iith micro-video sample, and σyi\sigma_{y_{i}} is the standard deviation of the target popularity.

In addition, we measure Spearman’s Rank Correlation (SRC) coefficient, Pearson linear correlation coefficient (PLCC), and Mean Absolute Error (MAE) as complementary metrics. SRC, PLCC, and MAE are defined as follows:

SRC=16i=1Ndi2N(N21)\displaystyle SRC=1-\frac{6\sum_{i=1}^{N}d_{i}^{2}}{N(N^{2}-1)} (20)
PLCC=\displaystyle PLCC= 1Ni=1N1Tj=1T(yijy¯iσyi)(y^ijy^¯iσy^i)\displaystyle\frac{1}{N}\sum_{i=1}^{N}\frac{1}{T}\sum_{j=1}^{T}(\frac{y_{i}^{j}-\bar{y}_{i}}{\sigma_{y_{i}}})(\frac{\hat{y}_{i}^{j}-\bar{\hat{y}}_{i}}{\sigma_{\hat{y}_{i}}}) (21)
MAE=1Ni=1N|yiy^i|,\displaystyle MAE=\frac{1}{N}\sum_{i=1}^{N}\lvert{y_{i}-\hat{y}_{i}}\rvert, (22)

where did_{i} is the rank difference of a micro-video between the prediction and the popularity target. σyi\sigma_{y_{i}} and σy^i\sigma_{\hat{y}_{i}} stand for the standard deviation of the target and predicted popularity sequences for the ii-th micro-video sample.

A higher SRC value indicates a stronger monotonic correlation between targets and predictions, a higher PLCC value indicates a higher model performance, while a lower nMSE or MAE indicates a more precise prediction of the model. More details of baseline models and experiments are displayed in the Appendix.

Implementation

Device

All our experiments are conducted on a Linux server with 16 NVIDIA A100 Tensor Core GPUs using a multi-thread dataloader.

Table 2: Main hyper-parameters in the training process
lrlr Learning rate
batch_sizebatch\_size Batch size
α\alpha Balance parameter between +/- attention
wdecaywdecay Weight decay
dropoutdropout Dropout rate in neural network
KK Number of frames captured in video
SS Number of similar videos retrieved

Hyper-Parameter Settings

There are some main parameters in the training process, displayed in Table 2. We adopt the Bayesian Optimization (Jones, Schonlau, and Welch 1998), searching through the parameter hyper-space and find the lowest nMSE when setting [lr, batch_size, wdecay, dropout, α\alpha, K, S]=[1e-4, 64, 0.001, 0, 0.2, 10, 20]. We apply the above set of parameters to acquire our result and compare MUFM with other baselines. Hyper-parameter optimization and sensitivity analysis will be discussed in detail in the following paragraphs.
Running details: We have set the random seed to 2024 for reproducibility. The best performance is achieved after 6 epochs.

Performance Comparison

Baseline models

To evaluate the superiority of the model, we conduct experiments with 9 competitive baselines of different methods. Our experimental baseline model includes: SVR (Khosla, Das Sarma, and Hamid 2014), HyFea (Lai, Zhang, and Zhang 2020), CLSTM (Ghosh et al. 2016), TMALL (Chen et al. 2016), MASSL (Zhang et al. 2023), CBAN (hin Cheung and man Lam 2022), HMMVED, MTFM and MMRA (Zhong et al. 2024). More details about our baseline model are displayed in Appendix 6, and all baseline code is included in the code appendix.

Performance Comparison

The performance of various baseline models and our MUFM model on the dataset is presented in Table 3. The results demonstrate that MUFM consistently outperforms all baselines on pMicroLens dataset.

These findings confirm the value of constructing a multi-modal pipeline integrated with User Feedback to boost prediction accuracy. Notably, compared to the current state-of-the-art model MMRA, our approach shows significant improvements by incorporating user-targeted recommendations on SASRec, and item-targeted reaction sequences on MHP. The introduction of additional learned meta-knowledge in our framework allows for a more nuanced decoupling and representation of the factors driving micro-video popularity, which naturally leads to superior performance.

Table 3: Performance comparison among baselines and MUFM
Model nMSEnMSE SRCSRC MAEMAE PLCCPLCC
CBAN 0.8705 0.2780 27.771 0.3808
CLSTM 0.8868 0.3758 27.326 0.3608
HMMVED 0.9076 0.2726 28.342 0.2586
MASSL 0.9436 0.2920 26.984 0.2917
SVR 1.0445 0.3147 26.848 0.2685
MFTM 0.8985 0.3143 27.935 0.3207
Hyfea 0.9050 0.3059 27.928 0.3085
TMALL 0.8729 0.3353 26.297 0.3458
MMRA 0.9059 0.4305 25.498 0.4323
MUFM 0.8269 0.5308 24.643 0.4919

\ast: MMRA is the current state-of-the-art model.

Hyper-parameter Analysis

After initial adjustments, the model’s performance showed low sensitivity to wdecaywdecay batch_sizebatch\_size and dropoutdropout. Therefore, we focused on optimizing and analyzing other key parameters: KK, SS, α\alpha, and lrlr.

As shown in Figure 2, more videos aggregated in the retrieval process enable the model to pay more attention to the shared characteristics from similar videos and attain effective information, which ultimately improves the performance of MUFM. The optimal value for SS is 0.01. In terms of lrlr, increasing the learning rate requires more epochs to learn information and achieve the lowest nMSE. The optimal learning rate is found to be 1e-4.

Figure 5 illustrates the best nMSE and SRC obtained when training with different values of KK and α\alpha. The model’s performance is relatively insensitive both to the number of frames captured (KK) and to the balance between positive and negative attention (α\alpha). MUFM performed best with K=10K=10 and α=0.8\alpha=0.8.

Overall, our model demonstrated strong robustness once the value of SS was optimized. The best-performing hyper-parameters are [lrlr, batch_sizebatch\_size, wdecaywdecay, dropoutdropout, α\alpha, KK, SS] = [1e-4, 64, 0.001, 0, 0.2, 10, 20].

Refer to caption
Refer to caption
Figure 4: nMSEnMSE when training with different lrlr or SS
Refer to caption
Refer to caption
Figure 5: nMSEnMSE when training with different KK or α\alpha

Ablation Study

In this section, we conduct an ablation study on MUFM to assess the impact of three critical components. We create the following variants for evaluation.

  • noSAS: This variant removes the SASRec, thereby eliminating user-targeted recommendations and leaving MHP as the sole recommendation mechanism.

  • noMamba: In this variant, the Mamba model in MHP is removed, eliminating context encoder and dependency capture through selective state space models (SSM). Therefore, the predicted recommendation likelihoods are attained from the user reaction sequence using merely the traditional Hawkes Process.

  • noHP: This variant excludes the Hawkes Process(HP) methodology. The predictive algorithm does not follow the Hawkes Process model, but solely depends on Mamba itself, which prevents modeling the phenomena of self-excitation or mutual inhibition between events.

In this analysis, we selected the top 50, 100, and 200 videos in the test samples with the highest ground truth popularity scores, as well as the bottom 50, 100, and 200 videos with the lowest scores. We then reported the average predicted popularity scores based on the number of comments. Additionally, we evaluated the entire test dataset using metrics such as nMSE, SRC, and MAE across four different model variants.

Table 4: Ablation study among four variants: MUFM, noSAS, noMamba and noHP
Prediction noSAS noMamba noHP MUFM
Top50 83.456 83.389 76.389 83.433
Top100 70.690 69.565 65.573 71.349
Top200 65.111 62.445 60.835 65.869
Bottom200 19.714 29.030 31.332 17.940
Bottom100 15.638 26.547 28.565 12.493
Bottom50 12.409 25.855 27.921 10.865
nMSE 0.8231 0.8474 0.8756 0.8219
SRC 0.5216 0.4438 0.4214 0.5308
MAE 24.890 25.882 26.498 24.643

As shown in Table 5, the predicted popularity scores for the six groups followed the order: Top50 ¿¿ Top100 ¿¿ Top200 ¿¿ Bottom200 ¿¿ Bottom100 ¿¿ Bottom50, indicating that the prediction model performs as expected. Furthermore, compared to the actual number of comments in each group, MUFM achieved the best performance in most cases. As for the evaluation metrics, MUFM outperformed the other three variants in nMSE, SRC, and MAE, validating the effectiveness of incorporating SASRec and MHP.

Notably, removing the Mamba architecture or Hawkes Process both leads to a significant decline in performance. This finding suggests that the mutual inhibition and context dependency captured between events both contribute to gaining effective meta-knowledge from the user reaction sequence. While combining the Mamba architecture with the Hawkes process, the model’s ability to capture critical information and its interpretability are enhanced, improving the model’s performance.

It’s a meaningful indication that the recommendation system based on user reactions significantly influences video popularity more than previously understood.

Conclusion

In this work, we propose MUFM, a multi-modal model for MVPP. MUFM introduces a retrieval-augmented framework and a Hawkes process model for the MVPP task. We align the visual audio and textual modalities to find similar videos. Cross-modal bipolar interactions are implied to address the presence of inconsistent information between models, as well as a retrieval interaction enhancement method to capture meaningful knowledge from relevant instances. The mamba-based Hawkes process model provides info on user behavior and improves our model precision. Experiments on the real-world micro-video dataset show our method is effective and outperforms the current state-of-the-art in MVPP.

References

  • Chen et al. (2016) Chen, J.; Song, X.; Nie, L.; Wang, X.; Zhang, H.; and Chua, T.-S. 2016. Micro Tells Macro: Predicting the Popularity of Micro-Videos via a Transductive Model. In Proceedings of the 24th ACM International Conference on Multimedia, MM ’16, 898–907. New York, NY, USA: Association for Computing Machinery. ISBN 9781450336031.
  • Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
  • Dosovitskiy et al. (2021) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929.
  • Gao, Dai, and Hu (2024) Gao, A.; Dai, S.; and Hu, Y. 2024. Mamba Hawkes Process. arXiv preprint arXiv:2407.05302.
  • Geva et al. (2021) Geva, M.; Schuster, R.; Berant, J.; and Levy, O. 2021. Transformer Feed-Forward Layers Are Key-Value Memories. arXiv:2012.14913.
  • Ghosh et al. (2016) Ghosh, S.; Vinyals, O.; Strope, B.; Roy, S.; Dean, T.; and Heck, L. 2016. Contextual LSTM (CLSTM) models for Large scale NLP tasks. arXiv:1602.06291.
  • Gong, Chung, and Glass (2021) Gong, Y.; Chung, Y.-A.; and Glass, J. 2021. AST: Audio Spectrogram Transformer. arXiv:2104.01778.
  • Hawkes (1971) Hawkes, A. G. 1971. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1): 83–90.
  • He et al. (2015) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385.
  • hin Cheung and man Lam (2022) hin Cheung, T.; and man Lam, K. 2022. Crossmodal bipolar attention for multimodal classification on social media. Neurocomputing, 514: 1–12.
  • Jones, Schonlau, and Welch (1998) Jones, D. R.; Schonlau, M.; and Welch, W. J. 1998. Efficient Global Optimization of Expensive Black-Box Functions.
  • Khosla, Das Sarma, and Hamid (2014) Khosla, A.; Das Sarma, A.; and Hamid, R. 2014. What makes an image popular? In Proceedings of the 23rd International Conference on World Wide Web, WWW ’14, 867–876. New York, NY, USA: Association for Computing Machinery. ISBN 9781450327442.
  • Lai, Zhang, and Zhang (2020) Lai, X.; Zhang, Y.; and Zhang, W. 2020. HyFea: Winning Solution to Social Media Popularity Prediction for Multimedia Grand Challenge 2020. In Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, 4565–4569. New York, NY, USA: Association for Computing Machinery. ISBN 9781450379885.
  • Laub, Taimre, and Pollett (2015) Laub, P. J.; Taimre, T.; and Pollett, P. K. 2015. Hawkes processes. arXiv preprint arXiv:1507.02822.
  • Li et al. (2013) Li, H.; Ma, X.; Wang, F.; Liu, J.; and Xu, K. 2013. On popularity prediction of videos shared in online social networks. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM ’13, 169–178. New York, NY, USA: Association for Computing Machinery. ISBN 9781450322638.
  • Li et al. (2023) Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597.
  • Li and Li (2024) Li, X.; and Li, J. 2024. AnglE-optimized Text Embeddings. arXiv:2309.12871.
  • Ni et al. (2023) Ni, Y.; Cheng, Y.; Liu, X.; Fu, J.; Li, Y.; He, X.; Zhang, Y.; and Yuan, F. 2023. A Content-Driven Micro-Video Recommendation Dataset at Scale. arXiv:2309.15379.
  • Rizoiu et al. (2017) Rizoiu, M.-A.; Lee, Y.; Mishra, S.; and Xie, L. 2017. Hawkes processes for events in social media, 191–218. Association for Computing Machinery and Morgan & Claypool. ISBN 9781970001075.
  • Roy et al. (2013) Roy, S. D.; Mei, T.; Zeng, W.; and Li, S. 2013. Towards Cross-Domain Learning for Social Video Popularity Prediction. IEEE Transactions on Multimedia, 15(6): 1255–1267.
  • Sun and Lu (2020) Sun, X.; and Lu, W. 2020. Understanding Attention for Text Classification. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J., eds., Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3418–3428. Online: Association for Computational Linguistics.
  • Wu et al. (2016) Wu, B.; Mei, T.; Cheng, W.-H.; and Zhang, Y. 2016. Unfolding Temporal Dynamics: Predicting Social Media Popularity Using Multi-scale Temporal Decomposition. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1).
  • Wu et al. (2024) Wu, Y.; Chen, K.; Zhang, T.; Hui, Y.; Nezhurina, M.; Berg-Kirkpatrick, T.; and Dubnov, S. 2024. Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. arXiv:2211.06687.
  • Xie, Zhu, and Chen (2023) Xie, J.; Zhu, Y.; and Chen, Z. 2023. Micro-Video Popularity Prediction Via Multimodal Variational Information Bottleneck. IEEE Transactions on Multimedia, 25: 24–37.
  • Zhang et al. (2023) Zhang, Z.; Xu, S.; Guo, L.; and Lian, W. 2023. Multi-modal Variational Auto-Encoder Model for Micro-video Popularity Prediction. In Proceedings of the 8th International Conference on Communication and Information Processing, ICCIP ’22, 9–16. New York, NY, USA: Association for Computing Machinery. ISBN 9781450397100.
  • Zhao et al. (2024) Zhao, L.; li, Y.; Chen, X.; Sun, L.; and Xue, Z. 2024. MFTM-Informer: A multi-step prediction model based on multivariate fuzzy trend matching and Informer.
  • Zhong et al. (2024) Zhong, T.; Lang, J.; Zhang, Y.; Cheng, Z.; Zhang, K.; and Zhou, F. 2024. Predicting Micro-video Popularity via Multi-modal Retrieval Augmentation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, 2579–2583. New York, NY, USA: Association for Computing Machinery. ISBN 9798400704314.
  • Zhou, Zha, and Song (2013) Zhou, K.; Zha, H.; and Song, L. 2013. Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In Artificial intelligence and statistics, 641–649. PMLR.

Appendix 1: Micro Video Retrieval

The micro video memory repository is a collection of reference pairs, denoted as frames,text\left\langle frames,text\right\langle, with each element encoded for efficient retrieval. To generate a retrieval vector for a given micro video ViV_{i}, we begin by using BLIP-2, a pre-trained large language model, to analyze the video’s content and produce descriptive captions for its frames, resulting in the set Ci={c1i,,cLi}C_{i}=\{c^{i}_{1},\dots,c^{i}_{L}\}, where LL represents the total number of frames in video ViV_{i}.

For the audio component, we apply a similar process using the CLAP (Cross-Modal Language-Audio Pre-training) model, which generates descriptive textual annotations for the audio, denoted as aia_{i}. These annotations are derived from a predefined set of audio descriptions. The resulting synthetic captions CiC_{i} are then combined with the audio descriptions AiA_{i} and the original textual descriptions TiT_{i} to form a comprehensive text prompt. This is achieved using the concatenation operator \oplus, leading to

Pi=CiAiTi.\displaystyle P_{i}=C_{i}\oplus A_{i}\oplus T_{i}. (23)

This synthesized text prompt PiP_{i} is then processed through a pre-trained semantic extraction model, specifically the UAE-Large architecture, to generate a retrieval vector RiR_{i} that encapsulates the key attributes of video ViV_{i}. This process is repeated for each micro video in the memory repository, assigning each one a unique retrieval vector that serves as its identifier within the system.

This method unifies the visual, audio, and textual modalities into a cohesive representation, enhancing the model’s ability to predict micro video popularity. Additionally, it improves the accuracy and efficiency of content retrieval and analysis in the domain of micro form video dissemination.

The Bank Retriever mechanism is designed to identify the Top-SS most similar videos from a memory bank by evaluating the similarity scores between the target video and stored instances. The process begins by receiving a query video Vq={v1q,v2q,,vSq}V_{q}=\{v_{1}^{q},v_{2}^{q},\dots,v_{S}^{q}\} and searching through the memory bank, which holds a collection of frame-text pairs. The retriever employs a maximum inner product search algorithm to scan the memory bank, selecting the most relevant videos based on their similarity to the query.

Appendix 2: Multimodal Process

To capture the key features of a micro video ViV_{i}, we extract visual, audio, and textual information. We start by selecting a series of key frames, denoted as F1i,F2i,,FKiF_{1}^{i},F_{2}^{i},\dots,F_{K}^{i}, where KK represents the total number of frames.

For visual feature extraction, we apply a pre-trained visual model. Each frame FjiH×W×CF_{j}^{i}\in\mathbb{R}^{H\times W\times C} is divided into fixed-size patches and reshaped into a sequence of flattened 2D patches FjiN×P2×C{F_{j}^{i}}^{\ast}\in\mathbb{R}^{N\times P^{2}\times C}, where (H,W)(H,W) is the resolution of the original frame, CC is the number of channels, (P,P)(P,P) is the patch size, and NN is the number of patches, which serves as the input sequence length for the Vision Transformer (ViT).

Refer to caption
Figure 6: Multimodal Feature Extraction and Embedding

For audio features, we follow a similar approach using the Audio Spectrogram Transformer (AST). We first align the timestamps of the sampled frames with corresponding audio segments W={w1,w2,,wK}W=\{w_{1},w_{2},\dots,w_{K}\} extracted from the video’s soundtrack. These audio segments are processed using a Fast Fourier Transform (FFT) to generate Mel-frequency spectral coefficients (MFCC), resulting in a spectrum matrix MSnmel×naM_{S}^{n_{mel}\times n_{a}}, where nmeln_{mel} is the number of mel-filters, and nan_{a} is the number of time frames, determined by the audio length and stride. This spectrum matrix, treated as a single-channel image, is then processed using the same patching method applied to visual frames.

Next, we linearly embed each visual and audio patch, adding sequential position embeddings to capture positional relationships within the frames. The resulting KK patch sequences are then processed by ViT and AST, producing visual and audio representations EivE_{i}^{v} and EiaE_{i}^{a}, respectively. These feature matrices are concatenated to emphasize the relationship between visual and audio information:

Eiv,a=EivEiaK×(dv+da),\displaystyle E_{i}^{v,a}=E_{i}^{v}\oplus E_{i}^{a}\in\mathbb{R}^{K\times(d_{v}+d_{a})}, (24)

where \oplus denotes the concatenation operation, combining visual and audio features. Here, dvd_{v} and dad_{a} represent the dimensions of the visual and audio features, respectively. The concatenated features Eiv,aE_{i}^{v,a} are then passed through a linear layer Wc(dv+da)×dW_{c}\in\mathbb{R}^{(d_{v}+d_{a})\times d} with a ReLU activation function, generating an audio-visual input token XcX^{c}:

XcK×d:Xic=ReLU(Eiv,aWc).\displaystyle X^{c}\in\mathbb{R}^{K\times d}:X_{i}^{c}=ReLU(E_{i}^{v,a}W_{c}). (25)

For textual information, we process micro video textual description into pre-trained language model AngLE to produce a sequence of textual embedding Eitnw×dtE^{t}_{i}\in\mathbb{R}^{n_{w}\times d_{t}}, where nwn_{w} is the word length in the textual descriptions. The textual embedding EitE^{t}_{i} are then passed through a linear layer Wtdt×dW_{t}\in\mathbb{R}^{d_{t}\times d}, generating textual input token XtX^{t}:

Xtnw×d:Xit=ReLU(EitWt).\displaystyle X^{t}\in\mathbb{R}^{n_{w}\times d}:X_{i}^{t}=ReLU(E_{i}^{t}W_{t}). (26)

Appendix 3: Cross-modal Bipolar Interaction

Aligning audio-visual and textual modalities in micro videos presents challenges due to potential inconsistencies between textual descriptions and video content. To address this, we implement a cross-attention network comprising both positive and negative attention modules, designed to capture the similarities and differences between multi-modal information. The positive attention module focuses on identifying the most consistent features across different modalities, while the negative attention module highlights any inconsistent or contradictory information.

Within the positive attention module, the most aligned features between modalities are calculated using cross-modal attention vectors. For a given video viv_{i}, the audio-visually guided positive textual features Ti𝒫T_{i}^{\mathcal{P}} and the textually guided positive audio-visual features Ci𝒫C_{i}^{\mathcal{P}} are derived as follows:

Ti𝒫\displaystyle T_{i}\mathcal{P} =ATT𝒫(XicW𝒫𝒬,XitW𝒫𝒦,XitW𝒫𝒞)\displaystyle=ATT^{\mathcal{P}}\left(X_{i}^{c}W^{\mathcal{Q}}_{\mathcal{P}},X_{i}^{t}W^{\mathcal{K}}_{\mathcal{P}},X_{i}^{t}W^{\mathcal{C}}_{\mathcal{P}}\right) (27)
=Softmax(α𝒬𝒦Td)𝒞,\displaystyle=Softmax(\alpha\frac{\mathcal{QK}^{T}}{\sqrt{d}})\mathcal{C}, (28)
Ci𝒫\displaystyle C_{i}^{\mathcal{P}} =ATT𝒫(XitW𝒫𝒬,XicW𝒫𝒦,XicW𝒫𝒯)\displaystyle=ATT^{\mathcal{P}}\left(X_{i}^{t}W^{\mathcal{Q}}_{\mathcal{P}},X_{i}^{c}W^{\mathcal{K}}_{\mathcal{P}},X_{i}^{c}W^{\mathcal{T}}_{\mathcal{P}}\right) (29)
=Softmax(α𝒬𝒦Cd)𝒯,\displaystyle=Softmax(\alpha\frac{\mathcal{QK}^{C}}{\sqrt{d}})\mathcal{T}, (30)

where W𝒫𝒬,W𝒫𝒦,W𝒫𝒞,W𝒫𝒯W^{\mathcal{Q}}_{\mathcal{P}},W^{\mathcal{K}}_{\mathcal{P}},W^{\mathcal{C}}_{\mathcal{P}},W^{\mathcal{T}}_{\mathcal{P}} denote the query, key, and two value projection matrices. The parameter α\alpha is used to balance the influence of positive and negative attention. Similarly, the positive audio-visual features guided by textual information can be obtained using the same method.

Negative attention is designed to highlight inconsistencies across different modalities. These negative attention scores are calculated using scaled dot-product attention, with a negative constant applied before the Softmax function. The process can be summarized as follows:

Ti𝒩\displaystyle T_{i}^{\mathcal{N}} =ATT𝒩(XicW𝒩𝒬,XitW𝒩𝒦,XitW𝒩𝒞)\displaystyle=ATT^{\mathcal{N}}\left(X_{i}^{c}W^{\mathcal{Q}}_{\mathcal{N}},X_{i}^{t}W^{\mathcal{K}}_{\mathcal{N}},X_{i}^{t}W^{\mathcal{C}}_{\mathcal{N}}\right) (31)
=Softmax(γ𝒬𝒦Td)𝒞,\displaystyle=Softmax(\gamma\frac{\mathcal{QK}^{T}}{\sqrt{d}})\mathcal{C}, (32)
Ci𝒩\displaystyle C_{i}^{\mathcal{N}} =ATT𝒩(XitW𝒩𝒬,XicW𝒩𝒦,XicW𝒩𝒯)\displaystyle=ATT^{\mathcal{N}}\left(X_{i}^{t}W^{\mathcal{Q}}_{\mathcal{N}},X_{i}^{c}W^{\mathcal{K}}_{\mathcal{N}},X_{i}^{c}W^{\mathcal{T}}_{\mathcal{N}}\right) (33)
=Softmax(γ𝒬𝒦Cd)𝒯,\displaystyle=Softmax(\gamma\frac{\mathcal{QK}^{C}}{\sqrt{d}})\mathcal{T}, (34)

where W𝒫𝒬,W𝒫𝒦,W𝒫𝒞,W𝒫𝒯W^{\mathcal{Q}}_{\mathcal{P}},W^{\mathcal{K}}_{\mathcal{P}},W^{\mathcal{C}}_{\mathcal{P}},W^{\mathcal{T}}_{\mathcal{P}} denote the query, key, and two value projection matrices respectively. Moreover, γ\gamma is denoted as γ=(1α)\gamma=-(1-\alpha). After this, we referred to the integration processing of hidden states in the MMRA model to generate a comprehensive textual modal representation in FFN layers, incorporate audio-visual hidden states into textual hidden states to generate a comprehensive textual and audio-visual modal representation Ti~\widetilde{T_{i}} and Ci~\widetilde{C_{i}}, which modify the calculation of the FFN process as follows:

Ti~\displaystyle\widetilde{T_{i}} =ReLU(Xit+(Ci𝒫Ci𝒩)W1t)W2t,\displaystyle=ReLU\left(X_{i}^{t}+\left(C_{i}^{\mathcal{P}}\oplus C_{i}^{\mathcal{N}}\right)W^{t}_{1}\right)W^{t}_{2}, (35)
Ci~\displaystyle\widetilde{C_{i}} =ReLU(Xic+(Ti𝒫Ti𝒩)W1c)W2c,\displaystyle=ReLU\left(X_{i}^{c}+\left(T_{i}^{\mathcal{P}}\oplus T_{i}^{\mathcal{N}}\right)W^{c}_{1}\right)W^{c}_{2}, (36)

where W1t,W1c2d×dW^{t}_{1},W^{c}_{1}\in\mathbb{R}^{2d\times d}, W2t,W2cd×dW^{t}_{2},W^{c}_{2}\in\mathbb{R}^{d\times d} represent learnable weights matrix via the attentive pooling strategy, \oplus denotes the concatenation operation. Finally, we exploit expressive representations Ti,CiT_{i},C_{i} by the attentive pooling strategy.

This dual-path approach ensures a synthesis of audio-visual and textual modalities, each subjected to a consistent and rigorous extraction process. This method not only captures the intrinsic characteristics of micro videos but also lays a solid foundation for subsequent multifaceted analyses and applications, thereby enhancing the depth and breadth of our understanding of video content.

Refer to caption
Figure 7: Cross-modal Attention Structure

Appendix 4: Retrieval Interaction Enhancement

We focus on extracting valuable insights from relevant instances retrieved to improve micro video popularity prediction (MVPP). To achieve this, we use an aggregation function that combines the comprehensive representations of these instances from the memory bank. This function assigns attention scores based on the normalized similarity scores obtained during retrieval, using them as weights to construct the retrieved embeddings Xir,cX_{i}^{r,c} and Xir,tX_{i}^{r,t}. Instances with higher similarity scores are prioritized, highlighting their relevance to the target micro video.

Similarly, we calculate Tir~\widetilde{T_{i}^{r}} and Cir~\widetilde{C_{i}^{r}}, and then derive the final representations TirT_{i}^{r} and CirC_{i}^{r} using an attentive pooling strategy. To capture the popularity trends of the retrieved instances, we encode the label information through linear layers and aggregation, resulting in the retrieved label embedding LirL_{i}^{r}. Finally, AUG-MMRA integrates all features from TirT_{i}^{r} and CirC_{i}^{r} to model cross-sample interactions. These feature interactions are constructed as

=[Φ(Ci,Cir),Φ(Ci,Tir),,Φ(Ti,Lir)],\displaystyle\mathcal{I}=[\Phi(C_{i},C_{i}^{r}),\Phi(C_{i},T_{i}^{r}),...,\Phi(T_{i},L_{i}^{r})], (37)

where Φ\Phi denotes the process of inner products.

Appendix 5: Multimodel Information Based SASRec

In our micro-video popularity prediction model, we adopt SASRec (Self-Attention Sequence Recommender), a Transformer-based architecture, to capture users’ sequential behavior patterns. SASRec is crucial for modeling both short-term and long-term user preferences, incorporating multimodal item features to enhance recommendation accuracy.

Our model processes various modalities, including video frames (FiF_{i}), cover images (IiI_{i}), and titles (TiT_{i}), which are extracted using pre-trained encoders. The user interaction histories are fed into the SASRec module, where self-attention layers encode sequential user behaviors, producing a sequence-based representation of user-item interactions. Simultaneously, the multimodal features are integrated to construct the item scoring representation item_scoring. The user embedding UiU_{i}, which encapsulates the user’s behavior sequence, is then combined with this multimodal item representation.

For each user, we calculate a score for each item by computing the dot product between the user embedding UiU_{i} and the multimodal item scoring item_scoring:

scorej=Uiitem_scoring\displaystyle\text{score}_{j}=U_{i}\cdot\text{item\_scoring} (38)

The final score for an item across all users is obtained by averaging these individual scores:

avg_scorej=1Ni=1Nscorej\displaystyle\text{avg\_score}_{j}=\frac{1}{N}\sum_{i=1}^{N}\text{score}_{j} (39)

where NN represents the total number of users. These averaged scores are converted into probabilities using a sigmoid function to predict the likelihood of interaction:

ScoreSAS=σ(avg_scorej)\displaystyle\text{Score}_{\text{SAS}}=\sigma(\text{avg\_score}_{j}) (40)

The model is trained by minimizing the cross-entropy loss between the predicted scores and actual user interactions (positive and negative samples). By integrating multimodal item features (video, image, text), SASRec provides a rich representation of user-item interactions. The model optimizes its predictions through backpropagation, adjusting parameters based on user behavior sequences.

Appendix 6: Mamba Architecture

Denote the sequence 𝒮={(t1,r1),(t2,r2),,(tn,rn)}\mathcal{S}=\{(t_{1},r_{1}),(t_{2},r_{2}),...,(t_{n},r_{n})\} as the temporal users’ reaction sequence, where RR is the number of events. Let Δi=titi1\Delta_{i}=t_{i}-t_{i-1} represent the temporal differences, with Δ1=t1\Delta_{1}=t_{1} for convention, hence the temporal sequence is represented by

Δ=(Δ1,Δ2,,Δn)\Delta=(\Delta_{1},\Delta_{2},...,\Delta_{n})

Additionally, let 𝐫i\mathbf{r}_{i} be the one-hot vector of the events. Inspired by the discretization

h(t+Δt)=𝐀¯h(t)+𝐁¯u(t)h(t+\Delta t)=\overline{\mathbf{A}}h(t)+\overline{\mathbf{B}}u(t) (41)

the hidden states differ by time gap Δt\Delta t are related by the formula, we put the temporal differences into the equation directly to construct our Mamba Hawkes Process (MHP) structure. Now we state our construction.

Algorithm 1 The architecture of Mamba Hawkes Process

Input:The temporal sequence {(t1,r1),(t2,r2),,\{(t_{1},r_{1}),(t_{2},r_{2}),...,
(tn,rn)}(t_{n},r_{n})\}
Output: The hidden state 𝐡\mathbf{h}

1:  A:(,,𝒟)ParameterA:(\mathcal{B},\mathcal{L},\mathcal{D})\leftarrow\text{Parameter} // N×NN\times N matrix
2:  B:(,𝒟)LinearB(xti)B:(\mathcal{L},\mathcal{D})\leftarrow\text{Linear}_{B}(x_{t_{i}})
3:  C:(,𝒟)LinearC(xti)C:(\mathcal{L},\mathcal{D})\leftarrow\text{Linear}_{C}(x_{t_{i}}) // BB & CC : time dependent
4:  Δtiti1\Delta\leftarrow t_{i}-t_{i-1} // Δ\Delta records the temporal information
5:  A¯,B¯Discretize(A,B,Δ)\bar{A},\bar{B}\leftarrow\text{Discretize}(A,B,\Delta)
6:  𝐡SSM(A¯,B¯,C,Δ)(x)\mathbf{h}\leftarrow\textbf{SSM}(\bar{A},\bar{B},C,\Delta)(x)
7:  return  𝐡\mathbf{h}

{}^{\ast}\ \mathcal{B}: batch size, \mathcal{L}: sequence length, 𝒟\mathcal{D}: input vector size

Let WeW^{e} be the event embedding matrix with dimensions D×RD\times R, where DD is the dimension of the hidden layers of Mamba blocks. The event embedding is defined as xti=𝐫i(We)Tx_{t_{i}}=\mathbf{r}_{i}(W^{e})^{T}

(xt1,xt2,,xtn)=(𝐫1,𝐫2,,𝐫n)(We)T(x_{t_{1}},x_{t_{2}},...,x_{t_{n}})=(\mathbf{r}_{1},\mathbf{r}_{2},...,\mathbf{r}_{n})(W^{e})^{T}

In the Mamba architecture, the matrices Δ,𝐁\Delta,\mathbf{B} and 𝐂\mathbf{C} are time-dependent and are obtained by linear projection from xtx_{t}. However, for the Hawkes Process, the approach is different, as it requires the use of temporal features. Specifically, we make 𝐁\mathbf{B} and 𝐂\mathbf{C} time-dependent, and Δ=(Δ1,Δ2,,Δn)\Delta=(\Delta_{1},\Delta_{2},...,\Delta_{n}) is defined by the temporal differences as above. Following Equation 41, we have Δi=titi1\Delta_{i}=t_{i}-t_{i-1}, thus we replace tt and t+Δtt+\Delta t by ti1t_{i-1} and tit_{i}. We define

𝐁(ti)=Linear𝐁(xti),𝐂(ti)=Linear𝐂(xti)\mathbf{B}(t_{i})=\text{Linear}_{\mathbf{B}}(x_{t_{i}}),\mathbf{C}(t_{i})=\text{Linear}_{\mathbf{C}}(x_{t_{i}})

are obtained by a linear transformation of the vector xtix_{t_{i}}. We have the transition formulas for the Mamba Hawkes process:

zti\displaystyle z_{t_{i}} =𝐀¯(ti)zti1+𝐁¯(ti)xti,\displaystyle=\overline{\mathbf{A}}(t_{i})z_{t_{i-1}}+\overline{\mathbf{B}}(t_{i})x_{t_{i}}, (42)
yti\displaystyle y_{t_{i}} =𝐂(ti)zti.\displaystyle=\mathbf{C}(t_{i})z_{t_{i}}.

where

𝐀¯(ti)\displaystyle\overline{\mathbf{A}}(t_{i}) =exp(Δi𝐀),\displaystyle=\exp(\Delta_{i}\mathbf{A}), (43)
𝐁¯(ti)\displaystyle\overline{\mathbf{B}}(t_{i}) =(Δi𝐀)1(exp(Δi𝐀)𝐈)Δi𝐁(ti).\displaystyle=(\Delta_{i}\mathbf{A})^{-1}(\exp(\Delta_{i}\mathbf{A})-\mathbf{I})\cdot\Delta_{i}\mathbf{B}(t_{i}).

is the temporal-dependent coefficients. Hence, the temporal information is incorporated into our recurrence process.

Then we got the final output, which is denoted as 𝐎=(𝐨1,𝐨2,,𝐨n)\mathbf{O}=(\mathbf{o}_{1},\mathbf{o}_{2},...,\mathbf{o}_{n}), The attention output 𝐎\mathbf{O} is then fed through a neural network, generating hidden representations 𝐡(t)\mathbf{h}(t) of the input event sequence:

𝐇=ReLU(𝐎W1+b1)W2+b2,𝐡(tj)=𝐇(j,:)\begin{split}&\mathbf{H}=\mathrm{ReLU}(\mathbf{O}W_{1}+b_{1})W_{2}+b_{2},\quad\mathbf{h}(t_{j})=\mathbf{H}(j,:)\end{split} (44)

where Wi,bi(i=1,2)W_{i},b_{i}(i=1,2) are the parameters for the neural network. The resulting matrix 𝐇\mathbf{H} contains hidden representations of all the events in the input sequence, where each row corresponds to a particular event.

To avoid “peeking into the future”, our algorithm is equipped with masks. That is, when computing the attention output 𝐡(j,:)\mathbf{h}(j,:) (the jj-th row of 𝐡\mathbf{h}), we mask all the future positions, i.e. we only choose the first 5 user-item interactions to train. This will preventing the function from assigning dependency to events in the future.

The intensity function of Mamba Hawkes process is given by

λ(t)=r=1Rλr(t),\lambda(t)=\sum_{r=1}^{R}\lambda_{r}(t), (45)

where λr\lambda_{r} is the intensity function of the rr-th event, and

λr=fr(αr(ttj)+𝐰rT𝐡(tj)+br),\lambda_{r}=f_{r}(\alpha_{r}(t-t_{j})+\mathbf{w}_{r}^{T}\mathbf{h}(t_{j})+b_{r}), (46)

where tt is defined on interval t[tj,tj+1)t\in\left[t_{j},t_{j+1}\right), and fr(x)=βrlog(1+exp(x/βr))f_{r}(x)=\beta_{r}\log\left(1+\exp\left(x/\beta_{r}\right)\right) is the Softplus function. For the log-likelihood based on a sequence SS, it is given by

(𝒮)=j=1nlogλ(tj|j)event log-likelihoodt1tnλ(t|t)𝑑tnon-event log-likelihood.\ell(\mathcal{S})=\underbrace{\sum_{j=1}^{n}\log\lambda(t_{j}|\mathcal{H}_{j})}_{\text{event log-likelihood}}-\underbrace{\vphantom{\sum_{j=1}^{N}\log\lambda(t_{j}|\mathcal{H}_{j})}\int_{t_{1}}^{t_{n}}\lambda(t|\mathcal{H}_{t})dt}_{\text{non-event log-likelihood}}. (47)

Log-likelihood function

Here, log-likelihood function will play two important roles:

  • loss-function for our pretrained model

  • To evaluate the reaction ratio from the users

We have scraped a total of 19,000 users interaction data (i.e., likes, comments, shares, etc.) from 275 micro-videos on TikTok. These interaction sequences

𝒮i(types,timestamp)(i=1,2,3,,N)\mathcal{S}_{i}(types,timestamp)\ (i=1,2,3,\dots,N)

which are timestamped, will be used to train the Mamba Hawkes process module. Model parameters are learned by maximizing the log-likelihood across all sequences. Concretely, suppose we have NN sequences 𝒮1,𝒮2,,𝒮N\mathcal{S}_{1},\mathcal{S}_{2},\ldots,\mathcal{S}_{N} , then the goal is to find parameters that solve

maxi=1N(𝒮i).\max~{}\sum_{i=1}^{N}\ell(\mathcal{S}_{i}). (48)

This optimization problem can be efficiently solved by stochastic gradient type algorithms like ADAM . Additionally, techniques that help stabilizing training such as layer normalization and residual connection are also applied.

We assume that the TikTok platform has a stable recommendation algorithm 𝒜\mathcal{A} that adjusts the recommendation strength of a video 𝒾\mathcal{i} based on user feedback frequency, intensity, and other factors, which we call Hi\textbf{H}_{i}. Different recommendation intensity Ii=𝒜(Hi)\textbf{I}_{i}=\mathcal{A}(\textbf{H}_{i}), in turn, affect the feedback strength from other users regarding the video. Therefore, under the combined effect of platform recommendations and user choices, the user feedback sequence for each video exhibits a stable distribution \mathcal{F}.

After training on more than 19,000 such feedback sequences, we believe that this pre-trained model has successfully learned this stable distribution. Thus, we treat this pretrained Mamba Hawkes process as an evaluation module. For the video viv_{i} which belongs to Microlen-100k V, we assign it the likelihood:

LikelihoodMHP=(vi),viV.Likelihood_{MHP}=\ell(v_{i}),\ v_{i}\in\textbf{V}. (49)

we use it to assign a comprehensive score based on the user interaction sequences corresponding to videos in the Microlens-100k dataset, reflecting the influence of platform recommendations on the popularity of short videos. We then use this score as a weighting factor, concatenating it with multimodal information for subsequent training.

Appendix 7: Datasets

Table 5: Statistics of dataset.
Dataset #Video #User #Train #Val #Test
MicroLens-100k 19,72419,724^{\ast} 100,000100,000 1577915779 19731973 19721972

\ast: 14 videos are disposed due to file corruption.

To evaluate the effectiveness of RecMMR, we conduct experiments on the micro-video dataset: MicroLens-100k. Its descriptive statistics are summarized in Table 5. MicroLens-100k consists of 19,738 unique micro-videos viewed by 100,000 users from various online video platforms.

Table 6: Dataset comparison. “r-Image” refers to images with raw image pixels. “Audio” and “Video” mean the original full-length audio and video content.
Dataset Modality Scale
Text r-Image Audio Video #User #Item
Tenrec \faTimes \faTimes \faTimes \faTimes 6.41M 4.11M
Flickr \faTimes \faTimes \faTimes \faTimes 8K 105K
Pinterest \faTimes \faCheck \faTimes \faTimes 46K 880K
WikiMedia \faTimes \faCheck \faTimes \faTimes 1K 10K
Behance \faTimes \faTimes \faTimes \faTimes 63K 179K
KuaiRand \faTimes \faTimes \faTimes \faTimes 27K 32.03M
KuaiRec \faTimes \faTimes \faTimes \faTimes 7K 11K
Reasoner \faCheck \faCheck \faTimes \faTimes 3K 5K
MicroLens \faCheck \faCheck \faCheck \faCheck 30M 1M

We attempt to search for other datasets to enrich our experiment, but unfortunately by now among all acknowledged datasets, it is difficult to find a well-structured multi-modal micro-video dataset with a scale like MicroLens-100k dataset. A detailed comparison of the datasets is provided in Table 6.

Appendix 8: Baseline Models

Here we elaborate how base models works.

  • SVR: Uses Gaussian kernel-based Support Vector Regression (SVR) to predict micro-video popularity.

  • HyFea: Employs the CatBoost tree model, leveraging multiple features (image, category, space-time, user profile, tags) for accurate prediction.

  • Contextual-LSTM: Integrates contextual features into the LSTM model, capturing long-range context to improve prediction accuracy.

  • TMALL: Introduces a common space to handle modality relatedness and limitations, enhancing popularity prediction.

  • MASSL: Utilizes a multi-modal variational auto-encoder model, capturing cross-modal correlations to predict popularity.

  • MTFM: Combines fuzzy trend matching and Informer in a multi-step prediction model, forecasting popularity trends.

  • HMMVED: Extracts and fuses multi-modal features through a variational information bottleneck for better prediction.

  • CBAN: Applies cross-modal bipolar attention mechanisms, effectively capturing correlations in multi-modal data to enhance prediction.

  • MMRA: Enhances prediction by retrieving relevant instances from a multi-modal memory bank and augmenting the prediction process.