MUFM: A Mamba-Enhanced Feedback Model for Micro Video Popularity Prediction

Jiacheng Lu¹, Mingyuan Xiao¹, Weijian Wang¹, Yuxin Du¹,
Yi Cui², Jingnan Zhao¹, Cheng Hua¹ Corresponding author: [email protected]

Abstract

The surge in micro videos is transforming the concept of popularity. As researchers delve into vast multi-modal datasets, there is a growing interest in understanding the origins of this popularity and the forces driving its rapid expansion. Recent studies suggest that the virality of short videos is not only tied to their inherent multi-modal content but is also heavily influenced by the strength of platform recommendations driven by audience feedback. In this paper, we introduce a framework for capturing long-term dependencies in user feedback and dynamic event interactions, based on the Mamba Hawkes process. Our experiments on the large-scale open-source multi-modal dataset, Microlens, show that our model significantly outperforms state-of-the-art approaches across various metrics by 23.2%. We believe our model’s capability to map the relationships within user feedback behavior sequences will not only contribute to the evolution of next-generation recommendation algorithms and platform applications but also enhance our understanding of micro video dissemination and its broader societal impact.

Introduction

The widespread adoption of portable devices has significantly contributed to the success of micro video platforms like TikTok. These devices make it easy for users to share their experiences, opinions, and thoughts in various formats, such as text, images, audio, and video. The resulting increase in user participation has led to the emergence of an important research area: micro video popularity prediction (MVPP).

Refer to caption — Figure 1: TikTok Micro Videos

We recognize that current approaches to predicting short video popularity often fall short by underutilizing the wealth of available data, particularly in overlooking the role of event-driven propagation within social networks. To address these shortcomings, we introduce the Mamba-Enhanced User Feedback Capture Model for Micro Video Popularity Prediction (MUFM), which advances popularity prediction by integrating refined recommendation systems with models that account for social network dynamics.

MUFM is designed to make better use of the diverse data linked to short videos. It starts by using a retrieval system to find relevant micro videos from a multi-modal database, filtering content based on all available information including visual, audio, and text features. To better understand user responses, we use the Mamba Hawkes process, training on a dataset of 19,000 scraped comments to show how recommendation systems and user behavior influence video popularity. Specifically, the model processes video, audio, and related text data using the Mamba Hawkes process to recreate the video’s spread and produce a recommendation index. Additionally, we use a cross-attention mechanism to detect connections between the target video and similar content, further boosting the model’s prediction performance.

Contributions: Overall, the main contributions of this work are summarized as follows:

•

We present a Mamba-based framework that captures long-term dependencies and dynamic event interactions driven by user feedback, aimed at predicting micro video popularity. Designed for robustness against noise and uncertainties, it ensures more accurate and reliable predictions.
•

We use a Hawkes Process Model, built on the Mamba architecture, to quantify the impact of user reactions. Additionally, we integrate this model with a user-focused SASRec model to analyze and infer the influence of TikTok’s recommendation system on the popularity of micro videos.
•

Our evaluation of the Microlens-100k dataset, the largest available for this task, shows that our MUFM model outperforms the current state-of-the-art by 23.2%.

Related Works

Micro Video Popularity Prediction

Micro Video Popularity Prediction (MVPP) has been studied using a variety of approaches. One prevalent method is feature engineering, which involves designing specific features to predict popularity. While this approach is widely adopted, it relies heavily on expert knowledge and high-quality feature selection, which can limit the scalability and flexibility of the models (Li et al. 2013) (Roy et al. 2013) (Wu et al. 2016).

Alternatively, deep learning methods have emerged as powerful tools for modeling multi-modal data. Techniques such as HMMVED (Xie, Zhu, and Chen 2023) and MTFM (Zhao et al. 2024) harness the capabilities of neural networks like ResNet (He et al. 2015) and ViT (Dosovitskiy et al. 2021) for visual data, and BERT (Devlin et al. 2019) and AnglE (Li and Li 2024) for textual data. These models excel in capturing cross-modal correlations and predicting popularity by leveraging the strengths of different data modalities. However, these methods mainly target limited video information, fail to utilize the user behavior information that is more important for social network dissemination, and can’t make sufficient use of multi-modal data. Therefore, the accuracy of the existing methods in the MVPP task is relatively low.

Hawkes Processes

The Hawkes Process is a self-exciting point process that defines the event intensity as a function of past events, capturing dependencies in time-series data (Hawkes 1971). The Hawkes Process is distinguished by its capability to model self-excitation, where events increase the likelihood of similar future occurrences, as well as mutual excitation, where distinct event types exert influence over one another (Laub, Taimre, and Pollett 2015). This dual functionality renders it particularly effective for applications in finance, social media analysis (Zhou, Zha, and Song 2013) (Rizoiu et al. 2017), and seismic activity modeling.

Recent advancements have enhanced the original model’s capabilities. Mamba Hawkes Process (MHP) combines the Mamba state space architecture with the Hawkes process to effectively capture long-range dependencies and dynamic interactions in temporal data (Gao, Dai, and Hu 2024). While traditional Hawkes processes model self-exciting event sequences, they often struggle with complex patterns, particularly in irregular or asynchronous data. MHP improves this by effectively modeling both history-dependent and independent components, naturally incorporating temporal differences to better predict event timing.

Preliminaries

Definitions & Problem Statement

Let $V=\{v_{1},v_{2},\dots,v_{N}\}$ represent the set of $N$ micro videos available on an online video platform. Each video $v_{i}$ consists of $L$ modalities of content, denoted as $M_{i}=\{m_{1}^{i},m_{2}^{i},\dots,m_{l}^{i}\}$ , where $l\geq 2$ . The goal of micro video popularity prediction is to forecast meta parameters related to video popularity, such as cumulative views, comments, and likes. These parameters are represented as $\textbf{Y}_{i}=[y_{1}^{i},y_{2}^{i},\dots,y_{h}^{i}]$ , where $h$ denotes the number of different popularity metrics for each video $v_{i}$ . The task is to predict the future values of these parameters, $\textbf{Y}_{i}$ , using all relevant modalities that influence the video’s popularity trend after its release.

In this study, we focus on predicting comments as the primary popularity metric, as comments are less prone to manipulation and provide more detailed temporal information compared to other metrics like total views. The problem we address is to accurately predict $y_{c}^{i}$ , the comment parameter for a given video $v_{i}$ , using its multi-modal content information $M_{i}$ . Specifically, the goal is to forecast Y_i in the future based on the available modalities. We employ our model F( $v_{i}$ , $M_{i}$ ) to make this prediction. We use normalized Mean Squared Error (nMSE) to evaluate the loss of our prediction model.

Methodology

Our methodological approach encompasses two modules: the video information extraction module founded on multi-modal information processing and database detection, and the social network dissemination simulation module for the video. In the multi-modal processing module of the video, we introduce LLM to retrieve related videos and process various modal information of the video via architectures such as ViT. Subsequently, we introduce a cross-attention mechanism to explore the interaction relationships between the retrieved videos and the target video. In our social network dissemination simulation module, we establish an event-focused Hawkes process model based on the Mamba architecture to reproduce the dissemination of the video within the group. A prediction network integrates all outcomes of the two modules and offers a prediction of the dissemination intensity of the video.

Micro Video Retrieval

In micro video dissemination, a video’s popularity is often related to the performance of similar videos and is significantly influenced by recommendation algorithms and user behavior. We posit that incorporating similar video information can enhance micro video popularity prediction accuracy. To achieve this, we first employ a module to summarize and extract multi-modal info from micro videos, converting data into a textual description for similar content identification. To retrieve valuable instances for target video popularity prediction, we craft a video-to-text transformation process. This process uses vision and audio-to-text models to generate captions for video content, combined with original textual descriptions. The resulting composite text is used as an input prompt for LLMs, encoded into a retrieval vector representing the corresponding video. This method aligns micro video’s audio-visual and textual modalities, addressing potential inconsistencies.

The micro video memory repository is a collection of reference pairs, denoted as $\left\langle frames,text\right\rangle$ , with each element encoded for efficient retrieval. To generate a retrieval vector for a given micro video $V_{i}$ , we use BLIP-2 (Li et al. 2023) and CLAP (Wu et al. 2024) to generate descriptive textual annotations, denoted as $C_{i}$ and $A_{i}$ . The resulting synthetic captions $C_{i}$ , $A_{i}$ , and text description $T_{i}$ are then combined, which is achieved using the concatenation operator $\oplus$ , leading to synthesized text prompt $P_{i}$ :

\displaystyle P_{i}=C_{i}\oplus A_{i}\oplus T_{i}.

(1)

Then $P_{i}$ is processed through a pre-trained semantic extraction model to generate a retrieval vector $R_{i}$ that encapsulates the key attributes of video $V_{i}$ . Then aggregating features from the top-S most similar videos, we can attain all retrieved features. The specific process and formula description of this part of the model can be found in Appendix 1.

Multi-Modal Process

For MVPP, it is fatal to extract the required feature information from the multi-modal information of the video. We perform frame-by-frame processing on all target videos, select certain interval frames for extracting image and audio information, and transform them into corresponding frame feature vectors through ViT and AST architectures. Further, we input the image-audio features of the video and the text description into a model with both forward and reverse attention mechanisms to capture its content features. Further, we compare it with the videos matched by the retriever in the database to further enhance this feature.

Multi-Modal Feature Extraction

To capture the key features of a micro video $v_{i}$ , we extract visual, audio and textual information. We start by selecting key frames $F_{1}^{i},F_{2}^{i},\dots,F_{K}^{i}$ , where $K$ is the total number of frames. Each frame $F_{j}^{i}\in\mathbb{R}^{H\times W\times C}$ is divided into fixed-size patches and reshaped ${F_{j}^{i}}^{\ast}\in\mathbb{R}^{N\times P^{2}\times C}$ .

After implementing the Audio Spectrogram Transformer (AST) structure (Gong, Chung, and Glass 2021) to gain spectrum matrix $M_{S}^{n_{mel}\times n_{a}}$ , we split it into patches and linearly embedded each visual and audio patch, attaining $K$ patch sequences. Through a transformer structure, we have $E_{i}^{v}$ and $E_{i}^{a}$ , respectively. These feature matrices are concatenated to emphasize the relationship between visual and audio information:

\displaystyle E_{i}^{v,a}=E_{i}^{v}\oplus E_{i}^{a}\in\mathbb{R}^{K\times(d_{v}+d_{a})}.

(2)

The concatenated features $E_{i}^{v,a}$ are then passed through a linear layer $W_{c}\in\mathbb{R}^{(d_{v}+d_{a})\times d}$ with a ReLU activation function, generating an audio-visual input token $X^{c}$ . Similarly we can get textual input token $X^{t}$ by process textual embedding $E^{t}_{i}$ through layer $W_{t}\in\mathbb{R}^{d_{t}\times d}$ :

	$\displaystyle X^{c}\in\mathbb{R}^{K\times d}:X_{i}^{c}=ReLU(E_{i}^{v,a}W_{c}).$		(3)
	$\displaystyle X^{t}\in\mathbb{R}^{n_{w}\times d}:X_{i}^{t}=ReLU(E_{i}^{t}W_{t}).$		(4)

The specific process and formula description of this part of the model can be found in Appendix 2.

Cross-modal Bipolar Interaction

Aligning audio-visual and textual modalities in micro videos presents challenges due to potential inconsistencies between textual descriptions and video content. To address this, we implement a cross-attention network comprising both positive and negative attention modules, designed to capture the similarities and differences between multi-modal information. The positive attention module focuses on identifying the most consistent features across different modalities, while the negative attention module highlights any inconsistent or contradictory information.

Within the positive attention module, the most aligned features between modalities are calculated using cross-modal attention vectors. For a given video $v_{i}$ , the audio-visually guided positive textual features $T_{i}^{\mathcal{P}}$ and the textually guided positive audio-visual features $C_{i}^{\mathcal{P}}$ are derived as follows:

	$\displaystyle T_{i}^{\mathcal{P}}$	$\displaystyle=ATT^{\mathcal{P}}\left(X_{i}^{c}W^{\mathcal{Q}}_{\mathcal{P}},X_{i}^{t}W^{\mathcal{K}}_{\mathcal{P}},X_{i}^{t}W^{\mathcal{C}}_{\mathcal{P}}\right)$		(5)
		$\displaystyle=Softmax(\alpha\frac{\mathcal{QK}^{T}}{\sqrt{d}})\mathcal{C},$		(6)

where $W^{\mathcal{Q}}_{\mathcal{P}},W^{\mathcal{K}}_{\mathcal{P}},W^{\mathcal{C}}_{\mathcal{P}}$ denote the query, key, and value projection matrices, respectively. Textually guided positive audio-visual features $C_{i}^{\mathcal{P}}$ can be gained with a similar method. The parameter $\alpha$ is used to balance the influence of positive and negative attention. Similarly, the negative audio-visually guided textual features $T_{i}^{\mathcal{N}}$ and textual guided audio-visual features $C_{i}^{\mathcal{N}}$ can be obtained using the same method.

After this, we referred to the integration processing of hidden states in the MMRA (Geva et al. 2021) (Zhong et al. 2024) model to generate a comprehensive textual modal representation in FFN layers, incorporate audio-visual hidden states into textual hidden states to generate a comprehensive textual and audio-visual modal representation $\widetilde{T_{i}}$ and $\widetilde{C_{i}}$ . Thus, we exploit expressive representations $T_{i},C_{i}$ by the attentive pooling strategy (Sun and Lu 2020).

This dual-path approach ensures a synthesis of audio-visual and textual modalities and the detailed process and formula description of this part of the model can be found in Appendix 3.

Retrieval Interaction Enhancement

We focus on extracting valuable insights from relevant instances retrieved to improve micro video popularity prediction (MVPP). To achieve this, we use an aggregation function that combines the comprehensive representations of these instances from the memory bank. This function assigns attention scores based on the normalized similarity scores obtained during retrieval as weights to construct the retrieved embeddings $X_{i}^{r,c}$ and $X_{i}^{r,t}$ . Instances with higher similarity scores are prioritized, highlighting their relevance to the target micro video.

Similarly, we calculate $\widetilde{T_{i}^{r}}$ and $\widetilde{C_{i}^{r}}$ , and then derive the final representations $T_{i}^{r}$ and $C_{i}^{r}$ using an attentive pooling strategy. To capture the popularity trends of the retrieved instances, we encode the label information through linear layers and aggregation, resulting in the retrieved label embedding $L_{i}^{r}$ . Finally, AUG-MMRA integrates all features from $T_{i}^{r}$ and $C_{i}^{r}$ to model cross-sample interactions. These feature interactions are constructed as

\displaystyle\mathcal{I}=[\Phi(C_{i},C_{i}^{r}),\Phi(C_{i},T_{i}^{r}),...,\Phi(T_{i},L_{i}^{r})],

(7)

where $\Phi$ denotes the process of inner products. A detailed description of this part of the model can be found in Appendix 4.

Multi-model Information Based SASRec

In our micro-video popularity prediction model, we use SASRec, a Transformer-based architecture, to capture users’ sequential behavior patterns. It models short and long-term user preferences and incorporates multi-modal item features.

The model processes video frames, cover images, and titles extracted with pre-trained encoders. User interaction histories go into the SASRec module, where self-attention layers encode sequential behaviors. Multi-modal features are integrated to form the item scoring representation. The user embedding is combined with this.

For each user, a score for each item is calculated by a dot product. The final score for an item across all users is the average of individual scores. These are converted to probabilities with a sigmoid function. The model integrates multi-modal features and optimizes through backpropagation based on user behavior sequences. More details can be found in Appendix 5.

Mamba Hawkes Process (MHP) Architecture

We define the user’s reaction sequence as $\mathcal{S}=\{(t_{1},r_{1}),(t_{2},r_{2}),...,(t_{n},r_{n})\}$ , where $\Delta_{i}=t_{i}-t_{i-1}$ represents the time difference, and it is expressed as $\Delta=(\Delta_{1},\Delta_{2},...,\Delta_{n})$ . The event is represented as a one-hot vector $\mathbf{r}_{i}$ .We construct the MHP model using the following equation:

h(t+\Delta t)=\overline{\mathbf{A}}h(t)+\overline{\mathbf{B}}u(t)

(8)

MHP Model Architecture

Let $W^{e}$ be the event embedding matrix of dimension $D\times R$ , and the event embedding is defined as $x_{t_{i}}=\mathbf{r}_{i}(W^{e})^{T}$ . The sequence embedding is $(x_{t_{1}},x_{t_{2}},...,x_{t_{n}})$ .We define the time-dependent matrices $\mathbf{B}(t_{i})$ and $\mathbf{C}(t_{i})$ as linear transformations. The state transition formulas for MHP are:

	$\displaystyle z_{t_{i}}$	$\displaystyle=\overline{\mathbf{A}}(t_{i})z_{t_{i-1}}+\overline{\mathbf{B}}(t_{i})x_{t_{i}},$		(9)
	$\displaystyle y_{t_{i}}$	$\displaystyle=\mathbf{C}(t_{i})z_{t_{i}},$		(9)

where the time-dependent coefficients are:

	$\displaystyle\overline{\mathbf{A}}(t_{i})$	$\displaystyle=\exp(\Delta_{i}\mathbf{A}),$		(10)
	$\displaystyle\overline{\mathbf{B}}(t_{i})$	$\displaystyle=(\Delta_{i}\mathbf{A})^{-1}(\exp(\Delta_{i}\mathbf{A})-\mathbf{I})\cdot\Delta_{i}\mathbf{B}(t_{i}).$		(10)

The final output $\mathbf{O}=(\mathbf{o}_{1},\mathbf{o}_{2},...,\mathbf{o}_{n})$ is passed through a neural network to generate the hidden representation $\mathbf{h}(t)$ :

\displaystyle\mathbf{H}=\mathrm{ReLU}(\mathbf{O}W_{1}+b_{1})W_{2}+b_{2},\quad\mathbf{h}(t_{j})=\mathbf{H}(j,:)

(11)

The resulting matrix $\mathbf{H}$ contains hidden representations of all the events in the input sequence, where each row corresponds to a particular event.

Intensity Function and Log-Likelihood

The intensity function of the MHP is given by

\lambda(t)=\sum_{r=1}^{R}\lambda_{r}(t),

(12)

where

\lambda_{r}=f_{r}(\alpha_{r}(t-t_{j})+\mathbf{w}_{r}^{T}\mathbf{h}(t_{j})+b_{r}),

(13)

and $f_{r}(x)=\beta_{r}\log\left(1+\exp\left(x/\beta_{r}\right)\right)$ is the Softplus function.The log-likelihood function based on the sequence $\mathcal{S}$ is given by

\ell(\mathcal{S})=\underbrace{\sum_{j=1}^{n}\log\lambda(t_{j}|\mathcal{H}_{j})}_{\text{event log-likelihood}}-\underbrace{\vphantom{\sum_{j=1}^{N}\log\lambda(t_{j}|\mathcal{H}_{j})}\int_{t_{1}}^{t_{n}}\lambda(t|\mathcal{H}_{t})dt}_{\text{non-event log-likelihood}}.

(14)

Note: Here, log-likelihood function will play two important roles:

•

loss-function for our pre-trained model

•

To evaluate the reaction ratio from the users

We learn the model parameters by maximizing the log-likelihood across all sequences:

\max~{}\sum_{i=1}^{N}\ell(\mathcal{S}_{i}),

(15)

using the ADAM optimization algorithm for an efficient solution.

Then we use the MHP model as an evaluation module, combining it with multi-modal information to assign a comprehensive score that reflects the impact of platform recommendations on the popularity of short videos.

Likelihood_{MHP}=\ell(v_{i}),\ v_{i}\in\textbf{V}.

(16)

More tricky details can be found in Appendix 6, please refer to it for more explanation.

Prediction Network

For each micro-video $v_{i}$ , the output layer in AUG-MMRA is fed a concatenated vector of the feature components, which is then passed through a fully connected layer with weights $W_{output}\in\mathbb{R}^{10d\times d}$

\displaystyle Output=concat([C_{i},T_{i},C_{i}^{r},T_{i}^{r},\mathcal{I}])W_{output}.

(17)

Additionally, for each micro-video, we obtain the recommendation score from SAS4Rec, $Score_{SAS}$ , and the recommendation likelihood $Likelihood_{MHP}$ from MHP.

\displaystyle Prediction=[Output,Score_{SAS},Likelihood_{MHP}]W_{pred}.

(18)

where $W_{pred}$ is a trainable parameter matrix. In the model training phase, we employ mean squared error (MSE) as our loss function.

Experiments

Research Question

In this section, we present experiments conducted to evaluate the effectiveness of MUFM on a real-world micro-video dataset, with the aim of addressing the following research questions:

•

RQ1: How does MUFM perform compared to existing models and state-of-the-art methods?
•

RQ2: What is the contribution of each component of MUFM to its overall performance in MVPP?
•

RQ3: How do key hyperparameters affect the model’s performance?
•

RQ4: What insights can be gained from the results of MUFM?

Table 1: Statistics of dataset.

Dataset	Video	Train	Val	Test
MicroLens-100k	$19738$	$15790$	$1974$	$1974$
pMicroLens	$17382^{\ast}$	$13905$	$1739$	$1738$

$\ast$ : 2342 videos are unused in pMicroLens due to lack of va-

lid comments, 14 videos are disposed due to file corruption.

Datasets

Based on an open-source dataset MicroLens-100k(Ni et al. 2023), We have constructed pMicroLens(permuted MicroLens), excluding videos with fewer than five comments from the dataset and only the 5 earliest published comments for the remaining videos. In that case, there’s no information directing to the amount of comments, and only the sequential knowledge of the earliest 5 comments can be learned in the Mamba Hawkes Process. To evaluate the effectiveness of MUFM, we conduct experiments on pMicroLens to predict the popularity score, which can be represented by the number of comments.

Evaluation Metrics

Since our short video popularity prediction is essentially a regression-like task, the degree of task completion is determined by the fit between the predicted popularity parameters and their actual values. We adopt the normalized Mean Square Error (nMSE) as the main parameter to measure the performance, which is defined as follows:

\displaystyle nMSE=\frac{1}{N\sigma_{y_{i}}^{2}}\sum_{i=1}^{N}(y_{i}-\hat{y}_{i})^{2}

(19)

where $N$ represent the total number of micro-video samples, $y_{i}$ and $\hat{y}_{i}$ are the target and predicted popularity score for the $i$ th micro-video sample, and $\sigma_{y_{i}}$ is the standard deviation of the target popularity.

In addition, we measure Spearman’s Rank Correlation (SRC) coefficient, Pearson linear correlation coefficient (PLCC), and Mean Absolute Error (MAE) as complementary metrics. SRC, PLCC, and MAE are defined as follows:

	$\displaystyle SRC=1-\frac{6\sum_{i=1}^{N}d_{i}^{2}}{N(N^{2}-1)}$	(20)
$\displaystyle PLCC=$	$\displaystyle\frac{1}{N}\sum_{i=1}^{N}\frac{1}{T}\sum_{j=1}^{T}(\frac{y_{i}^{j}-\bar{y}_{i}}{\sigma_{y_{i}}})(\frac{\hat{y}_{i}^{j}-\bar{\hat{y}}_{i}}{\sigma_{\hat{y}_{i}}})$	(21)
	$\displaystyle MAE=\frac{1}{N}\sum_{i=1}^{N}\lvert{y_{i}-\hat{y}_{i}}\rvert,$	(22)

where $d_{i}$ is the rank difference of a micro-video between the prediction and the popularity target. $\sigma_{y_{i}}$ and $\sigma_{\hat{y}_{i}}$ stand for the standard deviation of the target and predicted popularity sequences for the $i$ -th micro-video sample.

A higher SRC value indicates a stronger monotonic correlation between targets and predictions, a higher PLCC value indicates a higher model performance, while a lower nMSE or MAE indicates a more precise prediction of the model. More details of baseline models and experiments are displayed in the Appendix.

Implementation

Device

All our experiments are conducted on a Linux server with 16 NVIDIA A100 Tensor Core GPUs using a multi-thread dataloader.

Table 2: Main hyper-parameters in the training process

$lr$	Learning rate
$batch\_size$	Batch size
$\alpha$	Balance parameter between +/- attention
$wdecay$	Weight decay
$dropout$	Dropout rate in neural network
$K$	Number of frames captured in video
$S$	Number of similar videos retrieved

Hyper-Parameter Settings

There are some main parameters in the training process, displayed in Table 2. We adopt the Bayesian Optimization (Jones, Schonlau, and Welch 1998), searching through the parameter hyper-space and find the lowest nMSE when setting [lr, batch_size, wdecay, dropout, $\alpha$ , K, S]=[1e-4, 64, 0.001, 0, 0.2, 10, 20]. We apply the above set of parameters to acquire our result and compare MUFM with other baselines. Hyper-parameter optimization and sensitivity analysis will be discussed in detail in the following paragraphs.
Running details: We have set the random seed to 2024 for reproducibility. The best performance is achieved after 6 epochs.

Performance Comparison

Baseline models

To evaluate the superiority of the model, we conduct experiments with 9 competitive baselines of different methods. Our experimental baseline model includes: SVR (Khosla, Das Sarma, and Hamid 2014), HyFea (Lai, Zhang, and Zhang 2020), CLSTM (Ghosh et al. 2016), TMALL (Chen et al. 2016), MASSL (Zhang et al. 2023), CBAN (hin Cheung and man Lam 2022), HMMVED, MTFM and MMRA (Zhong et al. 2024). More details about our baseline model are displayed in Appendix 6, and all baseline code is included in the code appendix.

Performance Comparison

The performance of various baseline models and our MUFM model on the dataset is presented in Table 3. The results demonstrate that MUFM consistently outperforms all baselines on pMicroLens dataset.

These findings confirm the value of constructing a multi-modal pipeline integrated with User Feedback to boost prediction accuracy. Notably, compared to the current state-of-the-art model MMRA, our approach shows significant improvements by incorporating user-targeted recommendations on SASRec, and item-targeted reaction sequences on MHP. The introduction of additional learned meta-knowledge in our framework allows for a more nuanced decoupling and representation of the factors driving micro-video popularity, which naturally leads to superior performance.

Table 3: Performance comparison among baselines and MUFM

Model	$nMSE$	$SRC$	$MAE$	$PLCC$
CBAN	0.8705	0.2780	27.771	0.3808
CLSTM	0.8868	0.3758	27.326	0.3608
HMMVED	0.9076	0.2726	28.342	0.2586
MASSL	0.9436	0.2920	26.984	0.2917
SVR	1.0445	0.3147	26.848	0.2685
MFTM	0.8985	0.3143	27.935	0.3207
Hyfea	0.9050	0.3059	27.928	0.3085
TMALL	0.8729	0.3353	26.297	0.3458
MMRA^∗	0.9059	0.4305	25.498	0.4323
MUFM	0.8269	0.5308	24.643	0.4919

$\ast$ : MMRA is the current state-of-the-art model.

Hyper-parameter Analysis

After initial adjustments, the model’s performance showed low sensitivity to $wdecay$ $batch\_size$ and $dropout$ . Therefore, we focused on optimizing and analyzing other key parameters: $K$ , $S$ , $\alpha$ , and $lr$ .

As shown in Figure 2, more videos aggregated in the retrieval process enable the model to pay more attention to the shared characteristics from similar videos and attain effective information, which ultimately improves the performance of MUFM. The optimal value for $S$ is 0.01. In terms of $lr$ , increasing the learning rate requires more epochs to learn information and achieve the lowest nMSE. The optimal learning rate is found to be 1e-4.

Figure 5 illustrates the best nMSE and SRC obtained when training with different values of $K$ and $\alpha$ . The model’s performance is relatively insensitive both to the number of frames captured ( $K$ ) and to the balance between positive and negative attention ( $\alpha$ ). MUFM performed best with $K=10$ and $\alpha=0.8$ .

Overall, our model demonstrated strong robustness once the value of $S$ was optimized. The best-performing hyper-parameters are [ $lr$ , $batch\_size$ , $wdecay$ , $dropout$ , $\alpha$ , $K$ , $S$ ] = [1e-4, 64, 0.001, 0, 0.2, 10, 20].

Ablation Study

In this section, we conduct an ablation study on MUFM to assess the impact of three critical components. We create the following variants for evaluation.

•

noSAS: This variant removes the SASRec, thereby eliminating user-targeted recommendations and leaving MHP as the sole recommendation mechanism.
•

noMamba: In this variant, the Mamba model in MHP is removed, eliminating context encoder and dependency capture through selective state space models (SSM). Therefore, the predicted recommendation likelihoods are attained from the user reaction sequence using merely the traditional Hawkes Process.
•

noHP: This variant excludes the Hawkes Process(HP) methodology. The predictive algorithm does not follow the Hawkes Process model, but solely depends on Mamba itself, which prevents modeling the phenomena of self-excitation or mutual inhibition between events.

In this analysis, we selected the top 50, 100, and 200 videos in the test samples with the highest ground truth popularity scores, as well as the bottom 50, 100, and 200 videos with the lowest scores. We then reported the average predicted popularity scores based on the number of comments. Additionally, we evaluated the entire test dataset using metrics such as nMSE, SRC, and MAE across four different model variants.

Table 4: Ablation study among four variants: MUFM, noSAS, noMamba and noHP

Prediction	noSAS	noMamba	noHP	MUFM
Top50	83.456	83.389	76.389	83.433
Top100	70.690	69.565	65.573	71.349
Top200	65.111	62.445	60.835	65.869
Bottom200	19.714	29.030	31.332	17.940
Bottom100	15.638	26.547	28.565	12.493
Bottom50	12.409	25.855	27.921	10.865
nMSE	0.8231	0.8474	0.8756	0.8219
SRC	0.5216	0.4438	0.4214	0.5308
MAE	24.890	25.882	26.498	24.643

As shown in Table 5, the predicted popularity scores for the six groups followed the order: Top50 ¿¿ Top100 ¿¿ Top200 ¿¿ Bottom200 ¿¿ Bottom100 ¿¿ Bottom50, indicating that the prediction model performs as expected. Furthermore, compared to the actual number of comments in each group, MUFM achieved the best performance in most cases. As for the evaluation metrics, MUFM outperformed the other three variants in nMSE, SRC, and MAE, validating the effectiveness of incorporating SASRec and MHP.

Notably, removing the Mamba architecture or Hawkes Process both leads to a significant decline in performance. This finding suggests that the mutual inhibition and context dependency captured between events both contribute to gaining effective meta-knowledge from the user reaction sequence. While combining the Mamba architecture with the Hawkes process, the model’s ability to capture critical information and its interpretability are enhanced, improving the model’s performance.

It’s a meaningful indication that the recommendation system based on user reactions significantly influences video popularity more than previously understood.

Conclusion

In this work, we propose MUFM, a multi-modal model for MVPP. MUFM introduces a retrieval-augmented framework and a Hawkes process model for the MVPP task. We align the visual audio and textual modalities to find similar videos. Cross-modal bipolar interactions are implied to address the presence of inconsistent information between models, as well as a retrieval interaction enhancement method to capture meaningful knowledge from relevant instances. The mamba-based Hawkes process model provides info on user behavior and improves our model precision. Experiments on the real-world micro-video dataset show our method is effective and outperforms the current state-of-the-art in MVPP.

References

Chen et al. (2016) Chen, J.; Song, X.; Nie, L.; Wang, X.; Zhang, H.; and Chua, T.-S. 2016. Micro Tells Macro: Predicting the Popularity of Micro-Videos via a Transductive Model. In Proceedings of the 24th ACM International Conference on Multimedia, MM ’16, 898–907. New York, NY, USA: Association for Computing Machinery. ISBN 9781450336031.
Devlin et al. (2019) Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv:1810.04805.
Dosovitskiy et al. (2021) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; and Houlsby, N. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv:2010.11929.
Gao, Dai, and Hu (2024) Gao, A.; Dai, S.; and Hu, Y. 2024. Mamba Hawkes Process. arXiv preprint arXiv:2407.05302.
Geva et al. (2021) Geva, M.; Schuster, R.; Berant, J.; and Levy, O. 2021. Transformer Feed-Forward Layers Are Key-Value Memories. arXiv:2012.14913.
Ghosh et al. (2016) Ghosh, S.; Vinyals, O.; Strope, B.; Roy, S.; Dean, T.; and Heck, L. 2016. Contextual LSTM (CLSTM) models for Large scale NLP tasks. arXiv:1602.06291.
Gong, Chung, and Glass (2021) Gong, Y.; Chung, Y.-A.; and Glass, J. 2021. AST: Audio Spectrogram Transformer. arXiv:2104.01778.
Hawkes (1971) Hawkes, A. G. 1971. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1): 83–90.
He et al. (2015) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Deep Residual Learning for Image Recognition. arXiv:1512.03385.
hin Cheung and man Lam (2022) hin Cheung, T.; and man Lam, K. 2022. Crossmodal bipolar attention for multimodal classification on social media. Neurocomputing, 514: 1–12.
Jones, Schonlau, and Welch (1998) Jones, D. R.; Schonlau, M.; and Welch, W. J. 1998. Efficient Global Optimization of Expensive Black-Box Functions.
Khosla, Das Sarma, and Hamid (2014) Khosla, A.; Das Sarma, A.; and Hamid, R. 2014. What makes an image popular? In Proceedings of the 23rd International Conference on World Wide Web, WWW ’14, 867–876. New York, NY, USA: Association for Computing Machinery. ISBN 9781450327442.
Lai, Zhang, and Zhang (2020) Lai, X.; Zhang, Y.; and Zhang, W. 2020. HyFea: Winning Solution to Social Media Popularity Prediction for Multimedia Grand Challenge 2020. In Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, 4565–4569. New York, NY, USA: Association for Computing Machinery. ISBN 9781450379885.
Laub, Taimre, and Pollett (2015) Laub, P. J.; Taimre, T.; and Pollett, P. K. 2015. Hawkes processes. arXiv preprint arXiv:1507.02822.
Li et al. (2013) Li, H.; Ma, X.; Wang, F.; Liu, J.; and Xu, K. 2013. On popularity prediction of videos shared in online social networks. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM ’13, 169–178. New York, NY, USA: Association for Computing Machinery. ISBN 9781450322638.
Li et al. (2023) Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597.
Li and Li (2024) Li, X.; and Li, J. 2024. AnglE-optimized Text Embeddings. arXiv:2309.12871.
Ni et al. (2023) Ni, Y.; Cheng, Y.; Liu, X.; Fu, J.; Li, Y.; He, X.; Zhang, Y.; and Yuan, F. 2023. A Content-Driven Micro-Video Recommendation Dataset at Scale. arXiv:2309.15379.
Rizoiu et al. (2017) Rizoiu, M.-A.; Lee, Y.; Mishra, S.; and Xie, L. 2017. Hawkes processes for events in social media, 191–218. Association for Computing Machinery and Morgan & Claypool. ISBN 9781970001075.
Roy et al. (2013) Roy, S. D.; Mei, T.; Zeng, W.; and Li, S. 2013. Towards Cross-Domain Learning for Social Video Popularity Prediction. IEEE Transactions on Multimedia, 15(6): 1255–1267.
Sun and Lu (2020) Sun, X.; and Lu, W. 2020. Understanding Attention for Text Classification. In Jurafsky, D.; Chai, J.; Schluter, N.; and Tetreault, J., eds., Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 3418–3428. Online: Association for Computational Linguistics.
Wu et al. (2016) Wu, B.; Mei, T.; Cheng, W.-H.; and Zhang, Y. 2016. Unfolding Temporal Dynamics: Predicting Social Media Popularity Using Multi-scale Temporal Decomposition. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1).
Wu et al. (2024) Wu, Y.; Chen, K.; Zhang, T.; Hui, Y.; Nezhurina, M.; Berg-Kirkpatrick, T.; and Dubnov, S. 2024. Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. arXiv:2211.06687.
Xie, Zhu, and Chen (2023) Xie, J.; Zhu, Y.; and Chen, Z. 2023. Micro-Video Popularity Prediction Via Multimodal Variational Information Bottleneck. IEEE Transactions on Multimedia, 25: 24–37.
Zhang et al. (2023) Zhang, Z.; Xu, S.; Guo, L.; and Lian, W. 2023. Multi-modal Variational Auto-Encoder Model for Micro-video Popularity Prediction. In Proceedings of the 8th International Conference on Communication and Information Processing, ICCIP ’22, 9–16. New York, NY, USA: Association for Computing Machinery. ISBN 9781450397100.
Zhao et al. (2024) Zhao, L.; li, Y.; Chen, X.; Sun, L.; and Xue, Z. 2024. MFTM-Informer: A multi-step prediction model based on multivariate fuzzy trend matching and Informer.
Zhong et al. (2024) Zhong, T.; Lang, J.; Zhang, Y.; Cheng, Z.; Zhang, K.; and Zhou, F. 2024. Predicting Micro-video Popularity via Multi-modal Retrieval Augmentation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’24, 2579–2583. New York, NY, USA: Association for Computing Machinery. ISBN 9798400704314.
Zhou, Zha, and Song (2013) Zhou, K.; Zha, H.; and Song, L. 2013. Learning social infectivity in sparse low-rank networks using multi-dimensional hawkes processes. In Artificial intelligence and statistics, 641–649. PMLR.

Appendix 1: Micro Video Retrieval

The micro video memory repository is a collection of reference pairs, denoted as $\left\langle frames,text\right\langle$ , with each element encoded for efficient retrieval. To generate a retrieval vector for a given micro video $V_{i}$ , we begin by using BLIP-2, a pre-trained large language model, to analyze the video’s content and produce descriptive captions for its frames, resulting in the set $C_{i}=\{c^{i}_{1},\dots,c^{i}_{L}\}$ , where $L$ represents the total number of frames in video $V_{i}$ .

For the audio component, we apply a similar process using the CLAP (Cross-Modal Language-Audio Pre-training) model, which generates descriptive textual annotations for the audio, denoted as $a_{i}$ . These annotations are derived from a predefined set of audio descriptions. The resulting synthetic captions $C_{i}$ are then combined with the audio descriptions $A_{i}$ and the original textual descriptions $T_{i}$ to form a comprehensive text prompt. This is achieved using the concatenation operator $\oplus$ , leading to

\displaystyle P_{i}=C_{i}\oplus A_{i}\oplus T_{i}.

(23)

This synthesized text prompt $P_{i}$ is then processed through a pre-trained semantic extraction model, specifically the UAE-Large architecture, to generate a retrieval vector $R_{i}$ that encapsulates the key attributes of video $V_{i}$ . This process is repeated for each micro video in the memory repository, assigning each one a unique retrieval vector that serves as its identifier within the system.

This method unifies the visual, audio, and textual modalities into a cohesive representation, enhancing the model’s ability to predict micro video popularity. Additionally, it improves the accuracy and efficiency of content retrieval and analysis in the domain of micro form video dissemination.

The Bank Retriever mechanism is designed to identify the Top- $S$ most similar videos from a memory bank by evaluating the similarity scores between the target video and stored instances. The process begins by receiving a query video $V_{q}=\{v_{1}^{q},v_{2}^{q},\dots,v_{S}^{q}\}$ and searching through the memory bank, which holds a collection of frame-text pairs. The retriever employs a maximum inner product search algorithm to scan the memory bank, selecting the most relevant videos based on their similarity to the query.

Appendix 2: Multimodal Process

To capture the key features of a micro video $V_{i}$ , we extract visual, audio, and textual information. We start by selecting a series of key frames, denoted as $F_{1}^{i},F_{2}^{i},\dots,F_{K}^{i}$ , where $K$ represents the total number of frames.

For visual feature extraction, we apply a pre-trained visual model. Each frame $F_{j}^{i}\in\mathbb{R}^{H\times W\times C}$ is divided into fixed-size patches and reshaped into a sequence of flattened 2D patches ${F_{j}^{i}}^{\ast}\in\mathbb{R}^{N\times P^{2}\times C}$ , where $(H,W)$ is the resolution of the original frame, $C$ is the number of channels, $(P,P)$ is the patch size, and $N$ is the number of patches, which serves as the input sequence length for the Vision Transformer (ViT).

For audio features, we follow a similar approach using the Audio Spectrogram Transformer (AST). We first align the timestamps of the sampled frames with corresponding audio segments $W=\{w_{1},w_{2},\dots,w_{K}\}$ extracted from the video’s soundtrack. These audio segments are processed using a Fast Fourier Transform (FFT) to generate Mel-frequency spectral coefficients (MFCC), resulting in a spectrum matrix $M_{S}^{n_{mel}\times n_{a}}$ , where $n_{mel}$ is the number of mel-filters, and $n_{a}$ is the number of time frames, determined by the audio length and stride. This spectrum matrix, treated as a single-channel image, is then processed using the same patching method applied to visual frames.

Next, we linearly embed each visual and audio patch, adding sequential position embeddings to capture positional relationships within the frames. The resulting $K$ patch sequences are then processed by ViT and AST, producing visual and audio representations $E_{i}^{v}$ and $E_{i}^{a}$ , respectively. These feature matrices are concatenated to emphasize the relationship between visual and audio information:

\displaystyle E_{i}^{v,a}=E_{i}^{v}\oplus E_{i}^{a}\in\mathbb{R}^{K\times(d_{v}+d_{a})},

(24)

where $\oplus$ denotes the concatenation operation, combining visual and audio features. Here, $d_{v}$ and $d_{a}$ represent the dimensions of the visual and audio features, respectively. The concatenated features $E_{i}^{v,a}$ are then passed through a linear layer $W_{c}\in\mathbb{R}^{(d_{v}+d_{a})\times d}$ with a ReLU activation function, generating an audio-visual input token $X^{c}$ :

\displaystyle X^{c}\in\mathbb{R}^{K\times d}:X_{i}^{c}=ReLU(E_{i}^{v,a}W_{c}).

(25)

For textual information, we process micro video textual description into pre-trained language model AngLE to produce a sequence of textual embedding $E^{t}_{i}\in\mathbb{R}^{n_{w}\times d_{t}}$ , where $n_{w}$ is the word length in the textual descriptions. The textual embedding $E^{t}_{i}$ are then passed through a linear layer $W_{t}\in\mathbb{R}^{d_{t}\times d}$ , generating textual input token $X^{t}$ :

\displaystyle X^{t}\in\mathbb{R}^{n_{w}\times d}:X_{i}^{t}=ReLU(E_{i}^{t}W_{t}).

(26)

Appendix 3: Cross-modal Bipolar Interaction

	$\displaystyle T_{i}\mathcal{P}$	$\displaystyle=ATT^{\mathcal{P}}\left(X_{i}^{c}W^{\mathcal{Q}}_{\mathcal{P}},X_{i}^{t}W^{\mathcal{K}}_{\mathcal{P}},X_{i}^{t}W^{\mathcal{C}}_{\mathcal{P}}\right)$		(27)
		$\displaystyle=Softmax(\alpha\frac{\mathcal{QK}^{T}}{\sqrt{d}})\mathcal{C},$		(28)

	$\displaystyle C_{i}^{\mathcal{P}}$	$\displaystyle=ATT^{\mathcal{P}}\left(X_{i}^{t}W^{\mathcal{Q}}_{\mathcal{P}},X_{i}^{c}W^{\mathcal{K}}_{\mathcal{P}},X_{i}^{c}W^{\mathcal{T}}_{\mathcal{P}}\right)$		(29)
		$\displaystyle=Softmax(\alpha\frac{\mathcal{QK}^{C}}{\sqrt{d}})\mathcal{T},$		(30)

where $W^{\mathcal{Q}}_{\mathcal{P}},W^{\mathcal{K}}_{\mathcal{P}},W^{\mathcal{C}}_{\mathcal{P}},W^{\mathcal{T}}_{\mathcal{P}}$ denote the query, key, and two value projection matrices. The parameter $\alpha$ is used to balance the influence of positive and negative attention. Similarly, the positive audio-visual features guided by textual information can be obtained using the same method.

Negative attention is designed to highlight inconsistencies across different modalities. These negative attention scores are calculated using scaled dot-product attention, with a negative constant applied before the Softmax function. The process can be summarized as follows:

	$\displaystyle T_{i}^{\mathcal{N}}$	$\displaystyle=ATT^{\mathcal{N}}\left(X_{i}^{c}W^{\mathcal{Q}}_{\mathcal{N}},X_{i}^{t}W^{\mathcal{K}}_{\mathcal{N}},X_{i}^{t}W^{\mathcal{C}}_{\mathcal{N}}\right)$		(31)
		$\displaystyle=Softmax(\gamma\frac{\mathcal{QK}^{T}}{\sqrt{d}})\mathcal{C},$		(32)

	$\displaystyle C_{i}^{\mathcal{N}}$	$\displaystyle=ATT^{\mathcal{N}}\left(X_{i}^{t}W^{\mathcal{Q}}_{\mathcal{N}},X_{i}^{c}W^{\mathcal{K}}_{\mathcal{N}},X_{i}^{c}W^{\mathcal{T}}_{\mathcal{N}}\right)$		(33)
		$\displaystyle=Softmax(\gamma\frac{\mathcal{QK}^{C}}{\sqrt{d}})\mathcal{T},$		(34)

where $W^{\mathcal{Q}}_{\mathcal{P}},W^{\mathcal{K}}_{\mathcal{P}},W^{\mathcal{C}}_{\mathcal{P}},W^{\mathcal{T}}_{\mathcal{P}}$ denote the query, key, and two value projection matrices respectively. Moreover, $\gamma$ is denoted as $\gamma=-(1-\alpha)$ . After this, we referred to the integration processing of hidden states in the MMRA model to generate a comprehensive textual modal representation in FFN layers, incorporate audio-visual hidden states into textual hidden states to generate a comprehensive textual and audio-visual modal representation $\widetilde{T_{i}}$ and $\widetilde{C_{i}}$ , which modify the calculation of the FFN process as follows:

\displaystyle\widetilde{T_{i}}

\displaystyle=ReLU\left(X_{i}^{t}+\left(C_{i}^{\mathcal{P}}\oplus C_{i}^{\mathcal{N}}\right)W^{t}_{1}\right)W^{t}_{2},

(35)

\displaystyle\widetilde{C_{i}}

\displaystyle=ReLU\left(X_{i}^{c}+\left(T_{i}^{\mathcal{P}}\oplus T_{i}^{\mathcal{N}}\right)W^{c}_{1}\right)W^{c}_{2},

(36)

where $W^{t}_{1},W^{c}_{1}\in\mathbb{R}^{2d\times d}$ , $W^{t}_{2},W^{c}_{2}\in\mathbb{R}^{d\times d}$ represent learnable weights matrix via the attentive pooling strategy, $\oplus$ denotes the concatenation operation. Finally, we exploit expressive representations $T_{i},C_{i}$ by the attentive pooling strategy.

This dual-path approach ensures a synthesis of audio-visual and textual modalities, each subjected to a consistent and rigorous extraction process. This method not only captures the intrinsic characteristics of micro videos but also lays a solid foundation for subsequent multifaceted analyses and applications, thereby enhancing the depth and breadth of our understanding of video content.

Appendix 4: Retrieval Interaction Enhancement

We focus on extracting valuable insights from relevant instances retrieved to improve micro video popularity prediction (MVPP). To achieve this, we use an aggregation function that combines the comprehensive representations of these instances from the memory bank. This function assigns attention scores based on the normalized similarity scores obtained during retrieval, using them as weights to construct the retrieved embeddings $X_{i}^{r,c}$ and $X_{i}^{r,t}$ . Instances with higher similarity scores are prioritized, highlighting their relevance to the target micro video.

\displaystyle\mathcal{I}=[\Phi(C_{i},C_{i}^{r}),\Phi(C_{i},T_{i}^{r}),...,\Phi(T_{i},L_{i}^{r})],

(37)

where $\Phi$ denotes the process of inner products.

Appendix 5: Multimodel Information Based SASRec

In our micro-video popularity prediction model, we adopt SASRec (Self-Attention Sequence Recommender), a Transformer-based architecture, to capture users’ sequential behavior patterns. SASRec is crucial for modeling both short-term and long-term user preferences, incorporating multimodal item features to enhance recommendation accuracy.

Our model processes various modalities, including video frames ( $F_{i}$ ), cover images ( $I_{i}$ ), and titles ( $T_{i}$ ), which are extracted using pre-trained encoders. The user interaction histories are fed into the SASRec module, where self-attention layers encode sequential user behaviors, producing a sequence-based representation of user-item interactions. Simultaneously, the multimodal features are integrated to construct the item scoring representation item_scoring. The user embedding $U_{i}$ , which encapsulates the user’s behavior sequence, is then combined with this multimodal item representation.

For each user, we calculate a score for each item by computing the dot product between the user embedding $U_{i}$ and the multimodal item scoring item_scoring:

\displaystyle\text{score}_{j}=U_{i}\cdot\text{item\_scoring}

(38)

The final score for an item across all users is obtained by averaging these individual scores:

\displaystyle\text{avg\_score}_{j}=\frac{1}{N}\sum_{i=1}^{N}\text{score}_{j}

(39)

where $N$ represents the total number of users. These averaged scores are converted into probabilities using a sigmoid function to predict the likelihood of interaction:

\displaystyle\text{Score}_{\text{SAS}}=\sigma(\text{avg\_score}_{j})

(40)

The model is trained by minimizing the cross-entropy loss between the predicted scores and actual user interactions (positive and negative samples). By integrating multimodal item features (video, image, text), SASRec provides a rich representation of user-item interactions. The model optimizes its predictions through backpropagation, adjusting parameters based on user behavior sequences.

Appendix 6: Mamba Architecture

Denote the sequence $\mathcal{S}=\{(t_{1},r_{1}),(t_{2},r_{2}),...,(t_{n},r_{n})\}$ as the temporal users’ reaction sequence, where $R$ is the number of events. Let $\Delta_{i}=t_{i}-t_{i-1}$ represent the temporal differences, with $\Delta_{1}=t_{1}$ for convention, hence the temporal sequence is represented by

\Delta=(\Delta_{1},\Delta_{2},...,\Delta_{n})

Additionally, let $\mathbf{r}_{i}$ be the one-hot vector of the events. Inspired by the discretization

h(t+\Delta t)=\overline{\mathbf{A}}h(t)+\overline{\mathbf{B}}u(t)

(41)

the hidden states differ by time gap $\Delta t$ are related by the formula, we put the temporal differences into the equation directly to construct our Mamba Hawkes Process (MHP) structure. Now we state our construction.

Algorithm 1 The architecture of Mamba Hawkes Process

Input:The temporal sequence $\{(t_{1},r_{1}),(t_{2},r_{2}),...,$
$(t_{n},r_{n})\}$
Output: The hidden state $\mathbf{h}$

A:(\mathcal{B},\mathcal{L},\mathcal{D})\leftarrow\text{Parameter}

N\times N

matrix

B:(\mathcal{L},\mathcal{D})\leftarrow\text{Linear}_{B}(x_{t_{i}})

C:(\mathcal{L},\mathcal{D})\leftarrow\text{Linear}_{C}(x_{t_{i}})

B

C

: time dependent

\Delta\leftarrow t_{i}-t_{i-1}

\Delta

records the temporal information

\bar{A},\bar{B}\leftarrow\text{Discretize}(A,B,\Delta)

\mathbf{h}\leftarrow\textbf{SSM}(\bar{A},\bar{B},C,\Delta)(x)

7: return

\mathbf{h}

${}^{\ast}\$ $\mathcal{B}$ : batch size, $\mathcal{L}$ : sequence length, $\mathcal{D}$ : input vector size

Let $W^{e}$ be the event embedding matrix with dimensions $D\times R$ , where $D$ is the dimension of the hidden layers of Mamba blocks. The event embedding is defined as $x_{t_{i}}=\mathbf{r}_{i}(W^{e})^{T}$

(x_{t_{1}},x_{t_{2}},...,x_{t_{n}})=(\mathbf{r}_{1},\mathbf{r}_{2},...,\mathbf{r}_{n})(W^{e})^{T}

In the Mamba architecture, the matrices $\Delta,\mathbf{B}$ and $\mathbf{C}$ are time-dependent and are obtained by linear projection from $x_{t}$ . However, for the Hawkes Process, the approach is different, as it requires the use of temporal features. Specifically, we make $\mathbf{B}$ and $\mathbf{C}$ time-dependent, and $\Delta=(\Delta_{1},\Delta_{2},...,\Delta_{n})$ is defined by the temporal differences as above. Following Equation 41, we have $\Delta_{i}=t_{i}-t_{i-1}$ , thus we replace $t$ and $t+\Delta t$ by $t_{i-1}$ and $t_{i}$ . We define

\mathbf{B}(t_{i})=\text{Linear}_{\mathbf{B}}(x_{t_{i}}),\mathbf{C}(t_{i})=\text{Linear}_{\mathbf{C}}(x_{t_{i}})

are obtained by a linear transformation of the vector $x_{t_{i}}$ . We have the transition formulas for the Mamba Hawkes process:

	$\displaystyle z_{t_{i}}$	$\displaystyle=\overline{\mathbf{A}}(t_{i})z_{t_{i-1}}+\overline{\mathbf{B}}(t_{i})x_{t_{i}},$		(42)
	$\displaystyle y_{t_{i}}$	$\displaystyle=\mathbf{C}(t_{i})z_{t_{i}}.$		(42)

where

	$\displaystyle\overline{\mathbf{A}}(t_{i})$	$\displaystyle=\exp(\Delta_{i}\mathbf{A}),$		(43)
	$\displaystyle\overline{\mathbf{B}}(t_{i})$	$\displaystyle=(\Delta_{i}\mathbf{A})^{-1}(\exp(\Delta_{i}\mathbf{A})-\mathbf{I})\cdot\Delta_{i}\mathbf{B}(t_{i}).$		(43)

is the temporal-dependent coefficients. Hence, the temporal information is incorporated into our recurrence process.

Then we got the final output, which is denoted as $\mathbf{O}=(\mathbf{o}_{1},\mathbf{o}_{2},...,\mathbf{o}_{n})$ , The attention output $\mathbf{O}$ is then fed through a neural network, generating hidden representations $\mathbf{h}(t)$ of the input event sequence:

\begin{split}&\mathbf{H}=\mathrm{ReLU}(\mathbf{O}W_{1}+b_{1})W_{2}+b_{2},\quad\mathbf{h}(t_{j})=\mathbf{H}(j,:)\end{split}

(44)

where $W_{i},b_{i}(i=1,2)$ are the parameters for the neural network. The resulting matrix $\mathbf{H}$ contains hidden representations of all the events in the input sequence, where each row corresponds to a particular event.

To avoid “peeking into the future”, our algorithm is equipped with masks. That is, when computing the attention output $\mathbf{h}(j,:)$ (the $j$ -th row of $\mathbf{h}$ ), we mask all the future positions, i.e. we only choose the first 5 user-item interactions to train. This will preventing the function from assigning dependency to events in the future.

The intensity function of Mamba Hawkes process is given by

\lambda(t)=\sum_{r=1}^{R}\lambda_{r}(t),

(45)

where $\lambda_{r}$ is the intensity function of the $r$ -th event, and

\lambda_{r}=f_{r}(\alpha_{r}(t-t_{j})+\mathbf{w}_{r}^{T}\mathbf{h}(t_{j})+b_{r}),

(46)

where $t$ is defined on interval $t\in\left[t_{j},t_{j+1}\right)$ , and $f_{r}(x)=\beta_{r}\log\left(1+\exp\left(x/\beta_{r}\right)\right)$ is the Softplus function. For the log-likelihood based on a sequence $S$ , it is given by

\ell(\mathcal{S})=\underbrace{\sum_{j=1}^{n}\log\lambda(t_{j}|\mathcal{H}_{j})}_{\text{event log-likelihood}}-\underbrace{\vphantom{\sum_{j=1}^{N}\log\lambda(t_{j}|\mathcal{H}_{j})}\int_{t_{1}}^{t_{n}}\lambda(t|\mathcal{H}_{t})dt}_{\text{non-event log-likelihood}}.

(47)

Log-likelihood function

Here, log-likelihood function will play two important roles:

•

loss-function for our pretrained model

•

To evaluate the reaction ratio from the users

We have scraped a total of 19,000 users interaction data (i.e., likes, comments, shares, etc.) from 275 micro-videos on TikTok. These interaction sequences

\mathcal{S}_{i}(types,timestamp)\ (i=1,2,3,\dots,N)

which are timestamped, will be used to train the Mamba Hawkes process module. Model parameters are learned by maximizing the log-likelihood across all sequences. Concretely, suppose we have $N$ sequences $\mathcal{S}_{1},\mathcal{S}_{2},\ldots,\mathcal{S}_{N}$ , then the goal is to find parameters that solve

\max~{}\sum_{i=1}^{N}\ell(\mathcal{S}_{i}).

(48)

This optimization problem can be efficiently solved by stochastic gradient type algorithms like ADAM . Additionally, techniques that help stabilizing training such as layer normalization and residual connection are also applied.

We assume that the TikTok platform has a stable recommendation algorithm $\mathcal{A}$ that adjusts the recommendation strength of a video $\mathcal{i}$ based on user feedback frequency, intensity, and other factors, which we call $\textbf{H}_{i}$ . Different recommendation intensity $\textbf{I}_{i}=\mathcal{A}(\textbf{H}_{i})$ , in turn, affect the feedback strength from other users regarding the video. Therefore, under the combined effect of platform recommendations and user choices, the user feedback sequence for each video exhibits a stable distribution $\mathcal{F}$ .

After training on more than 19,000 such feedback sequences, we believe that this pre-trained model has successfully learned this stable distribution. Thus, we treat this pretrained Mamba Hawkes process as an evaluation module. For the video $v_{i}$ which belongs to Microlen-100k V, we assign it the likelihood:

Likelihood_{MHP}=\ell(v_{i}),\ v_{i}\in\textbf{V}.

(49)

we use it to assign a comprehensive score based on the user interaction sequences corresponding to videos in the Microlens-100k dataset, reflecting the influence of platform recommendations on the popularity of short videos. We then use this score as a weighting factor, concatenating it with multimodal information for subsequent training.

Appendix 7: Datasets

Table 5: Statistics of dataset.

Dataset	#Video	#User	#Train	#Val	#Test
MicroLens-100k	$19,724^{\ast}$	$100,000$	$15779$	$1973$	$1972$

$\ast$ : 14 videos are disposed due to file corruption.

To evaluate the effectiveness of RecMMR, we conduct experiments on the micro-video dataset: MicroLens-100k. Its descriptive statistics are summarized in Table 5. MicroLens-100k consists of 19,738 unique micro-videos viewed by 100,000 users from various online video platforms.

Table 6: Dataset comparison. “r-Image” refers to images with raw image pixels. “Audio” and “Video” mean the original full-length audio and video content.

Dataset	Modality				Scale
Dataset	Text	r-Image	Audio	Video	#User	#Item
Tenrec	\faTimes	\faTimes	\faTimes	\faTimes	6.41M	4.11M
Flickr	\faTimes	\faTimes	\faTimes	\faTimes	8K	105K
Pinterest	\faTimes	\faCheck	\faTimes	\faTimes	46K	880K
WikiMedia	\faTimes	\faCheck	\faTimes	\faTimes	1K	10K
Behance	\faTimes	\faTimes	\faTimes	\faTimes	63K	179K
KuaiRand	\faTimes	\faTimes	\faTimes	\faTimes	27K	32.03M
KuaiRec	\faTimes	\faTimes	\faTimes	\faTimes	7K	11K
Reasoner	\faCheck	\faCheck	\faTimes	\faTimes	3K	5K
MicroLens	\faCheck	\faCheck	\faCheck	\faCheck	30M	1M

We attempt to search for other datasets to enrich our experiment, but unfortunately by now among all acknowledged datasets, it is difficult to find a well-structured multi-modal micro-video dataset with a scale like MicroLens-100k dataset. A detailed comparison of the datasets is provided in Table 6.

Appendix 8: Baseline Models

Here we elaborate how base models works.

•

SVR: Uses Gaussian kernel-based Support Vector Regression (SVR) to predict micro-video popularity.
•

HyFea: Employs the CatBoost tree model, leveraging multiple features (image, category, space-time, user profile, tags) for accurate prediction.
•

Contextual-LSTM: Integrates contextual features into the LSTM model, capturing long-range context to improve prediction accuracy.
•

TMALL: Introduces a common space to handle modality relatedness and limitations, enhancing popularity prediction.
•

MASSL: Utilizes a multi-modal variational auto-encoder model, capturing cross-modal correlations to predict popularity.
•

MTFM: Combines fuzzy trend matching and Informer in a multi-step prediction model, forecasting popularity trends.
•

HMMVED: Extracts and fuses multi-modal features through a variational information bottleneck for better prediction.
•

CBAN: Applies cross-modal bipolar attention mechanisms, effectively capturing correlations in multi-modal data to enhance prediction.
•

MMRA: Enhances prediction by retrieving relevant instances from a multi-modal memory bank and augmenting the prediction process.