Mamba Hawkes Process

Anningzhe Gao²²2The first two authors contributed equally to this work.
Shenzhen Research Institute of Big Data
[email protected]
&Shan Dai²²2The first two authors contributed equally to this work.
Shenzhen Research Institute of Big Data
[email protected]
&Yan Hu¹¹1Corresponding author.
The Chinese University of Hong Kong, Shenzhen
[email protected]

Abstract

Irregular and asynchronous event sequences are prevalent in many domains, such as social media, finance, and healthcare. Traditional temporal point processes (TPPs), like Hawkes processes, often struggle to model mutual inhibition and nonlinearity effectively. While recent neural network models, including RNNs and Transformers, address some of these issues, they still face challenges with long-term dependencies and computational efficiency. In this paper, we introduce the Mamba Hawkes Process (MHP), which leverages the Mamba state space architecture to capture long-range dependencies and dynamic event interactions. Our results show that MHP outperforms existing models across various datasets. Additionally, we propose the Mamba Hawkes Process Extension (MHP-E), which combines Mamba and Transformer models to enhance predictive capabilities. We present the novel application of the Mamba architecture to Hawkes processes, a flexible and extensible model structure, and a theoretical analysis of the synergy between state space models and Hawkes processes. Experimental results demonstrate the superior performance of both MHP and MHP-E, advancing the field of temporal point process modeling.

1 Introduction

Humans and natural phenomena often produce large volumes of irregular and asynchronous event sequences. Examples of these include user activities on social media platforms [33, 8], transaction histories in financial markets [1, 16], electronic health records [29], and earthquake occurrences with aftershocks in geophysics [26]. Unlike traditional sequential data, such as time series, asynchronous event sequences are characterized by irregular intervals between events, which are as critical as the sequence order in describing their dynamics.

Temporal point processes (TPPs) [4, 2] are a common modeling approach for asynchronous event sequences, defined by their intensity functions. Among TPPs, Hawkes processes are particularly notable due to their non-negative intensity functions that capture the triggering effects of previous events. However, traditional Hawkes processes have significant limitations. They overlook mutual inhibition between events, an essential factor in many real-world scenarios, and they lack robust nonlinear fitting capabilities, restricting their expressive power. To overcome such limitations, researchers have proposed likelihood-free methods [30, 21] and non-parametric models like kernel methods and splines [38].

The advancement of neural networks and deep learning has further revolutionized sequence modeling. Recurrent Neural Networks (RNNs) have been particularly effective, leading to the development of RNN-based Hawkes process models. This approach allows for neurally self-modulating multivariate point processes by not requiring historical contributions to be additive, and it enables the modeling of complex memory effects, such as delays.

Despite these advances above, RNN-based models have notable drawbacks. Even with mechanisms like Long Short-Term Memory (LSTM) [17] and Gated Recurrent Units (GRUs) [3], RNNs struggle with long-term dependencies. Additionally, training deep RNNs, including LSTMs, is notoriously difficult due to challenges like gradient explosion and vanishing gradients [27]. To address these issues, the Transformer Hawkes Process (THP) model was proposed [39], leveraging a pure transformer architecture without RNNs or CNNs, and achieving state-of-the-art performance. However, attention-based transformers also encounter limitations in modeling long input sequences, especially when dependencies extend far beyond the attention window. This kind of in-context constraint has proved to be more crucial in prediction tasks of long sequence data, as shown in [10].

Recently, structured state space sequence models (SSMs) [13] have emerged as a promising class of architectures for sequence modeling. The Mamba model [11], a selective state space model, addresses data-dependent context compression in sequence modeling. Unlike attention mechanisms, Mamba utilizes state space constructs to encode context using hidden states during recurrent scans. The selection mechanism determines which parts of the input influence the hidden states, thereby guiding subsequent embedding updates. The dynamics of temporal point processes are described by a continuous conditional intensity function. Mamba shares the property of continuous dynamic mechanism modeling, which matches the continuous conditional intensity nature of TPPs. However, adapting Mamba to model Hawkes processes requires careful architectural design. Thus, inspired by series of work [14, 13, 11], we propose Mamba Hawkes Process model. Our contributions are as follows:

•

Mamba for Hawkes Process: Our primary contribution lies in the innovative application of the Mamba architecture to model Hawkes Processes. To the best of our knowledge, this is the first instance in the literature where the Mamba framework, renowned for its ability to capture long sequence dependencies, has been adapted to address the unique challenges of temporal point processes. The resulting Mamba Hawkes Process (MHP) model demonstrates state-of-the-art performance, surpassing existing benchmarks across a variety of datasets.
•

Flexibility and Extensibility of the Model Structure: Our architecture is versatile and can be integrated into other models. As an extension, we have combined the Mamba architecture with the Transformer model, creating a hybrid encoder that concurrently processes temporal and event-based features. This novel architecture, named the Mamba Hawkes Process Extension (MHP-E), offers an advanced representation of Hawkes Processes and demonstrates enhanced predictive capabilities. The MHP-E highlights the potential of combining sequence modeling techniques of Mamba with attention mechanisms or other neural network architectures to improve the analysis of complex temporal patterns.

2 Related work

2.1 Neural Hawkes Process

Hawkes processes are widely used for temporal prediction in various fields. To enhance their performance, many deep learning approaches have been applied. [7] introduced the Recurrent Marked Temporal Point Process (RMTPP) model, which uses recurrent neural networks (RNNs) to learn the influence of event history on the intensity function. [31] employed two RNNs to model event sequences: one for background intensity and another for the impact of historical events, enabling effective end-to-end training. Similarly, [24] proposed a continuous-time LSTM model to capture the self-modulating nature of Hawkes processes, addressing the inhibiting and exciting effects of prior events.

With the development of self-attention mechanisms, self-attention-based neural Hawkes processes were proposed. [36] utilized self-attention to enhance these processes, while [39] used the transformer encoder to convert sequence data into continuous conditional intensity functions. UTHP [34] incorporated RNNs and CNNs to address issues in THP, such as parallel processing and recursive learning. TAA-THP [35] improved attention structures by incorporating temporal encoding. Lastly, [32] proposed Hawkesformer, linking hierarchical attention mechanisms to Hawkes processes for information cascade prediction.

2.2 State Space Models

State space models (SSMs) were initially developed as mathematical tools to describe dynamic systems in modern control theory. With the introduction of HiPPO [12] initialization, the Linear State-Space Layer (LSSL) [14] demonstrated the capability to handle long-range dependencies in sequential data. However, LSSL’s computational and memory overhead make it impractical for large-scale applications. Structured state space models (S4) [13] address these issues by using reparameterization to enhance computational efficiency, providing an effective alternative to traditional attention mechanisms.

Several recent variants of S4 have been proposed to achieve linear time attention, including H3 [9], Gated State Space [23], Hyena [25], and RWKV [28]. Mamba [11] introduces a data-dependent selection mechanism to S4, improving the capture of long-range context as sequence length increases. Mamba not only achieves linear time efficiency in long-sequence modeling but also outperforms Transformers across various benchmarks. Recently, Jamba [22] has been introduced as a novel hybrid model that combines Transformer and Mamba layers in a mixture-of-experts (MoE) architecture. Jamba interleaves blocks of Transformer and Mamba layers, harnessing the strengths of both model families.

3 Background

3.1 Temporal Point Processes

A Temporal Point Process (TPP) [6, 5] is a stochastic process that defines a probability distribution over event sequences. In this process, the number of points (events) $K$ and their locations (arrival times) $t_{i}$ are random. The realization of a TPP can be represented as a sequence of discrete events with $\{t_{i}\}\in\mathbb{R}^{+}$ and $i\in\mathbb{Z}^{+}$ abstracted as points on a timeline. We can represent a TPP realization by a counting measure $N(t)=\sum_{i}^{n}\mathbb{I}(t_{i}<t)$ , for $t\in[0,T]$ . The intensity characterzing a TPP can be interprested as the expected number of events per unit of time and is defined as:

\lambda(t|\mathcal{H}_{t})=\lim\limits_{\Delta t\downarrow 0}\frac{\mathbb{E}[N(t+\Delta t)-N(t)|\mathcal{H}_{t}]}{\Delta t},

(1)

where $\mathcal{H}_{t}=\{t_{i}:t_{i}<t\}$ is the event history until time $t$ , which acts as a filtration to the process.

3.2 Hawkes Processes.

The Hawkes process [15, 19] is a typical temporal point process, and it models past events and predicts the timestamp of the following event by its conditional intensity function, which is defined as:

\lambda(t)=\mu+\sum\limits_{j:t_{i}<t}\psi(t-t_{i}),

(2)

where $\mu\geq 0$ , named base intensity, is an exogenous component of the intensity function independent of the history, while $\psi(t)>0$ is an endogenous component dependent on the history. Besides, $\psi(t)$ is a triggering kernel containing the peer influence of past events. Due to the self-exciting charities, the Hawkes process has recently received wide attention in event sequence modeling.

3.3 State Space Models

The structured state space sequence models (S4) [13] is defined by the simple equation 3, which maps a 1-dimensional function or sequence $x(t)\in\mathbb{R}\mapsto y(t)\in\mathbb{R}$ through an implicit latent state $h(t)\in\mathbb{R}^{N}$ :

	$\displaystyle h^{\prime}(t)$	$\displaystyle=\mathbf{A}h(t)+\mathbf{B}x(t),$		(3)
	$\displaystyle y(t)$	$\displaystyle=\mathbf{C}h(t),$		(3)

where $\mathbf{A}\in\mathbb{R}^{N\times N},\mathbf{B}\in\mathbb{R}^{N\times 1},\mathbf{C}\in\mathbb{R}^{1\times N}$ are parameters of neural networks in deep learning. To deal with the discrete input sequence $\boldsymbol{x}=(x_{0},x_{1},...)\in\mathbb{R}^{L}$ , S4 discretizes these parameters in Eq. (3) using a step size $\Delta$ , where the continuous parameters $\mathbf{A},\mathbf{B}$ are converted into discrete parameters $\overline{\mathbf{A}}=f_{A}(\Delta,\mathbf{A}),\overline{\mathbf{B}}=f_{B}(\Delta,\mathbf{B})$ , where the pair $(f_{A},f_{B})$ is called a discretization rule [13]. Various rules can be used such as the zero-order hold (ZOH) defined as follows:

	$\displaystyle\overline{\mathbf{A}}$	$\displaystyle=\exp(\Delta\mathbf{A}),$		(4)
	$\displaystyle\overline{\mathbf{B}}$	$\displaystyle=(\Delta\mathbf{A})^{-1}(\exp(\Delta\mathbf{A})-\mathbf{I})\cdot\Delta\mathbf{B}.$		(4)

Then, this recurrent rule can be expanded as:

	$\displaystyle\overline{\mathbf{K}}$	$\displaystyle=(\mathbf{C}\overline{\mathbf{B}},\mathbf{C}\overline{\mathbf{AB}},\ldots,\mathbf{C}\overline{\mathbf{A}}^{L-1}\overline{\mathbf{B}}),$		(5)
	$\displaystyle\boldsymbol{y}$	$\displaystyle=\boldsymbol{x}*\overline{\mathbf{K}},$		(5)

where $L$ denotes the length of the input sequence $\boldsymbol{x}$ and $\overline{\mathbf{K}}\in\mathbb{R}^{L}$ is the convolution kernel.

A recent development in state space layers is selective SSMs [11] (S6). These models utilize time-variant SSMs, namely, where the discrete matrices $\bar{A},\bar{B},$ and $C$ of each channel are modified over $L$ time steps depending on the input sequence. Unlike traditional state-space layers, which operate individually on each channel, selective state-space layers compute the SSM matrices $\bar{A}_{i},\bar{B}_{i},C_{i}$ for all $i\leq L$ based on all the channels, and then apply the time-variant recurrent rule individually for each channel.

Thus, we denote the entire input sequence by $\hat{x}:=(\hat{x}_{1},\cdots,\hat{x}_{L})\in\mathbb{R}^{L\times D}$ where $\hat{x}_{i}\in\mathbb{R}^{D}$ . The per-time discrete matrices $\bar{A_{i}},\bar{B_{i}},$ and $C_{i}$ are defined as follows:

B_{i}=S_{B}(\hat{x}_{i}),\quad C_{i}=S_{C}(\hat{x}_{i}),\quad\Delta_{i}=\text{Softplus}(S_{\Delta}(\hat{x}_{i}))

(6)

f_{A}(\Delta_{i},A)=\exp(\Delta_{i}A),\quad f_{B}(\Delta_{i},A,B_{i})=(\Delta_{i}{A})^{-1}(\exp(\Delta_{i}{A})-\mathbf{I})\cdot\Delta_{i}B_{i},

(7)

\bar{A}_{i}=f_{A}(\Delta_{i},A),\quad\bar{B}_{i}=f_{B}(\Delta_{i},A,B_{i})

(8)

where $f_{A},f_{B}$ represents the discretization rule, $S_{B},S_{C},S_{\Delta}$ are linear projection layers, and softplus is an elementwise function that is a smooth approximation of ReLU.

4 Methodology

The primary challenge in modeling event sequence data revolves around several aspects. Firstly, there is the question of how to efficiently handle extended event sequences while effectively capturing the complex evolving dynamics of the intensity function that drives the sequence. Additionally, it is crucial to capture long-range event transition dependencies within the sequence, particularly considering the self-exciting properties inherent to a large class of point processes that may exhibit interactions between events located far apart in the temporal domain. To address these challenges, we propose the Mamba Hawkes Process model.

4.1 Mamba Hawkes Process

Denote the sequence $\mathcal{S}=\{(t_{1},k_{1}),(t_{2},k_{2}),...,(t_{n},k_{n})\}$ as the temporal event sequence, where $K$ is the number of events. Let $\Delta_{i}=t_{i}-t_{i-1}$ represent the temporal differences, with $\Delta_{1}=t_{1}$ for convention, hence the temporal sequence is represented by

\Delta=(\Delta_{1},\Delta_{2},...,\Delta_{n})

Additionally, let $\mathbf{k}_{i}$ be the one-hot vector of the events. Inspired by the discretization

h(t+\Delta t)=\overline{\mathbf{A}}h(t)+\overline{\mathbf{B}}u(t)

(9)

the hidden states deffer by time gap $\Delta t$ are related by the formula, we put the temporal differences into the equation directly to construct our Mamba Hawkes Process (MHP) structure. Now we state our construction.

Algorithm 1 The architecture of Mamba Hawkes Process

Input: The temporal sequence $\{(t_{1},k_{1}),(t_{2},k_{2}),...,(t_{n},k_{n})\}$
Output: The hidden state $\mathbf{h}$

A:(B,L,D)\leftarrow\text{Parameter}

N\times N

matrix

B:(L,D)\leftarrow\text{Linear}_{B}(x_{t_{i}})

C:(L,D)\leftarrow\text{Linear}_{C}(x_{t_{i}})

B

and

C

are time-dependent

\Delta\leftarrow t_{i}-t_{i-1}

\Delta

records the temporal information

\bar{A},\bar{B}\leftarrow\text{Discretize}(A,B,\Delta)

\mathbf{h}\leftarrow\textbf{SSM}(\bar{A},\bar{B},C,\Delta)(x)

7: return

\mathbf{h}

Let $W^{e}$ be the event embedding matrix with dimensions $D\times K$ , where $D$ is the dimension of the hidden layers of Mamba blocks. The event embedding is defined as $x_{t_{i}}=\mathbf{k}_{i}(W^{e})^{T}$

(x_{t_{1}},x_{t_{2}},...,x_{t_{n}})=(\mathbf{k}_{1},\mathbf{k}_{2},...,\mathbf{k}_{n})(W^{e})^{T}

In the Mamba architecture, the matrices $\Delta,\mathbf{B}$ and $\mathbf{C}$ are time-dependent and are obtained by linear projection from $x_{t}$ . However, for the Hawkes Process, the approach is different, as it requires the use of temporal features. Specifically, we make $\mathbf{B}$ and $\mathbf{C}$ time-dependent, and $\Delta=(\Delta_{1},\Delta_{2},...,\Delta_{n})$ is defined by the temporal differences as above. Following Equation 9, we have $\Delta_{i}=t_{i}-t_{i-1}$ , thus we replace $t$ and $t+\Delta t$ by $t_{i-1}$ and $t_{i}$ . We define

\mathbf{B}(t_{i})=\text{Linear}_{\mathbf{B}}(x_{t_{i}}),\mathbf{C}(t_{i})=\text{Linear}_{\mathbf{C}}(x_{t_{i}})

are obtained by a linear transformation of the vector $x_{t_{i}}$ . We have the transition formulas for the Mamba Hawkes process:

	$\displaystyle z_{t_{i}}$	$\displaystyle=\overline{\mathbf{A}}(t_{i})z_{t_{i-1}}+\overline{\mathbf{B}}(t_{i})x_{t_{i}},$		(10)
	$\displaystyle y_{t_{i}}$	$\displaystyle=\mathbf{C}(t_{i})z_{t_{i}}.$		(10)

where

	$\displaystyle\overline{\mathbf{A}}(t_{i})$	$\displaystyle=\exp(\Delta_{i}\mathbf{A}),$		(11)
	$\displaystyle\overline{\mathbf{B}}(t_{i})$	$\displaystyle=(\Delta_{i}\mathbf{A})^{-1}(\exp(\Delta_{i}\mathbf{A})-\mathbf{I})\cdot\Delta_{i}\mathbf{B}(t_{i}).$		(11)

is the temporal-dependent coefficients. Hence, the temporal information is incorporated into our recurrence process.

The final output state is denoted by $\mathbf{O}=(\mathbf{o}_{1},\mathbf{o}_{2},...,\mathbf{o}_{n})$ , we thus have

\mathbf{H}=\text{Activate}(\mathbf{O}W_{1}+b_{1})W_{2}+b_{2}

where $W_{i},b_{i}$ are the parameters for the two layers MLP. We have $\mathbf{h}(t_{j})=\mathbf{H}(j,:)$ in our case.

The intensity function of Neural Hawkes process is given by

\lambda(t)=\sum_{k=1}^{K}\lambda_{k}(t),

(12)

where $\lambda_{k}$ is the intensity function of the $k$ -th event, and

\lambda_{k}=f_{k}(\alpha_{k}(t-t_{j})+\mathbf{w}_{k}^{T}\mathbf{h}(t_{j})+b_{k}),

(13)

where $t$ is defined on interval $t\in\left[t_{j},t_{j+1}\right)$ , and $f_{k}(x)=\beta_{k}\log\left(1+\exp\left(x/\beta_{k}\right)\right)$ is the Softplus function. For the log-likelihood, it is given by

\sum_{i=1}^{n}\log\lambda\left(t_{i}\right)-\int_{t_{1}}^{t_{n}}\lambda(t)dt.

(14)

Denote the log-likelihood of the event sequence $\mathcal{S}$ as $\mathcal{L}$ .

For the prediction of next event type and timestamp, we train two linear layers $P^{e},P^{t}$

	$\displaystyle\hat{k}_{j+1}$	$\displaystyle=\text{argmax}(\text{Softmax}(P^{e}\mathbf{h}(t_{j})),$
	$\displaystyle\hat{t}_{j+1}$	$\displaystyle=P^{t}\mathbf{h}(t_{j}).$		(15)

For the sequence $\mathcal{S}=\{(t_{1},k_{1}),(t_{2},k_{2}),...,(t_{n},k_{n})\}$ , we define

	$\displaystyle\mathcal{L}_{event}(\mathcal{S})$	$\displaystyle=\sum_{j=1}^{n-1}-\log(\text{Softmax}(P^{e}\mathbf{h}(t_{j}))_{k_{j+1}}),$
	$\displaystyle\mathcal{L}_{time}(\mathcal{S})$	$\displaystyle=\sum_{j=1}^{n-1}((t_{j+1}-t_{j})-(\hat{t}_{j+1}-\hat{t}_{j}))^{2},$		(16)

where ${t}_{j}$ is the true timestamp of event $j$ . $\mathcal{L}_{event}$ is the cross-entropy loss that measures the accuracy of the event type prediction, and $\mathcal{L}_{time}$ is the MSE loss that measures the accuracy of the time prediction. The training loss can then be defined as

\mathcal{L}(\mathcal{S})=-\mathcal{L}+\beta\mathcal{L}_{event}(\mathcal{S})+\gamma\mathcal{L}_{time}(\mathcal{S}),

(17)

where $\beta,\gamma$ are hyper-parameters to control the range of event and time losses.

Next we provide a proposition to show that, with specific choices of parameters, the Mamba Hawkes Process recurrence can degenerate to an RNN architecture similar to RMTPP as in [7].

Proposition 4.1

When $N=1,\boldsymbol{A}=-1,\boldsymbol{B}=1$ , the Mamba Hawkes Process recurrence takes the form

		$\displaystyle g_{t_{i}}=\exp(t_{i-1}-t_{i})$		(18)
		$\displaystyle z_{t_{i}}=g_{t_{i}}z_{t_{i-1}}+(1-g_{t_{i}})x_{t_{i}},$		(18)

Proof: We consider that if a given input $x_{t}$ should be completely ignored (as necessary in the synthetic tasks), all $D$ channels should ignore it, and so we project the input down to 1 dimension before repeating/broadcasting with $\Delta$ .

In Mamba Hawkes Process, we set $\Delta_{i}=t_{i}-t_{i-1}$ , when $N=1,\boldsymbol{A}=-1,\boldsymbol{B}=1$ . By applying the zero-order hold $(\mathrm{ZOH})$ discretization formulas:

	$\displaystyle\overline{\mathbf{A}}(t_{i})$	$\displaystyle=\exp(\Delta_{i}\mathbf{A})=\exp(t_{i-1}-t_{i}),$		(19)
	$\displaystyle\overline{\mathbf{B}}(t_{i})$	$\displaystyle=(\Delta_{i}\mathbf{A})^{-1}(\exp(\Delta_{i}\mathbf{A})-\mathbf{I})\cdot\Delta_{i}\mathbf{B}(t_{i})=-(\exp(\Delta_{i}\boldsymbol{A})-\boldsymbol{I})=1-\overline{\boldsymbol{A}}(t_{i})$		(19)

Denote that $g_{t_{i}}=\exp(t_{i-1}-t_{i})$ , thus the final discrete recurrence is

\displaystyle z_{t_{i}}=g_{t_{i}}z_{t_{i-1}}+(1-g_{t_{i}})x_{t_{i}}

as desired. We finish the proof.

From the construction we can see that the temporal differences have a canonical way as the time scale variables. This architecture can inherent the temporal information naturally.

4.2 MHP-extension

In [22], the authors combine the Mamba layer and Transformer layer to create a Jamba structure, demonstrating impressive capabilities in large language models. Inspired by this, we propose combining the Mamba structure and Transformer structure to develop a new model architecture, which we call the Mamba Hawkes Process Extension (MHP-E).

Explicitly, Let MambaBlock be the Mamba structure defined above, TransformerBlock be the Transformer blocks. Given the temporal sequence $\mathcal{S}=\{(t_{1},k_{1}),(t_{2},k_{2}),...,(t_{n},k_{n})\}$ and the corresponding event one-hot vector $\mathbf{k}_{i}$ , we first apply the embedding layer:

x_{i}=\mathbf{k}_{i}(W^{e})^{T},

(20)

and we let the encoding vector pass through the Mamba layers and Transformer layers

y=\text{MambaBlock}(x),

h=\text{TransformerBlock}(y),

and then we can use $h$ as the hidden layer to compute the log-likelihood $\mathcal{L}$ , event cross entropy loss $\mathcal{L}_{event}$ and the mean square temporal loss $\mathcal{L}_{time}$ .

We need to analyze the construction of the above architecture to understand its implicit ideas. In this architecture, we did not use temporal encoding for the Transformer, neither absolute nor relative. Instead, we used the Mamba layer as an encoder to simultaneously encode the temporal and event features. Consequently, the hidden layer obtained from the Mamba layer inherently incorporates this information.

5 Experiments

5.1 Baselines

Recurrent Marked Temporal Point Process

[7] The RMTPP is a traditional model that employs a Recurrent Neural Network (RNN) architecture to predict the timing of the next event. It uses the RNN mechanism to give the representation of the temporal information.

Neural Hawkes Process

[24] Propose the Neural Hawkes Process (NHP) incorporating neural networks into the Hawkes process to improve the ability of the prediction.

Self-attentive Hawkes Process

[36] This model utilizes an attention mechanism to predict the Hawkes process. It incorporates a hybrid positional encoding in its model construction, which fuses the temporal positional encoding and the absolute positional encoding.

Transformer Hawkes Process

[39] The THP applys the transformer architecture to the Hawkes process. THP regards the time stamps as the position of the event vectors and apply absolute positional encoding in the architecture.

5.2 Datasets

Synthetic Dataset

This dataset, created using Python, is based on the methodology described in [36]. It is a result of a Hawkes process, making it an excellent fit for our investigation. The dataset includes 5 types of events, with sequences averaging 60 in length. The shortest sequence is 20, while the longest is 100.

Financial Transactions

[7] This dataset contains a day’s worth of stock transaction records. The sequences in this dataset are extensive, with events divided into two categories: "Buy" and "Sell". With an average sequence length of 2074, this dataset is well-suited to our experiment.

StackOverFlow

[20] This dataset is a compilation of user interaction data from the Q&A platform, Stackoverflow. We view the history of user interactions as a time-ordered sequence. The dataset’s sequences have an average length of 72, ranging from 41 to 736, and encompass 22 event types.

Retweet

[37] This dataset is a collection of various tweet threads. Each thread includes an original tweet and all subsequent response tweets from users. The dataset also records the timing of each tweet and the user’s ID. The sequences average 109 in length, ranging from 50 to 264. The event types are categorized into three groups based on the number of followers: "small", "medium", and "large".

MIMIC-II

[18] The MIMIC-II dataset contains data from patients’ ICU admissions over a seven-year period. Each patient’s visits are treated as separate sequences, with each event in the sequence marked by a timestamp and a diagnosis.

5.3 Implementation

We design our MHP models and MHP-E models as follows: For MPH, we set the dimension in our construction of the embedding as describing in the following table:

Table 1: Hyper-parameters of different dataset

Dataset	Financial	SO	Synthetic	Retweet	Mimic-II
d_model	128	512	64	64	64
learning rate	1e-4	1e-4	1e-4	1e-2	5e-4
batch size	1	4	4	16	1

We set all other architecture hyper-parameters as suggested in [11]:

Table 2: Architecture of MHP

d_inner	d_state	d_conv	expand factor	n_layers
2 $\times$ d_model	16	4	2	4

For the MHP-E, the Mamba part we use the same construction as MHP except we choose n_layers to be 2 since we only need it to encode the temporal and event features. For the Transformer blocks, we apply the same architecture as the Transformer Hawkes Process (THP) as described in [39]. Similarly, we follow the code from [39] and use $\beta=1$ and $\gamma=$ 1e-4 for the loss function. To avoid NaN values during our training, we apply Softplus function and Clamp function to the temporal difference $\Delta$ .

All experiments are performed using GPU RTX A6000 with 48GB memory, and spend less than a minute for the training process each epoch.

5.4 Experimental Results

Log-likelihood

Table 3: Log-likelihood

Models	Financial	SO	Synthetic	Retweet	Mimic-II
RMTPP	-3.89	-2.6	-1.33	-5.99	-1.35
NHP	-3.6	-2.55	-	-5.6	-1.38
SAHP	-	-1.86	0.59	-4.56	-0.52
THP	-1.11	-0.039	0.791	-2.04	0.48
MHP	0.974	0.391	0.993	0.180	0.996
MHP-E	0.966	0.407	0.956	2.016	1.186

We first see the log-likelihood of the models on these datasets. From the table, we can observe that the MHP and MHP-E models generally outperform the other models in most categories.

•

The MHP model performs exceptionally well in the Financial and Synthetic categories, achieving the highest log-likelihood scores of 0.974 and 0.993 respectively. This suggests that the MHP model has a strong fitting ability for the tasks. Also for the SO, Retweet and Mimin-II datasets, MHP also gives high log-likelihood.
•

The MHP-E model shows a strong performance in the SO, Retweet and Mimic-II categories, achieving the highest log-likelihood scores of 0.407, 2.016, and 1.186 respectively. This suggests that the MHP-E model is particularly effective for these types of data. In the Financial and Synthetic categories, the MHP-E model’s performance is slightly lower than the MHP model, but still competitive. This shows the advantage of the MHP-E model.

Refer to caption — Figure 1: Comparison of log-likelihood between THP and MHP. Green lines are MHP, red lines are THP. The left figure represent the training process of Mimic-II, the middle figure is for the synthetic dataset and the right is for the stackoverflow dataset. We can see in both figures MHP outperforms THP

In conclusion, both the MHP and MHP-E models have their strengths. They outperform other models in these tasks and show strong abilities in the Hawkes process case.

Accuracy and RMSE

Table 4: Accuracy

Models	Financial	Mimic-II	SO
RMTPP	61.95	81.2	45.9
NHP	62.20	83.2	46.3
THP	62.23	84.9	46.4
MHP	62.5	85.5	45.4
MHP-E	62.7	85.5	46.5

Table 5: RMSE

Models	Financial	Mimic-II	SO
RMTPP	1.56	6.12	9.78
NHP	1.56	6.13	9.83
SAHP	-	3.89	5.57
THP	0.93	0.82	4.99
MHP	0.592	0.687	1.429
MHP-E	0.556	0.588	1.372

Table 5.4 and 5.4 provide a comparative analysis of the performance of various models across three different datasets: Financial, Mimic-II, and SO. The performance is evaluated based on two metrics: Accuracy and RMSE.

•

In terms of accuracy, MHP-E model outperforms all other models across all three datasets. It achieves the highest accuracy of 62.7% on the Financial dataset, ties for the highest accuracy of 85.5% on the Mimic-II dataset, and also leads with an accuracy of 46.5% on the SO dataset. This consistent performance across different datasets underscores the robustness and generalizability of the MHP-E model.

The MHP model also shows strong performance, particularly when compared to the other models excluding MHP-E. It achieves an accuracy of 62.5% on the Financial dataset, which is the second-highest after MHP-E. On the Mimic-II dataset, it performs slightly lower than MHP-E, achieving an accuracy of 85.5%. However, on the SO dataset, its performance drops to 45.4%, which is lower than MHP-E, THP, and NHP. Despite this, the MHP model’s overall performance is commendable and it can be considered a good model.
•

In terms of RMSE, a lower value is preferable as it indicates a closer fit to the data. Again, the MHP-E model outshines all others with the lowest RMSE across all datasets: 0.556 for Financial, 0.588 for Mimic-II, and 1.372 for SO.

The MHP model also performs well in terms of RMSE, securing the second-lowest values on all datasets after MHP-E. It achieves an RMSE of 0.592 on the Financial dataset, 0.687 on the Mimic-II dataset, and 1.429 on the SO dataset. This further reinforces the effectiveness of the MHP model, as it not only maintains high accuracy but also keeps the prediction error relatively low.

In conclusion, the MHP-E model stands out as the best in terms of both accuracy and RMSE, the MHP model also demonstrates strong performance in these two metrics.

6 Limitation

We first propose incorporating the Mamba structure into the Hawkes process in our paper, and it achieves impressive performance in our experiments. However, it is important to note that our construction is specifically designed for the Hawkes process. For general temporal point processes, our architecture may not work directly and may require further modifications.

7 Conclusion

In this paper, we propose the Mamba Hawkes Process, a new framework for modeling event sequence data. By adopting a time-variant recurrent rule, our model utilizes context-dependent reasoning to capture long dependencies while scaling linearly with sequence length. Moreover, the Mamba Hawkes Process is versatile enough to integrate with transformer architectures. We further introduce the Mamba Hawkes Process Extension (MHP-E) to illustrate the potential of combining the Mamba Hawkes Process with attention mechanisms to improve the analysis of complex temporal patterns.

Through comprehensive analysis, we articulate the compatibility and mutual reinforcement between state-space models (SSMs) and Hawkes processes. Experiments on various real-world datasets demonstrate that the Mamba Hawkes Processes exhibit state-of-the-art performance, outperforming existing benchmarks in both likelihood and event prediction accuracy. Our construction and analysis may pave the way for new experimental and theoretical directions, contributing to the real-world applications of Hawkes Processes.

References

[1] Emmanuel Bacry, Iacopo Mastromatteo, and Jean-François Muzy. Hawkes processes in finance. Market Microstructure and Liquidity, 1(01):1550005, 2015.
[2] David R Brillinger, Peter M Guttorp, Frederic Paik Schoenberg, Abdel H El-Shaarawi, and Walter W Piegorsch. Point processes, temporal. Encyclopedia of environmetrics, 3:1577–1581, 2002.
[3] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
[4] David Roxbee Cox and Valerie Isham. Point processes, volume 12. CRC Press, 1980.
[5] Daryl J Daley and David Vere-Jones. An introduction to the theory of point processes: volume II: general theory and structure. Springer Science & Business Media, 2007.
[6] Daryl J Daley, David Vere-Jones, et al. An introduction to the theory of point processes: volume I: elementary theory and methods. Springer, 2003.
[7] Nan Du, Hanjun Dai, Rakshit Trivedi, Utkarsh Upadhyay, Manuel Gomez-Rodriguez, and Le Song. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1555–1564, 2016.
[8] Guanhua Fang, Ganggang Xu, Haochen Xu, Xuening Zhu, and Yongtao Guan. Group network hawkes process. Journal of the American Statistical Association, pages 1–17, 2023.
[9] Daniel Y Fu, Tri Dao, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052, 2022.
[10] Anningzhe Gao and Shan Dai. Rothp: Rotary position embedding-based transformer hawkes process. arXiv preprint arXiv:2405.06985, 2024.
[11] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
[12] Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474–1487, 2020.
[13] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
[14] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. Combining recurrent, convolutional, and continuous-time models with linear state space layers. Advances in neural information processing systems, 34:572–585, 2021.
[15] Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90, 1971.
[16] Alan G Hawkes. Hawkes processes and their applications to finance: a review. Quantitative Finance, 18(2):193–198, 2018.
[17] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[18] Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
[19] Patrick J Laub, Thomas Taimre, and Philip K Pollett. Hawkes processes. arXiv preprint arXiv:1507.02822, 2015.
[20] Jure Leskovec and Rok Sosič. Snap: A general-purpose network analysis and graph-mining library. ACM Transactions on Intelligent Systems and Technology (TIST), 8(1):1–20, 2016.
[21] Shuang Li, Shuai Xiao, Shixiang Zhu, Nan Du, Yao Xie, and Le Song. Learning temporal point processes via reinforcement learning. Advances in neural information processing systems, 31, 2018.
[22] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, et al. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024.
[23] Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. Long range language modeling via gated state spaces. In The Eleventh International Conference on Learning Representations, 2022.
[24] Hongyuan Mei and Jason M Eisner. The neural hawkes process: A neurally self-modulating multivariate point process. Advances in neural information processing systems, 30, 2017.
[25] Eric Nguyen, Michael Poli, Marjan Faizi, Armin Thomas, Michael Wornow, Callum Birch-Sykes, Stefano Massaroli, Aman Patel, Clayton Rabideau, Yoshua Bengio, et al. Hyenadna: Long-range genomic sequence modeling at single nucleotide resolution. Advances in neural information processing systems, 36, 2024.
[26] Yosihiko Ogata. Space-time point-process models for earthquake occurrences. Annals of the Institute of Statistical Mathematics, 50:379–402, 1998.
[27] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning, pages 1310–1318. Pmlr, 2013.
[28] Bo Peng, Eric Alcaide, Quentin Gregory Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Nguyen Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
[29] Lu Wang, Wei Zhang, Xiaofeng He, and Hongyuan Zha. Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2447–2456, 2018.
[30] Shuai Xiao, Mehrdad Farajtabar, Xiaojing Ye, Junchi Yan, Le Song, and Hongyuan Zha. Wasserstein learning of deep generative point process models. Advances in neural information processing systems, 30, 2017.
[31] Shuai Xiao, Junchi Yan, Xiaokang Yang, Hongyuan Zha, and Stephen Chu. Modeling the intensity function of point process via recurrent neural networks. In Proceedings of the AAAI conference on artificial intelligence, volume 31, 2017.
[32] Liu Yu, Xovee Xu, Goce Trajcevski, and Fan Zhou. Transformer-enhanced hawkes process with decoupling training for information cascade prediction. Knowledge-Based Systems, 255:109740, 2022.
[33] Jingfei Zhang, Biao Cai, Xuening Zhu, Hansheng Wang, Ganggang Xu, and Yongtao Guan. Learning human activity patterns using clustered point processes with active and inactive states. Journal of business & economic statistics, 41(2):388–398, 2023.
[34] Lu-ning Zhang, Jian-wei Liu, Zhi-yan Song, and Xin Zuo. Universal transformer hawkes process with adaptive recursive iteration. Engineering Applications of Artificial Intelligence, 105:104416, 2021.
[35] Lu-ning Zhang, Jian-wei Liu, Zhi-yan Song, and Xin Zuo. Temporal attention augmented transformer hawkes process. Neural Computing and Applications, pages 1–15, 2022.
[36] Qiang Zhang, Aldo Lipani, Omer Kirnap, and Emine Yilmaz. Self-attentive hawkes process. In International conference on machine learning, pages 11183–11193. PMLR, 2020.
[37] Qingyuan Zhao, Murat A Erdogdu, Hera Y He, Anand Rajaraman, and Jure Leskovec. Seismic: A self-exciting point process model for predicting tweet popularity. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 1513–1522, 2015.
[38] Ke Zhou, Hongyuan Zha, and Le Song. Learning triggering kernels for multi-dimensional hawkes processes. In International conference on machine learning, pages 1301–1309. PMLR, 2013.
[39] Simiao Zuo, Haoming Jiang, Zichong Li, Tuo Zhao, and Hongyuan Zha. Transformer hawkes process. In International conference on machine learning, pages 11692–11702. PMLR, 2020.