HyperHawkes: Hypernetwork based Neural Temporal Point Process

Manisha Dubey Manisha Dubey, P.K. Srijith, Maunendra Sankar Desarkar
[email protected], [email protected], [email protected]
Indian Institute of Technology Hyderabad, India

Abstract

Temporal point process serves as an essential tool for modeling time-to-event data in continuous time space. Despite having massive amounts of event sequence data from various domains like social media, healthcare etc., real world application of temporal point process faces two major challenges: 1) it is not generalizable to predict events from unseen sequences in dynamic environment 2) they are not capable of thriving in continually evolving environment with minimal supervision while retaining previously learnt knowledge. To tackle these issues, we propose HyperHawkes, a hypernetwork based temporal point process framework which is capable of modeling time of occurrence of events for unseen sequences. Thereby, we solve the problem of zero-shot learning for time-to-event modeling. We also develop a hypernetwork based continually learning temporal point process for continuous modeling of time-to-event sequences with minimal forgetting. In this way, HyperHawkes augments the temporal point process with zero-shot modeling and continual learning capabilities. We demonstrate the application of the proposed framework through our experiments on two real-world datasets. Our results show the efficacy of the proposed approach in terms of predicting future events under zero-shot regime for unseen event sequences. We also show that the proposed model is able to predict sequences continually while retaining information from previous event sequences, hence mitigating catastrophic forgetting for time-to-event data.

Introduction

Various applications in daily life like earthquake occurrences, social networks, financial transactions, user activity logs etc. are associated with collection of discrete asynchronous events where event occurrences are represented with timestamps. Each event sequence, consisting of a series of timestamps, is associated with a separate entity. For example, in social media, each user can be associated with the time of posting a tweet, and each tweet can be viewed as an event. Similarly, in financial transactions, each stock can be associated with the time of buy-sell order. The ability to model such sequences is of vital importance to create intelligent systems. These sequences often contain rich information, which can predict the future evolution of the sequences.

A principled mathematical framework to model such sequences in continuous time space is temporal point process (Valkeila 2008). Hawkes process (Hawkes 1971), a self exciting point process, has a rich literature in terms of theoretical importance and has been widely used in a wide array of practical applications like epidemic modeling(Diggle, Rowlingson, and Su 2005), earthquake prediction(Hainzl, Steacy, and Marsan 2010), financial modeling(Bacry, Mastromatteo, and Muzy 2015), crime prediction(Mohler et al. 2011) etc. Recent works improve the performance of standard classic Hawkes process by considering neural networks for modeling such event sequences (Mei and Eisner 2016; Du et al. 2016; Xiao et al. 2017; Omi, Ueda, and Aihara 2019). Neural Hawkes processes have proved to learn complex dependencies as against their classical counterparts. The Neural Hawkes process is one of the cornerstones of recent progress in time-to-event modeling.

Despite having improved performance, the neural Hawkes process is challenged by two potential limitations on the practical side. Neural Hawkes process typically needs to be trained on a large time-to-event dataset for the specific domain or entity. This restricts the prediction for a new and unseen entity with limited or no data and can be very crucial for certain applications. Moreover, the process of data acquisition for the time of occurrence of events for a new sequence is expensive. Besides, such a process can be time consuming for some sequences which may have low frequency of occurrence, and hence may take a long time to produce massive amounts of data which will be used to predict future time of occurrences. Secondly, real-world event occurrences happen sequentially in continuous streams. Therefore, a realistic and challenging problem is to continually learn time-to-event models in an ever-changing environment while retaining previous learnt knowledge.

Motivated by the above limitations, we consider a practical and under-explored setting for time-to-event modeling, called zero-shot event modeling. We also consider a continual learning setup where time-to-event prediction tasks arrive sequentially in an online manner. We aim to develop neural Hawkes process models which could generalize to time-to-event prediction tasks with no data and can continually learn while retaining previous knowledge. To this end, we introduce HyperHawkes, a hypernetwork based Hawkes process to generate sequence-specific parameters for the neural Hawkes process. Hypernetwork is essentially a metanetwork which can generate parameters for the neural Hawkes process network for modeling continuous events. By employing hypernetwork based learning, we improve the model’s generalization ability to predict unseen sequences using the sequence descriptors. By incorporating descriptor-conditioned hypernetwork, we enable learning at the level of time-to-event sequence by learning event-sequence-specific parameters, hence being able to predict unseen sequences with the help of a descriptor. For a more pragmatic setup, we augment our model to consider continually arriving sequences where each sequence can be considered as a separate task. For continually learning the event sequences, we recast the descriptor-conditioned hypernetwork to include a hypernetwork output regularizer. This regularizer will penalize the changes in previously learnt parameters, hence retaining previously learnt time-to-event modeling capabilities. We provide two variations to the proposed approach, allowing to encompass 1) zero-shot modeling 2) continual learning capabilities within the framework of neural Hawkes process. To the best of our knowledge, there is no prior work on zero-shot or continually learning time-to-event modeling.

Our contributions can be summarized as follows:

•

We propose two novel problems of zero-shot learning and continually learning from the paradigm of time-to-event modeling.
•

We propose HyperHawkes, a descriptor-conditioned hypernetwork based neural Hawkes process which can generate event sequence specific parameters, hence learning at the level of sequence. We present two variants of HyperHawkes considering architecture of neural Hawkes process.
•

The proposed methods can be used for predicting time of occurrences of unseen sequences, hence performing zero-shot time-to-event modeling.
•

We augment the model with continual learning abilities by employing hypernetwork based regularization parameter, hence avoiding catastrophic forgetting for successively appearing time-to-event sequences.
•

We present an experimental setup for evaluating zero shot learning and continual learning for time-to-event modeling. We demonstrate the effectiveness of the proposed models on these setups for two real-world datasets.

Related Work

Hawkes Process

Hawkes process (Hawkes 1971) is a point process (Valkeila 2008) with self-triggering property. i.e occurrence of previous events trigger occurrences of future events. Hawkes process has been used in earthquake modeling (Hainzl, Steacy, and Marsan 2010), crime forecasting (Mohler et al. 2011), social media (Rizoiu et al. 2017) finance (Bacry, Mastromatteo, and Muzy 2015; Embrechts, Liniger, and Lin 2011) and epidemic forecasting (Diggle, Rowlingson, and Su 2005; Chiang, Liu, and Mohler 2021). They provide a solid mathematical framework for modeling event sequences. Earlier works on point process modeling specify a parametric form for the intensity function characterizing the point process. However, parametric models may not be capable of capturing the complex event dynamics. To address this, several research works were proposed (Du et al. 2016; Mei and Eisner 2017; Omi, Ueda, and Aihara 2019; Zuo et al. 2020) where the intensity function is modeled using neural networks, which are better at learning complex event dynamics. Recently, (Zuo et al. 2020) and (Zhang et al. 2020) proposed to use positional encodings in transformer language models (Vaswani et al. 2017) to model point processes. There are some efforts for learning from small data for Hawkes process (Xie et al. 2019; Salehi et al. 2019). However, they are based on the statistical Hawkes process model (not the neural Hawkes process) and are not applicable to a zero shot learning setting.

Hypernetwork, Continual Learning and ZSL

Zero-Shot Learning:

ZSL (Palatucci et al. 2009; Lampert, Nickisch, and Harmeling 2009) aims to predict classes which are not in training samples. Such classes are known as unseen classes and classes which are in training samples are known as seen classes. A few methods to address zero-shot learning is through mapping function (Frome et al. 2013), generative model (Felix et al. 2018) and graph neural network (Wang, Ye, and Gupta 2018). Huge literature has addressed the problem of zero-shot learning across various domains of vision, natural language processing tasks such as text classification, relation extraction etc.

Continual Learning:

aims to create a learning paradigm which is able to model a stream of tasks while avoiding catastrophic forgetting. Different techniques (Kirkpatrick et al. 2017; Li and Hoiem 2017; Lopez-Paz and Ranzato 2017; Von Oswald et al. 2019) have been proposed in this regard by consolidating knowledge either in various spaces like data, weight or meta space. (Von Oswald et al. 2019) performs continual learning through hypernetwork which learns task conditioned weights of base model.

Hypernetworks:

They have been introduced as a metanetwork which can generate weights for another network (Ha, Dai, and Le 2017). It is used for various tasks like meta learning (Zhao et al. 2020), neural architecture search (Zoph and Le 2016), natural language understanding (He et al. 2022) etc.

Despite having several literature in all these domains, to the best of our knowledge, there is no work in the intersection of zero-shot learning and time-to-event modeling. Also, there is no effort along the lines of continual learning for time-to-event modeling. Therefore, we address novel and essential problems in this direction which can benefit several applications.

Preliminary

Problem Definition

•

Zero-shot Learning for time-to-event modeling: Assume we are given a collection of N seen sequences $\mathcal{D}^{S}=\{(\mathcal{T}^{1},\mathbf{d}^{1}),(\mathcal{T}^{2},\mathbf{d}^{2}),...,(\mathcal{T}^{N},\mathbf{d}^{N}\}$ where $\mathbf{d}^{i}$ represents the descriptor of the $i^{th}$ sequence or meta-information and $\mathcal{T}^{i}$ represents the times of occurrence of $n^{i}$ events in the $i^{th}$ sequence, i.e. $\mathcal{T}^{i}=\{t^{i}_{j}\}_{j=1}^{n^{i}}$ . Our goal is to predict time of event occurrences for $\bar{N}$ unseen sequences $\mathcal{D}^{U}=\{(\mathcal{T}^{1},\mathbf{d}^{1}),(\mathcal{T}^{2},\mathbf{d}^{2}),...,(\mathcal{T}^{\bar{N}},\mathbf{d}^{\bar{N}}\}$ with the help of the sequence descriptor.
•

Continual learning for time-to-event modeling: Assume we are given a collection of N sequences $\mathcal{D}=\{(\mathcal{T}^{1},\mathbf{d}^{1}),(\mathcal{T}^{2},\mathbf{d}^{2}),...,(\mathcal{T}^{N},\mathbf{d}^{N}\}$ where $\mathbf{d}^{i}$ represents the sequence descriptor and $\mathcal{T}^{i}$ represents the times of occurrence of $n^{i}$ events in the $i^{th}$ sequence, i.e. $\mathcal{T}^{i}=\{t^{i}_{j}\}_{j=1}^{n^{i}}$ and we assume these sequences arrive one after the other in the order of their index. Our goal is to continually learn the sequences while avoiding catastrophic forgetting from the previous sequences. So, we aim to learn a NHP model which will be able to predict the future event occurrences in all the sequences $(\mathcal{T}^{i},\mathbf{d}^{i})$ where $i\leq j$ after training on the $j^{th}$ sequence $(\mathcal{T}^{j},\mathbf{d}^{j})$ .

Hawkes Process

Point processes are useful to model the distribution of points over some space and are defined using an underlying intensity function. A Hawkes process (Hawkes 1971) is a point process with self-triggering property i.e occurrence of previous events trigger occurrences of future events. Conditional intensity function for univariate Hawkes process at time $t^{i}_{j}$ for the $i^{th}$ sequence is defined as

\lambda(t^{i}_{j})=\mu_{i}+\sum_{k=1}^{j-1}k(t^{i}_{j}-t^{i}_{k})

(1)

where $\mu_{i}$ is the base intensity function and $k(\cdot)$ is the triggering kernel function capturing the influence from previous events. The summation represents the effect of all events prior to time $t^{i}_{j}$ which will contribute to computing the intensity at time $t^{i}_{j}$ . The probability density function at time $t^{i}_{j}$ given the past event times as $\{t^{i}_{1},t^{i}_{2},\ldots,t^{i}_{j-1}\}$ , is obtained as follows:

p(t^{i}_{j}|t^{i}_{1},t^{i}_{2},\ldots,t^{i}_{j-1})=\lambda(t^{i}_{j})\exp\left\{-\int_{t^{i}_{j-1}}^{t^{i}_{j}}\lambda(t)dt\right\},

(2)

where the exponential term in the right-hand side represents the probability that no events occur in $[t^{i}_{j-1},t^{i}_{j})$ .

Neural Hawkes process

Standard Hawkes process assumes a parametric form for the intensity function which is not generalizable to every event prediction problem. The influences between the events can be complex and need not be exponentially decaying. Various recent works introduced neural Hawkes processes (Du et al. 2016; Mei and Eisner 2016; Omi, Ueda, and Aihara 2019) which models the intensity function as a nonlinear function of history using a neural network. The central idea of these works is to use recurrent neural networks (RNN) to model intensity function which captures the influence of past events. So, conditional intensity function is modeled as:

\lambda(t^{i}_{j})=f(\mathbf{h}_{j}^{i})

where $\mathbf{h}_{j}^{i}$ represents a hidden state updated using RNN and $f(\cdot)$ is a positive valued function ensuring positivity of the intensity function.

Proposed Model

We propose HyperHawkes, a hypernetwork based neural Hawkes process for time-to-event modeling. For time-to-event modeling, we consider the neural Hawkes process (NHP) (Omi, Ueda, and Aihara 2019) as the base model. Integrating Hypernetwork with neural Hawkes process, we introduce descriptor-conditioned hypernetwork to generate weights for each sequence which can perform time-to-event modeling. The descriptor-conditioned hypernetwork learns separate weights of NHP for each sequence. We leverage this framework for zero-shot event modeling where the hypernetwork produces weights for unseen tasks using the sequence descriptor and NHP predicts future events. Inspired by (Von Oswald et al. 2019), we also use this framework for continual learning of tasks by using a hypernetwork based regularizer. We discuss each of these pieces in detail in further subsections.

Base Model: Neural Hawkes Process

In particular we employ neural Hawkes process (Omi, Ueda, and Aihara 2019) as base model for time-to-event modeling. It uses a combination of recurrent neural network and feedforward neural network to model the intensity function. We represent history by using hidden representations generated by recurrent neural networks (RNNs) at each time step. The hidden representation $\boldsymbol{h}^{i}_{j}$ at time $t^{i}_{j}$ is obtained as

\boldsymbol{h}^{i}_{j}=RNN(\tau^{i}_{j},\boldsymbol{h}^{i}_{j-1};W^{i}_{r})=\sigma(\tau^{i}_{j}V^{i}_{r}+\boldsymbol{h}^{i}_{j-1}U^{i}_{r}+b^{i}_{r})

(3)

where $\tau^{i}_{j}=t-t^{i}_{j}$ and $W^{i}_{r}$ represents the parameters associated with RNN for the $i^{th}$ sequence such as input weight matrix $V^{i}_{r}$ , recurrent weight matrix $U^{i}_{r}$ , and and bias $b^{i}_{r}$ . $\boldsymbol{h}^{i}_{j}$ is obtained by repeated application of the RNN block on a sequence formed from previous $M$ inter-arrival times. This is used as input to a feedforward neural network to compute the intensity function (hazard function) and consequently the cumulative hazard function for computing the likelihood of event occurrences. In the proposed model, input to the feed-forward neural network is \Romannum1) the hidden representation generated from RNN \Romannum2) elapsed time from the most recent event. We model the conditional intensity as a function of the elapsed time from the most recent event as-

\lambda(t|H^{i}_{t})=\lambda(t-t^{i}_{j}|\boldsymbol{h}^{i}_{j})

(4)

where $\lambda(\cdot)$ is a non-negative function referred to as a hazard function. Therefore, we define cumulative hazard function in terms of inter-event interval $\tau^{i}_{j}=t-t^{i}_{j}$ as $\Phi(\tau^{i}_{j}|\boldsymbol{h}^{i}_{j})=\int_{0}^{\tau^{i}_{j}}\lambda(s|\boldsymbol{h}^{i}_{j})ds$ . Cumulative hazard function is modeled using a feed-forward neural network (FNN)

\Phi(\tau^{i}_{j}|\boldsymbol{h}^{i}_{j})=FNN(\tau^{i}_{j},h^{i}_{j};W^{i}_{t}).

(5)

However, we need to fulfill two properties of cumulative hazard function. Firstly, it has to be a monotonically increasing function of $\tau^{i}_{j}$ and secondly, it has to be positive valued. We achieve these by maintaining positive weights and positive activation functions in the neural network (Chilinski and Silva 2020; Omi, Ueda, and Aihara 2019). The hazard function itself can be then obtained by differentiating the cumulative hazard function with respect to $\tau$ as

\lambda(\tau^{i}_{j}|\boldsymbol{h}^{i}_{j})=\frac{\partial}{\partial\tau^{i}_{j}}\Phi(\tau^{i}_{j}|\boldsymbol{h}^{i}_{j})

(6)

The log-likelihood of observing event times is defined as follows using the cumulative hazard function:

\begin{split}&\log p(\{t^{i}_{j}\}_{j=1}^{n^{i}};W^{i})=\sum_{j=1}^{n^{i}}\log p(t^{i}_{j}|\mathcal{H}^{i}_{j};W^{i})=\\ &\sum_{j=1}^{n^{i}}\big{(}\log(\frac{\partial}{\partial\tau^{i}_{j}}\Phi(\tau^{i}_{j}|\boldsymbol{h}^{i}_{j-1};W^{i}))-\Phi(\tau^{i}_{j}|\boldsymbol{h}^{i}_{j-1};W^{i})\big{)}\end{split}

(7)

where $\tau^{i}_{j}=t^{i}_{j}-t^{i}_{j-1}$ and $W^{i}=\{W^{i}_{r},W^{i}_{t}\}$ represents the combined weights associated with RNN and FNN. In NHP, the weights of the networks are learnt by maximizing the likelihood given by (7). The gradient of the log-likelihood function is calculated using backpropagation.

HyperHawkes: Hypernetwork based Neural Hawkes Process

Hypernetwork is a meta-network which produces parameters used by other networks (Ha, Dai, and Le 2017). As discussed in the above section, the neural Hawkes process comprises of two building blocks - RNN and FNN. So, we use hypernetwork to produce weights for these two components. We use a feed-forward neural network (FNN) to produce parameters $W^{i}=\{W_{r}^{i},W_{t}^{i}\}$ associated with the NHP. Since the nature of the RNN parameters $W_{r}^{i}$ and FNN parameters $W_{t}^{i}$ are different, we use two different types of hypernetworks, $f_{r}(\cdot)$ producing $W_{r}^{i}$ and $f_{t}(\cdot)$ producing $W_{t}^{i}$ . Given a sequence description $\mathbf{d}^{i}$ , parameters for the RNN are generated as follows:

W_{r}^{i}=f_{r}(\mathbf{d}^{i};\theta_{fr})

(8)

where $\theta_{fr}$ denotes the parameters of the hypernetwork (weight vectors of a neural network). Note that the hypernetwork parameters are the same across the sequences. The descriptor $\mathbf{d}^{i}$ is used to generate the sequence specific parameters. As discussed above, the cumulative hazard function is a monotonically increasing function of $\tau^{i}_{j}$ and is positive-valued. The hypernetwork which will generate parameters for cumulative hazard function has to fulfill these properties. This can be achieved when hypernetwork generates only positive weights for which we use a positive activation function. Hypernetwork $f_{t}(\cdot)$ for FNN can be written as:

W_{t}^{i}=f_{t}(\mathbf{d}^{i};\theta_{ft})

(9)

where $\theta_{ft}$ denotes parameters of hypernetwork $f_{t}(\cdot)$ . We propose the following variants of HyperHawkes -

•

HyperHawkes-FNN: Hypernetwork is considered only for the FNN modeling the cumulative hazard function.
•

HyperHawkes-FNN-RNN: This variant uses two separate hypernetworks, one to model the RNN modeling the history and the second to model the FNN.

Overview of the proposed architecture is shown in Fig 1.

HyperHawkes for Zero-shot modeling

In this section we discuss how we employ HyperHawkes for zero-shot event modeling. Our goal is to train the model on seen sequences $\mathcal{D}^{S}$ with event sequences $\mathcal{T}^{s}$ and task descriptors $\mathbf{d}^{s}$ and predict event times of unseen sequences in $\mathcal{D}^{U}$ given a task descriptor $\mathbf{d}^{u}$ . We employ HyperHawkes for performing zero-shot learning on event sequences. The central idea of the proposed approach is to predict the parameters for the neural Hawkes process for the unseen task $\mathbf{d}^{u}$ . This is achieved using hypernetwork which considers the sequence descriptor as input and parameters for neural Hawkes process as output as discussed in the previous section. Consequently, we can get parameters for RNN ( $W_{r}^{u}$ ) using Equation 8 and FFN ( $W_{t}^{u}$ ) using Equation 9 and use them to model the cumulative hazard function for an unseen sequence. These parameters can then be used for predicting events in the sequence $\mathcal{T}^{u}$ .

Training We adopt a training procedure where we train hypernetwork using the maximum likelihood estimation for the NHP model. For each seen sequence $\mathcal{T}^{s}$ from $D^{S}$ , we sample a mini-batch consisting of events. The sequence descriptor $\mathbf{d}^{s}$ of this sequence is used to generate parameters of the neural Hawkes process using the hyper-network and its parameters. These values are then used in Equation 7 to find the log-likelihood of the event times of the seen sequences. So, the log-likelihood described in Equation 7 will now be:

\begin{split}\sum_{i=1}^{N}\sum_{j=1}^{n^{i}}\log p(t^{i}_{j}|\mathcal{H}^{i}_{j};(f_{r}(\mathbf{d}^{i};\theta_{fr}),f_{t}(\mathbf{d}^{i};\theta_{ft})))\end{split}

(10)

The difference in training lies in the fact that by maximizing this log-likelihood, we will get weights of the hyperparameter network rather than the neural Hawkes process. So, we calculate $\theta_{fr}$ and $\theta_{ft}$ using the gradient of the log-likelihood function using backpropagation.

Prediction For prediction of events from unseen sequence $\mathcal{T}^{u}$ from $D^{U}$ , we employ our trained hypernetwork to produce weights $\{W_{r}^{u},W_{t}^{u}\}$ for the neural Hawkes process using sequence descriptor $\mathbf{d}^{u}$ . Neural Hawkes process uses the bisection method (Omi, Ueda, and Aihara 2019) to predict the time of the next event. Bisection method provides the median $t_{*}$ of the predictive distribution over next event time using the relation $\Phi(t_{*}-t_{j}^{i}|\boldsymbol{h}_{j}^{i};W_{r}^{u},W_{t}^{u})=\log(2)$ .

HyperHawkes for Continual Learning

In a more realistic setup of event time modeling, sequences appear one after the other and it is unrealistic to store the data and models associated with all the previous sequences. In spite of this, we need to predict correctly on these past sequences though we have only data from new sequences. We want the NHP models to retain information from the past sequences while learning from new sequences. The standard training of the NHP model adapts them to the new sequence data (by updating parameters to optimize the loss to new sequence data) and results in forgetting what it has learnt from past sequences. The inability of the neural network models to retain knowledge from past data is known as catastrophic forgetting and continual learning techniques have been proposed to address this. Though it is studied in the vision community, to the best of our knowledge we could not find any work on the time-to-event prediction problem and with NHP models.

We address the novel problem of learning the time-to-event sequences continually while retaining knowledge from past time-to-event sequences. Inspired by (Von Oswald et al. 2019), we use descriptor conditioned hypernetworks for continually learning from event sequences. Ideally, we want our model to remember the parameters of the neural Hawkes process for each sequence. A naive approach to achieve this is through storing and replaying over previous data, which is obviously memory expensive and unrealistic. However, HyperHawkes, being conditioned on the sequence descriptor, can be modified to handle this problem. The direct use of the HyperHawkes training through (10) would result in hypernetworks forgetting the generation of the NHP parameters corresponding to past event sequences. We overcome this by incorporating a regularization on the hypernetwork parameters such that it penalizes any change to the NHP parameters produced from old sequences.

Given a sequence description $\mathbf{d}^{s}$ for the descriptor $\mathcal{T}^{s}$ , our descriptor conditioned hypernetwork $f_{r}(\cdot)$ can generate parameters $W_{r}^{s}$ and $f_{t}(\cdot)$ can generate parameters $W_{t}^{s}$ . To perform continual learning, we use regularization to penalize changes in $\{W_{r}^{c},W_{t}^{c}\}$ generated for past sequences in order to retain information from those sequences and to learn continually. The regularization is applied to the hypernetwork parameters while learning a new event sequence, and this prevents adaptation of the hypernetworks parameters completely to the new event sequence. For a new event sequence $\mathcal{T}^{s}$ and its corresponding descriptor $\mathbf{d}^{s}$ , the hypernetwork parameters are learnt by minimizing the following continual learning loss over events in the sequence:

\begin{split}&\sum_{j=1}^{n^{s}}-\log p(t^{s}_{j}|\mathcal{H}^{s}_{j};(f_{r}(\mathbf{d}^{s};\theta_{fr}),f_{t}(\mathbf{d}^{s};\theta_{ft})))\\ &+\frac{\beta}{s-1}\sum_{c=1}^{s-1}\biggl{(}\parallel f_{r}(\mathbf{d}^{c};\theta_{fr})-f_{r}(\mathbf{d}^{c};\bar{\theta}_{fr})\!\!\parallel^{2}\\ &+\parallel f_{t}(\mathbf{d}^{c};\theta_{ft})-f_{t}(\mathbf{d}^{c};\bar{\theta}_{ft})\parallel^{2}\biggr{)}\end{split}

(11)

where $\{\bar{\theta}_{fr},\bar{\theta}_{ft}\}$ represents the stored hypernetwork parameters after learning until sequence $s-1$ and $\{\theta_{fr},\theta_{ft}\}$ represent the hypernetwork parameters learnt considering the event sequence $s$ and regularization to avoid forgetting. The regularization term ensures that the newly learnt hyper-network parameters will be able to produce the required main network parameters from the past event sequences given the sequence descriptor without forgetting and the regularization constant $\beta$ captures the importance associated with it. So, in this way, we try to retain the information from previous sequences at a meta-level. By including a simple regularization term within the framework of HyperHawkes, our model is capable of learning sequences continually without forgetting knowledge learnt from previous sequences. We are able to achieve this because of the use of sequence-conditioned hypernetwork on the top of neural Hawkes process, emphasizing its usefulness for continual learning over event sequences in addition to zero shot learning.

Table 1: Results for Zero-shot setup for proposed method HyperHawkes under different setups against the proposed baselines. The table considers both the variants HyperHawkes-FNN and HyperHawkes-FNN-RNN as proposed methodology for ZSL

Experimental Setup

Dataset

MNLL

MAE

FNHP

FNHP-

Descriptor

HyperHawkes-

FNN

HyperHawkes-

FNN-RNN

FNHP

FNHP-

Descriptor

HyperHawkes-

FNN

HyperHawkes-

FNN-RNN

Zero-shot

Yelp

-4.2023

-3.3334

-5.7045

-5.4934

0.0025

0.0015

0.0013

0.0014

Meme

-3.9011

-2.9421

-5.4339

-5.3403

0.0056

0.0072

0.0036

0.0018

Generalized zero-shot

Yelp

-4.9279

-4.2075

-5.2473

-5.6869

0.0025

0.0016

0.0021

0.0014

Meme

-4.2735

-4.0250

-5.2475

-5.2922

0.0036

0.0033

0.0030

0.0026

Standard event modeling

Yelp

-4.2700

-3.8543

-4.9030

-4.9475

0.0047

0.0042

0.0024

0.0025

Meme

-4.6991

-2.9633

-2.5325

-4.8796

0.0046

0.0072

0.0078

0.0027

Experiments

Datasets

Due to paucity of standard datasets for event modeling tasks which contain meta descriptions as well, we use the following two datasets: 1)Yelp:¹¹1https://www.kaggle.com/datasets/yelp-dataset/: This is a dataset comprising of business information and their check-in information. Each business is associated with 82 attributes like Wheelchair Accessible, Accepts Insurance, By Appointment Only, Business Category, Business Timings etc. Also, they are associated with latitude-longitude pairs. Moreover, these businesses are associated with a fine-grained category. For higher granularity, we convert them into 22 broad categories using the hierarchy mentioned in their website ²²2https://www.yelp.com/developers/documentation/v3/all_category_list. Using these attributes, we create a vector of length 1229 representing a business. This vector acts as a descriptor of the business. We select businesses with more than 5000 check-ins. For continual learning, we have considered another sample of dataset consisting of business with more than 10k checkins, hence considering 26 sequences of business. This is done to reduce the number of sequences for better visualization of performance of each sequence. 2) Meme:³³3https://snap.stanford.edu/data/memetracker9.html: This dataset(Leskovec, Backstrom, and Kleinberg 2009) tracks the popular phrases and quotes which appear appear most frequently over time in news media and blogs. Each meme is associated with the content and timestamps when they were quoted in the media. We select top 200 english phrases and doc2vec representation of the meme content is considered as the descriptor of length 100. We have considered memes from April, 2009 with an average number of events as 970.

Baselines

To the best of our knowledge, the proposed problem statement is the first work along this direction. Therefore, we propose our own baselines as - 1) FNHP: This includes fully neural Hawkes process (Omi, Ueda, and Aihara 2019). This approach doesn’t incorporate the sequence descriptor. 2) FNHP-Descriptor: In this variant, we use concatenated descriptor and time as input to RNN and FNN. For Continual learning setup, we compare against HyperHawkes without any regularization as baseline.

Implementation Details

For zero-shot setup, we perform a 60-20-20 split where 60% of sequences are considered as seen sequence and 20% for validation unseen sequence and rest 20% as test unseen sequence. We have used a single layer with 32 units for hypernetwork where we use softplus activation function for modeling cumulative hazard function. For the neural Hawkes process, we consider a recurrent neural network with one layer and 16 units and 2-layer feed-forward neural network with 16 units in each layer. We use Adam optimizer with learning rate, $\beta_{1}$ and $\beta_{2}$ as 0.0001, 0.90 and 0.99 for the reported results. We perform single step lookahead prediction where we use actual time of occurrence of events as past events for the historical information to predict future events. We have performed all the experiments on Intel(R) Xeon(R) Silver 4208 CPU @ 2.10GHz, GeForce RTX 2080 Ti GPU and 128 GB RAM. We have implemented our code in Tensorflow 2.2.0 (Abadi et al. 2015). For continual learning setup, we test on regularization parameters in range [0.001-0.9]. For Yelp, reported results are for $\beta$ is set as 0.5 for HyperHawkes-FNN, 0.6 for HyperHawkes-FNN-RNN and for Meme $\beta$ is 0.6 for HyperHawkes-FNN and 0.1 for HyperHawkes-FNN-RNN. More implementation details and hyperparameter settings are presented in the supplementary.

Table 2: Results for Continual Learning by comparing the performance of the proposed HyperHawkes with regularization against under HyperHawkes without regularization (Lower MNLL and MAE indicates better performance)

Dataset	HyperHawkes-FNN				HyperHawkes-FNN-RNN
	MNLL		MAE		MNLL		MAE
	WithoutCL	WithCL	WithoutCL	WithCL	WithoutCL	WithCL	WithoutCL	WithCL
Yelp	-4.8627	-5.6928	0.00159	0.00149	-5.2629	-5.7727	0.00152	0.00150
Meme	-2.8550	-5.1462	0.00548	0.00471	-3.8254	-5.1192	0.00508	0.00471

Experimental Setup

We consider these experimental setups to evaluate the performance of our model - 1) Zero-Shot: Training is done on seen sequences and testing is done on unseen sequences. 2) Generalized Zero-Shot: In this, testing is done by randomly sampling 20% of events from seen sequences and unseen sequences. 3) Standard Event Modeling: Training is done on the first 70% of the events from all sequences. Testing is done on the last 20% events for unseen sequences. Mean negative log-likelihood (MNLL) and mean absolute error (MAE) are considered as evaluation metrics for both zero-shot and continual learning setup. Lower MNLL and MAE indicates better performance.

Results and Analysis

Zero-Shot Learning

Results for zero-shot setup are presented in Table 1. A better model is expected to have lower MNLL and MAE. We can observe that the proposed method HyperHawkes performs better than the baselines in terms of both evaluation metrics MNLL and MAE. We can observe that zero-shot setup, which consists of predictive performance for unseen sequences, exhibits significantly lower MNLL and MAE for HyperHawkes-FNN as compared to FNHP and FNHP-Descriptor. Also, HyperHawkes-FNN-RNN yields lower MAE for Meme dataset, however, MNLL for both the variants of the proposed methods is close. Comparing the results of generalized zero-shot setup where we consider instances from seen sequences as well as unseen sequences, we can observe that HyperHawkes-FNN-RNN performs better for both the datasets. The final section of the table discusses results for standard event modeling where we test on the last 20% of the events of unseen sequences. HyperHawkes-FNN-RNN performs well here as well except for MAE for Yelp dataset. We can also observe that HyperHawkes-FNN-RNN performs slightly better for generalized zero-shot and standard event modeling setup. Moreover, it is interesting to note that inclusion of sequence descriptors within FNHP in FNHP-Descriptor has not helped much in prediction of unseen sequences. In fact, in various cases, it has performed worse than FNHP itself. This confirms the necessity of an event sequence specific framework which can predict better for unseen sequences as well. Therefore, our results indicate that the proposed variants of HyperHawkes perform consistently and significantly better than the baselines for both the datasets.

Continual Learning

Table 2 presents the averaged results over all tasks by enabling HyperHawkes for continual learning. We can observe that averaged performance for the proposed model is better than the case when no regularization is incorporated. Also, we can observe that both the proposed variants perform better than the model without using regularization (corresponds to the case when $\beta$ from Equation 11 is set to 0). Hence the use of regularization within the framework of HyperHawkes supports that proposed method can avoid catastrophic forgetting. Fig 2 displays sequence-wise performance for both the datasets for the proposed variants HyperHawkes-FNN and HyperHawkes-FNN-RNN.2(a)) displays average MNLL over previous sequences for both the models for Yelp. This shows that while training the model without regularization over new sequences, the network is unable to retain information learnt from previous sequences, hence MNLL increases as we train new sequences. However, with the use of regularization, we can avoid catastrophic forgetting, hence having lower MNLL for successive tasks. Similar behavior is observed by 2(b)) as well which displays average MAE over previous sequences for both the variants, with and without CL for Meme. So, this corroborates that use of regularization with HyperHawkes can help in backward transfer. 2(c)) shows MNLL for each sequence using the model HyperHawkes-FNN-RNN with regularization. MNLL for the model with regularization is having lower MNLL as compared to the model without any regularization. This essentially reflects that the proposed model is able to forward transfer the knowledge learnt from previous sequences as well. So, the model is able to perform forward and backward transfer, which are important continual learning desiderata. 2(d)) displays the effect of various regularization parameters for Yelp and Meme dataset for HyperHawkes-FNN. A possible explanation could be for Meme for $\beta$ , the model is not able to learn from previous sequences and for large $\beta$ might not be able to learn from new sequences. To conclude, presented results suggest that proposed framework can aid in avoiding catastrophic forgetting while learning continually.

Conclusion

In this work, we address two novel and practical limitations for time-to-event modeling. Firstly, we address zero-shot event modeling for predicting time of unseen events. Secondly, we propose an approach for continual learning for time-to-event modeling where we learn when event sequences continually and the model learns while retaining previous knowledge. To address both of these issues, we propose HyperHawkes, a descriptor conditioned hypernetwork based neural Hawkes process which can generate event sequence specific parameters. The proposed approach can predict the time of occurrences of events from unseen sequences, hence performing zero-shot time-to-event modeling. Subsequently, we augment HyperHawkes with regularization which can aid in learning time-to-event sequences continually by avoiding catastrophic forgetting. Our experiments on two real-world datasets demonstrate the effectiveness of the proposed approach for both the issues. In this way, we augment the ability of the neural Hawkes process to perform two unexplored and practical tasks of zero-shot and continually learning time-to-event modeling.

References

Abadi et al. (2015) Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Goodfellow, I.; Harp, A.; Irving, G.; Isard, M.; Jia, Y.; Jozefowicz, R.; Kaiser, L.; Kudlur, M.; Levenberg, J.; Mané, D.; Monga, R.; Moore, S.; Murray, D.; Olah, C.; Schuster, M.; Shlens, J.; Steiner, B.; Sutskever, I.; Talwar, K.; Tucker, P.; Vanhoucke, V.; Vasudevan, V.; Viégas, F.; Vinyals, O.; Warden, P.; Wattenberg, M.; Wicke, M.; Yu, Y.; and Zheng, X. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Software available from tensorflow.org.
Bacry, Mastromatteo, and Muzy (2015) Bacry, E.; Mastromatteo, I.; and Muzy, J.-F. 2015. Hawkes processes in finance. Market Microstructure and Liquidity, 1(01): 1550005.
Chiang, Liu, and Mohler (2021) Chiang, W.-H.; Liu, X.; and Mohler, G. 2021. Hawkes process modeling of COVID-19 with mobility leading indicators and spatial covariates. International journal of forecasting.
Chilinski and Silva (2020) Chilinski, P.; and Silva, R. 2020. Neural likelihoods via cumulative distribution functions. In Conference on Uncertainty in Artificial Intelligence, 420–429. PMLR.
Diggle, Rowlingson, and Su (2005) Diggle, P.; Rowlingson, B.; and Su, T.-l. 2005. Point process methodology for on-line spatio-temporal disease surveillance. Environmetrics: The official journal of the International Environmetrics Society, 16(5): 423–434.
Du et al. (2016) Du, N.; Dai, H.; Trivedi, R.; Upadhyay, U.; Gomez-Rodriguez, M.; and Song, L. 2016. Recurrent marked temporal point processes: Embedding event history to vector. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1555–1564.
Embrechts, Liniger, and Lin (2011) Embrechts, P.; Liniger, T.; and Lin, L. 2011. Multivariate Hawkes processes: an application to financial data. Journal of Applied Probability, 48(A): 367–378.
Felix et al. (2018) Felix, R.; Reid, I.; Carneiro, G.; et al. 2018. Multi-modal cycle-consistent generalized zero-shot learning. In Proceedings of the European Conference on Computer Vision (ECCV), 21–37.
Frome et al. (2013) Frome, A.; Corrado, G. S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; and Mikolov, T. 2013. Devise: A deep visual-semantic embedding model. Advances in neural information processing systems, 26.
Ha, Dai, and Le (2017) Ha, D.; Dai, A. M.; and Le, Q. V. 2017. “Hypernetworks,” in 5th International Conference on Learning Representations, ICLR 2017. In Conference Track Proceedings, OpenReview. net, 24–26.
Hainzl, Steacy, and Marsan (2010) Hainzl, S.; Steacy, D.; and Marsan, S. 2010. Seismicity models based on Coulomb stress calculations. Community Online Resource for Statistical Seismicity Analysis.
Hawkes (1971) Hawkes, A. G. 1971. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1): 83–90.
He et al. (2022) He, Y.; Zheng, S.; Tay, Y.; Gupta, J.; Du, Y.; Aribandi, V.; Zhao, Z.; Li, Y.; Chen, Z.; Metzler, D.; et al. 2022. Hyperprompt: Prompt-based task-conditioning of transformers. In International Conference on Machine Learning, 8678–8690. PMLR.
Kirkpatrick et al. (2017) Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13): 3521–3526.
Lampert, Nickisch, and Harmeling (2009) Lampert, C. H.; Nickisch, H.; and Harmeling, S. 2009. Learning to detect unseen object classes by between-class attribute transfer. In 2009 IEEE conference on computer vision and pattern recognition, 951–958. IEEE.
Leskovec, Backstrom, and Kleinberg (2009) Leskovec, J.; Backstrom, L.; and Kleinberg, J. 2009. Meme-tracking and the dynamics of the news cycle. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 497–506.
Li and Hoiem (2017) Li, Z.; and Hoiem, D. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12): 2935–2947.
Lopez-Paz and Ranzato (2017) Lopez-Paz, D.; and Ranzato, M. 2017. Gradient episodic memory for continual learning. Advances in neural information processing systems, 30.
Mei and Eisner (2016) Mei, H.; and Eisner, J. 2016. The neural hawkes process: A neurally self-modulating multivariate point process. arXiv preprint arXiv:1612.09328.
Mei and Eisner (2017) Mei, H.; and Eisner, J. M. 2017. The Neural Hawkes Process: A Neurally Self-Modulating Multivariate Point Process. Advances in Neural Information Processing Systems, 30.
Mohler et al. (2011) Mohler, G. O.; Short, M. B.; Brantingham, P. J.; Schoenberg, F. P.; and Tita, G. E. 2011. Self-exciting point process modeling of crime. Journal of the American Statistical Association, 106(493): 100–108.
Omi, Ueda, and Aihara (2019) Omi, T.; Ueda, N.; and Aihara, K. 2019. Fully neural network based model for general temporal point processes. arXiv preprint arXiv:1905.09690.
Palatucci et al. (2009) Palatucci, M.; Pomerleau, D.; Hinton, G. E.; and Mitchell, T. M. 2009. Zero-shot learning with semantic output codes. Advances in neural information processing systems, 22.
Rizoiu et al. (2017) Rizoiu, M.-A.; Lee, Y.; Mishra, S.; and Xie, L. 2017. A tutorial on hawkes processes for events in social media. arXiv preprint arXiv:1708.06401.
Salehi et al. (2019) Salehi, F.; Trouleau, W.; Grossglauser, M.; and Thiran, P. 2019. Learning hawkes processes from a handful of events. Advances in Neural Information Processing Systems, 32.
Valkeila (2008) Valkeila, E. 2008. An Introduction to the Theory of Point Processes, Volume II: General Theory and Structure, by Daryl J. Daley, David Vere-Jones.
Vaswani et al. (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. Advances in neural information processing systems, 30.
Von Oswald et al. (2019) Von Oswald, J.; Henning, C.; Sacramento, J.; and Grewe, B. F. 2019. Continual learning with hypernetworks. arXiv preprint arXiv:1906.00695.
Wang, Ye, and Gupta (2018) Wang, X.; Ye, Y.; and Gupta, A. 2018. Zero-shot recognition via semantic embeddings and knowledge graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6857–6866.
Xiao et al. (2017) Xiao, S.; Yan, J.; Yang, X.; Zha, H.; and Chu, S. 2017. Modeling the intensity function of point process via recurrent neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31.
Xie et al. (2019) Xie, Y.; Jiang, H.; Liu, F.; Zhao, T.; and Zha, H. 2019. Meta learning with relational information for short sequences. Advances in neural information processing systems, 32.
Zhang et al. (2020) Zhang, Q.; Lipani, A.; Kirnap, O.; and Yilmaz, E. 2020. Self-attentive Hawkes process. In International conference on machine learning, 11183–11193. PMLR.
Zhao et al. (2020) Zhao, D.; von Oswald, J.; Kobayashi, S.; Sacramento, J.; and Grewe, B. F. 2020. Meta-learning via hypernetworks.
Zoph and Le (2016) Zoph, B.; and Le, Q. V. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.
Zuo et al. (2020) Zuo, S.; Jiang, H.; Li, Z.; Zhao, T.; and Zha, H. 2020. Transformer hawkes process. In International conference on machine learning, 11692–11702. PMLR.