Temporal Point Process Graphical Models

Yalong Lyu The first and second authors have equal contributions to this paper. School of Mathematical Sciences, Peking University, China Huiyuan Wang and Wei Lin [email protected] School of Mathematical Sciences, Peking University, China

Abstract

Many real-world objects can be modeled as a stream of events on the nodes of a graph. In this paper, we propose a class of graphical event models named temporal point process graphical models for representing the temporal dependencies among different components of a multivariate point process. In our model, the intensity of an event stream can depend on the historical events in a nonlinear way. We provide a procedure that allows us to estimate the parameters in the model with a convex loss function in the high-dimensional setting. For the approximation error introduced during the implementation, we also establish the error bound for our estimators. We demonstrate the performance of our method with extensive simulations and a spike train data set.

keywords: graphical models; temporal point processes; high-dimensionality

1 Introduction

Graphical modeling of point processes is drawing increasing attention, as data in the form of point processes on a graph are emerging in scientific and business applications. Examples include biological neural networks with neural spike train data (see e.g. Paninski et al., 2007; Pillow et al., 2008; Bolding and Franks, 2018); social media data recording timestamps of each user’s actions (Perry and Wolfe, 2013; Zhou et al., 2013; Fox et al., 2016); high frequency financial data recording times of market orders (Bacry et al., 2013; Aït-Sahalia et al., 2015; Bacry et al., 2015). Understanding the connectivity structure of these point processes, i.e, the edges on the graphs, is a challenging problem of growing interest. Our work is motivated by a recent spike train data set from Bolding and Franks (2018) . This data set collected spike train data from two important brain regions for olfactory of mice, namely the olfactory blub and the piriform cortex. The neural activity in the olfactory bulb is highly concentration-dependent, while that in the piriform cortex is concentration-invariant. The study of Bolding and Franks (2018) suggests that there exists a complex excitation-inhabitation mechanism between the olfactory blub and the piriform cortex that allows the nervous system to produce a concentration-invariant representation of odors.

In this article, we propose a new temporal point process graphical model to address this problem. We model the spike train data of the neurons as a multivariate temporal point process, where each component of the point process represents the activity of a neuron. We allow future events be excited or inhabited by the past events in a non-additive manner. We organize the transfer coefficients into a matrix and impose a sparse structure, which is welcomed for interpretation in neuroscience and other applications (Zhao and Yu, 2006; Zou, 2006). Our loss function is convex, so many highly scalable algorithms, such as alternating direction method of multipliers and coordinate descent method, can be applied to parameter estimation. We establish a non-asymptotic error bound of the estimator and allow the number of nodes in the graph to diverge. We also analyze the approximation error introduced during the implementation phase.

There have been a great number of works focusing on graphical models, including the Gaussian graphical model (Yuan and Lin, 2007; Liu, 2013) and directed acyclic graph (Kalisch and Bühlman, 2007; Van de Geer et al., 2013; Zheng et al., 2018). In this models, each node of the graph is corresponding to a random variable and the data are realizations of the random variables. The graphs of classical graphical models are used to represent the structure of conditional independence among the random variables. In contrast, we aim at the case where each node of the graph represents a point process and the graph can be viewed as a multivariate point process. In our setting, the graph encodes the local independence relationship among the components of the multivariate point process (Didelez, 2008). The basic idea of local independence is that once we know about specific past events, the intensity of the future event is independent of other past events. In our model, local independence is fully characterised by the parameters (see Proposition 1).

There have been a family of models targeting temporal point processes. A Hawkes process (Hawkes, 1974) assumes that historical events can trigger future events and has been widely used in modelling neuronal spike train data and social network data. The model we proposed differs from existing research of the Hawkes process in many ways. Most prior art has focus on parameter estimation in the linear case, where the transfer function of the multivariate Hawkes process is the identity function. Hansen et al.(2015) considered estimating parameters in a linear Hawkes process of fixed number of nodes with least-square loss function and $\ell_{1}$ regularization. Bacry et al.(2020) organized the transferring coefficients as a matrix and impose a low-rank structure. A crucial weakness of linear Hawkes process is that it does not allow inhibitory relations, i.e, the past events can only trigger more future events, not silence future events. Our model extends to a nonlinear setting where inhibitory influences are allowed and the influence of past events from different nodes can accumulate in a non-additive manner.

Tang and Li(2021) also considered parameter estimation of a nonlinear Hawkes process in their work. Our work differs from Tang and Li(2021) in several aspects. First, we focus on the estimation of sparse graphical models while they focus on the estimation of a low-rank tensor of regression parameters. Second, we provide a stronger theoretical guarantee by proving the global minimum of our penalized loss function has a $\sqrt{\frac{s\log p}{T}}$ convergence rate while their theory only shows that there exists a local minimum of the penalized loss function in the neighborhood of the true parameter. Finally, our loss function is convex while theirs is not. A convex loss function with restricted strong convexity allows us to find global minimum efficiently.

The paper is organized as follows. We will finish this section with notation. We briefly introduce the temporal point process and our model in Section 2. In Section 3, we present the construction of our estimator and our main theorem. In Section 4, we analyze the approximation error introduced during the implementation phase. We show the results of simulations in Section 5 and an application on a spick train data set in Section 6. We conclude with a discussion in Section 7. Technical proofs are provided in the Appendix.

Some notations will be employed through out this paper. Let $[n],n\in\mathbb{N}$ denote the set $\{1,\ldots,n\}$ . For a set $S\subset[p]$ and a vector $v\in\mathbb{R}^{p}$ , we define $v_{S}\in\mathbb{R}^{p}$ as $(v_{S})_{i}=v_{i}\mathbb{I}(i\in S)$ . Let $\|\cdot\|_{p},\ p\in[1,\infty)$ denote the $\ell_{p}$ -norm of a vector and $\|\cdot\|_{0}$ denote the number of non-zero entries of a vector. For a matrix $A\in\mathbb{R}^{m\times n}$ , let $\|A\|_{F}=(\sum_{i=1}^{m}\sum_{j=1}^{n}a_{ij}^{2})^{1/2}$ denote the Frobenius norm of a matrix. We use $S_{q}^{p}(r)$ for a $\|\cdot\|_{q}$ ball with radius $r$ in $\mathbb{R}^{p}$ . We use $c_{1},c_{2},\ldots$ to denote positive numbers that do not depend on $p$ or $T$ but can still depend on some values that we consider as constants in this paper, and their value can change from line to line.

2 The Temporal Point Process Graphical Model

We will begin with a very brief review of temporal point processes. For a more systematic and comprehensive discussion of point processes, we refer to Brémaud (1981) and Daley and Vere-Jones (2007).

Let $N(t),t\geq 0$ be a simple point process. $N(t)$ is corresponding to a sequence of real-valued random variables $\{t_{k}\}_{k\in\mathbb{Z}}$ , with $t_{k+1}>t_{k}$ and $t_{1}\geq 0$ almost surely. $\{t_{k}\}_{k\in\mathbb{Z}}$ are called the event times of $N(t)$ . For a set $A$ in the Borel $\sigma$ -field of the real line, $N(A)=\sum_{k}\mathbb{I}_{\{t_{k}\in A\}}$ . We write $N([t,t+\mathrm{d}t))$ as $\mathrm{d}N(t)$ , where $\mathrm{d}t$ denotes an arbitrarily small increment of $t$ . Let $\mathcal{H}_{t}$ denote the $\sigma$ -field generated by $N(\tau),\tau\in[0,t]$ . The intensity of $N(t)$ is defined as

\lambda(t)\mathrm{d}t=\mathbb{P}(\mathrm{d}N(t)=1|\mathcal{H}_{t}).

Now suppose we have $p$ simple point processes $N_{1},\ldots,N_{p}$ with $t_{j,1}<t_{j,2}<\cdots<t_{j,n}<\cdots$ denoting the event times of the process $N_{j},j\in[p]$ . Then we employ $\mathcal{N}=(N_{1},\ldots,N_{p})$ to denote a $p$ -variate point process. Let $\mathcal{H}_{t}$ denote the $\sigma$ -field generated by $\mathcal{N}(\tau),\tau\in[0,T]$ . The intensity process $\bm{\lambda}(t)=(\lambda_{1}(t),\ldots,\lambda_{p}(t))^{\mathrm{T}}$ is a $p$ -variate $\mathcal{H}_{t}$ -predictable process defined as

\lambda_{j}(t)\text{d}t=\mathbb{P}\big{(}\text{d}N_{j}(t)=1|\mathcal{H}_{t}\big{)},\ j\in[p].

In our model, we assume this intensity process takes the form

\lambda_{j}(t)=h_{j}\Big{(}\mu_{j}+\sum_{k=1}^{p}\int_{0}^{t}\omega_{j,k}(\tau)\text{d}N_{k}(t-\tau)\Big{)},\ j\in[p],

(1)

where $h_{j}(\cdot),j\in[p]$ are link functions and can be nonlinear. $\mu_{j},j\in[p]$ are the background intensities of the components of the multivariate point process. $\omega_{j,k}(\cdot):\mathbb{R}^{+}\rightarrow\mathbb{R},j,k\in[p]$ are the transfer functions. $\bm{\lambda}(t)=\big{(}\lambda_{1}(t),\cdots,\lambda_{p}(t)\big{)}$ is called the intensity process. The convolutional expression $\int_{0}^{t}\omega_{j,k}(\tau)\text{d}N_{k}(t-\tau)$ summaries the historical information of the $k$ -th component. In this paper, we consider parameterized transfer functions that can be written as

\omega_{j,k}(t)=\beta_{j,k}\kappa_{j,k}(t),\ j,k\in[p],

where $\kappa_{j,k}(t),\ j,k\in[p]$ are known kernel functions.

In graphical models, the edges encodes conditional independence structures among the random variables (Maathuis et al., 2018; Qiao et al., 2019; Engelke and Hitz, 2020). For multivariate point processes, the question of interest is that whether the intensity of one component depends on the history of another component, which can be represent by the local independence for multivariate point process (Didelez, 2008).

Definition 1 (local independence for multivariate point process)

Let $\mathcal{N}=(N_{1},\ldots,N_{p})$ be a multivariate point process. Let further $A,B$ and $C$ be disjoint subsets of $[p]$ . We then say that a subprocess $\mathcal{N}_{B}$ is locally independent of $\mathcal{N}_{A}$ given $\mathcal{N}_{C}$ over $\mathcal{T}$ if $\mathbb{E}[\lambda_{k}(t)|\mathcal{F}_{t}^{A\cup B\cup C}]=\mathbb{E}[\lambda_{k}(t)|\mathcal{F}_{t}^{B\cup C}]$ holds for all $t\in\ \mathcal{T}$ and $k\in B$ , where $\mathcal{F}_{t}^{A}$ denotes the $\sigma$ -field generated by $\{N_{i}([0,t]),i\in A\}$ .

In temporal point process graphical models, the local independence structures are encoded in the transfer matrix $B=(\beta_{j,k})\in\mathbb{R}^{p\times p}$ , because $\lambda_{j}(t)=h_{j}\big{(}\mu_{j}+\sum_{k=1}^{p}\int_{0}^{t}\beta_{j,k}\kappa_{j,k}(\tau)\text{d}N_{k}(t-\tau)\big{)}\in\mathcal{F}_{t}^{A}$ for index set $A$ as long as $\{k,\beta_{j,k}\neq 0\}\subset A$ . We have the following proposition:

Proposion 1

For a temporal point process graphical model $\mathcal{N}=(N_{1},\ldots,N_{p})$ , let $A,B$ be disjoint subsets of $[p]$ and $\mathcal{N}_{A}=\{N_{i},\ i\in A\}$ . $\mathcal{N}_{A}$ is locally independent of $\mathcal{N}_{B}$ given $\mathcal{N}_{(A\cup B)^{c}}$ if and only if $\beta_{i,j}=0,\ \forall i\in A,\ j\in B$ .

Thus by estimating the transfer matrix $B$ , we can recover the local independence relationships in $\mathcal{N}(t)$ .

3 Estimation

In this section, we present an approach for estimating the temporal point process graphical models. We will begin with the construction of our estimator in Section 3.1, then provide a detailed comparison among our method with the widely employed method in prior arts, the least-square estimation and the maximum likelihood estimation in Section 3.2. In Section 3.3, we will present theoretical analysis that guarantees the performance of our method.

3.1 Our Method

In this paper, we will focus on the temporal point process graphical models with bounded link functions and compactly supported transfer functions, as explained in Assumption 1.

Assumption 1

There exists a constant $h_{\max}$ that $h_{j}(\cdot)<h_{\max},j\in[p]$ and $h_{j},j\in[p]$ have continuous derivatives. $\kappa_{j,k}(t),j,k\in[p]$ are supported on $[0,\kappa_{\operatorname{supp}}]$ for some constant $\kappa_{\operatorname{supp}}$ with bounded derivatives $|\kappa_{j,k}^{\prime}(t)|\leq\kappa_{\max},j,k\in[p]$ (and naturally we can pick $\kappa_{\max}$ such that $|\kappa_{j,k}(t)|\leq\kappa_{\max},j,k\in[p]$ also holds).

Similar versions of assumptions on the link function and transfer kernels are required in existing theory (Hansen et al., 2015; Chen et al., 2017b; Bacry et al., 2020). As we allow a nonlinear link function, the transfer kernels are no longer restricted to be non-negative. For simplicity of the theory, we assume that $h_{j}(\cdot)$ is bounded. This assumption covers most of the cases in the applications.

Under Assumption 1, the temporal point process is Markovian, i.e., given the event times in $[t,t+\kappa_{\operatorname{supp}}]$ , the intensity process $\{\bm{\lambda}(s),s>t+\kappa_{\operatorname{supp}}\}$ is independent of $\mathcal{H}_{t}$ . Together with the fact that the intensity process is bounded by $h_{\max}$ , it is straightforward to show that the empty state $\phi$ , where there is no event on $[t,t+\kappa_{\operatorname{supp}}]$ , is a positive recurrent and the temporal point process is stable in distribution. For more discussion of stability for the temporal point process, we refer interested readers to Brémaud and Massoulié (1996) and Costa et al. (2018).

Suppose that we observe the event times of a stationary process on $[0,T]$ . Under Assumption 1, we can rewrite the intensity in a compact form:

\lambda_{j}(t)=h_{j}\big{(}\mu_{j}+\langle\bm{x}_{j}(t),\bm{\beta}_{j}\rangle\big{)},j\in[p],

where $\bm{x}_{j}(t)=\big{(}x_{j,1}(t),\cdots,x_{j,p}(t)\big{)}^{\mathrm{T}}$ , $x_{j,k}(t)=\int_{0}^{t}\kappa_{j,k}(t-\tau)\text{d}N_{k}(\tau)$ and $\bm{\beta}_{j}=(\beta_{j,1},\beta_{j,2},\ldots,\beta_{j,p})$ . Inspired by the log-linear models in Negahban et al.(2012), we considered the following loss function for estimating $(\mu_{j},\bm{\beta}_{j})$ :

L_{T}(\bm{\mu},B)=\frac{1}{T}\sum_{j=1}^{p}\int_{0}^{T}W_{j}(t)\big{(}H_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)\text{d}t-\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle\text{d}N_{j}(t)\big{)},

(2)

where $H_{j}(\cdot)$ is a primitive function of $h_{j}(\cdot)$ , i.e., $H_{j}^{\prime}(\cdot)=h_{j}(\cdot)$ , $\bm{\mu}=(\mu_{1},\ldots,\mu_{p})$ and $B=(\bm{\beta}_{1}^{\mathrm{T}},\ldots,\bm{\beta}_{p}^{\mathrm{T}})$ , $\bm{W}(t)=\big{(}W_{1}(t),\ldots,W_{p}(t)\big{)}$ is a weighting process. If $\bm{\mu}$ and $B$ is intrinsically sparse due to some practical reasons, we can adopt the following $\ell_{1}$ penalized loss minimization (Tibshirani,1996):

(\hat{\bm{\mu}},\hat{B})\in\mathop{\arg\min}_{\bm{\mu},B}\Big{\{}\frac{1}{T}\sum_{j=1}^{p}\int_{0}^{T}W_{j}(t)\Big{(}H_{j}\big{(}\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle\big{)}\text{d}t-\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle\text{d}N_{j}(t)\Big{)}+\lambda_{T}(\|\bm{\mu}\|_{1}+\|B\|_{1})\Big{\}},\ j\in[p].

(3)

In the generalized linear models, Negahban et al.(2012) employed a similar method to derive a convex loss function. The negative log-likelihood loss function for generalized linear model is convex but not strongly convex when $p>T$ , i.e., there are some directions that have zero gradient and the Hessian matrix is not of full rank. Negahban et al. (2012) addressed this problem by introducing the conception of restricted strong convexity. For a decomposable penalty (such as $\|\cdot\|_{1}$ ) and a convex loss function, the solutions of optimization will always fall in a ’restricted set’, restricted to which the loss function is strongly convex. We show that our loss function also admits a restriction strong convexity condition. This implies that the obtained estimator, regardless of algorithm used or initial value, will provide the convergence rate we proved. In contrast, the estimator provided by Tang and Li (2021) is a local estimator, which only retain the statistical property as long as the initial value is close enough to the true parameters.

3.2 Examples

For estimating a temporal point process graphical with linear link functions, there are two widely used estimating strategies, the least-square estimator (Bacry et al.,2020; Hansen et al.,2015):

L^{\mathrm{LS}}_{T}(\bm{\mu},B)=\frac{1}{T}\sum_{j=1}^{p}\int_{0}^{T}\lambda_{j}^{2}(t;\mu_{j},\bm{\beta}_{j})\text{d}t-\frac{2}{T}\int_{0}^{T}\lambda_{j}(t;\mu_{j},\bm{\beta}_{j})\text{d}N_{j}(t),

(4)

and the maximum likelihood estimator (Ozaki,1979; Tang and Li, 2021)

L^{\mathrm{MLE}}_{T}(\bm{\mu},B)=\frac{1}{T}\sum_{j=1}^{p}\int_{0}^{T}\lambda_{j}(t;\mu_{j},\bm{\beta}_{j})\text{d}t-\frac{1}{T}\int_{0}^{T}\log\big{(}\lambda_{j}(t;\mu_{j},\bm{\beta}_{j})\big{)}\text{d}N_{j}(t).

(5)

The score functions for least square loss and negative log-likelihood loss are

\nabla_{\bm{\beta}_{j}}L_{T}^{\mathrm{LS}}=\frac{2}{T}\int_{0}^{T}h^{\prime}_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)\bm{x}_{j}(t)\Big{[}h_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)\text{d}t-\text{d}N_{j}(t)\Big{]}

and

\nabla_{\bm{\beta}_{j}}L_{T}^{\mathrm{MLE}}=\frac{1}{T}\int_{0}^{T}\frac{h^{\prime}_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)}{h_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)}\bm{x}_{j}(t)\Big{[}h_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)\text{d}t-\text{d}N_{j}(t)\Big{]}.

It is seen that optimizing the least-square risk function is equivalent to optimizing (2) with iterative reweigh $\hat{W}^{\mathrm{LS}}_{j}(t)=h^{\prime}_{j}(\mu_{j}+\langle\hat{\bm{\beta}}_{j},\bm{x}_{j}(t)\rangle)$ and optimizing the negative log-likelihood risk function is equivalent to optimizing (2) with iterative reweigh $\hat{W}^{\mathrm{MLE}}_{j}(t)=\frac{h^{\prime}_{j}(\mu_{j}+\langle\hat{\bm{\beta}}_{j},\bm{x}_{j}(t)\rangle)}{h_{j}(\mu_{j}+\langle\hat{\bm{\beta}}_{j},\bm{x}_{j}(t)\rangle)}$ . Directly optimizing the risk function for least square estimator and maximum likelihood estimator are non-convex problems. But our iterative reweigh procedure permits us to approximate the non-convex problem with a sequence of convex problems.

3.3 Theoretical Guarantees

Now we return to the $\ell_{1}$ -penalized loss minimization (3). Theoretical result indicates that our estimator is consistent even in high-dimensional case. When unrestricted, it is possible to cook up extreme graphs, where, for instance, some components of the influence process $\bm{x}(t)$ are constantly pushed towards infinity, which makes the model hardly identifiable. To avoid such cases, we pose the following regularity conditions.

Assumption 2

(1)The mean influence, $\mu^{*}_{j}+\sum_{k\in S_{j}}\beta^{*}_{j,k}\mathbb{E}[x_{j,k}]$ are uniformly bounded:

\sup_{j\in[p]}\mu^{*}_{j}+\sum_{k\in S_{j}}\beta^{*}_{j,k}\mathbb{E}[x_{j,k}]<\sigma_{0},\ \forall j\in[p],

for some constant $\sigma_{0}>0$ for every $j$ .

(2) The covariance matrix of the influence process $\mathrm{Cov}\big{(}\bm{x}_{j}(t)\big{)}=\mathrm{Cov}\big{(}x_{j,1}(t),\cdots,x_{j,p}(t)\big{)}=\Sigma_{j}$ is positive definite and has bounded eigenvalue:

0<\Sigma_{\min}\leq\Gamma_{\min}(\Sigma_{j})\leq\Gamma_{\max}(\Sigma_{j})\leq\Sigma_{\max}<\infty,\ \forall j\in[p]

where $\Gamma_{\min}(\cdot)$ and $\Gamma_{\max}(\cdot)$ indicate the smallest and the largest eigenvalue of a matrix. That is, for any $(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}^{p+1}_{2}(1)$ we have

\mathbb{E}\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}^{2}>\sigma_{1}

and there exists some $R>0$ such that

\mathbb{E}\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}^{2}\cdot\mathbb{I}\big{(}|\mu_{i}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R\big{)}\cdot\mathbb{I}\big{(}|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle|\leq R\big{)}>\sigma_{1}.

(6)

(3) The derivative is bounded away from $0$ on a compact interval: there exists $\sigma_{2}>0$ such that

\inf_{-2R<t<2R}h_{j}^{\prime}(t)\geq\sigma_{2}>0,\ \forall j\in[p],

where $R$ is introduced above.

(4) There exist $0<\sigma_{3}<\sigma_{4}$ that $\sigma_{3}<W_{j}(t)<\sigma_{4},\ \forall j\in[p]$ with probability $1$ for $T$ large enough.

We briefly discuss these assumptions. Assumption 2 (1) restricts the average influence of process $N_{j}(t)$ and it is equivalent to having bounded intercept terms in vanilla Lasso problems (Tibshirani, 1996). Having Non-degenerate covariates (Assumption 2 (2)) is a standard requirement for estimation problems (condition GLM1 in Negahban et al.,2012, regularity condition 4 in Tang and Li, 2021). Here (6) asks for a version of non-degenerate that looks stronger. It is not truly stronger. If we let $R\rightarrow\infty$ , we will have

\mathbb{E}\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}^{2}\cdot\mathbb{I}\big{(}|\mu_{i}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R\big{)}\cdot\mathbb{I}\big{(}|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle|\leq R\big{)}\rightarrow\mathbb{E}\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}^{2}.

This involves a trade-off between constants $\sigma_{1}$ and $\sigma_{2}$ . One can show that (6) holds if $\bm{x}_{j}(t)$ has a sub-exponential tail. Assumption 2 (3) is employed to ensure identifiability. To conclude, Assumption 2 (2) and 2 (3) are required mainly for technical reasons. We propose Assumption 2 (4) to make sure that the weights are well-behaved.

Now we are ready to introduce the main theorem.

Theorem 1

For estimators in (3), under Assumption 1 and 2, there exists constants $c_{1},c_{2},c_{3}>0$ that with $\lambda_{T}=c_{1}\sqrt{\frac{\tau\log(p\vee T)}{T}}$ , we have

p^{-1/2}\|(\hat{\bm{\mu}},\hat{B})-(\bm{\mu}^{*},B^{*})\|_{F}\leq c_{2}\sqrt{s}\lambda_{T}

(7)

\|(\hat{\bm{\mu}},\hat{B})-(\bm{\mu}^{*},B^{*})\|_{\infty}\leq c_{2}s\lambda_{T}

(8)

hold with possibility greater than $1-\frac{c_{3}}{(p\vee T)^{\tau}}$ , where $s=1+\max_{j}\{\|\bm{\beta}_{j}\|_{0},j\in[p]\}$ and a star $*$ as a superscript denotes the true value of parameters.

Theorem 1 guarantees that with high probability, the estimator will converge in the order of $\sqrt{\frac{s\log(p\vee T)}{T}}$ , which is similar to the established ones in prior works (Bacry et al., 2020; Hansen et al., 2015; Tang and Li, 2021). It is important to note that Theorem 1 considers an oracle procedure, where the tuning parameters depend on unknown parameters. When the linear functions are linear, Hansen et al. (2015) explored the selection guidelines in the case that the number of nodes $p$ is fixed but the dictionary for sieving transfer kernels is growing. A general theory for tuning parameter selection when $p\rightarrow\infty$ still remains an open problem. A detailed proof of Theorem 1 is provided in the Appendix.

4 Implementation

4.1 Theory for Approximation

Implementation lies between the beautiful statistical theory and the successful application of statistical methods. The concrete details of implementation are important but seldom mentioned in prior articles. In this section, we present a comprehensive analysis of the implementative version of our estimation.

For temporal point processes, the algorithm usually proceeds in a discrete manner (Tang and Li, 2021). The observation interval $[0,T]$ is equally divided into $M$ subintervals and the length of each interval is $\frac{T}{M}$ and the integral is replaced by numerical integral. Let $y_{j,m}=N_{j}(\frac{(m+1)T}{M})-N_{j}(\frac{mT}{M})$ denote the number of events in $[\frac{mT}{M},\frac{(m+1)T}{M}]$ of node $j$ and let $x_{j,k,m}=x_{j,k}(\frac{mT}{M})$ for $m=0,\ldots,M-1$ . Let $\bm{x}_{j,m}=(\bm{x}_{j,1,m},\cdots,\bm{x}_{j,p,m})$ . Then our loss function in implementation becomes

\tilde{L}_{T}(\bm{\mu},B)=\frac{1}{T}\sum_{j=1}^{p}\sum_{m=0}^{M-1}W_{j}(\frac{mT}{M})\Big{(}\frac{T}{M}H_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j,m}\rangle)-y_{j,m}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j,m}\rangle)\Big{)}.

(9)

Generally, the discrete version (9) has additional approximation error compared to (2). To characterize the additional error, we pose the following error decomposition:

y_{j,m}=N_{j}(\frac{(m+1)T}{M})-N_{j}(\frac{mT}{M})=\mathbb{E}[N_{j}(\frac{(m+1)T}{M})-N_{j}(\frac{mT}{M})\Big{|}\lambda_{j}|_{(0,\frac{(m+1)T}{M}]}]+\epsilon_{1,j,m}

and

	$\displaystyle\mathbb{E}[N_{j}(\frac{(m+1)T}{M})-N_{j}(\frac{mT}{M})\Big{\|}\lambda_{j}\|_{(0,\frac{(m+1)T}{M}]}]+\epsilon_{1,j,m}$	$\displaystyle=\int_{\frac{mT}{M}}^{\frac{(m+1)T}{M}}h_{j}(\mu_{j}^{}+\langle\bm{x}_{j}(\tau),\bm{\beta}_{j}^{}\rangle)\mathrm{d}\tau$
		$\displaystyle=\frac{T}{M}h_{j}(\mu_{j}^{}+\langle\bm{x}_{j,m},\bm{\beta}^{}_{j}\rangle)+\epsilon_{2,j,m},$

where $\bm{x}_{j,m}=(x_{j,1,m},x_{j,2,m},\ldots,x_{j,p,m})$ . Thus we have

y_{j,m}=\frac{T}{M}h_{j}\Big{(}\mu_{j}^{*}+\langle\bm{x}_{j,m},\bm{\beta}_{j}^{*}\rangle\Big{)}+\epsilon_{j,m},\ j\in[p],\ m=0,\ldots,M-1,

where $\epsilon_{j,m}=\epsilon_{1,j,m}+\epsilon_{2,j,m}$ . The choice of $M$ is of great importance because it leads a trade-off between computation efficiency and approximation accuracy. To ensure the approximation error $\epsilon_{2,j,m}$ is dominated by the statistical noise $\epsilon_{1,j,m}$ , we assume that,

Assumption 3

$M$ , the number of subintervals is large enough that $\frac{\tau^{1/4}s^{1/2}T^{5/4}\log^{2}(p\vee T)}{M}=o(1)$ .

The $\ell_{1}$ penalized loss minimization in implementation is

(\tilde{\bm{\mu}},\tilde{B})\in\mathop{\arg\min}_{\bm{\mu},B}\Big{\{}\frac{1}{T}\sum_{j=1}^{p}\sum_{m=0}^{M-1}W_{j}(\frac{mT}{M})\Big{(}\frac{T}{M}H_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j,m}\rangle)-y_{j,m}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j,m}\rangle)\Big{)}+\lambda_{T}(\|\bm{\mu}\|_{1}+\|B\|_{1})\Big{\}}.

(10)

By carefully analysing the approximation error, we can provide a similar bound for convergence rate for $(\tilde{\bm{\mu}},\tilde{B})$ :

Theorem 2

For estimators in (10), under Assumption 1-3, we have there exists constants $c_{1},c_{2},c_{3}>0$ that with $\lambda_{T}=c_{1}\sqrt{\frac{\tau\log(p\vee T)}{T}}$ , we have

p^{-1/2}\|(\tilde{\bm{\mu}},\tilde{B})-(\bm{\mu}^{*},B^{*})\|_{F}\leq c_{2}\sqrt{s}\lambda_{T}

(11)

\|(\tilde{\bm{\mu}},\tilde{B})-(\bm{\mu}^{*},B^{*})\|_{\infty}\leq c_{2}s\lambda_{T}

(12)

hold with possibility greater than $1-\frac{c_{3}}{(p\vee T)^{\tau}}$ , where $s=1+\max_{j}\{\|\bm{\beta}_{j}\|_{0},j\in[p]\}$ and a star $*$ as a superscript denotes the true value of parameters.

When $M$ satisfies Assumption 3, the approximation error is dominated by the statistical noise. There are other graphical models of point process that divided the observation window into a number of bins and model the number of events in each bin, with a generalized linear model (Zhang et al.,2016) or with Gaussian graphical model (Vinci et al., 2018). Our method is fundamentally different from these methods because discretization is employed only in the implementation of the method. Our final goal is to estimate the parameters in the generalized Hawkes process instead of other models. A detailed proof is included in the supplementary material.

5 Simulation

In this section, we investigate the performance of our estimation procedure in a simulation study. We propose two estimators, the naively weighted estimator with $W_{j}(t)=1,\ \forall j\in[p]$ and the iterative reweighted maximum likelihood estimator where $W_{j}(t)=\frac{h_{j}^{\prime}(t)}{h_{j}(t)}$ is iteratively updated.

We set the number of nodes to $p\in\{30,60\}$ . We consider two popular graphical structures: block and chain. For the block structure, $B$ is a $60$ ( $30\times 30$ ) blockwise diagonal matrix of 12 (6) identical symmetrical Toeplitz matrices whose first row is $(0,0.3,-0.3,0.3,-0.3)$ for $p=60$ ( $p=30$ ). For the chain structure, $B$ is a symmetrical Toeplitz matrix whose first row is $(0,0.3,-0.3,0,0,\cdots,0)$ . Their explicit forms are shown in Figure 1 and 2 for $p=30$ and $p=60$ , where the excitatory connection are shown in blue with entry $\beta_{j,k}=0.3$ and the inhibitory connection are shown in red with entry $\beta_{j,k}=-0.3$ . The background intensity is $\mu_{j}=0.5$ . We consider two sets of link function and transfer kernel. For setting 1, we consider an $\arctan$ function as the link function and a restricted linear transfer kernel.

\displaystyle\begin{cases}h_{j}(x)&=4+\frac{8}{\pi}\arctan(x),\ j\in[p],\\ \kappa_{j,k}(x)&=(1-x)\mathbb{I}_{[0,1]}(x),\ j,k\in[p].\end{cases}

For setting 2, we consider a sigmoid function as our link function and a exponential transfer kernel.

\displaystyle\begin{cases}h_{j}(x)&=1+\frac{4e^{x}}{1+e^{x}},\ j\in[p],\\ \kappa_{j,k}(x)&=e^{-x}\mathbb{I}_{[0,5]}(x),\ j,k\in[p].\end{cases}

Refer to caption — Figure 1: Connectivity matrices with block and chain structures for $p=30$ . The excitatory connection are shown in blue with entry $\beta_{j,k}=0.3$ and the inhibitory connection are shown in red with entry $\beta_{j,k}=-0.3$ .

The length of observation is $T\in\{200,400\}$ and the number of sub-intervals $M=10T$ . We use a 5-fold cross-validation to select the penalty level. We evaluate the estimation accuracy by the relative $\ell_{1}$ error $\frac{\sum_{i,j}|\hat{\beta}_{i,j}-\beta_{i,j}|}{\sum_{i,j}|\beta_{i,j}|}$ and the relative mean square error of the estimated transferring coefficient matrix $\frac{\|\hat{B}-B^{*}\|_{F}^{2}}{\|B^{*}\|_{F}^{2}}$ . Three methods are compared: the $\ell_{1}$ penalized naively weighted estimator (Sparse-Naive) the $\ell_{1}$ penalized iteratively reweighted maximum likelihood estimator (Sparse-MLE) and the vanilla maximum likelihood estimator. The average relative $\ell_{1}$ error and $\ell_{2}$ error based on 50 replications with the standard errors in the parenthesis are reported.

Table 1: Estimation accuracy of

B

in form of relative

\ell_{1}

error

Setting	Structure	p	T	Sparse-Naive	Sparse-MLE	vanilla-MLE
Setting 1	Block	30	200	0.718(0.049)	0.667(0.025)	1.510(0.057)
			400	0.525(0.024)	0.547(0.032)	0.942(0.025)
		60	200	1.010(0.077)	0.805(0.022)	4.460(0.134)
			400	0.776(0.037)	0.656(0.011)	2.128(0.040)
	Chain	30	200	0.877(0.043)	0.796(0.032)	1.793(0.135)
			400	0.645(0.028)	0.656(0.034)	1.066(0.048)
		60	200	1.240(0.086)	0.959(0.025)	5.277(0.170)
			400	0.916(0.049)	0.787(0.022)	2.403(0.063)
Setting 2	Block	30	200	0.895(0.037)	0.884(0.033)	2.812(0.163)
			400	0.716(0.023)	0.723(0.026)	1.612(0.052)
		60	200	1.035(0.043)	1.048(0.032)	14.62(0.611)
			400	0.862(0.025)	0.872(0.025)	3.906(0.083)
	Chain	30	200	1.063(0.057)	1.054(0.033)	3.571(0.213)
			400	0.883(0.037)	0.882(0.038)	1.826(0.086)
		60	200	1.174(0.044)	1.176(0.026)	15.77(0.546)
			400	1.031(0.029)	1.024(0.024)	4.451(0.137)

Table 2: Estimation accuracy of

B

in form of relative

\ell_{2}

error

Setting	Structure	p	T	Sparse-Naive	Sparse-MLE	vanilla-MLE
Setting 1	Block	30	200	0.235(0.018)	0.241(0.018)	0.495(0.042)
			400	0.149(0.012)	0.150(0.012)	0.191(0.011)
		60	200	0.300(0.015)	0.307(0.014)	2.235(0.143)
			400	0.174(0.005)	0.194(0.007)	0.483(0.019)
	Chain	30	200	0.326(0.029)	0.318(0.026)	0.843(0.301)
			400	0.200(0.017)	0.181(0.018)	0.250(0.034)
		60	200	0.418(0.027)	0.399(0.021)	3.523(0.296)
			400	0.224(0.011)	0.242(0.011)	0.687(0.077)
Setting 2	Block	30	200	0.485(0.030)	0.475(0.032)	1.772(0.268)
			400	0.285(0.020)	0.281(0.020)	0.555(0.036)
		60	200	0.575(0.029)	0.566(0.024)	24.53(1.682)
			400	0.348(0.020)	0.341(0.019)	1.652(0.074)
	Chain	30	200	0.672(0.083)	0.620(0.045)	3.575(0.729)
			400	0.389(0.033)	0.371(0.027)	0.736(0.120)
		60	200	0.769(0.064)	0.724(0.058)	27.85(1.561)
			400	0.484(0.023)	0.462(0.022)	2.561(0.266)

Table 3: The mean area under the ROC curves

Setting	Structure	p	T	Sparse-Naive	Sparse-MLE
Setting 1	Block	30	200	0.995(0.0026)	0.997(0.0015)
			400	1.000(0.0001)	1.000(0.0001)
		60	200	0.995(0.0016)	0.997(0.0014)
			400	1.000(0.0001)	1.000(0.0001)
	Chain	30	200	0.954(0.0074)	0.965(0.0063)
			400	0.974(0.0025)	0.978(0.0027)
		60	200	0.968(0.0041)	0.976(0.0032)
			400	0.985(0.0012)	0.988(0.0012)
Setting 2	Block	30	200	0.914(0.0143)	0.916(0.0139)
			400	0.979(0.0044)	0.981(0.0042)
		60	200	0.915(0.0137)	0.918(0.0138)
			400	0.983(0.0039)	0.985(0.0035)
	Chain	30	200	0.837(0.0174)	0.841(0.0189)
			400	0.924(0.0090)	0.925(0.0100)
		60	200	0.840(0.0134)	0.844(0.0145)
			400	0.944(0.0056)	0.948(0.0052)

Table 1 and 2 summarize the results based on $50$ data replications. It is seen that both of our proposed methods outperform the vanilla maximum likelihood estimator, which is known to be asymptotically efficient (Ogata,1978) in estimating parameters when the transfer function is linear. Overall, the iteratively reweighted penalized maximum likelihood estimator achieves better performance. But the loss surface for the $\ell_{1}$ penalized MLE is not convex, so there is no permission that the working algorithm will find a feasible solution. The naive weighted estimator provides a convex loss function that guarantees that the algorithm returns a solution in the neighborhood of a feasible solution.

To demonstrate the accuracy of variable selection for our proposed methods, we calculated true-positive rates and false-positive rates, i.e, the edges of the network that were correctly identified for each method and tuning parameter. Plotting true positive rate versus false positive rate over a fine grid of values of the tuning parameter produces a ROC curve, with curves near the top left corner indicating a method that performs well. Figure 3 shows the median of the ROC curves for our two proposed methods and table 3 provides the area under the ROC curve (average over the 50 simulation runs). It is seen that both methods are successful in recovering the graphical structure in all the settings we considered.

Finally, we report the computation time. The naively weighted estimator has a convex loss function and does not need to calculate the reweigh parameters and runs faster than the iteratively reweigh MLE. The simulation code is programmed in Python. For the simulation example in setting 1 with $p=60$ , $T=400$ and $M=4000$ , the average computing time was about 1.5 minutes for Naive estimator and 1.7 minutes for MLE. For the simulation example in setting 2 with $p=60$ , $T=400$ and $M=4000$ , the average computing time was $2.1$ minutes for Naive estimator and $2.5$ minutes for MLE. All computations were performed with a Intel(R) Core(TM) i9 10900K [email protected].

6 Application

We consider the task of recovering the connectivity network among the neurons using the spike train data from the optogenetic study of Bolding and Franks(2018) discussed in the Section 1. In this experiment, the spike trains are recorded at 30 kHz on the olfactory bulb (OB) and the piriform cortex (PCx) of the mice, while a laser pulse is applied directly on the OB cells. For specially designed transgenic mice, light pulses will elicited an increase in OB spiking and remain elevated for the duration of the stimulus. We consider the spike train data collected at three intensity levels, $0\ mW/mm^{2}$ , $5\ mW/mm^{2}$ and $10\ mW/mm^{2}$ from $15$ OB cells and $45$ PCx cells. For each intensity level, the observation lasts about $T=90$ seconds. We choose the number of sub-intervals $M=2000$ by pre-experiments.

Figure 4 shows the number of firings of OB cells and PCx cells. The firing numbers vary largely among cells. We use an adaptive link-function to better characterize the average intensity of different neurons:

h_{i}(t)=\hat{\lambda}_{i}\big{(}1+\frac{2\arctan(t)}{\pi}\big{)},\ i=1,\ldots,60,

where $\hat{\lambda}_{i}=\frac{N_{i}([0,T])}{T}$ . Based on the research of the duration of influence among the neurons in Bolding and Franks(2018), we choose $\kappa_{j,k}(x)=\mathbb{I}_{[0,0.25]}(x),j,k,\in[p]$ . Since our goal was to provide interpretable visualizations and investigate the influence of the laser on the neural connectivity, we computed sparse graphs with approximately $5\%$ connected edges. We performed a bootstrap procedure by randomly selecting $2000$ sub-intervals with replacement from the original data, finding a tuning parameters $\gamma_{n}$ to achieve $5\%$ sparsity level, fitting the point process graphical model and repeating 50 times. The final graph was constructed from the excitatory and inhibitory edges that occurred in at least $50\%$ of the bootstrap replications.

Figure 5 Shows the networks estimated using a point process graphical model for three groups. The bootstrapped graphical model estimated a sparser network with sparsity level $3.28\%$ for the $0\ mW/mm^{2}$ group, $4.08\%$ for the $5\ mW/mm^{2}$ group and $3.67\%$ for the $10\ mW/mm^{2}$ group. We observe a few apparent patterns. First, PCx cells are densely connected in all three groups but the $0mW/mm^{2}$ group has increased connectivity relative to other groups, suggesting that there may be a multi-layer excitatory-inhibitory neuron network in the piriform cortex to stabilize the odor signal. Second, when the laser was applied to OB cells, the number of edges from other cells to the OB cells increase and the sign of if several connections changed, indicating that the laser affected the connectivity structure of OB cells and PCx cells and this effect was heterogenous. Finally, we observed that a few neurons fired less frequently and had little effect on other cells (such as OB-9, PCx-1, PCx-15, PCx-42, PCx-44, etc). Their role in the odor-processing mechanism needs to be further explored.

7 Conclusion and Future Work

In this paper, we have proposed the temporal point process graphical model, a new class of graphical models for learning the temporal dependencies among different components of a multivariate point process. We have shown that the locally independent structure of the process is fully encoded in the transfer matrix. We have provided a class of estimators with a non-asymptotic estimation error bound, which extends the classical estimators of temporal point process. Our model naturally includes the case of sparse temporal point process regression, where the predictors and the responses are not the same process. The performance of our estimator has been tested by simulations and a spike train data set.

Several challenges still remain in the analysis of the temporal point process graphical models. Using cross-validation for selecting tuning parameters is proved to be successful in the classical Lasso estimators (Chetverikov et al., 2021). However, a detailed theoretical analysis is needed to demonstrate its performance in our case. The criteria for optimal tuning parameter selection for estimation and variable selection are left for future research. The second challenge is to assess the uncertainty of the estimated parameters, which is less addressed in the existing work.

References

Aït-Sahalia et al. [2015] Yacine Aït-Sahalia, Julio Cacho-Diaz, and Roger JA Laeven. Modeling financial contagion using mutually exciting jump processes. Journal of Financial Economics, 117(3):585–606, 2015.
Bacry et al. [2013] Emmanuel Bacry, Sylvain Delattre, Marc Hoffmann, and Jean-François Muzy. Some limit theorems for hawkes processes and application to financial statistics. Stochastic Processes and their Applications, 123(7):2475–2499, 2013.
Bacry et al. [2015] Emmanuel Bacry, Iacopo Mastromatteo, and Jean-François Muzy. Hawkes processes in finance. Market Microstructure and Liquidity, 1(01):1550005, 2015.
Bacry et al. [2020] Emmanuel Bacry, Martin Bompaire, Stéphane Gaïffas, and Jean-Francois Muzy. Sparse and low-rank multivariate hawkes processes. Journal of Machine Learning Research, 21(50):1–32, 2020.
Bolding and Franks [2018] K. A. Bolding and K. M. Franks. Recurrent cortical circuits implement concentration-invariant odor coding. Science, 361(6407):eaat6904–, 2018.
Brémaud and Massoulié [1996] Pierre Brémaud and Laurent Massoulié. Stability of nonlinear hawkes processes. The Annals of Probability, 24(3):1563 – 1588, 1996.
Brémaud [1981] Pierre Brémaud. Point Processes and Queues. Springer-Verlag, 1981.
Chen et al. [2017a] Shizhe Chen, Ali Shojaie, Eric Shea-Brown, and Daniela Witten. The multivariate hawkes process in high dimensions: Beyond mutual excitation. arXiv preprint arXiv:1707.04928, 2017a.
Chen et al. [2017b] Shizhe Chen, Daniela Witten, and Ali Shojaie. Nearly assumptionless screening for the mutually-exciting multivariate hawkes process. Electronic journal of statistics, 11(1):1207, 2017b.
Chetverikov et al. [2021] Denis Chetverikov, Zhipeng Liao, and Victor Chernozhukov. On cross-validated lasso in high dimensions. The Annals of Statistics, 49(3):1300–1317, 2021.
Costa et al. [2018] Manon Costa, Carl Graham, Laurence Marsalle, and Viet Chi Tran. Renewal in hawkes processes with self-excitation and inhibition. Advances in Applied Probability, 52, 01 2018.
Daley and Vere-Jones [2007] DJ Daley and David Vere-Jones. An Introduction to the Theory of Point Processes: Volume II: General Theory and Structure. Springer Science & Business Media, 2007.
Didelez [2008] Vanessa Didelez. Graphical models for marked point processes based on local independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):245–264, 2008.
Engelke and Hitz [2020] Sebastian Engelke and Adrien S Hitz. Graphical models for extremes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(4):871–932, 2020.
Fox et al. [2016] Eric W Fox, Martin B Short, Frederic P Schoenberg, Kathryn D Coronges, and Andrea L Bertozzi. Modeling e-mail networks and inferring leadership using self-exciting point processes. Journal of the American Statistical Association, 111(514):564–584, 2016.
Hansen et al. [2015] Niels Hansen, Patricia Reynaud-Bouret, and Vincent Rivoirard. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 21, 08 2015.
Hawkes [1971] Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90, 1971.
Hawkes and Oakes [1974] Alan G. Hawkes and David Oakes. A cluster process representation of a self-exciting process. Journal of Applied Probability, 11(03):493–503, 1974.
Kalisch and Bühlman [2007] Markus Kalisch and Peter Bühlman. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(3), 2007.
Liu [2013] Weidong Liu. Gaussian graphical model estimation with false discovery rate control. The Annals of Statistics, pages 2948–2978, 2013.
Maathuis et al. [2018] Marloes Maathuis, Mathias Drton, Steffen Lauritzen, and Martin Wainwright. Handbook of graphical models. CRC Press, 2018.
Negahban et al. [2012] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of $m$ -estimators with decomposable regularizers. Statistical science, 27(4):538–557, 2012.
Ogata [1978] Yoshiko Ogata. The asymptotic behaviour of maximum likelihood estimators for stationary point processes. Annals of the Institute of Statistical Mathematics, 30(2):243–261, 1978.
Ozaki [1979] Tohru Ozaki. Maximum likelihood estimation of Hawkes’ self-exciting point processes. Annals of the Institute of Statistical Mathematics, 31(1):145–155, 1979.
Paninski et al. [2007] Liam Paninski, Jonathan Pillow, and Jeremy Lewi. Statistical models for neural encoding, decoding, and optimal stimulus design. Progress in brain research, 165:493–507, 2007.
Perry and Wolfe [2013] Patrick O Perry and Patrick J Wolfe. Point process modelling for directed interaction networks. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(5):821–849, 2013.
Pillow et al. [2008] Jonathan W Pillow, Jonathon Shlens, Liam Paninski, Alexander Sher, Alan M Litke, EJ Chichilnisky, and Eero P Simoncelli. Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature, 454(7207):995–999, 2008.
Qiao et al. [2019] Xinghao Qiao, Shaojun Guo, and Gareth M James. Functional graphical models. Journal of the American Statistical Association, 114(525):211–222, 2019.
Shorack and Wellner [2009] Galen R Shorack and Jon A Wellner. Empirical processes with applications to statistics. SIAM, 2009.
Tang and Li [2021] Xiwei Tang and Lexin Li. Multivariate temporal point process regression. Journal of the American Statistical Association, 0(0):1–16, 2021.
Tibshirani [1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
Van de Geer et al. [2013] Sara Van de Geer, Peter Bühlmann, et al. $\ell_{0}$ -penalized maximum likelihood for sparse directed acyclic graphs. Annals of Statistics, 41(2):536–567, 2013.
Vinci et al. [2018] Giuseppe Vinci, Valérie Ventura, Matthew A Smith, and Robert E Kass. Adjusted regularization in latent graphical models: Application to multiple-neuron spike count data. The annals of applied statistics, 12(2):1068, 2018.
Yuan and Lin [2007] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.
Zhang et al. [2016] Chunming Zhang, Yi Chai, Xiao Guo, Muhong Gao, David Devilbiss, and Zhengjun Zhang. Statistical learning of neuronal functional connectivity. Technometrics, 58(3):350–359, 2016.
Zhao and Yu [2006] Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006.
Zheng et al. [2018] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, 31, 2018.
Zhou et al. [2013] Ke Zhou, Hongyuan Zha, and Le Song. Learning triggering kernels for multi-dimensional hawkes processes. In International Conference on Machine Learning, pages 1301–1309. PMLR, 2013.
Zou [2006] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.

Appendix A Appendix

Here we show the proof of Theorem 1. The proof of Theorem 1 is divided in three steps. First, we show that the loss function admits the restricted strong convexity condition in Theorem 3. Second, we show that the estimator has a convergence rate of $\sqrt{\frac{s\log(p\vee T)}{T}}$ as long as $\lambda_{T}$ dominate the statistical noise in Theorem 4. In the last step, we show that $\lambda_{T}$ dominate the statistical noise with high probability in Lemma 1.

Theorem 3

Let

L_{T}(\mu_{j},\beta_{j})=\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}H_{j}(\mu_{j}^{*}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)\mathrm{d}t-\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle\mathrm{d}N_{j}(t)\Big{)}

(13)

denote the loss function. Under Assumption 1 and 2, there exists constants $\kappa_{1},\kappa_{2}>0$ such that for every $(\delta_{\mu},\Delta^{\mathrm{T}})\in K(r,S)$ ,

		$\displaystyle L_{T}(\mu^{}_{j}+\delta_{\mu},\bm{\beta}^{}_{j}+\Delta)-L_{T}(\mu^{}_{j},\bm{\beta}^{}_{j})-\langle\nabla_{\bm{\beta}_{j}}L_{T}(\mu^{}_{j},\bm{\beta}^{}_{j}),\bm{\Delta}\rangle-\nabla_{\mu}L_{T}(\mu^{}_{j},\bm{\beta}^{}_{j})\delta_{\mu}$
	$\displaystyle\geq$	$\displaystyle\kappa_{1}\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{2}\Big{\{}\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{2}-\kappa_{2}(\tau+1)^{2}\varepsilon(T,p)\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{1}^{2}\Big{\}}$

holds with probability greater than $1-\frac{c_{1}}{(p\vee T)^{\tau}}$ , where $S$ is the support of $\bm{\beta}^{*}_{j}$ and $K(r,S)=\{(\delta_{\mu},\bm{\Delta})\in\mathbb{R}^{p+1},\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}=r,\|(0,\bm{\Delta}_{S^{c}}^{\mathrm{T}})\|_{1}\leq 4\|(\delta_{\mu},\bm{\Delta}_{S}^{\mathrm{T}})\|_{1}\}$ , $\varepsilon(T,p)=\sqrt{\frac{\log(p\vee T)}{T}}.$

Theorem 4

Under Assumption 1 and 2, taking $\lambda_{T}=c_{1}\sqrt{\frac{\tau\log(p\vee T)}{T}}$ and conditioning on the event $\cup_{j=1}^{p}G_{j}$ , where

	$\displaystyle G_{j}$	$\displaystyle=\Big{\{}\lambda_{T}\geq 2h_{\max}\Big{\|}\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)\mathrm{d}t-\mathrm{d}N_{j}(t)\Big{)}\Big{\|}\$
		$\displaystyle\text{and}\ \lambda_{T}\geq 2h_{\max}\Big{\|}\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)x_{j,k}(t)\mathrm{d}t-x_{j,k}(t)\mathrm{d}N_{j}(t)\Big{)}\Big{\|},\forall k\in[p]\Big{\}},$

we have

p^{-1/2}\|(\hat{\bm{\mu}},\hat{B})-(\bm{\mu}^{*},B^{*})\|_{F}\leq c_{2}\sqrt{s}\lambda_{T}

(14)

and

\|(\hat{\bm{\mu}},\hat{B})-(\bm{\mu}^{*},B^{*})\|_{\infty}\leq c_{2}s\lambda_{T}

(15)

hold with possibility greater than $1-\frac{c_{3}}{(p\vee T)^{\tau}}$ , where $s=1+\max_{j}\|\bm{\beta}^{*}_{j}\|_{0}$ .

Lemma 1

Under Assumption 1 and 2, there exist positive constants $c_{1},c_{2}>0$ , with probability greater than $1-\frac{c_{1}}{(p\vee T)^{\tau}}$ , we have

\Big{|}\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)x_{j,k}(t)\mathrm{d}t-x_{j,k}(t)\mathrm{d}N_{j}(t)\Big{)}\Big{|}\leq c_{2}\sqrt{\frac{\tau\log(p\vee T)}{T}},\ \ \forall k\in[p]

and

\Big{|}\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)\mathrm{d}t-\mathrm{d}N_{j}(t)\Big{)}\Big{|}\leq c_{2}\sqrt{\frac{\tau\log(p\vee T)}{T}}.

A.1 Proof of Theorem 3

We will show the restricted strong convexity for the loss function.

proof. With a little abuse of notation, we let $\bm{x}_{j}(t)$ denote the centralized influence function $\bm{x}_{j}(t)-\mathbb{E}[\bm{x}_{j}(t)]$ and $\mu^{*}_{j}$ denote $\mu^{*}_{j}+\sum_{k\in S}\beta_{j,k}^{*}\mathbb{E}[x_{j,k}(t)]$ .

For $\bm{\Delta}\in\mathbb{R}^{p}$ , define $\delta L_{T}(\delta_{\mu},\mu^{*}_{j},\bm{\Delta},\bm{\beta}^{*}_{j})=L_{T}(\mu^{*}_{j}+\delta_{\mu},\bm{\beta}_{j}^{*}+\bm{\Delta})-L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*})-\langle\nabla_{\bm{\beta}}L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*}),\bm{\Delta}\rangle-\delta_{\mu}\frac{\partial}{\partial\mu}L_{T}(\mu,\bm{\beta}_{j}^{*})\Big{|}_{\mu^{*}_{j}}$ . By the Taylor expansion, there exists some $v\in[0,1]$ such that

\delta L_{T}(\delta_{\mu},\mu^{*}_{j},\bm{\Delta},\bm{\beta}^{*}_{j})=\frac{1}{T}\int_{0}^{T}W_{j}(t)h_{j}^{\prime}\Big{(}\mu^{*}_{j}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle+v\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}\Big{)}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}\text{d}t.

Let $\gamma$ be positive numbers to be determined later, $0<\gamma\leq R$ , where $R$ is the constant in Assumption 2. Define the functions $\phi_{\gamma}(u)=u^{2}\mathbb{I}(|u|\leq\gamma/2)+(\gamma-u)^{2}\mathbb{I}(\gamma/2\leq|u|\leq\gamma)$ and $\zeta_{\gamma}(u)=u\mathbb{I}(|u|\leq\gamma)$ .

Note that

	$\displaystyle(\mu_{j}^{}+\langle\bm{\beta}^{}_{j},\bm{x}_{j}(t)\rangle)\mathbb{I}(\|\mu_{j}^{}+\langle\bm{\beta}_{j}^{},\bm{x}_{j}(t)\rangle\|\leq R)=\zeta_{R}(\mu_{j}^{}+\langle\bm{\beta}_{j}^{},\bm{x}_{j}(t)\rangle),$
	$\displaystyle(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)\mathbb{I}(\|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\|\leq\gamma)=\zeta_{\gamma}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle),$
	$\displaystyle(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}\mathbb{I}(\|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\|\leq\gamma)\geq\phi_{\gamma}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle).$

Then

	$\displaystyle\delta L_{T}(\delta_{\mu},\mu^{}_{j},\bm{\Delta},\bm{\beta}^{}_{j})\geq$	$\displaystyle\frac{1}{T}\int_{0}^{T}W_{j}(t)h_{j}^{\prime}\Big{(}\mu^{}_{j}+\langle\bm{\beta}_{j}^{},\bm{x}_{j}(t)\rangle+v\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}\Big{)}\cdot\mathbb{I}(\|\mu^{}_{j}+\langle\bm{\beta}^{}_{j},\bm{x}_{j}(t)\rangle\|\leq R)$
	$\displaystyle\cdot$	$\displaystyle(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}\cdot\mathbb{I}(\|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\|\leq\gamma)\text{d}t$
	$\displaystyle\geq$	$\displaystyle\frac{1}{T}\int_{0}^{T}W_{j}(t)h_{j}^{\prime}\Big{(}\mu^{}_{j}+\langle\bm{\beta}_{j}^{},\bm{x}_{j}(t)\rangle+v\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}\Big{)}\cdot\mathbb{I}(\|\mu^{}_{j}+\langle\bm{\beta}^{}_{j},\bm{x}_{j}(t)\rangle\|\leq R)$
	$\displaystyle\cdot$	$\displaystyle\phi_{\gamma}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)\text{d}t$
	$\displaystyle\geq$	$\displaystyle C_{h_{j}}(R)\frac{1}{T}\int_{0}^{T}\phi_{\gamma}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)\cdot\mathbb{I}(\|\mu_{j}^{}+\langle\bm{\beta}_{j}^{},\bm{x}_{j}(t)\rangle\|\leq R)\text{d}t,$

where $C_{h_{j}}(R)=\sigma_{3}\inf\limits_{|x|\leq 2R}{h_{j}}^{\prime}(x)$ is a constant by Assumption 2. In this vein, it suffices to bound

\frac{1}{T}\int_{0}^{T}\phi_{\gamma}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)\cdot\mathbb{I}(|\mu_{j}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R)\text{d}t.

Let $\gamma(r)=Rr$ , Note that $\phi_{c}(cz)=c^{2}\phi_{1}(z)$ for any $c>0$ and $z\in\mathbb{R}$ . Then for $\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}=r$ , $\phi_{\gamma(r)}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j,m}\rangle)=r^{2}\phi_{R}(\delta_{\mu}/r+\langle\bm{\Delta}/r,\bm{x}_{j,m}\rangle)$ , and

		$\displaystyle\frac{1}{T}\int_{0}^{T}\phi_{\gamma(r)}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)\cdot\mathbb{I}(\|\mu_{j}^{}+\langle\bm{\beta}_{j}^{},\bm{x}_{j}(t)\rangle\|\leq R)\text{d}t$
	$\displaystyle=$	$\displaystyle\frac{r^{2}}{T}\int_{0}^{T}\phi_{R}(\delta_{\mu}/r+\langle\bm{\Delta}/r,\bm{x}_{j}(t)\rangle)\cdot\mathbb{I}(\|\mu_{j}^{}+\langle\bm{\beta}_{j}^{},\bm{x}_{j}(t)\rangle\|\leq R)\text{d}t,$

which implies that it suffices to show, there exist strictly positive constants $\kappa_{1}$ and $\kappa_{2}$ which depends only on the $\Sigma_{j}$ and $h_{j}(\cdot)$ such that

\frac{1}{T}\int_{0}^{T}\phi_{R}(\delta_{\mu}/r+\langle\bm{\Delta}/r,\bm{x}_{j}(t)\rangle)\cdot\mathbb{I}(|\mu_{j}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R)\text{d}t\geq\kappa_{1}\Big{\{}1-\kappa_{2}(\tau+1)^{2}\varepsilon(T,p)\cdot\frac{\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{1}^{2}}{\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}}\Big{\}}.

From this perspective, we only need to prove the inequality when $\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}=1$ .

Denote by $\mathbb{S}_{2}(1)$ the unit sphere and let $\mathbb{S}_{1}(\varpi)=\{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{R}^{p+1}:\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{1}=\varpi\}$ . Let $f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})=\frac{1}{T}\int_{0}^{T}\phi_{R}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)\cdot\mathbb{I}(|\mu_{j}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R)\text{d}t$ . For each $\varpi>0$ , it suffices to show that

\mathcal{E}_{\varpi}=\bigg{\{}f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})<\kappa_{1}\bigg{(}1-\kappa_{2}(\tau+1)^{2}\varepsilon(T,p)\varpi^{2}\bigg{)},\ \text{for some}\ (\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{1}(\varpi)\cap\mathbb{S}_{2}(1)\bigg{\}}

occurs with small probability. Note that $\mathbb{S}_{1}(\varpi)\cap\mathbb{S}_{2}(1)\neq\emptyset$ only if $1\leq\varpi\leq{\sqrt{p+1}}$ .

Recall our Assumption 2 (2), we have

\mathbb{E}[f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})]>\sigma_{1}>0

holds for some $\sigma_{1}$ .

By Lemma 6, we have for any $\varpi^{2}\in[p]$ ,

\mathbb{P}\Big{(}\sup_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{2}(1)\cap\mathbb{S}_{1}(\varpi)}\Big{|}\frac{1}{T}\int_{0}^{T}|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle|^{2}-\mathbb{E}|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}\rangle|^{2}\Big{|}\geq c_{4}\tau^{2}\varpi^{2}\epsilon(T,p)\Big{)}\leq\frac{c_{5}}{(p\vee T)^{\tau}},

where

\varepsilon(T,p)=\sqrt{\frac{\log(p\vee T)}{T}}.

Since $\phi_{R}(\cdot)$ is $R$ -Lipschitz function, we have

\mathbb{P}\Big{(}\sup_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{2}(1)\cap\mathbb{S}_{1}(\varpi)}\Big{|}f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j}(t))-\mathbb{E}[f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})]\Big{|}\geq c_{4}\tau^{2}R\varpi^{2}\varepsilon(T,p)\Big{)}\leq\frac{c_{5}}{(p\vee T)^{\tau}}.

Then

	$\displaystyle f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})$	$\displaystyle\geq\mathbb{E}[f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})]-\sup_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{2}(1)\cap\mathbb{S}_{1}(\varpi)}\Big{\|}f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})-\mathbb{E}[f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})]\Big{\|}$
		$\displaystyle\geq\sigma_{1}-c_{4}\tau^{2}R\varpi^{2}\varepsilon(T,p)$
		$\displaystyle=\sigma_{1}\Big{(}1-\frac{c_{4}\tau^{2}R\varpi^{2}}{\sigma_{1}}\varepsilon(T,p)\Big{)}$

holds with probability greater than $1-\frac{c_{5}}{(p\vee T)^{\tau}}$ .

The last step is ”peeling”. For any $\varpi^{2}\in\{1,2,\cdots,p+1\}$ , we have

\mathbb{P}\Big{\{}f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})<\sigma_{1}\Big{(}1-\frac{c_{4}\tau^{2}R\varpi^{2}}{\sigma_{1}}\varepsilon(T,p)\Big{)},\ \text{for some}\ (\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{1}(\varpi)\cap\mathbb{S}_{2}(1)\Big{\}}\leq\frac{c_{5}}{(p\vee T)^{\tau}}.

Then we have

		$\displaystyle\mathbb{P}\bigg{\{}f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})<\frac{\sigma_{1}}{2}\Big{(}1-\frac{c_{4}\tau^{2}R}{\sigma_{1}}\varepsilon(T,p)\Big{)},\ \text{for some}\ (\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{2}(1)\bigg{\}}$
	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{p+1}\mathbb{P}\bigg{\{}f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})<\sigma_{1}\Big{(}1-\frac{c_{4}\tau^{2}R}{\sigma_{1}}\varepsilon(T,p)\cdot i\Big{)},\ \text{for some}\ (\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{1}(\sqrt{i})\cap\mathbb{S}_{2}(1)\bigg{\}}$
	$\displaystyle\leq$	$\displaystyle\frac{c_{6}}{(p\vee T)^{\tau-1}}.$

Then we conclude that by choosing $\kappa_{1}=\frac{\sigma_{1}}{2}$ and $\kappa_{2}=\frac{c_{4}R}{\sigma_{1}}$ with some constant $R$ , the following

\displaystyle\delta L_{T}(\delta_{\mu},\mu^{*}_{j},\bm{\Delta},\bm{\beta}^{*}_{j})

\displaystyle\geq\kappa_{1}\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}\Big{\{}\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}-\kappa_{2}(\tau+1)^{2}\varepsilon(T,p)\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{1}^{2}\Big{\}}

holds uniformly for all $\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}\leq 1$ with probability greater than $1-\frac{c_{6}}{(p\vee T)^{\tau}}$ . $\Box$

A.2 Proof of Theorem 4

Let $\mathcal{F}(\delta_{\mu},\bm{\Delta})=L_{T}(\mu_{j}^{*}+\delta_{\mu},\bm{\beta}_{j}^{*}+\bm{\Delta})-L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*})+\lambda_{T}(\|(\mu_{j}^{*}+\delta_{\mu},(\bm{\beta}_{j}^{*}+\bm{\Delta})^{\mathrm{T}})\|_{1}-\|(\mu_{j}^{*},\bm{\beta}_{j}^{*\mathrm{T}})\|_{1})$ be the difference of penalized loss function. Let $(\hat{\mu}_{j},\hat{\bm{\beta}}_{j})$ be the optimal solution to the penalized loss function minimization and define $\hat{\delta}_{\mu}=\hat{\mu}_{j}-\mu_{j}^{*},\hat{\bm{\Delta}}=\hat{\bm{\beta}}_{j}-\bm{\beta}_{j}^{*}$ . We first show some properties of the global optimizer $(\hat{\mu}_{j},\hat{\bm{\beta}}_{j})$ .

Notice that $L_{T}(\cdot,\cdot)$ is a convex function. Then

L_{T}(\mu_{j}^{*}+\hat{\delta}_{\mu},\bm{\beta}_{j}^{*}+\hat{\bm{\Delta}})-L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*})\geq\langle\nabla_{\bm{\beta}}L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*}),\hat{\bm{\Delta}}\rangle+\nabla_{\mu}L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*})\cdot\hat{\delta}_{\mu}\geq-\frac{\lambda_{T}}{2}\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{1}.

$\mathcal{F}(\hat{\delta}_{\mu},\hat{\bm{\Delta}})\leq 0$ gives that

-\frac{\lambda_{T}}{2}\|\hat{\bm{\Delta}}\|_{1}\leq\lambda_{T}\Big{(}\|(\mu_{j}^{*},\bm{\beta}_{j}^{*\mathrm{T}})\|_{1}-\|(\mu_{j}^{*}+\hat{\delta}_{\mu},(\bm{\beta}_{j}^{*}+\hat{\bm{\Delta}})^{\mathrm{T}})\|_{1}\Big{)}.

Owing to

	$\displaystyle\\|(\mu_{j}^{},\bm{\beta}_{j}^{\mathrm{T}})\\|_{1}$	$\displaystyle=\\|(\mu_{j}^{},\bm{\beta}^{\mathrm{T}}_{j,S})\\|_{1}$
	$\displaystyle\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{1}$	$\displaystyle=\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}}_{S})\\|_{1}+\\|(0,\hat{\bm{\Delta}}^{\mathrm{T}}_{S^{C}})\\|_{1}$
	$\displaystyle\\|(\mu_{j}^{}+\hat{\delta}_{\mu},(\bm{\beta}_{j}^{}+\hat{\bm{\Delta}})^{\mathrm{T}})\\|_{1}$	$\displaystyle=\\|(\mu_{j}^{}+\hat{\delta}_{\mu},(\bm{\beta}_{j,S}^{}+\hat{\bm{\Delta}}_{S})^{\mathrm{T}})\\|_{1}+\\|(0,(\bm{\beta}_{j,S^{C}}^{*}+\hat{\bm{\Delta}}_{S^{C}})^{\mathrm{T}})\\|_{1}$
	$\displaystyle\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}_{S}^{\mathrm{T}})\\|_{1}$	$\displaystyle\geq\\|(\mu_{j}^{},\bm{\beta}^{\mathrm{T}}_{j,S})\\|_{1}-\\|(\mu_{j}^{}+\hat{\delta}_{\mu},(\bm{\beta}_{j}^{}+\hat{\bm{\Delta}})^{\mathrm{T}}_{S})\\|_{1},$

where the subscript $S$ denotes that the vector is restricted to the set $S$ , we have

\|\hat{\bm{\Delta}}_{S^{C}}\|_{1}\leq 3\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}_{S}^{\mathrm{T}})\|_{1}\ \text{and}\ \|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{1}\leq 4\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}_{S}^{\mathrm{T}})\|_{1}\leq 4\sqrt{s}\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}_{S}^{\mathrm{T}})\|_{2}.

Consider the set $K(r,S)=\{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{R}^{p+1},\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}=r,\|(0,\bm{\Delta}^{\mathrm{T}}_{S^{C}})\|_{1}\leq 4\|({\delta}_{\mu},{\bm{\Delta}}_{S}^{\mathrm{T}})\|_{1}\}$ for $0<r<1$ . With the restricted strong convexity condition provided by Theorem 3, we have for any $(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in K(r,S)$ ,

		$\displaystyle L_{T}(\mu_{j}^{}+\delta_{\mu},\bm{\beta}_{j}^{}+\bm{\Delta})-L_{T}(\mu_{j}^{},\bm{\beta}_{j}^{})-\langle\nabla_{\bm{\beta}}L_{T}(\mu_{j}^{},\bm{\beta}_{j}^{}),\bm{\Delta}\rangle-\nabla_{\mu}L_{T}(\mu_{j}^{},\bm{\beta}_{j}^{})\delta_{\mu}$
	$\displaystyle\geq$	$\displaystyle\kappa_{1}\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{2}\Big{\{}\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{2}-\kappa_{2}\tau^{2}\varepsilon(T,p)\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{1}^{2}\Big{\}}$
	$\displaystyle\geq$	$\displaystyle\kappa_{1}\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{2}\Big{\{}\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{2}-s\kappa_{2}\tau^{2}\varepsilon(T,p)\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{2}^{2}\Big{\}}$

holds with probability greater than $1-\frac{c_{6}}{(p\vee T)^{\tau}}$ , where $\epsilon(T,p)=\sqrt{\frac{\log(p\vee T)}{T}}$ . Further, condition on the event

	$\displaystyle G_{j}$	$\displaystyle=\Big{\{}\lambda_{T}\geq 2h_{\max}\Big{\|}\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)\mathrm{d}t-\mathrm{d}N_{j}(t)\Big{)}\Big{\|}\$
		$\displaystyle\text{and}\ \lambda_{T}\geq 2h_{\max}\Big{\|}\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)x_{j,k}(t)\mathrm{d}t-x_{j,k}(t)\mathrm{d}N_{j}(t)\Big{)}\Big{\|},\forall k\in[p]\Big{\}},$

note that $\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}=r<1$ , we can derive that

	$\displaystyle\mathcal{F}(\hat{\delta}_{\mu},\hat{\bm{\Delta}})$	$\displaystyle=L_{T}(\mu_{j}^{}+\hat{\delta}_{\mu},\bm{\beta}^{}+\hat{\bm{\Delta}})-L_{T}(\mu_{j}^{},\bm{\beta}_{j}^{})+\lambda_{T}\Big{(}\\|(\mu_{j}^{}+\hat{\delta}_{\mu},(\bm{\beta}_{j}^{}+\hat{\bm{\Delta}})_{S}^{\mathrm{T}})\\|_{1}+\\|(0,\hat{\bm{\Delta}}^{\mathrm{T}}_{S^{C}})\\|_{1}-\\|(\mu_{j}^{},\bm{\beta}_{j}^{\mathrm{T}})\\|_{1}\Big{)}$
		$\displaystyle\geq-\frac{\lambda_{T}}{2}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{1}+\kappa_{1}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}\Big{\{}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}-s\kappa_{2}\tau^{2}\varepsilon(T,p)\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}^{2}\Big{\}}$
		$\displaystyle+\lambda_{T}(\\|(0,\hat{\bm{\Delta}}_{S^{C}}^{\mathrm{T}})\\|_{1}-\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}_{S}^{\mathrm{T}})\\|_{1})$
		$\displaystyle\geq\kappa_{1}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}\Big{\{}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}-s\kappa_{2}\tau^{2}\varepsilon(T,p)\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}^{2}\Big{\}}-\frac{3\sqrt{s}\lambda_{T}}{2}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}$
		$\displaystyle\geq\Big{(}\frac{\kappa_{1}}{2}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}-3\sqrt{s}\lambda_{T}\Big{)}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2},$

where the last inequality comes form $(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\in K(r,S),s\kappa_{2}\tau^{2}\varepsilon(T,p)<\frac{1}{2}$ for $T$ large enough and the choice of $\lambda_{T}$ . To derive the final result, we need to show that $\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}=r<1$ . Consider the choice of $r=\frac{6\sqrt{s}\lambda_{T}}{\kappa_{1}}<1$ . In this case, $\mathcal{F}(\delta_{\mu},\bm{\Delta})>0$ for every $(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in K(r,S)$ . If $\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}>r$ , then there exists a vector $(a\hat{\delta}_{\mu},a\hat{\bm{\Delta}}^{\mathrm{T}})$ with $0<a<1$ and $\|(a\hat{\delta}_{\mu},a\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}=\delta$ , As $\mathcal{F}$ is convex, then

\mathcal{F}(a\hat{\delta}_{\mu},a\hat{\bm{\Delta}})\leq a\mathcal{F}(\hat{\delta}_{\mu},\hat{\bm{\Delta}})+(1-a)\mathcal{F}(0)=a\mathcal{F}(\hat{\delta}_{\mu},\hat{\bm{\Delta}})\leq 0.

Note that $(a\hat{\delta}_{\mu},a\hat{\bm{\Delta}}^{\mathrm{T}})\in K(r,S)$ . We come to a contradictory and we conclude that $\|\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}}\|_{2}\leq\frac{6\sqrt{s}\lambda_{T}}{\kappa_{1}}$ . $\Box$

A.3 Proof of Lemma 1

Recall the well known Bernstein type inequalities for point process which can be found in Shorack and Wellner(2009):

\mathbb{P}\Big{(}|M_{T}|\geq\sqrt{2vx}+\frac{Bx}{3}\ \text{and}\ \int_{0}^{T}g^{2}(t)\text{d}\Lambda(t)\leq v\ \text{and}\ \sup_{t\in[0,T]}|g(t)|\leq B\Big{)}\leq 2e^{-x},

(16)

where $g(t)$ is a predictable process, $\Lambda(t)$ is the compensator of the point process $N(t)$ and $M_{T}=\int_{0}^{T}g(t)(\text{d}N(t)-\text{d}\Lambda(t))$ .

Conditioning on $C_{\tau}$ defined in Lemma 3 and define the predictable process $g_{j,k}(t)=W_{j}(t)x_{j,k}(t)$ , $\Lambda(t)=\Lambda_{j}(t)=\int_{0}^{t}h_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(s)\rangle)\text{d}s,\ M_{k,T}=\int_{0}^{T}g_{j,k}(t)(\text{d}N_{j}(t)-\text{d}\Lambda_{j}(t))$ and $g_{j,k}(t)$ is a predictable process with $\int_{0}^{T}g_{j,k}^{2}(t)\text{d}\Lambda_{j}(t)\leq\sigma^{2}_{4}\int_{0}^{t}x^{2}_{j,k}(s)\text{d}\Lambda_{j}(t)$ . With Lemma 4, we have

	$\displaystyle\int_{0}^{T}g_{j,k}^{2}(t)\text{d}\Lambda_{j}(t)$	$\displaystyle\leq h_{\max}\sigma^{2}_{4}\cdot\Big{(}\Big{\|}\int_{0}^{T}x^{2}_{j,k}(t)\text{d}t-T\mathbb{E}[x_{j,k}^{2}(t)]\Big{\|}+T\mathbb{E}[x_{j,k}^{2}(t)]\Big{)}$
		$\displaystyle\leq c_{2}T(\mathbb{E}[x^{2}_{j,k}(t)]+1)$

holds simultaneously for all $k\in[p]$ with probability greater than $1-\frac{c_{1}}{(p\vee T)^{\tau}}$ . On the event $C_{\tau}$ in Lemma 3, we have $\sup_{k,t\in[0,T]}|g_{j,k}(t)|=\sup_{k,t\in[0,T]}|x_{j,k}(t)|\leq(\tau+2)\kappa_{\max}\log(p\vee T)$ .

Notice that

|M_{k,T}|=|\int_{0}^{T}g_{j,k}(t)\big{(}\text{d}N_{j}(t)-\text{d}\Lambda_{j}(t)\big{)}|.

Let $v=c_{2}T(\mathbb{E}[x^{2}_{j,k}(t)]+1)$ and $B=(\tau+2)\kappa_{\max}\log(p\vee T)$ in (16), we have

\mathbb{P}\Big{(}|M_{k,T}|\geq\sqrt{c_{2}T(\mathbb{E}[x^{2}_{j,k}(t)]+1)x}+\frac{(\tau+2)\kappa_{\max}\log(p\vee T)}{3}x\Big{)}\leq 2e^{-x}.

and

\mathbb{P}\Big{(}|M_{k,T}|\geq\sqrt{c_{2}T(\mathbb{E}[x^{2}_{j,k}(t)]+1)x}+\frac{(\tau+2)\kappa_{\max}\log(p\vee T)}{3}x,\ \forall k\in[p]\Big{)}\leq 2pe^{-x}.

Take $x=(\tau+2)\log(p\vee T)$ and notice that $\sqrt{c_{2}T(\mathbb{E}[x^{2}_{j,k}(t)]+1)x}$ dominate $\frac{(\tau+2)\kappa_{\max}\log(p\vee T)}{3}x$ , we will have

\Big{|}\frac{1}{T}\int_{0}^{T}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)x_{j,k}(t)\text{d}t-x_{j,k}(t)\text{d}N_{j}(t)\Big{|}\leq c_{2}\sqrt{\frac{\tau\log(p\vee T)}{T}},\ \ \forall k\in[p]

holds for all $k\in[p]$ with probability greater than $1-\frac{c_{2}}{(p\vee T)^{\tau}}$ where $c_{1},c_{2}$ are constants do not depend on $p$ or $T$ . The second inequation can be proved similarly. $\Box$

A.4 Ancillary Results

Lemma 2

$N_{j}(T)$ is the number of event on the node $j$ in the time interval $[0,T]$ and there exists a constant $C$ that only depends on $h_{\max}$ that with probability greater than $1-o(e^{-C_{0}T})$ , we have

N_{j}(T)\leq Ceh_{\max}T,\ \forall j=1,\cdots,p.

proof: By the idea of Poisson thinning, we have

N_{j}(T)\leq\tilde{N}_{j}(T)

and for $C\geq 2$ ,

	$\displaystyle\mathbb{P}(\tilde{N}_{j}(T)>Ceh_{\max}T,\ \forall j\in[p])$	$\displaystyle\leq p\Big{(}\sum_{k=[Ceh_{\max}T]+1}^{\infty}\frac{(Th_{\max})^{k}}{k!}e^{-Th_{\max}}\Big{)}$
		$\displaystyle\leq\frac{2p(Th_{\max})^{CeTh_{\max}}}{([Ceh_{\max}T]+1)!}e^{-Th_{\max}}$
		$\displaystyle\lesssim\frac{p(Th_{\max})^{CeTh_{\max}}}{(CeTh_{\max})^{CeTh_{\max}}e^{-CeTh_{\max}}\sqrt{CT}}e^{-Th_{\max}}$
		$\displaystyle=\frac{1}{\sqrt{CT}}\exp\Big{\{}-(Ce\log C-C+1)Th_{\max}+\log p\Big{\}}.$

Because $\log(p)=o(T)$ , we can pick $C$ only depending on $h_{\max}$ such that with probability greater than $1-o(e^{-CT})$ , we have

N_{j}(T)\leq Ceh_{\max}T,\ \forall j=1,\cdots,p.

$\Box$

Define the event $B=\{N_{j}(T)\leq Ceh_{\max}T,\ \forall j=1,\cdots,p\}$ and $\mathbb{P}(B)=1-o(e^{-CT})$ .

Lemma 3

Under Assumption 1, we have

\mathbb{P}\Big{(}\sup_{k\in[p],t\in[0,T]}|x_{j,k}(t)|>\kappa_{\max}(\tau+2)\log(p\vee T)\Big{)}\lesssim\frac{Tp}{\sqrt{\log(p\vee T)}}\exp\Big{(}-(\tau+2)\log(p\vee T)\Big{)}=o(\frac{1}{(p\vee T)^{\tau}})

(17)

holds for $\tau\geq 1$ .

proof:

	$\displaystyle\|x_{j,k}(t)\|$	$\displaystyle=\|\int_{(t-\kappa_{\operatorname{supp}})\vee 0}^{t}\kappa_{j,k}(t-s)\text{d}N_{k}(s)\|$
		$\displaystyle\leq\kappa_{\max}\int_{t-\kappa_{\operatorname{supp}}}^{t}\text{d}N_{k}(s)$
		$\displaystyle\leq\kappa_{\max}\Big{(}N_{k}\big{(}T\wedge(v+2)\kappa_{\operatorname{supp}}\big{)}-N_{k}(v\kappa_{\operatorname{supp}})\Big{)},$

where $v\kappa_{\operatorname{supp}}\leq t-\kappa_{\operatorname{supp}}<t\leq T\wedge(v+2)\kappa_{\operatorname{supp}}$ . Thus

	$\displaystyle\mathbb{P}(\sup_{k,t\in[0,T]}\|x_{j,k}(t)\|>\eta)$	$\displaystyle\leq\mathbb{P}\Big{(}\max_{k,v}[N_{k}(T\wedge(v+2)\kappa_{\operatorname{supp}})-N_{k}(v\kappa_{\operatorname{supp}})]>\frac{\eta}{\kappa_{\max}}\Big{)}$
		$\displaystyle\lesssim Tp\frac{(h_{\max}\kappa_{\operatorname{supp}})^{\frac{\eta}{k_{\max}}}}{(\frac{\eta}{\kappa_{\max}})!}e^{-h_{\max}\kappa_{\max}}$
		$\displaystyle\lesssim\frac{Tp}{\sqrt{\eta}}\exp\Big{(}\frac{\eta}{\kappa_{\max}}(1+\log(h_{\max}\kappa_{\operatorname{supp}})-\log\frac{\eta}{\kappa_{\max}})\Big{)}$

The inequality we desire holds if we take $\eta=(\tau+2)\log(p\vee T)$ for $T,p$ both large enough:

\mathbb{P}\Big{(}\sup_{k,t\in[0,T]}|x_{j,k}(t)|>\kappa_{\max}(\tau+2)\log(p\vee T)\Big{)}\lesssim\frac{Tp}{\sqrt{\log(p\vee T)}}\exp\Big{(}-(\tau+2)\log(p\vee T)\Big{)}=o(\frac{1}{(p\vee T)^{\tau}})

A similar proof can show that

		$\displaystyle\mathbb{P}\Big{(}\sup_{k,t\in[0,T]\ \text{and}\ \text{d}N_{j,k}(t)=0}\|x_{j,k}^{\prime}(t)\|>\kappa_{\max}(\tau+2)\log(p\vee M)\Big{)}$
	$\displaystyle\lesssim$	$\displaystyle\frac{Tp}{\sqrt{\log(p\vee T)}}\exp\Big{(}-(\tau+2)\log(p\vee T)\Big{)}=o(\frac{1}{(p\vee T)^{\tau}})$

$\Box$

For simplicity of notation, we define the events

C_{\tau}=\Big{\{}\sup_{k,t\in[0,T]}|x_{j,k}(t)|\leq\kappa_{\max}(\tau+2)\log(p\vee T),\ \sup_{k,t\in[0,T]\ \text{and}\ \text{d}N_{j,k}(t)=0}|x_{j,k}^{\prime}(t)|>\kappa_{\max}(\tau+2)\log(p\vee T)\Big{\}}

and $\mathbb{P}(C_{\tau})=1-o(\frac{1}{(p\vee T)^{\tau}}).$

Theorem 5

Suppose that $N$ is a temporal point process with intensity which satisfies those assumptions above. Suppose also that the functions $f_{i,k}(\cdot)$ have uniformly bounded support and

\sup_{i,k\in[p]}\|f_{i,k}\|_{\infty}\equiv\sup_{i,k\in[p]}\max_{x}|f(x)|\leq C_{f}.

Then, for $r>0$ , we have

		$\displaystyle\mathbb{P}\bigg{(}\bigcup_{1\leq i\leq k\leq p}\Big{\{}\|\bar{y}_{i,k}-\mathbb{E}\bar{y}_{i,k}\|\geq\epsilon\Big{\}}\bigg{)}$
	$\displaystyle\leq$	$\displaystyle c_{1}p^{2}T\exp\Big{(}-\frac{(\epsilon T)^{r/(3r+1)}}{c_{2}}\Big{)}+p^{2}\exp\Big{(}-c_{3}\epsilon^{2}T\Big{)}+p^{2}\exp\Big{[}-c_{4}T\exp\Big{(}\frac{(\epsilon T)^{r(2r+1)/(3r+1)^{2}}}{c_{5}[\log(\epsilon T)]^{r/(3r+1)}}\Big{)}\Big{]}.$

where

\bar{y}_{i,k}\equiv\frac{1}{T}\int_{0}^{T}\int_{0}^{T}f_{i,k}(t-t^{\prime})\text{d}N_{i}(t)\text{d}N_{k}(t^{\prime}),\ \ 1\leq i,k\leq p,

$c_{1},c_{2},c_{3},c_{4},c_{5}$ are positive constants that do not depend on $p$ or $T$ . Take $\epsilon=\sqrt{\frac{(\tau+2)\log(p\vee T)}{c_{3}T}},r=1$ and we have

\mathbb{P}\bigg{(}\bigcup_{1\leq i\leq k\leq p}\Big{\{}|\bar{y}_{i,k}-\mathbb{E}\bar{y}_{i,k}|\geq\sqrt{\frac{(\tau+2)\log(p\vee T)}{T}}\Big{\}}\bigg{)}\leq\frac{c_{6}}{(p\vee T)^{\tau}}.

The inequation above stems from the proof of Theorem 3 in Chen et al.(2017a) and $r>0$ is introduced when making assumption about the tail of transfer kernel (see Assumption 2 of Chen et al.(2017a) for more details). In our setting, the transfer kernels are compact supported and we can choose any $r>0$ .

Lemma 4

For $T$ large enough, we have constants $c_{1},\ c_{2},\ c_{3}$ that do not depend on $p$ and $T$ that

\mathbb{P}\bigg{(}\bigcup_{1\leq i\leq k\leq p}\bigg{[}\bigg{|}\frac{1}{T}\int_{0}^{T}x_{j,i}(t)x_{j,k}(t)dt-\mathbb{E}(x_{j,i}(t)x_{j,k}(t))\bigg{|}\geq c_{1}\sqrt{\frac{(\tau+2)\log(p\vee T)}{T}}\bigg{]}\bigg{)}\leq\frac{c_{2}}{(p\vee T)^{\tau}}.

holds.

Proof.

	$\displaystyle\frac{1}{T}\int_{0}^{T}x_{j,i}(t)x_{j,k}(t)\text{d}t$	$\displaystyle=\frac{1}{T}\int_{0}^{T}\big{(}\int_{0}^{t}\kappa_{j,i}(t-s)\text{d}N_{i}(s)\big{)}\big{(}\int_{0}^{T}\kappa_{j,k}(t-r)dN_{k}(r)\big{)}\text{d}t$
		$\displaystyle=\frac{1}{T}\int_{0}^{T}\int_{0}^{T}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(r)$

Generally, $\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t$ is not a function only depend on $r-s$ . However, for all $i,j\in[p]$ , $\kappa_{j,i}(t)$ is supported in $[0,{\kappa_{\operatorname{supp}}}]$ and we observe that for $s\vee r<T-\kappa_{\operatorname{supp}}$ , we have

\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t=\begin{cases}\int_{0}^{\kappa_{\operatorname{supp}}}\kappa_{j,i}(t)\kappa_{j,k}(t+s-r)\text{d}t\ \ \ &s\geq r\\ \int_{0}^{\kappa_{\operatorname{supp}}}\kappa_{j,i}(t+r-s)\kappa_{j,k}(t)\text{d}t\ \ \ &s<r\end{cases}

both only depend on $s-r$ . Thus

		$\displaystyle\frac{1}{T}\int_{0}^{T}\int_{0}^{T}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)$
	$\displaystyle=$	$\displaystyle\frac{1}{T}\int_{0}^{T-{\kappa_{\operatorname{supp}}}}\int_{0}^{T-{\kappa_{\operatorname{supp}}}}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)$
	$\displaystyle+$	$\displaystyle\frac{1}{T}\int_{T-{\kappa_{\operatorname{supp}}}}^{T}\int_{0}^{T}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)$
	$\displaystyle+$	$\displaystyle\frac{1}{T}\int_{0}^{T-{\kappa_{\operatorname{supp}}}}\int_{T-{\kappa_{\operatorname{supp}}}}^{T}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)$
	$\displaystyle=$	$\displaystyle I^{(1)}_{ik}+I^{(2)}_{ik}+I^{(3)}_{ik}.$

Let

f_{i,k}(s-r)=\begin{cases}\int_{0}^{\kappa_{\operatorname{supp}}}\kappa_{j,i}(t)\kappa_{j,k}(t+s-r)\text{d}t\ \ \ &s-r\geq 0\\ \int_{0}^{\kappa_{\operatorname{supp}}}\kappa_{j,i}(t+r-s)\kappa_{j,k}(t)\text{d}t\ \ \ &s-r<0.\end{cases}

With Theorem 5, we have

\mathbb{P}\Big{(}\bigcup_{1\leq i\leq k\leq p}|I^{(1)}_{ik}-\mathbb{E}[I^{(1)}_{ik}]|\geq c_{3}\sqrt{\frac{\tau\log(p\vee T)}{T}}\Big{)}\leq\frac{c_{4}}{(p\vee T)^{\tau}}.

As for $I^{(2)}_{ik}$ and $I^{(3)}_{ik}$ , we have

	$\displaystyle I^{(2)}_{ik}=$	$\displaystyle\bigg{\|}\frac{1}{T}\int_{T-{\kappa_{\operatorname{supp}}}}^{T}\int_{0}^{T}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)\bigg{\|}$
	$\displaystyle\leq$	$\displaystyle\bigg{\|}\frac{1}{T}\int_{T-{\kappa_{\operatorname{supp}}}}^{T}\int_{T-2{\kappa_{\operatorname{supp}}}}^{T}\Big{(}\int_{s\vee r}^{T}\kappa^{2}_{\max}\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)\bigg{\|}$
	$\displaystyle\leq$	$\displaystyle\frac{8{\kappa_{\operatorname{supp}}}^{3}\kappa^{2}_{\max}}{T}$

and

	$\displaystyle I^{(3)}_{ik}=$	$\displaystyle\bigg{\|}\frac{1}{T}\int_{0}^{T-{\kappa_{\operatorname{supp}}}}\int_{T-{\kappa_{\operatorname{supp}}}}^{T}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)\bigg{\|}$
	$\displaystyle\leq$	$\displaystyle\bigg{\|}\frac{1}{T}\int_{T-2{\kappa_{\operatorname{supp}}}}^{T-{\kappa_{\operatorname{supp}}}}\int_{T-{\kappa_{\operatorname{supp}}}}^{T}\Big{(}\int_{s\vee r}^{T}\kappa^{2}_{\max}\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)\bigg{\|}$
	$\displaystyle\leq$	$\displaystyle\frac{4{\kappa_{\operatorname{supp}}}^{3}\kappa^{2}_{\max}}{T}.\$

Similarly we have

\mathbb{E}[I^{(2)}_{ik}]\leq\frac{8{\kappa_{\operatorname{supp}}}^{3}\kappa^{2}_{\max}}{T},\ \mathbb{E}[I^{(3)}_{ik}]\leq\frac{4{\kappa_{\operatorname{supp}}}^{3}\kappa^{2}_{\max}}{T}.

Thus $|I^{(2)}_{ik}-\mathbb{E}[I^{(2)}_{ik}]|+|I^{(2)}_{ik}-\mathbb{E}[I^{(2)}_{ik}]|=O(T^{-1})$ . We have

\mathbb{P}\bigg{(}\bigcup_{1\leq i\leq k\leq p}\bigg{[}\bigg{|}\frac{1}{T}\int_{0}^{T}x_{j,i}(t)x_{j,k}(t)dt-\mathbb{E}(x_{j,i}(t)x_{j,k}(t))\bigg{|}\geq c_{3}\sqrt{\frac{\tau\log(p\vee T)}{T}}\bigg{]}\bigg{)}\leq\frac{c_{6}}{(p\vee T)^{\tau}}.

$\Box$

Lemma 5

Under Assumptions 1 and 2, we have

\mathbb{P}\Big{(}\max_{k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,k}(t)\text{d}t-\mathbb{E}[x_{j,k}(t)]\Big{|}=c_{1}\sqrt{\frac{\tau\log(p\vee T)}{T}}\Big{)}\leq 1-\frac{c_{2}}{(p\vee T)^{\tau}},

where $c_{1},c_{2}$ are positive constants that do not depend on $p$ or $T$ .

Recall the well known Bernstein type inequalities for point process which can be found in Shorack and Wellner [2009]:

\mathbb{P}\Big{(}|M_{T}|\geq\sqrt{2vx}+\frac{Bx}{3}\ \text{and}\ \int_{0}^{T}g^{2}(t)\text{d}\Lambda(t)\leq v\ \text{and}\ \sup_{t\in[0,T]}|g(t)|\leq B\Big{)}\leq 2e^{-x},

where $M_{T}=\int_{0}^{T}g(t)(\text{d}N(t)-\text{d}\Lambda(t))$ .

Notice that

	$\displaystyle\frac{1}{T}\int_{0}^{T}(x_{j,k}(t)-\mathbb{E}[x_{j,k}(t)])\text{d}t$	$\displaystyle=\frac{1}{T}\int_{0}^{T}\int_{0}^{t}\kappa_{j,k}(t-s)\Big{(}\text{d}N_{k}(s)-\text{d}\Lambda_{k}(s)\Big{)}\text{d}t$
		$\displaystyle=\frac{1}{T}\int_{0}^{T}\int_{s}^{T}\kappa_{j,k}(t-s)\text{d}t(\text{d}N_{k}(s)-\text{d}\Lambda_{k}(s))$

where the last equality is by Fubini’s theorem. Let $f_{1}(s)=\int_{s}^{T}\kappa_{j,k}(t-s)\text{d}t$ . Since the transition kernel is bounded and supported on a compact set, $f_{1}(s)$ is bounded by a constant $B$ that does not depend on $p$ or $T$ . Thus we have

\mathbb{P}\Big{(}\Big{|}\int f_{1}(t)(\text{d}N_{j}(t)-\text{d}\Lambda_{j}(t))\Big{|}\geq\sqrt{2B^{2}Tx}+\frac{Bx}{3}\ \text{and}\ \int_{0}^{T}f_{1}^{2}(t)\text{d}\Lambda_{j}(t)\leq B^{2}T\ \text{and}\ \sup_{t\in[0,T]}|f_{1}(t)|\leq B\Big{)}\leq 2e^{-x}.

Take $x=(\tau+2)\log(p\vee T)$ and we have

\mathbb{P}\Big{(}\max_{k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,k}(t)\text{d}t-\mathbb{E}[x_{j,k}(t)]\Big{|}=c_{1}\sqrt{\frac{\tau\log(p\vee T)}{T}}\Big{)}\leq 1-\frac{c_{2}}{(p\vee T)^{\tau}},

$\Box$

Lemma 6

For a fixed positive constant $r>0$ and $R>0$ , let

Z(r)=\sup_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{2}(1)\cap\mathbb{S}_{1}(r)}\Big{|}\frac{1}{T}\int_{0}^{T}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}-\mathbb{E}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}\Big{|}.

We have for any arbitrary constant $\tau\geq 1$ ,

\mathbb{P}\Big{(}Z(r)\geq c_{3}r^{2}(\tau+2)^{2}\sqrt{\frac{\log(p\vee T)}{T}}\Big{)}\leq\frac{c_{1}}{(p\vee T)^{\tau}}.

proof: By Lemma 4 and 5 above, we have

\mathbb{P}\Big{(}\max_{i,k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,i}(t)x_{j,k}(t)-\mathbb{E}[x_{j,i}(t)x_{j,k}(t)]\Big{|}>c_{1}(\tau+2)^{2}\sqrt{\frac{\log(p\vee T)}{T}}\Big{)}\leq\frac{c_{1}}{(p\vee T)^{\tau}},

\mathbb{P}\Big{(}\max_{k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,k}(t)-\mathbb{E}[x_{j,k}(t)]\Big{|}>c_{1}(\tau+2)^{2}\sqrt{\frac{\log(p\vee T)}{T}}\Big{)}\leq\frac{c_{2}}{(p\vee T)^{\tau}},

where $c_{1}$ and $c_{2}$ are positive constants that do not depend on $p$ or $T$ .

Note that for any $(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{2}(1)\cap\mathbb{S}_{1}(r)$ ,

		$\displaystyle\Big{\|}\frac{1}{T}\int_{0}^{T}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}-\mathbb{E}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}\Big{\|}$
	$\displaystyle\leq$	$\displaystyle\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{1}^{2}\cdot\Big{(}\max_{i,k}\Big{\|}\frac{1}{T}\int_{0}^{T}x_{j,i}(t)x_{j,k}(t)\text{d}t-\mathbb{E}[x_{j,i}(t)x_{j,k}(t)]\Big{\|}\vee\max_{k}\Big{\|}\frac{1}{T}\int_{0}^{T}x_{j,k}(t)\text{d}t-\mathbb{E}[x_{j,k}(t)]\Big{\|}\Big{)}.$

Then

\Big{\{}\max_{i,k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,i}(t)x_{j,k}(t)-\mathbb{E}[x_{j,i}(t)x_{j,k}(t)]\Big{|}\vee\max_{k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,k}(t)-\mathbb{E}[x_{j,k}(t)]\Big{|}\leq\epsilon\Big{\}}\subset\{Z(r)\leq r^{2}\epsilon\}.

Take $\epsilon=c_{3}(\tau+2)^{2}\sqrt{\frac{\log(p\vee T)}{T}}$ and we have what we want. $\Box$

	$\displaystyle\\|(\mu_{j}^{},\bm{\beta}_{j}^{\mathrm{T}})\\|_{1}$	$\displaystyle=\\|(\mu_{j}^{},\bm{\beta}^{\mathrm{T}}_{j,S})\\|_{1}$
	$\displaystyle\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{1}$	$\displaystyle=\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}}_{S})\\|_{1}+\\|(0,\hat{\bm{\Delta}}^{\mathrm{T}}_{S^{C}})\\|_{1}$
	$\displaystyle\\|(\mu_{j}^{}+\hat{\delta}_{\mu},(\bm{\beta}_{j}^{}+\hat{\bm{\Delta}})^{\mathrm{T}})\\|_{1}$	$\displaystyle=\\|(\mu_{j}^{}+\hat{\delta}_{\mu},(\bm{\beta}_{j,S}^{}+\hat{\bm{\Delta}}_{S})^{\mathrm{T}})\\|_{1}+\\|(0,(\bm{\beta}_{j,S^{C}}^{*}+\hat{\bm{\Delta}}_{S^{C}})^{\mathrm{T}})\\|_{1}$
	$\displaystyle\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}_{S}^{\mathrm{T}})\\|_{1}$	$\displaystyle\geq\\|(\mu_{j}^{},\bm{\beta}^{\mathrm{T}}_{j,S})\\|_{1}-\\|(\mu_{j}^{}+\hat{\delta}_{\mu},(\bm{\beta}_{j}^{}+\hat{\bm{\Delta}})^{\mathrm{T}}_{S})\\|_{1},$

		$\displaystyle L_{T}(\mu_{j}^{}+\delta_{\mu},\bm{\beta}_{j}^{}+\bm{\Delta})-L_{T}(\mu_{j}^{},\bm{\beta}_{j}^{})-\langle\nabla_{\bm{\beta}}L_{T}(\mu_{j}^{},\bm{\beta}_{j}^{}),\bm{\Delta}\rangle-\nabla_{\mu}L_{T}(\mu_{j}^{},\bm{\beta}_{j}^{})\delta_{\mu}$
	$\displaystyle\geq$	$\displaystyle\kappa_{1}\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{2}\Big{\{}\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{2}-\kappa_{2}\tau^{2}\varepsilon(T,p)\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{1}^{2}\Big{\}}$
	$\displaystyle\geq$	$\displaystyle\kappa_{1}\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{2}\Big{\{}\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{2}-s\kappa_{2}\tau^{2}\varepsilon(T,p)\\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\\|_{2}^{2}\Big{\}}$

	$\displaystyle\mathcal{F}(\hat{\delta}_{\mu},\hat{\bm{\Delta}})$	$\displaystyle=L_{T}(\mu_{j}^{}+\hat{\delta}_{\mu},\bm{\beta}^{}+\hat{\bm{\Delta}})-L_{T}(\mu_{j}^{},\bm{\beta}_{j}^{})+\lambda_{T}\Big{(}\\|(\mu_{j}^{}+\hat{\delta}_{\mu},(\bm{\beta}_{j}^{}+\hat{\bm{\Delta}})_{S}^{\mathrm{T}})\\|_{1}+\\|(0,\hat{\bm{\Delta}}^{\mathrm{T}}_{S^{C}})\\|_{1}-\\|(\mu_{j}^{},\bm{\beta}_{j}^{\mathrm{T}})\\|_{1}\Big{)}$
		$\displaystyle\geq-\frac{\lambda_{T}}{2}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{1}+\kappa_{1}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}\Big{\{}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}-s\kappa_{2}\tau^{2}\varepsilon(T,p)\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}^{2}\Big{\}}$
		$\displaystyle+\lambda_{T}(\\|(0,\hat{\bm{\Delta}}_{S^{C}}^{\mathrm{T}})\\|_{1}-\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}_{S}^{\mathrm{T}})\\|_{1})$
		$\displaystyle\geq\kappa_{1}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}\Big{\{}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}-s\kappa_{2}\tau^{2}\varepsilon(T,p)\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}^{2}\Big{\}}-\frac{3\sqrt{s}\lambda_{T}}{2}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}$
		$\displaystyle\geq\Big{(}\frac{\kappa_{1}}{2}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2}-3\sqrt{s}\lambda_{T}\Big{)}\\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\\|_{2},$