This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Temporal Point Process Graphical Models

Yalong Lyu The first and second authors have equal contributions to this paper. School of Mathematical Sciences, Peking University, China Huiyuan Wang and Wei Lin [email protected] School of Mathematical Sciences, Peking University, China
Abstract

Many real-world objects can be modeled as a stream of events on the nodes of a graph. In this paper, we propose a class of graphical event models named temporal point process graphical models for representing the temporal dependencies among different components of a multivariate point process. In our model, the intensity of an event stream can depend on the historical events in a nonlinear way. We provide a procedure that allows us to estimate the parameters in the model with a convex loss function in the high-dimensional setting. For the approximation error introduced during the implementation, we also establish the error bound for our estimators. We demonstrate the performance of our method with extensive simulations and a spike train data set.

keywords: graphical models; temporal point processes; high-dimensionality

1 Introduction

Graphical modeling of point processes is drawing increasing attention, as data in the form of point processes on a graph are emerging in scientific and business applications. Examples include biological neural networks with neural spike train data (see e.g. Paninski et al., 2007; Pillow et al., 2008; Bolding and Franks, 2018); social media data recording timestamps of each user’s actions (Perry and Wolfe, 2013; Zhou et al., 2013; Fox et al., 2016); high frequency financial data recording times of market orders (Bacry et al.2013; Aït-Sahalia et al., 2015; Bacry et al., 2015). Understanding the connectivity structure of these point processes, i.e, the edges on the graphs, is a challenging problem of growing interest. Our work is motivated by a recent spike train data set from Bolding and Franks (2018) . This data set collected spike train data from two important brain regions for olfactory of mice, namely the olfactory blub and the piriform cortex. The neural activity in the olfactory bulb is highly concentration-dependent, while that in the piriform cortex is concentration-invariant. The study of Bolding and Franks (2018) suggests that there exists a complex excitation-inhabitation mechanism between the olfactory blub and the piriform cortex that allows the nervous system to produce a concentration-invariant representation of odors.

In this article, we propose a new temporal point process graphical model to address this problem. We model the spike train data of the neurons as a multivariate temporal point process, where each component of the point process represents the activity of a neuron. We allow future events be excited or inhabited by the past events in a non-additive manner. We organize the transfer coefficients into a matrix and impose a sparse structure, which is welcomed for interpretation in neuroscience and other applications (Zhao and Yu, 2006; Zou, 2006). Our loss function is convex, so many highly scalable algorithms, such as alternating direction method of multipliers and coordinate descent method, can be applied to parameter estimation. We establish a non-asymptotic error bound of the estimator and allow the number of nodes in the graph to diverge. We also analyze the approximation error introduced during the implementation phase.

There have been a great number of works focusing on graphical models, including the Gaussian graphical model (Yuan and Lin, 2007; Liu, 2013) and directed acyclic graph (Kalisch and Bühlman, 2007; Van de Geer et al., 2013; Zheng et al., 2018). In this models, each node of the graph is corresponding to a random variable and the data are realizations of the random variables. The graphs of classical graphical models are used to represent the structure of conditional independence among the random variables. In contrast, we aim at the case where each node of the graph represents a point process and the graph can be viewed as a multivariate point process. In our setting, the graph encodes the local independence relationship among the components of the multivariate point process (Didelez, 2008). The basic idea of local independence is that once we know about specific past events, the intensity of the future event is independent of other past events. In our model, local independence is fully characterised by the parameters (see Proposition 1).

There have been a family of models targeting temporal point processes. A Hawkes process (Hawkes, 1974) assumes that historical events can trigger future events and has been widely used in modelling neuronal spike train data and social network data. The model we proposed differs from existing research of the Hawkes process in many ways. Most prior art has focus on parameter estimation in the linear case, where the transfer function of the multivariate Hawkes process is the identity function. Hansen et al.(2015) considered estimating parameters in a linear Hawkes process of fixed number of nodes with least-square loss function and 1\ell_{1} regularization. Bacry et al.(2020) organized the transferring coefficients as a matrix and impose a low-rank structure. A crucial weakness of linear Hawkes process is that it does not allow inhibitory relations, i.e, the past events can only trigger more future events, not silence future events. Our model extends to a nonlinear setting where inhibitory influences are allowed and the influence of past events from different nodes can accumulate in a non-additive manner.

Tang and Li(2021) also considered parameter estimation of a nonlinear Hawkes process in their work. Our work differs from Tang and Li(2021) in several aspects. First, we focus on the estimation of sparse graphical models while they focus on the estimation of a low-rank tensor of regression parameters. Second, we provide a stronger theoretical guarantee by proving the global minimum of our penalized loss function has a slogpT\sqrt{\frac{s\log p}{T}} convergence rate while their theory only shows that there exists a local minimum of the penalized loss function in the neighborhood of the true parameter. Finally, our loss function is convex while theirs is not. A convex loss function with restricted strong convexity allows us to find global minimum efficiently.

The paper is organized as follows. We will finish this section with notation. We briefly introduce the temporal point process and our model in Section 2. In Section 3, we present the construction of our estimator and our main theorem. In Section 4, we analyze the approximation error introduced during the implementation phase. We show the results of simulations in Section 5 and an application on a spick train data set in Section 6. We conclude with a discussion in Section 7. Technical proofs are provided in the Appendix.

Some notations will be employed through out this paper. Let [n],n[n],n\in\mathbb{N} denote the set {1,,n}\{1,\ldots,n\}. For a set S[p]S\subset[p] and a vector vpv\in\mathbb{R}^{p}, we define vSpv_{S}\in\mathbb{R}^{p} as (vS)i=vi𝕀(iS)(v_{S})_{i}=v_{i}\mathbb{I}(i\in S). Let p,p[1,)\|\cdot\|_{p},\ p\in[1,\infty) denote the p\ell_{p}-norm of a vector and 0\|\cdot\|_{0} denote the number of non-zero entries of a vector. For a matrix Am×nA\in\mathbb{R}^{m\times n}, let AF=(i=1mj=1naij2)1/2\|A\|_{F}=(\sum_{i=1}^{m}\sum_{j=1}^{n}a_{ij}^{2})^{1/2} denote the Frobenius norm of a matrix. We use Sqp(r)S_{q}^{p}(r) for a q\|\cdot\|_{q} ball with radius rr in p\mathbb{R}^{p}. We use c1,c2,c_{1},c_{2},\ldots to denote positive numbers that do not depend on pp or TT but can still depend on some values that we consider as constants in this paper, and their value can change from line to line.

2 The Temporal Point Process Graphical Model

We will begin with a very brief review of temporal point processes. For a more systematic and comprehensive discussion of point processes, we refer to Brémaud (1981) and Daley and Vere-Jones (2007).

Let N(t),t0N(t),t\geq 0 be a simple point process. N(t)N(t) is corresponding to a sequence of real-valued random variables {tk}k\{t_{k}\}_{k\in\mathbb{Z}}, with tk+1>tkt_{k+1}>t_{k} and t10t_{1}\geq 0 almost surely. {tk}k\{t_{k}\}_{k\in\mathbb{Z}} are called the event times of N(t)N(t). For a set AA in the Borel σ\sigma-field of the real line, N(A)=k𝕀{tkA}N(A)=\sum_{k}\mathbb{I}_{\{t_{k}\in A\}}. We write N([t,t+dt))N([t,t+\mathrm{d}t)) as dN(t)\mathrm{d}N(t), where dt\mathrm{d}t denotes an arbitrarily small increment of tt. Let t\mathcal{H}_{t} denote the σ\sigma-field generated by N(τ),τ[0,t]N(\tau),\tau\in[0,t]. The intensity of N(t)N(t) is defined as

λ(t)dt=(dN(t)=1|t).\lambda(t)\mathrm{d}t=\mathbb{P}(\mathrm{d}N(t)=1|\mathcal{H}_{t}).

Now suppose we have pp simple point processes N1,,NpN_{1},\ldots,N_{p} with tj,1<tj,2<<tj,n<t_{j,1}<t_{j,2}<\cdots<t_{j,n}<\cdots denoting the event times of the process Nj,j[p]N_{j},j\in[p]. Then we employ 𝒩=(N1,,Np)\mathcal{N}=(N_{1},\ldots,N_{p}) to denote a pp-variate point process. Let t\mathcal{H}_{t} denote the σ\sigma-field generated by 𝒩(τ),τ[0,T]\mathcal{N}(\tau),\tau\in[0,T]. The intensity process 𝝀(t)=(λ1(t),,λp(t))T\bm{\lambda}(t)=(\lambda_{1}(t),\ldots,\lambda_{p}(t))^{\mathrm{T}} is a pp-variate t\mathcal{H}_{t}-predictable process defined as

λj(t)dt=(dNj(t)=1|t),j[p].\lambda_{j}(t)\text{d}t=\mathbb{P}\big{(}\text{d}N_{j}(t)=1|\mathcal{H}_{t}\big{)},\ j\in[p].

In our model, we assume this intensity process takes the form

λj(t)=hj(μj+k=1p0tωj,k(τ)dNk(tτ)),j[p],\lambda_{j}(t)=h_{j}\Big{(}\mu_{j}+\sum_{k=1}^{p}\int_{0}^{t}\omega_{j,k}(\tau)\text{d}N_{k}(t-\tau)\Big{)},\ j\in[p], (1)

where hj(),j[p]h_{j}(\cdot),j\in[p] are link functions and can be nonlinear. μj,j[p]\mu_{j},j\in[p] are the background intensities of the components of the multivariate point process. ωj,k():+,j,k[p]\omega_{j,k}(\cdot):\mathbb{R}^{+}\rightarrow\mathbb{R},j,k\in[p] are the transfer functions. 𝝀(t)=(λ1(t),,λp(t))\bm{\lambda}(t)=\big{(}\lambda_{1}(t),\cdots,\lambda_{p}(t)\big{)} is called the intensity process. The convolutional expression 0tωj,k(τ)dNk(tτ)\int_{0}^{t}\omega_{j,k}(\tau)\text{d}N_{k}(t-\tau) summaries the historical information of the kk-th component. In this paper, we consider parameterized transfer functions that can be written as

ωj,k(t)=βj,kκj,k(t),j,k[p],\omega_{j,k}(t)=\beta_{j,k}\kappa_{j,k}(t),\ j,k\in[p],

where κj,k(t),j,k[p]\kappa_{j,k}(t),\ j,k\in[p] are known kernel functions.

In graphical models, the edges encodes conditional independence structures among the random variables (Maathuis et al., 2018; Qiao et al., 2019; Engelke and Hitz, 2020). For multivariate point processes, the question of interest is that whether the intensity of one component depends on the history of another component, which can be represent by the local independence for multivariate point process (Didelez, 2008).

Definition 1 (local independence for multivariate point process)

Let 𝒩=(N1,,Np)\mathcal{N}=(N_{1},\ldots,N_{p}) be a multivariate point process. Let further A,BA,B and CC be disjoint subsets of [p][p]. We then say that a subprocess 𝒩B\mathcal{N}_{B} is locally independent of 𝒩A\mathcal{N}_{A} given 𝒩C\mathcal{N}_{C} over 𝒯\mathcal{T} if 𝔼[λk(t)|tABC]=𝔼[λk(t)|tBC]\mathbb{E}[\lambda_{k}(t)|\mathcal{F}_{t}^{A\cup B\cup C}]=\mathbb{E}[\lambda_{k}(t)|\mathcal{F}_{t}^{B\cup C}] holds for all t𝒯t\in\ \mathcal{T} and kBk\in B, where tA\mathcal{F}_{t}^{A} denotes the σ\sigma-field generated by {Ni([0,t]),iA}\{N_{i}([0,t]),i\in A\}.

In temporal point process graphical models, the local independence structures are encoded in the transfer matrix B=(βj,k)p×pB=(\beta_{j,k})\in\mathbb{R}^{p\times p}, because λj(t)=hj(μj+k=1p0tβj,kκj,k(τ)dNk(tτ))tA\lambda_{j}(t)=h_{j}\big{(}\mu_{j}+\sum_{k=1}^{p}\int_{0}^{t}\beta_{j,k}\kappa_{j,k}(\tau)\text{d}N_{k}(t-\tau)\big{)}\in\mathcal{F}_{t}^{A} for index set AA as long as {k,βj,k0}A\{k,\beta_{j,k}\neq 0\}\subset A. We have the following proposition:

Proposion 1

For a temporal point process graphical model 𝒩=(N1,,Np)\mathcal{N}=(N_{1},\ldots,N_{p}), let A,BA,B be disjoint subsets of [p][p] and 𝒩A={Ni,iA}\mathcal{N}_{A}=\{N_{i},\ i\in A\}. 𝒩A\mathcal{N}_{A} is locally independent of 𝒩B\mathcal{N}_{B} given 𝒩(AB)c\mathcal{N}_{(A\cup B)^{c}} if and only if βi,j=0,iA,jB\beta_{i,j}=0,\ \forall i\in A,\ j\in B.

Thus by estimating the transfer matrix BB, we can recover the local independence relationships in 𝒩(t)\mathcal{N}(t).

3 Estimation

In this section, we present an approach for estimating the temporal point process graphical models. We will begin with the construction of our estimator in Section 3.1, then provide a detailed comparison among our method with the widely employed method in prior arts, the least-square estimation and the maximum likelihood estimation in Section 3.2. In Section 3.3, we will present theoretical analysis that guarantees the performance of our method.

3.1 Our Method

In this paper, we will focus on the temporal point process graphical models with bounded link functions and compactly supported transfer functions, as explained in Assumption 1.

Assumption 1

There exists a constant hmaxh_{\max} that hj()<hmax,j[p]h_{j}(\cdot)<h_{\max},j\in[p] and hj,j[p]h_{j},j\in[p] have continuous derivatives. κj,k(t),j,k[p]\kappa_{j,k}(t),j,k\in[p] are supported on [0,κsupp][0,\kappa_{\operatorname{supp}}] for some constant κsupp\kappa_{\operatorname{supp}} with bounded derivatives |κj,k(t)|κmax,j,k[p]|\kappa_{j,k}^{\prime}(t)|\leq\kappa_{\max},j,k\in[p] (and naturally we can pick κmax\kappa_{\max} such that |κj,k(t)|κmax,j,k[p]|\kappa_{j,k}(t)|\leq\kappa_{\max},j,k\in[p] also holds).

Similar versions of assumptions on the link function and transfer kernels are required in existing theory (Hansen et al., 2015; Chen et al., 2017b; Bacry et al., 2020). As we allow a nonlinear link function, the transfer kernels are no longer restricted to be non-negative. For simplicity of the theory, we assume that hj()h_{j}(\cdot) is bounded. This assumption covers most of the cases in the applications.

Under Assumption 1, the temporal point process is Markovian, i.e., given the event times in [t,t+κsupp][t,t+\kappa_{\operatorname{supp}}], the intensity process {𝝀(s),s>t+κsupp}\{\bm{\lambda}(s),s>t+\kappa_{\operatorname{supp}}\} is independent of t\mathcal{H}_{t}. Together with the fact that the intensity process is bounded by hmaxh_{\max}, it is straightforward to show that the empty state ϕ\phi, where there is no event on [t,t+κsupp][t,t+\kappa_{\operatorname{supp}}], is a positive recurrent and the temporal point process is stable in distribution. For more discussion of stability for the temporal point process, we refer interested readers to Brémaud and Massoulié (1996) and Costa et al. (2018).

Suppose that we observe the event times of a stationary process on [0,T][0,T]. Under Assumption 1, we can rewrite the intensity in a compact form:

λj(t)=hj(μj+𝒙j(t),𝜷j),j[p],\lambda_{j}(t)=h_{j}\big{(}\mu_{j}+\langle\bm{x}_{j}(t),\bm{\beta}_{j}\rangle\big{)},j\in[p],

where 𝒙j(t)=(xj,1(t),,xj,p(t))T\bm{x}_{j}(t)=\big{(}x_{j,1}(t),\cdots,x_{j,p}(t)\big{)}^{\mathrm{T}}, xj,k(t)=0tκj,k(tτ)dNk(τ)x_{j,k}(t)=\int_{0}^{t}\kappa_{j,k}(t-\tau)\text{d}N_{k}(\tau) and 𝜷j=(βj,1,βj,2,,βj,p)\bm{\beta}_{j}=(\beta_{j,1},\beta_{j,2},\ldots,\beta_{j,p}). Inspired by the log-linear models in Negahban et al.(2012), we considered the following loss function for estimating (μj,𝜷j)(\mu_{j},\bm{\beta}_{j}):

LT(𝝁,B)=1Tj=1p0TWj(t)(Hj(μj+𝜷j,𝒙j(t))dt𝜷j,𝒙j(t)dNj(t)),L_{T}(\bm{\mu},B)=\frac{1}{T}\sum_{j=1}^{p}\int_{0}^{T}W_{j}(t)\big{(}H_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)\text{d}t-\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle\text{d}N_{j}(t)\big{)}, (2)

where Hj()H_{j}(\cdot) is a primitive function of hj()h_{j}(\cdot), i.e., Hj()=hj()H_{j}^{\prime}(\cdot)=h_{j}(\cdot), 𝝁=(μ1,,μp)\bm{\mu}=(\mu_{1},\ldots,\mu_{p}) and B=(𝜷1T,,𝜷pT)B=(\bm{\beta}_{1}^{\mathrm{T}},\ldots,\bm{\beta}_{p}^{\mathrm{T}}), 𝑾(t)=(W1(t),,Wp(t))\bm{W}(t)=\big{(}W_{1}(t),\ldots,W_{p}(t)\big{)} is a weighting process. If 𝝁\bm{\mu} and BB is intrinsically sparse due to some practical reasons, we can adopt the following 1\ell_{1} penalized loss minimization (Tibshirani,1996):

(𝝁^,B^)argmin𝝁,B{1Tj=1p0TWj(t)(Hj(μj+𝜷j,𝒙j(t))dt𝜷j,𝒙j(t)dNj(t))+λT(𝝁1+B1)},j[p].(\hat{\bm{\mu}},\hat{B})\in\mathop{\arg\min}_{\bm{\mu},B}\Big{\{}\frac{1}{T}\sum_{j=1}^{p}\int_{0}^{T}W_{j}(t)\Big{(}H_{j}\big{(}\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle\big{)}\text{d}t-\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle\text{d}N_{j}(t)\Big{)}+\lambda_{T}(\|\bm{\mu}\|_{1}+\|B\|_{1})\Big{\}},\ j\in[p]. (3)

In the generalized linear models, Negahban et al.(2012) employed a similar method to derive a convex loss function. The negative log-likelihood loss function for generalized linear model is convex but not strongly convex when p>Tp>T, i.e., there are some directions that have zero gradient and the Hessian matrix is not of full rank. Negahban et al. (2012) addressed this problem by introducing the conception of restricted strong convexity. For a decomposable penalty (such as 1\|\cdot\|_{1}) and a convex loss function, the solutions of optimization will always fall in a ’restricted set’, restricted to which the loss function is strongly convex. We show that our loss function also admits a restriction strong convexity condition. This implies that the obtained estimator, regardless of algorithm used or initial value, will provide the convergence rate we proved. In contrast, the estimator provided by Tang and Li (2021) is a local estimator, which only retain the statistical property as long as the initial value is close enough to the true parameters.

3.2 Examples

For estimating a temporal point process graphical with linear link functions, there are two widely used estimating strategies, the least-square estimator (Bacry et al.,2020; Hansen et al.,2015):

LTLS(𝝁,B)=1Tj=1p0Tλj2(t;μj,𝜷j)dt2T0Tλj(t;μj,𝜷j)dNj(t),L^{\mathrm{LS}}_{T}(\bm{\mu},B)=\frac{1}{T}\sum_{j=1}^{p}\int_{0}^{T}\lambda_{j}^{2}(t;\mu_{j},\bm{\beta}_{j})\text{d}t-\frac{2}{T}\int_{0}^{T}\lambda_{j}(t;\mu_{j},\bm{\beta}_{j})\text{d}N_{j}(t), (4)

and the maximum likelihood estimator (Ozaki,1979; Tang and Li, 2021)

LTMLE(𝝁,B)=1Tj=1p0Tλj(t;μj,𝜷j)dt1T0Tlog(λj(t;μj,𝜷j))dNj(t).L^{\mathrm{MLE}}_{T}(\bm{\mu},B)=\frac{1}{T}\sum_{j=1}^{p}\int_{0}^{T}\lambda_{j}(t;\mu_{j},\bm{\beta}_{j})\text{d}t-\frac{1}{T}\int_{0}^{T}\log\big{(}\lambda_{j}(t;\mu_{j},\bm{\beta}_{j})\big{)}\text{d}N_{j}(t). (5)

The score functions for least square loss and negative log-likelihood loss are

𝜷jLTLS=2T0Thj(μj+𝜷j,𝒙j(t))𝒙j(t)[hj(μj+𝜷j,𝒙j(t))dtdNj(t)]\nabla_{\bm{\beta}_{j}}L_{T}^{\mathrm{LS}}=\frac{2}{T}\int_{0}^{T}h^{\prime}_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)\bm{x}_{j}(t)\Big{[}h_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)\text{d}t-\text{d}N_{j}(t)\Big{]}

and

𝜷jLTMLE=1T0Thj(μj+𝜷j,𝒙j(t))hj(μj+𝜷j,𝒙j(t))𝒙j(t)[hj(μj+𝜷j,𝒙j(t))dtdNj(t)].\nabla_{\bm{\beta}_{j}}L_{T}^{\mathrm{MLE}}=\frac{1}{T}\int_{0}^{T}\frac{h^{\prime}_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)}{h_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)}\bm{x}_{j}(t)\Big{[}h_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)\text{d}t-\text{d}N_{j}(t)\Big{]}.

It is seen that optimizing the least-square risk function is equivalent to optimizing (2) with iterative reweigh W^jLS(t)=hj(μj+𝜷^j,𝒙j(t))\hat{W}^{\mathrm{LS}}_{j}(t)=h^{\prime}_{j}(\mu_{j}+\langle\hat{\bm{\beta}}_{j},\bm{x}_{j}(t)\rangle) and optimizing the negative log-likelihood risk function is equivalent to optimizing (2) with iterative reweigh W^jMLE(t)=hj(μj+𝜷^j,𝒙j(t))hj(μj+𝜷^j,𝒙j(t))\hat{W}^{\mathrm{MLE}}_{j}(t)=\frac{h^{\prime}_{j}(\mu_{j}+\langle\hat{\bm{\beta}}_{j},\bm{x}_{j}(t)\rangle)}{h_{j}(\mu_{j}+\langle\hat{\bm{\beta}}_{j},\bm{x}_{j}(t)\rangle)}. Directly optimizing the risk function for least square estimator and maximum likelihood estimator are non-convex problems. But our iterative reweigh procedure permits us to approximate the non-convex problem with a sequence of convex problems.

3.3 Theoretical Guarantees

Now we return to the 1\ell_{1}-penalized loss minimization (3). Theoretical result indicates that our estimator is consistent even in high-dimensional case. When unrestricted, it is possible to cook up extreme graphs, where, for instance, some components of the influence process 𝒙(t)\bm{x}(t) are constantly pushed towards infinity, which makes the model hardly identifiable. To avoid such cases, we pose the following regularity conditions.

Assumption 2

(1)The mean influence, μj+kSjβj,k𝔼[xj,k]\mu^{*}_{j}+\sum_{k\in S_{j}}\beta^{*}_{j,k}\mathbb{E}[x_{j,k}] are uniformly bounded:

supj[p]μj+kSjβj,k𝔼[xj,k]<σ0,j[p],\sup_{j\in[p]}\mu^{*}_{j}+\sum_{k\in S_{j}}\beta^{*}_{j,k}\mathbb{E}[x_{j,k}]<\sigma_{0},\ \forall j\in[p],

for some constant σ0>0\sigma_{0}>0 for every jj.

(2) The covariance matrix of the influence process Cov(𝐱j(t))=Cov(xj,1(t),,xj,p(t))=Σj\mathrm{Cov}\big{(}\bm{x}_{j}(t)\big{)}=\mathrm{Cov}\big{(}x_{j,1}(t),\cdots,x_{j,p}(t)\big{)}=\Sigma_{j} is positive definite and has bounded eigenvalue:

0<ΣminΓmin(Σj)Γmax(Σj)Σmax<,j[p]0<\Sigma_{\min}\leq\Gamma_{\min}(\Sigma_{j})\leq\Gamma_{\max}(\Sigma_{j})\leq\Sigma_{\max}<\infty,\ \forall j\in[p]

where Γmin()\Gamma_{\min}(\cdot) and Γmax()\Gamma_{\max}(\cdot) indicate the smallest and the largest eigenvalue of a matrix. That is, for any (δμ,𝚫T)𝕊2p+1(1)(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}^{p+1}_{2}(1) we have

𝔼(δμ+𝚫,𝒙j(t))2>σ1\mathbb{E}\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}^{2}>\sigma_{1}

and there exists some R>0R>0 such that

𝔼(δμ+𝚫,𝒙j(t))2𝕀(|μi+𝜷j,𝒙j(t)|R)𝕀(|δμ+𝚫,𝒙j(t)|R)>σ1.\mathbb{E}\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}^{2}\cdot\mathbb{I}\big{(}|\mu_{i}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R\big{)}\cdot\mathbb{I}\big{(}|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle|\leq R\big{)}>\sigma_{1}. (6)

(3) The derivative is bounded away from 0 on a compact interval: there exists σ2>0\sigma_{2}>0 such that

inf2R<t<2Rhj(t)σ2>0,j[p],\inf_{-2R<t<2R}h_{j}^{\prime}(t)\geq\sigma_{2}>0,\ \forall j\in[p],

where RR is introduced above.

(4) There exist 0<σ3<σ40<\sigma_{3}<\sigma_{4} that σ3<Wj(t)<σ4,j[p]\sigma_{3}<W_{j}(t)<\sigma_{4},\ \forall j\in[p] with probability 11 for TT large enough.

We briefly discuss these assumptions. Assumption 2 (1) restricts the average influence of process Nj(t)N_{j}(t) and it is equivalent to having bounded intercept terms in vanilla Lasso problems (Tibshirani, 1996). Having Non-degenerate covariates (Assumption 2 (2)) is a standard requirement for estimation problems (condition GLM1 in Negahban et al.,2012, regularity condition 4 in Tang and Li, 2021). Here (6) asks for a version of non-degenerate that looks stronger. It is not truly stronger. If we let RR\rightarrow\infty, we will have

𝔼(δμ+𝚫,𝒙j(t))2𝕀(|μi+𝜷j,𝒙j(t)|R)𝕀(|δμ+𝚫,𝒙j(t)|R)𝔼(δμ+𝚫,𝒙j(t))2.\mathbb{E}\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}^{2}\cdot\mathbb{I}\big{(}|\mu_{i}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R\big{)}\cdot\mathbb{I}\big{(}|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle|\leq R\big{)}\rightarrow\mathbb{E}\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}^{2}.

This involves a trade-off between constants σ1\sigma_{1} and σ2\sigma_{2}. One can show that (6) holds if 𝒙j(t)\bm{x}_{j}(t) has a sub-exponential tail. Assumption 2 (3) is employed to ensure identifiability. To conclude, Assumption 2 (2) and 2 (3) are required mainly for technical reasons. We propose Assumption 2 (4) to make sure that the weights are well-behaved.

Now we are ready to introduce the main theorem.

Theorem 1

For estimators in (3), under Assumption 1 and 2, there exists constants c1,c2,c3>0c_{1},c_{2},c_{3}>0 that with λT=c1τlog(pT)T\lambda_{T}=c_{1}\sqrt{\frac{\tau\log(p\vee T)}{T}}, we have

p1/2(𝝁^,B^)(𝝁,B)Fc2sλTp^{-1/2}\|(\hat{\bm{\mu}},\hat{B})-(\bm{\mu}^{*},B^{*})\|_{F}\leq c_{2}\sqrt{s}\lambda_{T} (7)
(𝝁^,B^)(𝝁,B)c2sλT\|(\hat{\bm{\mu}},\hat{B})-(\bm{\mu}^{*},B^{*})\|_{\infty}\leq c_{2}s\lambda_{T} (8)

hold with possibility greater than 1c3(pT)τ1-\frac{c_{3}}{(p\vee T)^{\tau}}, where s=1+maxj{𝛃j0,j[p]}s=1+\max_{j}\{\|\bm{\beta}_{j}\|_{0},j\in[p]\} and a star * as a superscript denotes the true value of parameters.

Theorem 1 guarantees that with high probability, the estimator will converge in the order of slog(pT)T\sqrt{\frac{s\log(p\vee T)}{T}}, which is similar to the established ones in prior works (Bacry et al., 2020; Hansen et al., 2015; Tang and Li, 2021). It is important to note that Theorem 1 considers an oracle procedure, where the tuning parameters depend on unknown parameters. When the linear functions are linear, Hansen et al. (2015) explored the selection guidelines in the case that the number of nodes pp is fixed but the dictionary for sieving transfer kernels is growing. A general theory for tuning parameter selection when pp\rightarrow\infty still remains an open problem. A detailed proof of Theorem 1 is provided in the Appendix.

4 Implementation

4.1 Theory for Approximation

Implementation lies between the beautiful statistical theory and the successful application of statistical methods. The concrete details of implementation are important but seldom mentioned in prior articles. In this section, we present a comprehensive analysis of the implementative version of our estimation.

For temporal point processes, the algorithm usually proceeds in a discrete manner (Tang and Li, 2021). The observation interval [0,T][0,T] is equally divided into MM subintervals and the length of each interval is TM\frac{T}{M} and the integral is replaced by numerical integral. Let yj,m=Nj((m+1)TM)Nj(mTM)y_{j,m}=N_{j}(\frac{(m+1)T}{M})-N_{j}(\frac{mT}{M}) denote the number of events in [mTM,(m+1)TM][\frac{mT}{M},\frac{(m+1)T}{M}] of node jj and let xj,k,m=xj,k(mTM)x_{j,k,m}=x_{j,k}(\frac{mT}{M}) for m=0,,M1m=0,\ldots,M-1. Let 𝒙j,m=(𝒙j,1,m,,𝒙j,p,m)\bm{x}_{j,m}=(\bm{x}_{j,1,m},\cdots,\bm{x}_{j,p,m}). Then our loss function in implementation becomes

L~T(𝝁,B)=1Tj=1pm=0M1Wj(mTM)(TMHj(μj+𝜷j,𝒙j,m)yj,m(μj+𝜷j,𝒙j,m)).\tilde{L}_{T}(\bm{\mu},B)=\frac{1}{T}\sum_{j=1}^{p}\sum_{m=0}^{M-1}W_{j}(\frac{mT}{M})\Big{(}\frac{T}{M}H_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j,m}\rangle)-y_{j,m}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j,m}\rangle)\Big{)}. (9)

Generally, the discrete version (9) has additional approximation error compared to (2). To characterize the additional error, we pose the following error decomposition:

yj,m=Nj((m+1)TM)Nj(mTM)=𝔼[Nj((m+1)TM)Nj(mTM)|λj|(0,(m+1)TM]]+ϵ1,j,my_{j,m}=N_{j}(\frac{(m+1)T}{M})-N_{j}(\frac{mT}{M})=\mathbb{E}[N_{j}(\frac{(m+1)T}{M})-N_{j}(\frac{mT}{M})\Big{|}\lambda_{j}|_{(0,\frac{(m+1)T}{M}]}]+\epsilon_{1,j,m}

and

𝔼[Nj((m+1)TM)Nj(mTM)|λj|(0,(m+1)TM]]+ϵ1,j,m\displaystyle\mathbb{E}[N_{j}(\frac{(m+1)T}{M})-N_{j}(\frac{mT}{M})\Big{|}\lambda_{j}|_{(0,\frac{(m+1)T}{M}]}]+\epsilon_{1,j,m} =mTM(m+1)TMhj(μj+𝒙j(τ),𝜷j)dτ\displaystyle=\int_{\frac{mT}{M}}^{\frac{(m+1)T}{M}}h_{j}(\mu_{j}^{*}+\langle\bm{x}_{j}(\tau),\bm{\beta}_{j}^{*}\rangle)\mathrm{d}\tau
=TMhj(μj+𝒙j,m,𝜷j)+ϵ2,j,m,\displaystyle=\frac{T}{M}h_{j}(\mu_{j}^{*}+\langle\bm{x}_{j,m},\bm{\beta}^{*}_{j}\rangle)+\epsilon_{2,j,m},

where 𝒙j,m=(xj,1,m,xj,2,m,,xj,p,m)\bm{x}_{j,m}=(x_{j,1,m},x_{j,2,m},\ldots,x_{j,p,m}). Thus we have

yj,m=TMhj(μj+𝒙j,m,𝜷j)+ϵj,m,j[p],m=0,,M1,y_{j,m}=\frac{T}{M}h_{j}\Big{(}\mu_{j}^{*}+\langle\bm{x}_{j,m},\bm{\beta}_{j}^{*}\rangle\Big{)}+\epsilon_{j,m},\ j\in[p],\ m=0,\ldots,M-1,

where ϵj,m=ϵ1,j,m+ϵ2,j,m\epsilon_{j,m}=\epsilon_{1,j,m}+\epsilon_{2,j,m}. The choice of MM is of great importance because it leads a trade-off between computation efficiency and approximation accuracy. To ensure the approximation error ϵ2,j,m\epsilon_{2,j,m} is dominated by the statistical noise ϵ1,j,m\epsilon_{1,j,m}, we assume that,

Assumption 3

MM, the number of subintervals is large enough that τ1/4s1/2T5/4log2(pT)M=o(1)\frac{\tau^{1/4}s^{1/2}T^{5/4}\log^{2}(p\vee T)}{M}=o(1).

The 1\ell_{1} penalized loss minimization in implementation is

(𝝁~,B~)argmin𝝁,B{1Tj=1pm=0M1Wj(mTM)(TMHj(μj+𝜷j,𝒙j,m)yj,m(μj+𝜷j,𝒙j,m))+λT(𝝁1+B1)}.(\tilde{\bm{\mu}},\tilde{B})\in\mathop{\arg\min}_{\bm{\mu},B}\Big{\{}\frac{1}{T}\sum_{j=1}^{p}\sum_{m=0}^{M-1}W_{j}(\frac{mT}{M})\Big{(}\frac{T}{M}H_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j,m}\rangle)-y_{j,m}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j,m}\rangle)\Big{)}+\lambda_{T}(\|\bm{\mu}\|_{1}+\|B\|_{1})\Big{\}}. (10)

By carefully analysing the approximation error, we can provide a similar bound for convergence rate for (𝝁~,B~)(\tilde{\bm{\mu}},\tilde{B}):

Theorem 2

For estimators in (10), under Assumption 1-3, we have there exists constants c1,c2,c3>0c_{1},c_{2},c_{3}>0 that with λT=c1τlog(pT)T\lambda_{T}=c_{1}\sqrt{\frac{\tau\log(p\vee T)}{T}}, we have

p1/2(𝝁~,B~)(𝝁,B)Fc2sλTp^{-1/2}\|(\tilde{\bm{\mu}},\tilde{B})-(\bm{\mu}^{*},B^{*})\|_{F}\leq c_{2}\sqrt{s}\lambda_{T} (11)
(𝝁~,B~)(𝝁,B)c2sλT\|(\tilde{\bm{\mu}},\tilde{B})-(\bm{\mu}^{*},B^{*})\|_{\infty}\leq c_{2}s\lambda_{T} (12)

hold with possibility greater than 1c3(pT)τ1-\frac{c_{3}}{(p\vee T)^{\tau}}, where s=1+maxj{𝛃j0,j[p]}s=1+\max_{j}\{\|\bm{\beta}_{j}\|_{0},j\in[p]\} and a star * as a superscript denotes the true value of parameters.

When MM satisfies Assumption 3, the approximation error is dominated by the statistical noise. There are other graphical models of point process that divided the observation window into a number of bins and model the number of events in each bin, with a generalized linear model (Zhang et al.,2016) or with Gaussian graphical model (Vinci et al., 2018). Our method is fundamentally different from these methods because discretization is employed only in the implementation of the method. Our final goal is to estimate the parameters in the generalized Hawkes process instead of other models. A detailed proof is included in the supplementary material.

5 Simulation

In this section, we investigate the performance of our estimation procedure in a simulation study. We propose two estimators, the naively weighted estimator with Wj(t)=1,j[p]W_{j}(t)=1,\ \forall j\in[p] and the iterative reweighted maximum likelihood estimator where Wj(t)=hj(t)hj(t)W_{j}(t)=\frac{h_{j}^{\prime}(t)}{h_{j}(t)} is iteratively updated.

We set the number of nodes to p{30,60}p\in\{30,60\}. We consider two popular graphical structures: block and chain. For the block structure, BB is a 6060 (30×3030\times 30) blockwise diagonal matrix of 12 (6) identical symmetrical Toeplitz matrices whose first row is (0,0.3,0.3,0.3,0.3)(0,0.3,-0.3,0.3,-0.3) for p=60p=60 (p=30p=30). For the chain structure, BB is a symmetrical Toeplitz matrix whose first row is (0,0.3,0.3,0,0,,0)(0,0.3,-0.3,0,0,\cdots,0). Their explicit forms are shown in Figure 1 and 2 for p=30p=30 and p=60p=60, where the excitatory connection are shown in blue with entry βj,k=0.3\beta_{j,k}=0.3 and the inhibitory connection are shown in red with entry βj,k=0.3\beta_{j,k}=-0.3. The background intensity is μj=0.5\mu_{j}=0.5. We consider two sets of link function and transfer kernel. For setting 1, we consider an arctan\arctan function as the link function and a restricted linear transfer kernel.

{hj(x)=4+8πarctan(x),j[p],κj,k(x)=(1x)𝕀[0,1](x),j,k[p].\displaystyle\begin{cases}h_{j}(x)&=4+\frac{8}{\pi}\arctan(x),\ j\in[p],\\ \kappa_{j,k}(x)&=(1-x)\mathbb{I}_{[0,1]}(x),\ j,k\in[p].\end{cases}

For setting 2, we consider a sigmoid function as our link function and a exponential transfer kernel.

{hj(x)=1+4ex1+ex,j[p],κj,k(x)=ex𝕀[0,5](x),j,k[p].\displaystyle\begin{cases}h_{j}(x)&=1+\frac{4e^{x}}{1+e^{x}},\ j\in[p],\\ \kappa_{j,k}(x)&=e^{-x}\mathbb{I}_{[0,5]}(x),\ j,k\in[p].\end{cases}
Refer to caption
Figure 1: Connectivity matrices with block and chain structures for p=30p=30. The excitatory connection are shown in blue with entry βj,k=0.3\beta_{j,k}=0.3 and the inhibitory connection are shown in red with entry βj,k=0.3\beta_{j,k}=-0.3.
Refer to caption
Figure 2: Connectivity matrices with block and chain structures for p=60p=60. The excitatory connection are shown in blue with entry βj,k=0.3\beta_{j,k}=0.3 and the inhibitory connection are shown in red with entry βj,k=0.3\beta_{j,k}=-0.3.

The length of observation is T{200,400}T\in\{200,400\} and the number of sub-intervals M=10TM=10T. We use a 5-fold cross-validation to select the penalty level. We evaluate the estimation accuracy by the relative 1\ell_{1} error i,j|β^i,jβi,j|i,j|βi,j|\frac{\sum_{i,j}|\hat{\beta}_{i,j}-\beta_{i,j}|}{\sum_{i,j}|\beta_{i,j}|} and the relative mean square error of the estimated transferring coefficient matrix B^BF2BF2\frac{\|\hat{B}-B^{*}\|_{F}^{2}}{\|B^{*}\|_{F}^{2}}. Three methods are compared: the 1\ell_{1} penalized naively weighted estimator (Sparse-Naive) the 1\ell_{1} penalized iteratively reweighted maximum likelihood estimator (Sparse-MLE) and the vanilla maximum likelihood estimator. The average relative 1\ell_{1} error and 2\ell_{2} error based on 50 replications with the standard errors in the parenthesis are reported.

Table 1: Estimation accuracy of BB in form of relative 1\ell_{1} error
Setting Structure p T Sparse-Naive Sparse-MLE vanilla-MLE
Setting 1 Block 30 200 0.718(0.049) 0.667(0.025) 1.510(0.057)
400 0.525(0.024) 0.547(0.032) 0.942(0.025)
60 200 1.010(0.077) 0.805(0.022) 4.460(0.134)
400 0.776(0.037) 0.656(0.011) 2.128(0.040)
Chain 30 200 0.877(0.043) 0.796(0.032) 1.793(0.135)
400 0.645(0.028) 0.656(0.034) 1.066(0.048)
60 200 1.240(0.086) 0.959(0.025) 5.277(0.170)
400 0.916(0.049) 0.787(0.022) 2.403(0.063)
Setting 2 Block 30 200 0.895(0.037) 0.884(0.033) 2.812(0.163)
400 0.716(0.023) 0.723(0.026) 1.612(0.052)
60 200 1.035(0.043) 1.048(0.032) 14.62(0.611)
400 0.862(0.025) 0.872(0.025) 3.906(0.083)
Chain 30 200 1.063(0.057) 1.054(0.033) 3.571(0.213)
400 0.883(0.037) 0.882(0.038) 1.826(0.086)
60 200 1.174(0.044) 1.176(0.026) 15.77(0.546)
400 1.031(0.029) 1.024(0.024) 4.451(0.137)
Table 2: Estimation accuracy of BB in form of relative 2\ell_{2} error
Setting Structure p T Sparse-Naive Sparse-MLE vanilla-MLE
Setting 1 Block 30 200 0.235(0.018) 0.241(0.018) 0.495(0.042)
400 0.149(0.012) 0.150(0.012) 0.191(0.011)
60 200 0.300(0.015) 0.307(0.014) 2.235(0.143)
400 0.174(0.005) 0.194(0.007) 0.483(0.019)
Chain 30 200 0.326(0.029) 0.318(0.026) 0.843(0.301)
400 0.200(0.017) 0.181(0.018) 0.250(0.034)
60 200 0.418(0.027) 0.399(0.021) 3.523(0.296)
400 0.224(0.011) 0.242(0.011) 0.687(0.077)
Setting 2 Block 30 200 0.485(0.030) 0.475(0.032) 1.772(0.268)
400 0.285(0.020) 0.281(0.020) 0.555(0.036)
60 200 0.575(0.029) 0.566(0.024) 24.53(1.682)
400 0.348(0.020) 0.341(0.019) 1.652(0.074)
Chain 30 200 0.672(0.083) 0.620(0.045) 3.575(0.729)
400 0.389(0.033) 0.371(0.027) 0.736(0.120)
60 200 0.769(0.064) 0.724(0.058) 27.85(1.561)
400 0.484(0.023) 0.462(0.022) 2.561(0.266)
Table 3: The mean area under the ROC curves
Setting Structure p T Sparse-Naive Sparse-MLE
Setting 1 Block 30 200 0.995(0.0026) 0.997(0.0015)
400 1.000(0.0001) 1.000(0.0001)
60 200 0.995(0.0016) 0.997(0.0014)
400 1.000(0.0001) 1.000(0.0001)
Chain 30 200 0.954(0.0074) 0.965(0.0063)
400 0.974(0.0025) 0.978(0.0027)
60 200 0.968(0.0041) 0.976(0.0032)
400 0.985(0.0012) 0.988(0.0012)
Setting 2 Block 30 200 0.914(0.0143) 0.916(0.0139)
400 0.979(0.0044) 0.981(0.0042)
60 200 0.915(0.0137) 0.918(0.0138)
400 0.983(0.0039) 0.985(0.0035)
Chain 30 200 0.837(0.0174) 0.841(0.0189)
400 0.924(0.0090) 0.925(0.0100)
60 200 0.840(0.0134) 0.844(0.0145)
400 0.944(0.0056) 0.948(0.0052)

Table 1 and 2 summarize the results based on 5050 data replications. It is seen that both of our proposed methods outperform the vanilla maximum likelihood estimator, which is known to be asymptotically efficient (Ogata,1978) in estimating parameters when the transfer function is linear. Overall, the iteratively reweighted penalized maximum likelihood estimator achieves better performance. But the loss surface for the 1\ell_{1} penalized MLE is not convex, so there is no permission that the working algorithm will find a feasible solution. The naive weighted estimator provides a convex loss function that guarantees that the algorithm returns a solution in the neighborhood of a feasible solution.

To demonstrate the accuracy of variable selection for our proposed methods, we calculated true-positive rates and false-positive rates, i.e, the edges of the network that were correctly identified for each method and tuning parameter. Plotting true positive rate versus false positive rate over a fine grid of values of the tuning parameter produces a ROC curve, with curves near the top left corner indicating a method that performs well. Figure 3 shows the median of the ROC curves for our two proposed methods and table 3 provides the area under the ROC curve (average over the 50 simulation runs). It is seen that both methods are successful in recovering the graphical structure in all the settings we considered.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: ROC curves for p=60,T=200p=60,T=200 in setting 1 (top row) and setting 2 (bottom row).

Finally, we report the computation time. The naively weighted estimator has a convex loss function and does not need to calculate the reweigh parameters and runs faster than the iteratively reweigh MLE. The simulation code is programmed in Python. For the simulation example in setting 1 with p=60p=60, T=400T=400 and M=4000M=4000, the average computing time was about 1.5 minutes for Naive estimator and 1.7 minutes for MLE. For the simulation example in setting 2 with p=60p=60, T=400T=400 and M=4000M=4000, the average computing time was 2.12.1 minutes for Naive estimator and 2.52.5 minutes for MLE. All computations were performed with a Intel(R) Core(TM) i9 10900K [email protected].

6 Application

We consider the task of recovering the connectivity network among the neurons using the spike train data from the optogenetic study of Bolding and Franks(2018) discussed in the Section 1. In this experiment, the spike trains are recorded at 30 kHz on the olfactory bulb (OB) and the piriform cortex (PCx) of the mice, while a laser pulse is applied directly on the OB cells. For specially designed transgenic mice, light pulses will elicited an increase in OB spiking and remain elevated for the duration of the stimulus. We consider the spike train data collected at three intensity levels, 0mW/mm20\ mW/mm^{2}, 5mW/mm25\ mW/mm^{2} and 10mW/mm210\ mW/mm^{2} from 1515 OB cells and 4545 PCx cells. For each intensity level, the observation lasts about T=90T=90 seconds. We choose the number of sub-intervals M=2000M=2000 by pre-experiments.

Refer to caption
Figure 4: Number of firings of the OB cells and the PCx cells for laser intensity 0mW/mm20\ mW/mm^{2}, 5mW/mm25\ mW/mm^{2} and 10mW/mm210\ mW/mm^{2}. OB cells are numbered 1-15 and PCx cells are numbered 16-60.

Figure 4 shows the number of firings of OB cells and PCx cells. The firing numbers vary largely among cells. We use an adaptive link-function to better characterize the average intensity of different neurons:

hi(t)=λ^i(1+2arctan(t)π),i=1,,60,h_{i}(t)=\hat{\lambda}_{i}\big{(}1+\frac{2\arctan(t)}{\pi}\big{)},\ i=1,\ldots,60,

where λ^i=Ni([0,T])T\hat{\lambda}_{i}=\frac{N_{i}([0,T])}{T}. Based on the research of the duration of influence among the neurons in Bolding and Franks(2018), we choose κj,k(x)=𝕀[0,0.25](x),j,k,[p]\kappa_{j,k}(x)=\mathbb{I}_{[0,0.25]}(x),j,k,\in[p]. Since our goal was to provide interpretable visualizations and investigate the influence of the laser on the neural connectivity, we computed sparse graphs with approximately 5%5\% connected edges. We performed a bootstrap procedure by randomly selecting 20002000 sub-intervals with replacement from the original data, finding a tuning parameters γn\gamma_{n} to achieve 5%5\% sparsity level, fitting the point process graphical model and repeating 50 times. The final graph was constructed from the excitatory and inhibitory edges that occurred in at least 50%50\% of the bootstrap replications.

Refer to caption
Refer to caption
Refer to caption
Figure 5: Estimated Network among the OB cells and the PCx cells for laser intensity 0mW/mm20\ mW/mm^{2}, 5mW/mm25\ mW/mm^{2} and 10mW/mm210\ mW/mm^{2} from left to right. OB cells are numbered 1-15 and PCx cells are numbered 16-60. A red pixel at (i,j)(i,j) indicates an inhibitory connection from cell jj to cell ii and a blue pixel at (i,j)(i,j) indicates an excitatory connection from cell jj to cell ii.

Figure 5 Shows the networks estimated using a point process graphical model for three groups. The bootstrapped graphical model estimated a sparser network with sparsity level 3.28%3.28\% for the 0mW/mm20\ mW/mm^{2} group, 4.08%4.08\% for the 5mW/mm25\ mW/mm^{2} group and 3.67%3.67\% for the 10mW/mm210\ mW/mm^{2} group. We observe a few apparent patterns. First, PCx cells are densely connected in all three groups but the 0mW/mm20mW/mm^{2} group has increased connectivity relative to other groups, suggesting that there may be a multi-layer excitatory-inhibitory neuron network in the piriform cortex to stabilize the odor signal. Second, when the laser was applied to OB cells, the number of edges from other cells to the OB cells increase and the sign of if several connections changed, indicating that the laser affected the connectivity structure of OB cells and PCx cells and this effect was heterogenous. Finally, we observed that a few neurons fired less frequently and had little effect on other cells (such as OB-9, PCx-1, PCx-15, PCx-42, PCx-44, etc). Their role in the odor-processing mechanism needs to be further explored.

7 Conclusion and Future Work

In this paper, we have proposed the temporal point process graphical model, a new class of graphical models for learning the temporal dependencies among different components of a multivariate point process. We have shown that the locally independent structure of the process is fully encoded in the transfer matrix. We have provided a class of estimators with a non-asymptotic estimation error bound, which extends the classical estimators of temporal point process. Our model naturally includes the case of sparse temporal point process regression, where the predictors and the responses are not the same process. The performance of our estimator has been tested by simulations and a spike train data set.

Several challenges still remain in the analysis of the temporal point process graphical models. Using cross-validation for selecting tuning parameters is proved to be successful in the classical Lasso estimators (Chetverikov et al., 2021). However, a detailed theoretical analysis is needed to demonstrate its performance in our case. The criteria for optimal tuning parameter selection for estimation and variable selection are left for future research. The second challenge is to assess the uncertainty of the estimated parameters, which is less addressed in the existing work.

References

  • Aït-Sahalia et al. [2015] Yacine Aït-Sahalia, Julio Cacho-Diaz, and Roger JA Laeven. Modeling financial contagion using mutually exciting jump processes. Journal of Financial Economics, 117(3):585–606, 2015.
  • Bacry et al. [2013] Emmanuel Bacry, Sylvain Delattre, Marc Hoffmann, and Jean-François Muzy. Some limit theorems for hawkes processes and application to financial statistics. Stochastic Processes and their Applications, 123(7):2475–2499, 2013.
  • Bacry et al. [2015] Emmanuel Bacry, Iacopo Mastromatteo, and Jean-François Muzy. Hawkes processes in finance. Market Microstructure and Liquidity, 1(01):1550005, 2015.
  • Bacry et al. [2020] Emmanuel Bacry, Martin Bompaire, Stéphane Gaïffas, and Jean-Francois Muzy. Sparse and low-rank multivariate hawkes processes. Journal of Machine Learning Research, 21(50):1–32, 2020.
  • Bolding and Franks [2018] K. A. Bolding and K. M. Franks. Recurrent cortical circuits implement concentration-invariant odor coding. Science, 361(6407):eaat6904–, 2018.
  • Brémaud and Massoulié [1996] Pierre Brémaud and Laurent Massoulié. Stability of nonlinear hawkes processes. The Annals of Probability, 24(3):1563 – 1588, 1996.
  • Brémaud [1981] Pierre Brémaud. Point Processes and Queues. Springer-Verlag, 1981.
  • Chen et al. [2017a] Shizhe Chen, Ali Shojaie, Eric Shea-Brown, and Daniela Witten. The multivariate hawkes process in high dimensions: Beyond mutual excitation. arXiv preprint arXiv:1707.04928, 2017a.
  • Chen et al. [2017b] Shizhe Chen, Daniela Witten, and Ali Shojaie. Nearly assumptionless screening for the mutually-exciting multivariate hawkes process. Electronic journal of statistics, 11(1):1207, 2017b.
  • Chetverikov et al. [2021] Denis Chetverikov, Zhipeng Liao, and Victor Chernozhukov. On cross-validated lasso in high dimensions. The Annals of Statistics, 49(3):1300–1317, 2021.
  • Costa et al. [2018] Manon Costa, Carl Graham, Laurence Marsalle, and Viet Chi Tran. Renewal in hawkes processes with self-excitation and inhibition. Advances in Applied Probability, 52, 01 2018.
  • Daley and Vere-Jones [2007] DJ Daley and David Vere-Jones. An Introduction to the Theory of Point Processes: Volume II: General Theory and Structure. Springer Science & Business Media, 2007.
  • Didelez [2008] Vanessa Didelez. Graphical models for marked point processes based on local independence. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(1):245–264, 2008.
  • Engelke and Hitz [2020] Sebastian Engelke and Adrien S Hitz. Graphical models for extremes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(4):871–932, 2020.
  • Fox et al. [2016] Eric W Fox, Martin B Short, Frederic P Schoenberg, Kathryn D Coronges, and Andrea L Bertozzi. Modeling e-mail networks and inferring leadership using self-exciting point processes. Journal of the American Statistical Association, 111(514):564–584, 2016.
  • Hansen et al. [2015] Niels Hansen, Patricia Reynaud-Bouret, and Vincent Rivoirard. Lasso and probabilistic inequalities for multivariate point processes. Bernoulli, 21, 08 2015.
  • Hawkes [1971] Alan G Hawkes. Spectra of some self-exciting and mutually exciting point processes. Biometrika, 58(1):83–90, 1971.
  • Hawkes and Oakes [1974] Alan G. Hawkes and David Oakes. A cluster process representation of a self-exciting process. Journal of Applied Probability, 11(03):493–503, 1974.
  • Kalisch and Bühlman [2007] Markus Kalisch and Peter Bühlman. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. Journal of Machine Learning Research, 8(3), 2007.
  • Liu [2013] Weidong Liu. Gaussian graphical model estimation with false discovery rate control. The Annals of Statistics, pages 2948–2978, 2013.
  • Maathuis et al. [2018] Marloes Maathuis, Mathias Drton, Steffen Lauritzen, and Martin Wainwright. Handbook of graphical models. CRC Press, 2018.
  • Negahban et al. [2012] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of mm-estimators with decomposable regularizers. Statistical science, 27(4):538–557, 2012.
  • Ogata [1978] Yoshiko Ogata. The asymptotic behaviour of maximum likelihood estimators for stationary point processes. Annals of the Institute of Statistical Mathematics, 30(2):243–261, 1978.
  • Ozaki [1979] Tohru Ozaki. Maximum likelihood estimation of Hawkes’ self-exciting point processes. Annals of the Institute of Statistical Mathematics, 31(1):145–155, 1979.
  • Paninski et al. [2007] Liam Paninski, Jonathan Pillow, and Jeremy Lewi. Statistical models for neural encoding, decoding, and optimal stimulus design. Progress in brain research, 165:493–507, 2007.
  • Perry and Wolfe [2013] Patrick O Perry and Patrick J Wolfe. Point process modelling for directed interaction networks. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(5):821–849, 2013.
  • Pillow et al. [2008] Jonathan W Pillow, Jonathon Shlens, Liam Paninski, Alexander Sher, Alan M Litke, EJ Chichilnisky, and Eero P Simoncelli. Spatio-temporal correlations and visual signalling in a complete neuronal population. Nature, 454(7207):995–999, 2008.
  • Qiao et al. [2019] Xinghao Qiao, Shaojun Guo, and Gareth M James. Functional graphical models. Journal of the American Statistical Association, 114(525):211–222, 2019.
  • Shorack and Wellner [2009] Galen R Shorack and Jon A Wellner. Empirical processes with applications to statistics. SIAM, 2009.
  • Tang and Li [2021] Xiwei Tang and Lexin Li. Multivariate temporal point process regression. Journal of the American Statistical Association, 0(0):1–16, 2021.
  • Tibshirani [1996] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • Van de Geer et al. [2013] Sara Van de Geer, Peter Bühlmann, et al. 0\ell_{0}-penalized maximum likelihood for sparse directed acyclic graphs. Annals of Statistics, 41(2):536–567, 2013.
  • Vinci et al. [2018] Giuseppe Vinci, Valérie Ventura, Matthew A Smith, and Robert E Kass. Adjusted regularization in latent graphical models: Application to multiple-neuron spike count data. The annals of applied statistics, 12(2):1068, 2018.
  • Yuan and Lin [2007] Ming Yuan and Yi Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19–35, 2007.
  • Zhang et al. [2016] Chunming Zhang, Yi Chai, Xiao Guo, Muhong Gao, David Devilbiss, and Zhengjun Zhang. Statistical learning of neuronal functional connectivity. Technometrics, 58(3):350–359, 2016.
  • Zhao and Yu [2006] Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563, 2006.
  • Zheng et al. [2018] Xun Zheng, Bryon Aragam, Pradeep K Ravikumar, and Eric P Xing. Dags with no tears: Continuous optimization for structure learning. Advances in Neural Information Processing Systems, 31, 2018.
  • Zhou et al. [2013] Ke Zhou, Hongyuan Zha, and Le Song. Learning triggering kernels for multi-dimensional hawkes processes. In International Conference on Machine Learning, pages 1301–1309. PMLR, 2013.
  • Zou [2006] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.

Appendix A Appendix

Here we show the proof of Theorem 1. The proof of Theorem 1 is divided in three steps. First, we show that the loss function admits the restricted strong convexity condition in Theorem  3. Second, we show that the estimator has a convergence rate of slog(pT)T\sqrt{\frac{s\log(p\vee T)}{T}} as long as λT\lambda_{T} dominate the statistical noise in Theorem 4. In the last step, we show that λT\lambda_{T} dominate the statistical noise with high probability in Lemma 1.

Theorem 3

Let

LT(μj,βj)=1T0TWj(t)(Hj(μj+𝜷j,𝒙j(t))dt𝜷j,𝒙j(t)dNj(t))L_{T}(\mu_{j},\beta_{j})=\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}H_{j}(\mu_{j}^{*}+\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle)\mathrm{d}t-\langle\bm{\beta}_{j},\bm{x}_{j}(t)\rangle\mathrm{d}N_{j}(t)\Big{)} (13)

denote the loss function. Under Assumption 1 and 2, there exists constants κ1,κ2>0\kappa_{1},\kappa_{2}>0 such that for every (δμ,ΔT)K(r,S)(\delta_{\mu},\Delta^{\mathrm{T}})\in K(r,S),

LT(μj+δμ,𝜷j+Δ)LT(μj,𝜷j)𝜷jLT(μj,𝜷j),𝚫μLT(μj,𝜷j)δμ\displaystyle L_{T}(\mu^{*}_{j}+\delta_{\mu},\bm{\beta}^{*}_{j}+\Delta)-L_{T}(\mu^{*}_{j},\bm{\beta}^{*}_{j})-\langle\nabla_{\bm{\beta}_{j}}L_{T}(\mu^{*}_{j},\bm{\beta}^{*}_{j}),\bm{\Delta}\rangle-\nabla_{\mu}L_{T}(\mu^{*}_{j},\bm{\beta}^{*}_{j})\delta_{\mu}
\displaystyle\geq κ1(δμ,𝚫T)2{(δμ,𝚫T)2κ2(τ+1)2ε(T,p)(δμ,𝚫T)12}\displaystyle\kappa_{1}\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}\Big{\{}\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}-\kappa_{2}(\tau+1)^{2}\varepsilon(T,p)\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{1}^{2}\Big{\}}

holds with probability greater than 1c1(pT)τ1-\frac{c_{1}}{(p\vee T)^{\tau}}, where SS is the support of 𝛃j\bm{\beta}^{*}_{j} and K(r,S)={(δμ,𝚫)p+1,(δμ,𝚫T)2=r,(0,𝚫ScT)14(δμ,𝚫ST)1}K(r,S)=\{(\delta_{\mu},\bm{\Delta})\in\mathbb{R}^{p+1},\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}=r,\|(0,\bm{\Delta}_{S^{c}}^{\mathrm{T}})\|_{1}\leq 4\|(\delta_{\mu},\bm{\Delta}_{S}^{\mathrm{T}})\|_{1}\}, ε(T,p)=log(pT)T.\varepsilon(T,p)=\sqrt{\frac{\log(p\vee T)}{T}}.

Theorem 4

Under Assumption 1 and 2, taking λT=c1τlog(pT)T\lambda_{T}=c_{1}\sqrt{\frac{\tau\log(p\vee T)}{T}} and conditioning on the event j=1pGj\cup_{j=1}^{p}G_{j}, where

Gj\displaystyle G_{j} ={λT2hmax|1T0TWj(t)(hj(μj+𝜷,𝒙j(t))dtdNj(t))|\displaystyle=\Big{\{}\lambda_{T}\geq 2h_{\max}\Big{|}\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)\mathrm{d}t-\mathrm{d}N_{j}(t)\Big{)}\Big{|}\
andλT2hmax|1T0TWj(t)(hj(μj+𝜷,𝒙j(t))xj,k(t)dtxj,k(t)dNj(t))|,k[p]},\displaystyle\text{and}\ \lambda_{T}\geq 2h_{\max}\Big{|}\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)x_{j,k}(t)\mathrm{d}t-x_{j,k}(t)\mathrm{d}N_{j}(t)\Big{)}\Big{|},\forall k\in[p]\Big{\}},

we have

p1/2(𝝁^,B^)(𝝁,B)Fc2sλTp^{-1/2}\|(\hat{\bm{\mu}},\hat{B})-(\bm{\mu}^{*},B^{*})\|_{F}\leq c_{2}\sqrt{s}\lambda_{T} (14)

and

(𝝁^,B^)(𝝁,B)c2sλT\|(\hat{\bm{\mu}},\hat{B})-(\bm{\mu}^{*},B^{*})\|_{\infty}\leq c_{2}s\lambda_{T} (15)

hold with possibility greater than 1c3(pT)τ1-\frac{c_{3}}{(p\vee T)^{\tau}}, where s=1+maxj𝛃j0s=1+\max_{j}\|\bm{\beta}^{*}_{j}\|_{0}.

Lemma 1

Under Assumption 1 and 2, there exist positive constants c1,c2>0c_{1},c_{2}>0, with probability greater than 1c1(pT)τ1-\frac{c_{1}}{(p\vee T)^{\tau}} , we have

|1T0TWj(t)(hj(μj+𝜷,𝒙j(t))xj,k(t)dtxj,k(t)dNj(t))|c2τlog(pT)T,k[p]\Big{|}\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)x_{j,k}(t)\mathrm{d}t-x_{j,k}(t)\mathrm{d}N_{j}(t)\Big{)}\Big{|}\leq c_{2}\sqrt{\frac{\tau\log(p\vee T)}{T}},\ \ \forall k\in[p]

and

|1T0TWj(t)(hj(μj+𝜷,𝒙j(t))dtdNj(t))|c2τlog(pT)T.\Big{|}\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)\mathrm{d}t-\mathrm{d}N_{j}(t)\Big{)}\Big{|}\leq c_{2}\sqrt{\frac{\tau\log(p\vee T)}{T}}.

A.1 Proof of Theorem 3

We will show the restricted strong convexity for the loss function.

proof. With a little abuse of notation, we let 𝒙j(t)\bm{x}_{j}(t) denote the centralized influence function 𝒙j(t)𝔼[𝒙j(t)]\bm{x}_{j}(t)-\mathbb{E}[\bm{x}_{j}(t)] and μj\mu^{*}_{j} denote μj+kSβj,k𝔼[xj,k(t)]\mu^{*}_{j}+\sum_{k\in S}\beta_{j,k}^{*}\mathbb{E}[x_{j,k}(t)].

For 𝚫p\bm{\Delta}\in\mathbb{R}^{p}, define δLT(δμ,μj,𝚫,𝜷j)=LT(μj+δμ,𝜷j+𝚫)LT(μj,𝜷j)𝜷LT(μj,𝜷j),𝚫δμμLT(μ,𝜷j)|μj\delta L_{T}(\delta_{\mu},\mu^{*}_{j},\bm{\Delta},\bm{\beta}^{*}_{j})=L_{T}(\mu^{*}_{j}+\delta_{\mu},\bm{\beta}_{j}^{*}+\bm{\Delta})-L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*})-\langle\nabla_{\bm{\beta}}L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*}),\bm{\Delta}\rangle-\delta_{\mu}\frac{\partial}{\partial\mu}L_{T}(\mu,\bm{\beta}_{j}^{*})\Big{|}_{\mu^{*}_{j}}. By the Taylor expansion, there exists some v[0,1]v\in[0,1] such that

δLT(δμ,μj,𝚫,𝜷j)=1T0TWj(t)hj(μj+𝜷j,𝒙j(t)+v(δμ+𝚫,𝒙j(t)))(δμ+𝚫,𝒙j(t))2dt.\delta L_{T}(\delta_{\mu},\mu^{*}_{j},\bm{\Delta},\bm{\beta}^{*}_{j})=\frac{1}{T}\int_{0}^{T}W_{j}(t)h_{j}^{\prime}\Big{(}\mu^{*}_{j}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle+v\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}\Big{)}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}\text{d}t.

Let γ\gamma be positive numbers to be determined later, 0<γR0<\gamma\leq R, where RR is the constant in Assumption 2. Define the functions ϕγ(u)=u2𝕀(|u|γ/2)+(γu)2𝕀(γ/2|u|γ)\phi_{\gamma}(u)=u^{2}\mathbb{I}(|u|\leq\gamma/2)+(\gamma-u)^{2}\mathbb{I}(\gamma/2\leq|u|\leq\gamma) and ζγ(u)=u𝕀(|u|γ)\zeta_{\gamma}(u)=u\mathbb{I}(|u|\leq\gamma).

Note that

(μj+𝜷j,𝒙j(t))𝕀(|μj+𝜷j,𝒙j(t)|R)=ζR(μj+𝜷j,𝒙j(t)),\displaystyle(\mu_{j}^{*}+\langle\bm{\beta}^{*}_{j},\bm{x}_{j}(t)\rangle)\mathbb{I}(|\mu_{j}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R)=\zeta_{R}(\mu_{j}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle),
(δμ+𝚫,𝒙j(t))𝕀(|δμ+𝚫,𝒙j(t)|γ)=ζγ(δμ+𝚫,𝒙j(t)),\displaystyle(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)\mathbb{I}(|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle|\leq\gamma)=\zeta_{\gamma}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle),
(δμ+𝚫,𝒙j(t))2𝕀(|δμ+𝚫,𝒙j(t)|γ)ϕγ(δμ+𝚫,𝒙j(t)).\displaystyle(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}\mathbb{I}(|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle|\leq\gamma)\geq\phi_{\gamma}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle).

Then

δLT(δμ,μj,𝚫,𝜷j)\displaystyle\delta L_{T}(\delta_{\mu},\mu^{*}_{j},\bm{\Delta},\bm{\beta}^{*}_{j})\geq 1T0TWj(t)hj(μj+𝜷j,𝒙j(t)+v(δμ+𝚫,𝒙j(t)))𝕀(|μj+𝜷j,𝒙j(t)|R)\displaystyle\frac{1}{T}\int_{0}^{T}W_{j}(t)h_{j}^{\prime}\Big{(}\mu^{*}_{j}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle+v\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}\Big{)}\cdot\mathbb{I}(|\mu^{*}_{j}+\langle\bm{\beta}^{*}_{j},\bm{x}_{j}(t)\rangle|\leq R)
\displaystyle\cdot (δμ+𝚫,𝒙j(t))2𝕀(|δμ+𝚫,𝒙j(t)|γ)dt\displaystyle(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}\cdot\mathbb{I}(|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle|\leq\gamma)\text{d}t
\displaystyle\geq 1T0TWj(t)hj(μj+𝜷j,𝒙j(t)+v(δμ+𝚫,𝒙j(t)))𝕀(|μj+𝜷j,𝒙j(t)|R)\displaystyle\frac{1}{T}\int_{0}^{T}W_{j}(t)h_{j}^{\prime}\Big{(}\mu^{*}_{j}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle+v\big{(}\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle\big{)}\Big{)}\cdot\mathbb{I}(|\mu^{*}_{j}+\langle\bm{\beta}^{*}_{j},\bm{x}_{j}(t)\rangle|\leq R)
\displaystyle\cdot ϕγ(δμ+𝚫,𝒙j(t))dt\displaystyle\phi_{\gamma}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)\text{d}t
\displaystyle\geq Chj(R)1T0Tϕγ(δμ+𝚫,𝒙j(t))𝕀(|μj+𝜷j,𝒙j(t)|R)dt,\displaystyle C_{h_{j}}(R)\frac{1}{T}\int_{0}^{T}\phi_{\gamma}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)\cdot\mathbb{I}(|\mu_{j}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R)\text{d}t,

where Chj(R)=σ3inf|x|2Rhj(x)C_{h_{j}}(R)=\sigma_{3}\inf\limits_{|x|\leq 2R}{h_{j}}^{\prime}(x) is a constant by Assumption 2. In this vein, it suffices to bound

1T0Tϕγ(δμ+𝚫,𝒙j(t))𝕀(|μj+𝜷j,𝒙j(t)|R)dt.\frac{1}{T}\int_{0}^{T}\phi_{\gamma}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)\cdot\mathbb{I}(|\mu_{j}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R)\text{d}t.

Let γ(r)=Rr\gamma(r)=Rr, Note that ϕc(cz)=c2ϕ1(z)\phi_{c}(cz)=c^{2}\phi_{1}(z) for any c>0c>0 and zz\in\mathbb{R}. Then for (δμ,𝚫T)2=r\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}=r, ϕγ(r)(δμ+𝚫,𝒙j,m)=r2ϕR(δμ/r+𝚫/r,𝒙j,m)\phi_{\gamma(r)}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j,m}\rangle)=r^{2}\phi_{R}(\delta_{\mu}/r+\langle\bm{\Delta}/r,\bm{x}_{j,m}\rangle), and

1T0Tϕγ(r)(δμ+𝚫,𝒙j(t))𝕀(|μj+𝜷j,𝒙j(t)|R)dt\displaystyle\frac{1}{T}\int_{0}^{T}\phi_{\gamma(r)}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)\cdot\mathbb{I}(|\mu_{j}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R)\text{d}t
=\displaystyle= r2T0TϕR(δμ/r+𝚫/r,𝒙j(t))𝕀(|μj+𝜷j,𝒙j(t)|R)dt,\displaystyle\frac{r^{2}}{T}\int_{0}^{T}\phi_{R}(\delta_{\mu}/r+\langle\bm{\Delta}/r,\bm{x}_{j}(t)\rangle)\cdot\mathbb{I}(|\mu_{j}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R)\text{d}t,

which implies that it suffices to show, there exist strictly positive constants κ1\kappa_{1} and κ2\kappa_{2} which depends only on the Σj\Sigma_{j} and hj()h_{j}(\cdot) such that

1T0TϕR(δμ/r+𝚫/r,𝒙j(t))𝕀(|μj+𝜷j,𝒙j(t)|R)dtκ1{1κ2(τ+1)2ε(T,p)(δμ,𝚫T)12(δμ,𝚫T)2}.\frac{1}{T}\int_{0}^{T}\phi_{R}(\delta_{\mu}/r+\langle\bm{\Delta}/r,\bm{x}_{j}(t)\rangle)\cdot\mathbb{I}(|\mu_{j}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R)\text{d}t\geq\kappa_{1}\Big{\{}1-\kappa_{2}(\tau+1)^{2}\varepsilon(T,p)\cdot\frac{\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{1}^{2}}{\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}}\Big{\}}.

From this perspective, we only need to prove the inequality when (δμ,𝚫T)2=1\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}=1.

Denote by 𝕊2(1)\mathbb{S}_{2}(1) the unit sphere and let 𝕊1(ϖ)={(δμ,𝚫T)p+1:(δμ,𝚫T)1=ϖ}\mathbb{S}_{1}(\varpi)=\{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{R}^{p+1}:\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{1}=\varpi\}. Let f(δμ,𝚫T)(𝒙j)=1T0TϕR(δμ+𝚫,𝒙j(t))𝕀(|μj+𝜷j,𝒙j(t)|R)dtf_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})=\frac{1}{T}\int_{0}^{T}\phi_{R}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)\cdot\mathbb{I}(|\mu_{j}^{*}+\langle\bm{\beta}_{j}^{*},\bm{x}_{j}(t)\rangle|\leq R)\text{d}t. For each ϖ>0\varpi>0, it suffices to show that

ϖ={f(δμ,𝚫T)(𝒙j)<κ1(1κ2(τ+1)2ε(T,p)ϖ2),for some(δμ,𝚫T)𝕊1(ϖ)𝕊2(1)}\mathcal{E}_{\varpi}=\bigg{\{}f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})<\kappa_{1}\bigg{(}1-\kappa_{2}(\tau+1)^{2}\varepsilon(T,p)\varpi^{2}\bigg{)},\ \text{for some}\ (\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{1}(\varpi)\cap\mathbb{S}_{2}(1)\bigg{\}}

occurs with small probability. Note that 𝕊1(ϖ)𝕊2(1)\mathbb{S}_{1}(\varpi)\cap\mathbb{S}_{2}(1)\neq\emptyset only if 1ϖp+11\leq\varpi\leq{\sqrt{p+1}}.

Recall our Assumption 2 (2), we have

𝔼[f(δμ,𝚫T)(𝒙j)]>σ1>0\mathbb{E}[f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})]>\sigma_{1}>0

holds for some σ1\sigma_{1}.

By Lemma 6, we have for any ϖ2[p]\varpi^{2}\in[p],

(sup(δμ,𝚫T)𝕊2(1)𝕊1(ϖ)|1T0T|δμ+𝚫,𝒙j(t)|2𝔼|δμ+𝚫,𝒙j|2|c4τ2ϖ2ϵ(T,p))c5(pT)τ,\mathbb{P}\Big{(}\sup_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{2}(1)\cap\mathbb{S}_{1}(\varpi)}\Big{|}\frac{1}{T}\int_{0}^{T}|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle|^{2}-\mathbb{E}|\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}\rangle|^{2}\Big{|}\geq c_{4}\tau^{2}\varpi^{2}\epsilon(T,p)\Big{)}\leq\frac{c_{5}}{(p\vee T)^{\tau}},

where

ε(T,p)=log(pT)T.\varepsilon(T,p)=\sqrt{\frac{\log(p\vee T)}{T}}.

Since ϕR()\phi_{R}(\cdot) is RR-Lipschitz function, we have

(sup(δμ,𝚫T)𝕊2(1)𝕊1(ϖ)|f(δμ,𝚫T)(𝒙j(t))𝔼[f(δμ,𝚫T)(𝒙j)]|c4τ2Rϖ2ε(T,p))c5(pT)τ.\mathbb{P}\Big{(}\sup_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{2}(1)\cap\mathbb{S}_{1}(\varpi)}\Big{|}f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j}(t))-\mathbb{E}[f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})]\Big{|}\geq c_{4}\tau^{2}R\varpi^{2}\varepsilon(T,p)\Big{)}\leq\frac{c_{5}}{(p\vee T)^{\tau}}.

Then

f(δμ,𝚫T)(𝒙j)\displaystyle f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j}) 𝔼[f(δμ,𝚫T)(𝒙j)]sup(δμ,𝚫T)𝕊2(1)𝕊1(ϖ)|f(δμ,𝚫T)(𝒙j)𝔼[f(δμ,𝚫T)(𝒙j)]|\displaystyle\geq\mathbb{E}[f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})]-\sup_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{2}(1)\cap\mathbb{S}_{1}(\varpi)}\Big{|}f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})-\mathbb{E}[f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})]\Big{|}
σ1c4τ2Rϖ2ε(T,p)\displaystyle\geq\sigma_{1}-c_{4}\tau^{2}R\varpi^{2}\varepsilon(T,p)
=σ1(1c4τ2Rϖ2σ1ε(T,p))\displaystyle=\sigma_{1}\Big{(}1-\frac{c_{4}\tau^{2}R\varpi^{2}}{\sigma_{1}}\varepsilon(T,p)\Big{)}

holds with probability greater than 1c5(pT)τ1-\frac{c_{5}}{(p\vee T)^{\tau}}.

The last step is ”peeling”. For any ϖ2{1,2,,p+1}\varpi^{2}\in\{1,2,\cdots,p+1\}, we have

{f(δμ,𝚫T)(𝒙j)<σ1(1c4τ2Rϖ2σ1ε(T,p)),for some(δμ,𝚫T)𝕊1(ϖ)𝕊2(1)}c5(pT)τ.\mathbb{P}\Big{\{}f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})<\sigma_{1}\Big{(}1-\frac{c_{4}\tau^{2}R\varpi^{2}}{\sigma_{1}}\varepsilon(T,p)\Big{)},\ \text{for some}\ (\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{1}(\varpi)\cap\mathbb{S}_{2}(1)\Big{\}}\leq\frac{c_{5}}{(p\vee T)^{\tau}}.

Then we have

{f(δμ,𝚫T)(𝒙j)<σ12(1c4τ2Rσ1ε(T,p)),for some(δμ,𝚫T)𝕊2(1)}\displaystyle\mathbb{P}\bigg{\{}f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})<\frac{\sigma_{1}}{2}\Big{(}1-\frac{c_{4}\tau^{2}R}{\sigma_{1}}\varepsilon(T,p)\Big{)},\ \text{for some}\ (\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{2}(1)\bigg{\}}
\displaystyle\leq i=1p+1{f(δμ,𝚫T)(𝒙j)<σ1(1c4τ2Rσ1ε(T,p)i),for some(δμ,𝚫T)𝕊1(i)𝕊2(1)}\displaystyle\sum_{i=1}^{p+1}\mathbb{P}\bigg{\{}f_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})}(\bm{x}_{j})<\sigma_{1}\Big{(}1-\frac{c_{4}\tau^{2}R}{\sigma_{1}}\varepsilon(T,p)\cdot i\Big{)},\ \text{for some}\ (\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{1}(\sqrt{i})\cap\mathbb{S}_{2}(1)\bigg{\}}
\displaystyle\leq c6(pT)τ1.\displaystyle\frac{c_{6}}{(p\vee T)^{\tau-1}}.

Then we conclude that by choosing κ1=σ12\kappa_{1}=\frac{\sigma_{1}}{2} and κ2=c4Rσ1\kappa_{2}=\frac{c_{4}R}{\sigma_{1}} with some constant RR, the following

δLT(δμ,μj,𝚫,𝜷j)\displaystyle\delta L_{T}(\delta_{\mu},\mu^{*}_{j},\bm{\Delta},\bm{\beta}^{*}_{j}) κ1(δμ,𝚫T)2{(δμ,𝚫T)2κ2(τ+1)2ε(T,p)(δμ,𝚫T)12}\displaystyle\geq\kappa_{1}\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}\Big{\{}\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}-\kappa_{2}(\tau+1)^{2}\varepsilon(T,p)\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{1}^{2}\Big{\}}

holds uniformly for all (δμ,𝚫T)21\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}\leq 1 with probability greater than 1c6(pT)τ1-\frac{c_{6}}{(p\vee T)^{\tau}}. \Box

A.2 Proof of Theorem 4

Let (δμ,𝚫)=LT(μj+δμ,𝜷j+𝚫)LT(μj,𝜷j)+λT((μj+δμ,(𝜷j+𝚫)T)1(μj,𝜷jT)1)\mathcal{F}(\delta_{\mu},\bm{\Delta})=L_{T}(\mu_{j}^{*}+\delta_{\mu},\bm{\beta}_{j}^{*}+\bm{\Delta})-L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*})+\lambda_{T}(\|(\mu_{j}^{*}+\delta_{\mu},(\bm{\beta}_{j}^{*}+\bm{\Delta})^{\mathrm{T}})\|_{1}-\|(\mu_{j}^{*},\bm{\beta}_{j}^{*\mathrm{T}})\|_{1}) be the difference of penalized loss function. Let (μ^j,𝜷^j)(\hat{\mu}_{j},\hat{\bm{\beta}}_{j}) be the optimal solution to the penalized loss function minimization and define δ^μ=μ^jμj,𝚫^=𝜷^j𝜷j\hat{\delta}_{\mu}=\hat{\mu}_{j}-\mu_{j}^{*},\hat{\bm{\Delta}}=\hat{\bm{\beta}}_{j}-\bm{\beta}_{j}^{*}. We first show some properties of the global optimizer (μ^j,𝜷^j)(\hat{\mu}_{j},\hat{\bm{\beta}}_{j}).

Notice that LT(,)L_{T}(\cdot,\cdot) is a convex function. Then

LT(μj+δ^μ,𝜷j+𝚫^)LT(μj,𝜷j)𝜷LT(μj,𝜷j),𝚫^+μLT(μj,𝜷j)δ^μλT2(δ^μ,𝚫^T)1.L_{T}(\mu_{j}^{*}+\hat{\delta}_{\mu},\bm{\beta}_{j}^{*}+\hat{\bm{\Delta}})-L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*})\geq\langle\nabla_{\bm{\beta}}L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*}),\hat{\bm{\Delta}}\rangle+\nabla_{\mu}L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*})\cdot\hat{\delta}_{\mu}\geq-\frac{\lambda_{T}}{2}\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{1}.

(δ^μ,𝚫^)0\mathcal{F}(\hat{\delta}_{\mu},\hat{\bm{\Delta}})\leq 0 gives that

λT2𝚫^1λT((μj,𝜷jT)1(μj+δ^μ,(𝜷j+𝚫^)T)1).-\frac{\lambda_{T}}{2}\|\hat{\bm{\Delta}}\|_{1}\leq\lambda_{T}\Big{(}\|(\mu_{j}^{*},\bm{\beta}_{j}^{*\mathrm{T}})\|_{1}-\|(\mu_{j}^{*}+\hat{\delta}_{\mu},(\bm{\beta}_{j}^{*}+\hat{\bm{\Delta}})^{\mathrm{T}})\|_{1}\Big{)}.

Owing to

(μj,𝜷jT)1\displaystyle\|(\mu_{j}^{*},\bm{\beta}_{j}^{*\mathrm{T}})\|_{1} =(μj,𝜷j,ST)1\displaystyle=\|(\mu_{j}^{*},\bm{\beta}^{*\mathrm{T}}_{j,S})\|_{1}
(δ^μ,𝚫^T)1\displaystyle\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{1} =(δ^μ,𝚫^ST)1+(0,𝚫^SCT)1\displaystyle=\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}}_{S})\|_{1}+\|(0,\hat{\bm{\Delta}}^{\mathrm{T}}_{S^{C}})\|_{1}
(μj+δ^μ,(𝜷j+𝚫^)T)1\displaystyle\|(\mu_{j}^{*}+\hat{\delta}_{\mu},(\bm{\beta}_{j}^{*}+\hat{\bm{\Delta}})^{\mathrm{T}})\|_{1} =(μj+δ^μ,(𝜷j,S+𝚫^S)T)1+(0,(𝜷j,SC+𝚫^SC)T)1\displaystyle=\|(\mu_{j}^{*}+\hat{\delta}_{\mu},(\bm{\beta}_{j,S}^{*}+\hat{\bm{\Delta}}_{S})^{\mathrm{T}})\|_{1}+\|(0,(\bm{\beta}_{j,S^{C}}^{*}+\hat{\bm{\Delta}}_{S^{C}})^{\mathrm{T}})\|_{1}
(δ^μ,𝚫^ST)1\displaystyle\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}_{S}^{\mathrm{T}})\|_{1} (μj,𝜷j,ST)1(μj+δ^μ,(𝜷j+𝚫^)ST)1,\displaystyle\geq\|(\mu_{j}^{*},\bm{\beta}^{*\mathrm{T}}_{j,S})\|_{1}-\|(\mu_{j}^{*}+\hat{\delta}_{\mu},(\bm{\beta}_{j}^{*}+\hat{\bm{\Delta}})^{\mathrm{T}}_{S})\|_{1},

where the subscript SS denotes that the vector is restricted to the set SS, we have

𝚫^SC13(δ^μ,𝚫^ST)1and(δ^μ,𝚫^T)14(δ^μ,𝚫^ST)14s(δ^μ,𝚫^ST)2.\|\hat{\bm{\Delta}}_{S^{C}}\|_{1}\leq 3\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}_{S}^{\mathrm{T}})\|_{1}\ \text{and}\ \|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{1}\leq 4\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}_{S}^{\mathrm{T}})\|_{1}\leq 4\sqrt{s}\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}_{S}^{\mathrm{T}})\|_{2}.

Consider the set K(r,S)={(δμ,𝚫T)p+1,(δμ,𝚫T)2=r,(0,𝚫SCT)14(δμ,𝚫ST)1}K(r,S)=\{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{R}^{p+1},\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}=r,\|(0,\bm{\Delta}^{\mathrm{T}}_{S^{C}})\|_{1}\leq 4\|({\delta}_{\mu},{\bm{\Delta}}_{S}^{\mathrm{T}})\|_{1}\} for 0<r<10<r<1. With the restricted strong convexity condition provided by Theorem 3, we have for any (δμ,𝚫T)K(r,S)(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in K(r,S),

LT(μj+δμ,𝜷j+𝚫)LT(μj,𝜷j)𝜷LT(μj,𝜷j),𝚫μLT(μj,𝜷j)δμ\displaystyle L_{T}(\mu_{j}^{*}+\delta_{\mu},\bm{\beta}_{j}^{*}+\bm{\Delta})-L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*})-\langle\nabla_{\bm{\beta}}L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*}),\bm{\Delta}\rangle-\nabla_{\mu}L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*})\delta_{\mu}
\displaystyle\geq κ1(δμ,𝚫T)2{(δμ,𝚫T)2κ2τ2ε(T,p)(δμ,𝚫T)12}\displaystyle\kappa_{1}\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}\Big{\{}\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}-\kappa_{2}\tau^{2}\varepsilon(T,p)\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{1}^{2}\Big{\}}
\displaystyle\geq κ1(δμ,𝚫T)2{(δμ,𝚫T)2sκ2τ2ε(T,p)(δμ,𝚫T)22}\displaystyle\kappa_{1}\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}\Big{\{}\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}-s\kappa_{2}\tau^{2}\varepsilon(T,p)\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{2}^{2}\Big{\}}

holds with probability greater than 1c6(pT)τ1-\frac{c_{6}}{(p\vee T)^{\tau}}, where ϵ(T,p)=log(pT)T\epsilon(T,p)=\sqrt{\frac{\log(p\vee T)}{T}}. Further, condition on the event

Gj\displaystyle G_{j} ={λT2hmax|1T0TWj(t)(hj(μj+𝜷,𝒙j(t))dtdNj(t))|\displaystyle=\Big{\{}\lambda_{T}\geq 2h_{\max}\Big{|}\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)\mathrm{d}t-\mathrm{d}N_{j}(t)\Big{)}\Big{|}\
andλT2hmax|1T0TWj(t)(hj(μj+𝜷,𝒙j(t))xj,k(t)dtxj,k(t)dNj(t))|,k[p]},\displaystyle\text{and}\ \lambda_{T}\geq 2h_{\max}\Big{|}\frac{1}{T}\int_{0}^{T}W_{j}(t)\Big{(}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)x_{j,k}(t)\mathrm{d}t-x_{j,k}(t)\mathrm{d}N_{j}(t)\Big{)}\Big{|},\forall k\in[p]\Big{\}},

note that (δ^μ,𝚫^T)2=r<1\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}=r<1, we can derive that

(δ^μ,𝚫^)\displaystyle\mathcal{F}(\hat{\delta}_{\mu},\hat{\bm{\Delta}}) =LT(μj+δ^μ,𝜷+𝚫^)LT(μj,𝜷j)+λT((μj+δ^μ,(𝜷j+𝚫^)ST)1+(0,𝚫^SCT)1(μj,𝜷jT)1)\displaystyle=L_{T}(\mu_{j}^{*}+\hat{\delta}_{\mu},\bm{\beta}^{*}+\hat{\bm{\Delta}})-L_{T}(\mu_{j}^{*},\bm{\beta}_{j}^{*})+\lambda_{T}\Big{(}\|(\mu_{j}^{*}+\hat{\delta}_{\mu},(\bm{\beta}_{j}^{*}+\hat{\bm{\Delta}})_{S}^{\mathrm{T}})\|_{1}+\|(0,\hat{\bm{\Delta}}^{\mathrm{T}}_{S^{C}})\|_{1}-\|(\mu_{j}^{*},\bm{\beta}_{j}^{*\mathrm{T}})\|_{1}\Big{)}
λT2(δ^μ,𝚫^T)1+κ1(δ^μ,𝚫^T)2{(δ^μ,𝚫^T)2sκ2τ2ε(T,p)(δ^μ,𝚫^T)22}\displaystyle\geq-\frac{\lambda_{T}}{2}\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{1}+\kappa_{1}\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}\Big{\{}\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}-s\kappa_{2}\tau^{2}\varepsilon(T,p)\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}^{2}\Big{\}}
+λT((0,𝚫^SCT)1(δ^μ,𝚫^ST)1)\displaystyle+\lambda_{T}(\|(0,\hat{\bm{\Delta}}_{S^{C}}^{\mathrm{T}})\|_{1}-\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}_{S}^{\mathrm{T}})\|_{1})
κ1(δ^μ,𝚫^T)2{(δ^μ,𝚫^T)2sκ2τ2ε(T,p)(δ^μ,𝚫^T)22}3sλT2(δ^μ,𝚫^T)2\displaystyle\geq\kappa_{1}\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}\Big{\{}\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}-s\kappa_{2}\tau^{2}\varepsilon(T,p)\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}^{2}\Big{\}}-\frac{3\sqrt{s}\lambda_{T}}{2}\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}
(κ12(δ^μ,𝚫^T)23sλT)(δ^μ,𝚫^T)2,\displaystyle\geq\Big{(}\frac{\kappa_{1}}{2}\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}-3\sqrt{s}\lambda_{T}\Big{)}\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2},

where the last inequality comes form (δ^μ,𝚫^T)K(r,S),sκ2τ2ε(T,p)<12(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\in K(r,S),s\kappa_{2}\tau^{2}\varepsilon(T,p)<\frac{1}{2} for TT large enough and the choice of λT\lambda_{T}. To derive the final result, we need to show that (δ^μ,𝚫^T)2=r<1\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}=r<1. Consider the choice of r=6sλTκ1<1r=\frac{6\sqrt{s}\lambda_{T}}{\kappa_{1}}<1. In this case, (δμ,𝚫)>0\mathcal{F}(\delta_{\mu},\bm{\Delta})>0 for every (δμ,𝚫T)K(r,S)(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in K(r,S). If (δ^μ,𝚫^T)2>r\|(\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}>r, then there exists a vector (aδ^μ,a𝚫^T)(a\hat{\delta}_{\mu},a\hat{\bm{\Delta}}^{\mathrm{T}}) with 0<a<10<a<1 and (aδ^μ,a𝚫^T)2=δ\|(a\hat{\delta}_{\mu},a\hat{\bm{\Delta}}^{\mathrm{T}})\|_{2}=\delta, As \mathcal{F} is convex, then

(aδ^μ,a𝚫^)a(δ^μ,𝚫^)+(1a)(0)=a(δ^μ,𝚫^)0.\mathcal{F}(a\hat{\delta}_{\mu},a\hat{\bm{\Delta}})\leq a\mathcal{F}(\hat{\delta}_{\mu},\hat{\bm{\Delta}})+(1-a)\mathcal{F}(0)=a\mathcal{F}(\hat{\delta}_{\mu},\hat{\bm{\Delta}})\leq 0.

Note that (aδ^μ,a𝚫^T)K(r,S)(a\hat{\delta}_{\mu},a\hat{\bm{\Delta}}^{\mathrm{T}})\in K(r,S). We come to a contradictory and we conclude that δ^μ,𝚫^T26sλTκ1\|\hat{\delta}_{\mu},\hat{\bm{\Delta}}^{\mathrm{T}}\|_{2}\leq\frac{6\sqrt{s}\lambda_{T}}{\kappa_{1}}. \Box

A.3 Proof of Lemma 1

Recall the well known Bernstein type inequalities for point process which can be found in Shorack and Wellner(2009):

(|MT|2vx+Bx3and0Tg2(t)dΛ(t)vandsupt[0,T]|g(t)|B)2ex,\mathbb{P}\Big{(}|M_{T}|\geq\sqrt{2vx}+\frac{Bx}{3}\ \text{and}\ \int_{0}^{T}g^{2}(t)\text{d}\Lambda(t)\leq v\ \text{and}\ \sup_{t\in[0,T]}|g(t)|\leq B\Big{)}\leq 2e^{-x}, (16)

where g(t)g(t) is a predictable process, Λ(t)\Lambda(t) is the compensator of the point process N(t)N(t) and MT=0Tg(t)(dN(t)dΛ(t))M_{T}=\int_{0}^{T}g(t)(\text{d}N(t)-\text{d}\Lambda(t)).

Conditioning on CτC_{\tau} defined in Lemma 3 and define the predictable process gj,k(t)=Wj(t)xj,k(t)g_{j,k}(t)=W_{j}(t)x_{j,k}(t), Λ(t)=Λj(t)=0thj(μj+𝜷j,𝒙j(s))ds,Mk,T=0Tgj,k(t)(dNj(t)dΛj(t))\Lambda(t)=\Lambda_{j}(t)=\int_{0}^{t}h_{j}(\mu_{j}+\langle\bm{\beta}_{j},\bm{x}_{j}(s)\rangle)\text{d}s,\ M_{k,T}=\int_{0}^{T}g_{j,k}(t)(\text{d}N_{j}(t)-\text{d}\Lambda_{j}(t)) and gj,k(t)g_{j,k}(t) is a predictable process with 0Tgj,k2(t)dΛj(t)σ420txj,k2(s)dΛj(t)\int_{0}^{T}g_{j,k}^{2}(t)\text{d}\Lambda_{j}(t)\leq\sigma^{2}_{4}\int_{0}^{t}x^{2}_{j,k}(s)\text{d}\Lambda_{j}(t). With Lemma 4, we have

0Tgj,k2(t)dΛj(t)\displaystyle\int_{0}^{T}g_{j,k}^{2}(t)\text{d}\Lambda_{j}(t) hmaxσ42(|0Txj,k2(t)dtT𝔼[xj,k2(t)]|+T𝔼[xj,k2(t)])\displaystyle\leq h_{\max}\sigma^{2}_{4}\cdot\Big{(}\Big{|}\int_{0}^{T}x^{2}_{j,k}(t)\text{d}t-T\mathbb{E}[x_{j,k}^{2}(t)]\Big{|}+T\mathbb{E}[x_{j,k}^{2}(t)]\Big{)}
c2T(𝔼[xj,k2(t)]+1)\displaystyle\leq c_{2}T(\mathbb{E}[x^{2}_{j,k}(t)]+1)

holds simultaneously for all k[p]k\in[p] with probability greater than 1c1(pT)τ1-\frac{c_{1}}{(p\vee T)^{\tau}}. On the event CτC_{\tau} in Lemma 3, we have supk,t[0,T]|gj,k(t)|=supk,t[0,T]|xj,k(t)|(τ+2)κmaxlog(pT)\sup_{k,t\in[0,T]}|g_{j,k}(t)|=\sup_{k,t\in[0,T]}|x_{j,k}(t)|\leq(\tau+2)\kappa_{\max}\log(p\vee T).

Notice that

|Mk,T|=|0Tgj,k(t)(dNj(t)dΛj(t))|.|M_{k,T}|=|\int_{0}^{T}g_{j,k}(t)\big{(}\text{d}N_{j}(t)-\text{d}\Lambda_{j}(t)\big{)}|.

Let v=c2T(𝔼[xj,k2(t)]+1)v=c_{2}T(\mathbb{E}[x^{2}_{j,k}(t)]+1) and B=(τ+2)κmaxlog(pT)B=(\tau+2)\kappa_{\max}\log(p\vee T) in (16), we have

(|Mk,T|c2T(𝔼[xj,k2(t)]+1)x+(τ+2)κmaxlog(pT)3x)2ex.\mathbb{P}\Big{(}|M_{k,T}|\geq\sqrt{c_{2}T(\mathbb{E}[x^{2}_{j,k}(t)]+1)x}+\frac{(\tau+2)\kappa_{\max}\log(p\vee T)}{3}x\Big{)}\leq 2e^{-x}.

and

(|Mk,T|c2T(𝔼[xj,k2(t)]+1)x+(τ+2)κmaxlog(pT)3x,k[p])2pex.\mathbb{P}\Big{(}|M_{k,T}|\geq\sqrt{c_{2}T(\mathbb{E}[x^{2}_{j,k}(t)]+1)x}+\frac{(\tau+2)\kappa_{\max}\log(p\vee T)}{3}x,\ \forall k\in[p]\Big{)}\leq 2pe^{-x}.

Take x=(τ+2)log(pT)x=(\tau+2)\log(p\vee T) and notice that c2T(𝔼[xj,k2(t)]+1)x\sqrt{c_{2}T(\mathbb{E}[x^{2}_{j,k}(t)]+1)x} dominate (τ+2)κmaxlog(pT)3x\frac{(\tau+2)\kappa_{\max}\log(p\vee T)}{3}x, we will have

|1T0Thj(μj+𝜷,𝒙j(t))xj,k(t)dtxj,k(t)dNj(t)|c2τlog(pT)T,k[p]\Big{|}\frac{1}{T}\int_{0}^{T}h_{j}(\mu_{j}^{*}+\langle\bm{\beta},\bm{x}_{j}(t)\rangle)x_{j,k}(t)\text{d}t-x_{j,k}(t)\text{d}N_{j}(t)\Big{|}\leq c_{2}\sqrt{\frac{\tau\log(p\vee T)}{T}},\ \ \forall k\in[p]

holds for all k[p]k\in[p] with probability greater than 1c2(pT)τ1-\frac{c_{2}}{(p\vee T)^{\tau}} where c1,c2c_{1},c_{2} are constants do not depend on pp or TT. The second inequation can be proved similarly. \Box

A.4 Ancillary Results

Lemma 2

Nj(T)N_{j}(T) is the number of event on the node jj in the time interval [0,T][0,T] and there exists a constant CC that only depends on hmaxh_{\max} that with probability greater than 1o(eC0T)1-o(e^{-C_{0}T}), we have

Nj(T)CehmaxT,j=1,,p.N_{j}(T)\leq Ceh_{\max}T,\ \forall j=1,\cdots,p.

proof: By the idea of Poisson thinning, we have

Nj(T)N~j(T)N_{j}(T)\leq\tilde{N}_{j}(T)

and for C2C\geq 2,

(N~j(T)>CehmaxT,j[p])\displaystyle\mathbb{P}(\tilde{N}_{j}(T)>Ceh_{\max}T,\ \forall j\in[p]) p(k=[CehmaxT]+1(Thmax)kk!eThmax)\displaystyle\leq p\Big{(}\sum_{k=[Ceh_{\max}T]+1}^{\infty}\frac{(Th_{\max})^{k}}{k!}e^{-Th_{\max}}\Big{)}
2p(Thmax)CeThmax([CehmaxT]+1)!eThmax\displaystyle\leq\frac{2p(Th_{\max})^{CeTh_{\max}}}{([Ceh_{\max}T]+1)!}e^{-Th_{\max}}
p(Thmax)CeThmax(CeThmax)CeThmaxeCeThmaxCTeThmax\displaystyle\lesssim\frac{p(Th_{\max})^{CeTh_{\max}}}{(CeTh_{\max})^{CeTh_{\max}}e^{-CeTh_{\max}}\sqrt{CT}}e^{-Th_{\max}}
=1CTexp{(CelogCC+1)Thmax+logp}.\displaystyle=\frac{1}{\sqrt{CT}}\exp\Big{\{}-(Ce\log C-C+1)Th_{\max}+\log p\Big{\}}.

Because log(p)=o(T)\log(p)=o(T), we can pick CC only depending on hmaxh_{\max} such that with probability greater than 1o(eCT)1-o(e^{-CT}), we have

Nj(T)CehmaxT,j=1,,p.N_{j}(T)\leq Ceh_{\max}T,\ \forall j=1,\cdots,p.

\Box

Define the event B={Nj(T)CehmaxT,j=1,,p}B=\{N_{j}(T)\leq Ceh_{\max}T,\ \forall j=1,\cdots,p\} and (B)=1o(eCT)\mathbb{P}(B)=1-o(e^{-CT}).

Lemma 3

Under Assumption 1, we have

(supk[p],t[0,T]|xj,k(t)|>κmax(τ+2)log(pT))Tplog(pT)exp((τ+2)log(pT))=o(1(pT)τ)\mathbb{P}\Big{(}\sup_{k\in[p],t\in[0,T]}|x_{j,k}(t)|>\kappa_{\max}(\tau+2)\log(p\vee T)\Big{)}\lesssim\frac{Tp}{\sqrt{\log(p\vee T)}}\exp\Big{(}-(\tau+2)\log(p\vee T)\Big{)}=o(\frac{1}{(p\vee T)^{\tau}}) (17)

holds for τ1\tau\geq 1.

proof:

|xj,k(t)|\displaystyle|x_{j,k}(t)| =|(tκsupp)0tκj,k(ts)dNk(s)|\displaystyle=|\int_{(t-\kappa_{\operatorname{supp}})\vee 0}^{t}\kappa_{j,k}(t-s)\text{d}N_{k}(s)|
κmaxtκsupptdNk(s)\displaystyle\leq\kappa_{\max}\int_{t-\kappa_{\operatorname{supp}}}^{t}\text{d}N_{k}(s)
κmax(Nk(T(v+2)κsupp)Nk(vκsupp)),\displaystyle\leq\kappa_{\max}\Big{(}N_{k}\big{(}T\wedge(v+2)\kappa_{\operatorname{supp}}\big{)}-N_{k}(v\kappa_{\operatorname{supp}})\Big{)},

where vκsupptκsupp<tT(v+2)κsuppv\kappa_{\operatorname{supp}}\leq t-\kappa_{\operatorname{supp}}<t\leq T\wedge(v+2)\kappa_{\operatorname{supp}}. Thus

(supk,t[0,T]|xj,k(t)|>η)\displaystyle\mathbb{P}(\sup_{k,t\in[0,T]}|x_{j,k}(t)|>\eta) (maxk,v[Nk(T(v+2)κsupp)Nk(vκsupp)]>ηκmax)\displaystyle\leq\mathbb{P}\Big{(}\max_{k,v}[N_{k}(T\wedge(v+2)\kappa_{\operatorname{supp}})-N_{k}(v\kappa_{\operatorname{supp}})]>\frac{\eta}{\kappa_{\max}}\Big{)}
Tp(hmaxκsupp)ηkmax(ηκmax)!ehmaxκmax\displaystyle\lesssim Tp\frac{(h_{\max}\kappa_{\operatorname{supp}})^{\frac{\eta}{k_{\max}}}}{(\frac{\eta}{\kappa_{\max}})!}e^{-h_{\max}\kappa_{\max}}
Tpηexp(ηκmax(1+log(hmaxκsupp)logηκmax))\displaystyle\lesssim\frac{Tp}{\sqrt{\eta}}\exp\Big{(}\frac{\eta}{\kappa_{\max}}(1+\log(h_{\max}\kappa_{\operatorname{supp}})-\log\frac{\eta}{\kappa_{\max}})\Big{)}

The inequality we desire holds if we take η=(τ+2)log(pT)\eta=(\tau+2)\log(p\vee T) for T,pT,p both large enough:

(supk,t[0,T]|xj,k(t)|>κmax(τ+2)log(pT))Tplog(pT)exp((τ+2)log(pT))=o(1(pT)τ)\mathbb{P}\Big{(}\sup_{k,t\in[0,T]}|x_{j,k}(t)|>\kappa_{\max}(\tau+2)\log(p\vee T)\Big{)}\lesssim\frac{Tp}{\sqrt{\log(p\vee T)}}\exp\Big{(}-(\tau+2)\log(p\vee T)\Big{)}=o(\frac{1}{(p\vee T)^{\tau}})

A similar proof can show that

(supk,t[0,T]anddNj,k(t)=0|xj,k(t)|>κmax(τ+2)log(pM))\displaystyle\mathbb{P}\Big{(}\sup_{k,t\in[0,T]\ \text{and}\ \text{d}N_{j,k}(t)=0}|x_{j,k}^{\prime}(t)|>\kappa_{\max}(\tau+2)\log(p\vee M)\Big{)}
\displaystyle\lesssim Tplog(pT)exp((τ+2)log(pT))=o(1(pT)τ)\displaystyle\frac{Tp}{\sqrt{\log(p\vee T)}}\exp\Big{(}-(\tau+2)\log(p\vee T)\Big{)}=o(\frac{1}{(p\vee T)^{\tau}})

\Box

For simplicity of notation, we define the events

Cτ={supk,t[0,T]|xj,k(t)|κmax(τ+2)log(pT),supk,t[0,T]anddNj,k(t)=0|xj,k(t)|>κmax(τ+2)log(pT)}C_{\tau}=\Big{\{}\sup_{k,t\in[0,T]}|x_{j,k}(t)|\leq\kappa_{\max}(\tau+2)\log(p\vee T),\ \sup_{k,t\in[0,T]\ \text{and}\ \text{d}N_{j,k}(t)=0}|x_{j,k}^{\prime}(t)|>\kappa_{\max}(\tau+2)\log(p\vee T)\Big{\}}

and (Cτ)=1o(1(pT)τ).\mathbb{P}(C_{\tau})=1-o(\frac{1}{(p\vee T)^{\tau}}).

Theorem 5

Suppose that NN is a temporal point process with intensity which satisfies those assumptions above. Suppose also that the functions fi,k()f_{i,k}(\cdot) have uniformly bounded support and

supi,k[p]fi,ksupi,k[p]maxx|f(x)|Cf.\sup_{i,k\in[p]}\|f_{i,k}\|_{\infty}\equiv\sup_{i,k\in[p]}\max_{x}|f(x)|\leq C_{f}.

Then, for r>0r>0, we have

(1ikp{|y¯i,k𝔼y¯i,k|ϵ})\displaystyle\mathbb{P}\bigg{(}\bigcup_{1\leq i\leq k\leq p}\Big{\{}|\bar{y}_{i,k}-\mathbb{E}\bar{y}_{i,k}|\geq\epsilon\Big{\}}\bigg{)}
\displaystyle\leq c1p2Texp((ϵT)r/(3r+1)c2)+p2exp(c3ϵ2T)+p2exp[c4Texp((ϵT)r(2r+1)/(3r+1)2c5[log(ϵT)]r/(3r+1))].\displaystyle c_{1}p^{2}T\exp\Big{(}-\frac{(\epsilon T)^{r/(3r+1)}}{c_{2}}\Big{)}+p^{2}\exp\Big{(}-c_{3}\epsilon^{2}T\Big{)}+p^{2}\exp\Big{[}-c_{4}T\exp\Big{(}\frac{(\epsilon T)^{r(2r+1)/(3r+1)^{2}}}{c_{5}[\log(\epsilon T)]^{r/(3r+1)}}\Big{)}\Big{]}.

where

y¯i,k1T0T0Tfi,k(tt)dNi(t)dNk(t), 1i,kp,\bar{y}_{i,k}\equiv\frac{1}{T}\int_{0}^{T}\int_{0}^{T}f_{i,k}(t-t^{\prime})\text{d}N_{i}(t)\text{d}N_{k}(t^{\prime}),\ \ 1\leq i,k\leq p,

c1,c2,c3,c4,c5c_{1},c_{2},c_{3},c_{4},c_{5} are positive constants that do not depend on pp or TT. Take ϵ=(τ+2)log(pT)c3T,r=1\epsilon=\sqrt{\frac{(\tau+2)\log(p\vee T)}{c_{3}T}},r=1 and we have

(1ikp{|y¯i,k𝔼y¯i,k|(τ+2)log(pT)T})c6(pT)τ.\mathbb{P}\bigg{(}\bigcup_{1\leq i\leq k\leq p}\Big{\{}|\bar{y}_{i,k}-\mathbb{E}\bar{y}_{i,k}|\geq\sqrt{\frac{(\tau+2)\log(p\vee T)}{T}}\Big{\}}\bigg{)}\leq\frac{c_{6}}{(p\vee T)^{\tau}}.

The inequation above stems from the proof of Theorem 3 in Chen et al.(2017a) and r>0r>0 is introduced when making assumption about the tail of transfer kernel (see Assumption 2 of Chen et al.(2017a) for more details). In our setting, the transfer kernels are compact supported and we can choose any r>0r>0.

Lemma 4

For TT large enough, we have constants c1,c2,c3c_{1},\ c_{2},\ c_{3} that do not depend on pp and TT that

(1ikp[|1T0Txj,i(t)xj,k(t)𝑑t𝔼(xj,i(t)xj,k(t))|c1(τ+2)log(pT)T])c2(pT)τ.\mathbb{P}\bigg{(}\bigcup_{1\leq i\leq k\leq p}\bigg{[}\bigg{|}\frac{1}{T}\int_{0}^{T}x_{j,i}(t)x_{j,k}(t)dt-\mathbb{E}(x_{j,i}(t)x_{j,k}(t))\bigg{|}\geq c_{1}\sqrt{\frac{(\tau+2)\log(p\vee T)}{T}}\bigg{]}\bigg{)}\leq\frac{c_{2}}{(p\vee T)^{\tau}}.

holds.

Proof.

1T0Txj,i(t)xj,k(t)dt\displaystyle\frac{1}{T}\int_{0}^{T}x_{j,i}(t)x_{j,k}(t)\text{d}t =1T0T(0tκj,i(ts)dNi(s))(0Tκj,k(tr)𝑑Nk(r))dt\displaystyle=\frac{1}{T}\int_{0}^{T}\big{(}\int_{0}^{t}\kappa_{j,i}(t-s)\text{d}N_{i}(s)\big{)}\big{(}\int_{0}^{T}\kappa_{j,k}(t-r)dN_{k}(r)\big{)}\text{d}t
=1T0T0T(srTκj,i(ts)κj,k(tr)dt)dNi(s)dNk(r)\displaystyle=\frac{1}{T}\int_{0}^{T}\int_{0}^{T}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(r)

Generally, srTκj,i(ts)κj,k(tr)dt\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t is not a function only depend on rsr-s. However, for all i,j[p]i,j\in[p]κj,i(t)\kappa_{j,i}(t) is supported in [0,κsupp][0,{\kappa_{\operatorname{supp}}}] and we observe that for sr<Tκsupps\vee r<T-\kappa_{\operatorname{supp}}, we have

srTκj,i(ts)κj,k(tr)dt={0κsuppκj,i(t)κj,k(t+sr)dtsr0κsuppκj,i(t+rs)κj,k(t)dts<r\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t=\begin{cases}\int_{0}^{\kappa_{\operatorname{supp}}}\kappa_{j,i}(t)\kappa_{j,k}(t+s-r)\text{d}t\ \ \ &s\geq r\\ \int_{0}^{\kappa_{\operatorname{supp}}}\kappa_{j,i}(t+r-s)\kappa_{j,k}(t)\text{d}t\ \ \ &s<r\end{cases}

both only depend on srs-r. Thus

1T0T0T(srTκj,i(ts)κj,k(tr)dt)dNi(s)dNk(t)\displaystyle\frac{1}{T}\int_{0}^{T}\int_{0}^{T}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)
=\displaystyle= 1T0Tκsupp0Tκsupp(srTκj,i(ts)κj,k(tr)dt)dNi(s)dNk(t)\displaystyle\frac{1}{T}\int_{0}^{T-{\kappa_{\operatorname{supp}}}}\int_{0}^{T-{\kappa_{\operatorname{supp}}}}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)
+\displaystyle+ 1TTκsuppT0T(srTκj,i(ts)κj,k(tr)dt)dNi(s)dNk(t)\displaystyle\frac{1}{T}\int_{T-{\kappa_{\operatorname{supp}}}}^{T}\int_{0}^{T}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)
+\displaystyle+ 1T0TκsuppTκsuppT(srTκj,i(ts)κj,k(tr)dt)dNi(s)dNk(t)\displaystyle\frac{1}{T}\int_{0}^{T-{\kappa_{\operatorname{supp}}}}\int_{T-{\kappa_{\operatorname{supp}}}}^{T}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)
=\displaystyle= Iik(1)+Iik(2)+Iik(3).\displaystyle I^{(1)}_{ik}+I^{(2)}_{ik}+I^{(3)}_{ik}.

Let

fi,k(sr)={0κsuppκj,i(t)κj,k(t+sr)dtsr00κsuppκj,i(t+rs)κj,k(t)dtsr<0.f_{i,k}(s-r)=\begin{cases}\int_{0}^{\kappa_{\operatorname{supp}}}\kappa_{j,i}(t)\kappa_{j,k}(t+s-r)\text{d}t\ \ \ &s-r\geq 0\\ \int_{0}^{\kappa_{\operatorname{supp}}}\kappa_{j,i}(t+r-s)\kappa_{j,k}(t)\text{d}t\ \ \ &s-r<0.\end{cases}

With Theorem 5, we have

(1ikp|Iik(1)𝔼[Iik(1)]|c3τlog(pT)T)c4(pT)τ.\mathbb{P}\Big{(}\bigcup_{1\leq i\leq k\leq p}|I^{(1)}_{ik}-\mathbb{E}[I^{(1)}_{ik}]|\geq c_{3}\sqrt{\frac{\tau\log(p\vee T)}{T}}\Big{)}\leq\frac{c_{4}}{(p\vee T)^{\tau}}.

As for Iik(2)I^{(2)}_{ik} and Iik(3)I^{(3)}_{ik}, we have

Iik(2)=\displaystyle I^{(2)}_{ik}= |1TTκsuppT0T(srTκj,i(ts)κj,k(tr)dt)dNi(s)dNk(t)|\displaystyle\bigg{|}\frac{1}{T}\int_{T-{\kappa_{\operatorname{supp}}}}^{T}\int_{0}^{T}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)\bigg{|}
\displaystyle\leq |1TTκsuppTT2κsuppT(srTκmax2dt)dNi(s)dNk(t)|\displaystyle\bigg{|}\frac{1}{T}\int_{T-{\kappa_{\operatorname{supp}}}}^{T}\int_{T-2{\kappa_{\operatorname{supp}}}}^{T}\Big{(}\int_{s\vee r}^{T}\kappa^{2}_{\max}\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)\bigg{|}
\displaystyle\leq 8κsupp3κmax2T\displaystyle\frac{8{\kappa_{\operatorname{supp}}}^{3}\kappa^{2}_{\max}}{T}

and

Iik(3)=\displaystyle I^{(3)}_{ik}= |1T0TκsuppTκsuppT(srTκj,i(ts)κj,k(tr)dt)dNi(s)dNk(t)|\displaystyle\bigg{|}\frac{1}{T}\int_{0}^{T-{\kappa_{\operatorname{supp}}}}\int_{T-{\kappa_{\operatorname{supp}}}}^{T}\Big{(}\int_{s\vee r}^{T}\kappa_{j,i}(t-s)\kappa_{j,k}(t-r)\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)\bigg{|}
\displaystyle\leq |1TT2κsuppTκsuppTκsuppT(srTκmax2dt)dNi(s)dNk(t)|\displaystyle\bigg{|}\frac{1}{T}\int_{T-2{\kappa_{\operatorname{supp}}}}^{T-{\kappa_{\operatorname{supp}}}}\int_{T-{\kappa_{\operatorname{supp}}}}^{T}\Big{(}\int_{s\vee r}^{T}\kappa^{2}_{\max}\text{d}t\Big{)}\text{d}N_{i}(s)\text{d}N_{k}(t)\bigg{|}
\displaystyle\leq 4κsupp3κmax2T.\displaystyle\frac{4{\kappa_{\operatorname{supp}}}^{3}\kappa^{2}_{\max}}{T}.\

Similarly we have

𝔼[Iik(2)]8κsupp3κmax2T,𝔼[Iik(3)]4κsupp3κmax2T.\mathbb{E}[I^{(2)}_{ik}]\leq\frac{8{\kappa_{\operatorname{supp}}}^{3}\kappa^{2}_{\max}}{T},\ \mathbb{E}[I^{(3)}_{ik}]\leq\frac{4{\kappa_{\operatorname{supp}}}^{3}\kappa^{2}_{\max}}{T}.

Thus |Iik(2)𝔼[Iik(2)]|+|Iik(2)𝔼[Iik(2)]|=O(T1)|I^{(2)}_{ik}-\mathbb{E}[I^{(2)}_{ik}]|+|I^{(2)}_{ik}-\mathbb{E}[I^{(2)}_{ik}]|=O(T^{-1}). We have

(1ikp[|1T0Txj,i(t)xj,k(t)𝑑t𝔼(xj,i(t)xj,k(t))|c3τlog(pT)T])c6(pT)τ.\mathbb{P}\bigg{(}\bigcup_{1\leq i\leq k\leq p}\bigg{[}\bigg{|}\frac{1}{T}\int_{0}^{T}x_{j,i}(t)x_{j,k}(t)dt-\mathbb{E}(x_{j,i}(t)x_{j,k}(t))\bigg{|}\geq c_{3}\sqrt{\frac{\tau\log(p\vee T)}{T}}\bigg{]}\bigg{)}\leq\frac{c_{6}}{(p\vee T)^{\tau}}.

\Box

Lemma 5

Under Assumptions 1 and 2, we have

(maxk|1T0Txj,k(t)dt𝔼[xj,k(t)]|=c1τlog(pT)T)1c2(pT)τ,\mathbb{P}\Big{(}\max_{k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,k}(t)\text{d}t-\mathbb{E}[x_{j,k}(t)]\Big{|}=c_{1}\sqrt{\frac{\tau\log(p\vee T)}{T}}\Big{)}\leq 1-\frac{c_{2}}{(p\vee T)^{\tau}},

where c1,c2c_{1},c_{2} are positive constants that do not depend on pp or TT.

Recall the well known Bernstein type inequalities for point process which can be found in Shorack and Wellner [2009]:

(|MT|2vx+Bx3and0Tg2(t)dΛ(t)vandsupt[0,T]|g(t)|B)2ex,\mathbb{P}\Big{(}|M_{T}|\geq\sqrt{2vx}+\frac{Bx}{3}\ \text{and}\ \int_{0}^{T}g^{2}(t)\text{d}\Lambda(t)\leq v\ \text{and}\ \sup_{t\in[0,T]}|g(t)|\leq B\Big{)}\leq 2e^{-x},

where MT=0Tg(t)(dN(t)dΛ(t))M_{T}=\int_{0}^{T}g(t)(\text{d}N(t)-\text{d}\Lambda(t)).

Notice that

1T0T(xj,k(t)𝔼[xj,k(t)])dt\displaystyle\frac{1}{T}\int_{0}^{T}(x_{j,k}(t)-\mathbb{E}[x_{j,k}(t)])\text{d}t =1T0T0tκj,k(ts)(dNk(s)dΛk(s))dt\displaystyle=\frac{1}{T}\int_{0}^{T}\int_{0}^{t}\kappa_{j,k}(t-s)\Big{(}\text{d}N_{k}(s)-\text{d}\Lambda_{k}(s)\Big{)}\text{d}t
=1T0TsTκj,k(ts)dt(dNk(s)dΛk(s))\displaystyle=\frac{1}{T}\int_{0}^{T}\int_{s}^{T}\kappa_{j,k}(t-s)\text{d}t(\text{d}N_{k}(s)-\text{d}\Lambda_{k}(s))

where the last equality is by Fubini’s theorem. Let f1(s)=sTκj,k(ts)dtf_{1}(s)=\int_{s}^{T}\kappa_{j,k}(t-s)\text{d}t. Since the transition kernel is bounded and supported on a compact set, f1(s)f_{1}(s) is bounded by a constant BB that does not depend on pp or TT. Thus we have

(|f1(t)(dNj(t)dΛj(t))|2B2Tx+Bx3and0Tf12(t)dΛj(t)B2Tandsupt[0,T]|f1(t)|B)2ex.\mathbb{P}\Big{(}\Big{|}\int f_{1}(t)(\text{d}N_{j}(t)-\text{d}\Lambda_{j}(t))\Big{|}\geq\sqrt{2B^{2}Tx}+\frac{Bx}{3}\ \text{and}\ \int_{0}^{T}f_{1}^{2}(t)\text{d}\Lambda_{j}(t)\leq B^{2}T\ \text{and}\ \sup_{t\in[0,T]}|f_{1}(t)|\leq B\Big{)}\leq 2e^{-x}.

Take x=(τ+2)log(pT)x=(\tau+2)\log(p\vee T) and we have

(maxk|1T0Txj,k(t)dt𝔼[xj,k(t)]|=c1τlog(pT)T)1c2(pT)τ,\mathbb{P}\Big{(}\max_{k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,k}(t)\text{d}t-\mathbb{E}[x_{j,k}(t)]\Big{|}=c_{1}\sqrt{\frac{\tau\log(p\vee T)}{T}}\Big{)}\leq 1-\frac{c_{2}}{(p\vee T)^{\tau}},

\Box

Lemma 6

For a fixed positive constant r>0r>0 and R>0R>0, let

Z(r)=sup(δμ,𝚫T)𝕊2(1)𝕊1(r)|1T0T(δμ+𝚫,𝒙j(t))2𝔼(δμ+𝚫,𝒙j(t))2|.Z(r)=\sup_{(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{2}(1)\cap\mathbb{S}_{1}(r)}\Big{|}\frac{1}{T}\int_{0}^{T}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}-\mathbb{E}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}\Big{|}.

We have for any arbitrary constant τ1\tau\geq 1,

(Z(r)c3r2(τ+2)2log(pT)T)c1(pT)τ.\mathbb{P}\Big{(}Z(r)\geq c_{3}r^{2}(\tau+2)^{2}\sqrt{\frac{\log(p\vee T)}{T}}\Big{)}\leq\frac{c_{1}}{(p\vee T)^{\tau}}.

proof: By Lemma 4 and 5 above, we have

(maxi,k|1T0Txj,i(t)xj,k(t)𝔼[xj,i(t)xj,k(t)]|>c1(τ+2)2log(pT)T)c1(pT)τ,\mathbb{P}\Big{(}\max_{i,k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,i}(t)x_{j,k}(t)-\mathbb{E}[x_{j,i}(t)x_{j,k}(t)]\Big{|}>c_{1}(\tau+2)^{2}\sqrt{\frac{\log(p\vee T)}{T}}\Big{)}\leq\frac{c_{1}}{(p\vee T)^{\tau}},
(maxk|1T0Txj,k(t)𝔼[xj,k(t)]|>c1(τ+2)2log(pT)T)c2(pT)τ,\mathbb{P}\Big{(}\max_{k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,k}(t)-\mathbb{E}[x_{j,k}(t)]\Big{|}>c_{1}(\tau+2)^{2}\sqrt{\frac{\log(p\vee T)}{T}}\Big{)}\leq\frac{c_{2}}{(p\vee T)^{\tau}},

where c1c_{1} and c2c_{2} are positive constants that do not depend on pp or TT.

Note that for any (δμ,𝚫T)𝕊2(1)𝕊1(r)(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\in\mathbb{S}_{2}(1)\cap\mathbb{S}_{1}(r),

|1T0T(δμ+𝚫,𝒙j(t))2𝔼(δμ+𝚫,𝒙j(t))2|\displaystyle\Big{|}\frac{1}{T}\int_{0}^{T}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}-\mathbb{E}(\delta_{\mu}+\langle\bm{\Delta},\bm{x}_{j}(t)\rangle)^{2}\Big{|}
\displaystyle\leq (δμ,𝚫T)12(maxi,k|1T0Txj,i(t)xj,k(t)dt𝔼[xj,i(t)xj,k(t)]|maxk|1T0Txj,k(t)dt𝔼[xj,k(t)]|).\displaystyle\|(\delta_{\mu},\bm{\Delta}^{\mathrm{T}})\|_{1}^{2}\cdot\Big{(}\max_{i,k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,i}(t)x_{j,k}(t)\text{d}t-\mathbb{E}[x_{j,i}(t)x_{j,k}(t)]\Big{|}\vee\max_{k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,k}(t)\text{d}t-\mathbb{E}[x_{j,k}(t)]\Big{|}\Big{)}.

Then

{maxi,k|1T0Txj,i(t)xj,k(t)𝔼[xj,i(t)xj,k(t)]|maxk|1T0Txj,k(t)𝔼[xj,k(t)]|ϵ}{Z(r)r2ϵ}.\Big{\{}\max_{i,k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,i}(t)x_{j,k}(t)-\mathbb{E}[x_{j,i}(t)x_{j,k}(t)]\Big{|}\vee\max_{k}\Big{|}\frac{1}{T}\int_{0}^{T}x_{j,k}(t)-\mathbb{E}[x_{j,k}(t)]\Big{|}\leq\epsilon\Big{\}}\subset\{Z(r)\leq r^{2}\epsilon\}.

Take ϵ=c3(τ+2)2log(pT)T\epsilon=c_{3}(\tau+2)^{2}\sqrt{\frac{\log(p\vee T)}{T}} and we have what we want. \Box