A two-way heterogeneity model for dynamic networks

Binyan Jiang
Department of Applied Mathematics, The Hong Kong Polytechnic University
Hong Kong, [email protected] Chenlei Leng
Department of Statistics, University of Warwick
United Kindom, [email protected] Ting Yan
Department of Statistics, Central China Normal University
China, [email protected] Qiwei Yao
Department of Statistics, London School of Economics
United Kindom, [email protected] Xinyang Yu
Department of Applied Mathematics, The Hong Kong Polytechnic University
Hong Kong, [email protected]

Abstract

Dynamic network data analysis requires joint modelling individual snapshots and time dynamics. This paper proposes a new two-way heterogeneity model towards this goal. The new model equips each node of the network with two heterogeneity parameters, one to characterize the propensity of forming ties with other nodes and the other to differentiate the tendency of retaining existing ties over time. Though the negative log-likelihood function is non-convex, it is locally convex in a neighbourhood of the true value of the parameter vector. By using a novel method of moments estimator as the initial value, the consistent local maximum likelihood estimator (MLE) can be obtained by a gradient descent algorithm. To establish the upper bound for the estimation error of the MLE, we derive a new uniform deviation bound, which is of independent interest. The usefulness of the model and the associated theory are further supported by extensive simulation and the analysis of some real network data sets.

Keywords: Degree heterogeneity, Dynamic networks, Maximum likelihood estimation, Uniform deviation bound

1 Introduction

Network data featuring prominent interactions between subjects arise in various areas such as biology, economics, engineering, medicine, and social sciences [27, 20]. As a rapidly growing field of active research, statistical modelling of networks aims to capture and understand the linking patterns in these data. A large part of the literature has focused on examining these patterns for canonical, static networks that are observed at a single snapshot. Due to the increasing availability of networks that are observed multiple times, models for dynamic networks evolving in time are of increasing interest now. These models typically assume, among others, that networks observed at different time are independent [28, 1], independent conditionally on some latent processes [4, 23], or drawn sequentially from an exponential random graph model conditional on the previous networks [11, 10, 21].

One of the stylized facts of real-life networks is that their nodes often have different tendencies to form ties and may evolve differently over time. The former is manifested by the fact that the so-called hub nodes have many links while the peripheral nodes have small numbers of connections in, for example, a big social network. The latter becomes evident when some individuals are more active in seeking new ties/friends than the others. In this paper, we refer to these two kinds of heterogeneity as static heterogeneity and dynamic heterogeneity respectively. Also known as degree heterogeneity in the static network literature, static heterogeneity has featured prominently in several popular models widely used in practice including the stochastic block model and its degree-corrected generalization [17]. See also [15, 29, 16, 19], and the references therein. Another common and natural approach to capture the static heterogeneity is to introduce node-specific parameters, one for each node. For single networks, this is often conducted via modelling the logit of the link probability between each pair of nodes as the sum of their heterogeneity parameters. Termed as the $\beta$ -model [2], this model and its generalizations have been extensively studied when a single static network is observed [39, 18, 7, 35, 3, 30].

With $n$ observed networks each having $p$ nodes, the goal of this study is two-fold: (i) We propose a dynamic network model named the two-way heterogeneity model that captures both static heterogeneity and dynamic heterogeneity, and develop the associate inference methodology; (ii) We establish new generic asymptotic results that can be applied or extended to different models with a large number of parameters (in relation to $p$ ). We focus on the scenario that the number of nodes $p$ goes to infinity. Our asymptotic results hold when $np\rightarrow\infty$ , though $n$ may be fixed. The main contributions of our paper can be summarized as follows.

•

We introduce a reparameterization of the general autoregressive network model [14] to accommodate variations in both node degree and dynamic fluctuations. This novel approach can be regarded as an extension of the $\beta$ -model [2] to a dynamic framework. It encompasses two sets of parameters for heterogeneity: one governs static variations, akin to those in the standard $\beta$ -model, while the other addresses dynamic fluctuations. Unlike the general model in [14], which necessitates a large number of network observations (i.e. $n\rightarrow\infty$ ), we demonstrate the validity of our formulation even in scenarios where $n$ is small but $p$ is large.
•

The formulation of our model gives rise to a high-dimensional non-convex loss function based on likelihood. By establishing the local convexity of the loss function in a neighborhood of the true parameters, we compute the local MLE by a standard gradient descent algorithm using a newly proposed method of moments estimator (MME) as its initial value. To our best knowledge, this is the first result in network data analysis for solving such a non-convex optimization problem with algorithmic guarantees.
•

Furthermore, to characterize the local MLE, we have derived its estimation error bounds in the $\ell_{2}$ norm and the $\ell_{\infty}$ norm when $np\rightarrow\infty$ in which $n\geq 2$ can be finite. Due to the dynamic structure of the data, the Hessian matrix of the loss function exhibits a complex structure. As a result, existing analytical approaches, such as the interior point theorem [6, 37] developed for static networks, are no longer applicable; see Section 3.1 for further elaboration. We derive a novel locally uniform deviation bound in a neighborhood of the true parameters with a diverging radius. Based on this we first establish $\ell_{2}$ norm consistency of the MLE, which paves the way for the uniform consistency in $\ell_{\infty}$ norm.
•

In establishing the locally uniform deviation bound, we have provided a general result for functions of the form $L(\boldsymbol{\theta})=\frac{1}{p}\sum_{1\leq i\neq j\leq p}l_{i,j}\left(\theta_{i},\theta_{j}\right)Y_{i,j}$ as defined in (4.11) below. This result explores the sparsity structure of $L(\boldsymbol{\theta})$ in the sense that most of its higher order derivatives are zero – the condition which our model satisfies, and provides a new bound that substantially extends the scope of empirical processes for the M-estimators [32] for the models with a fixed number of parameters to those with a growing number of parameters. The result here is of independent interest as it can be applied to any model with an objective function taking the form of $L$ .

The rest of the paper is organized as follows. We introduce in Section 2 the new two-way heterogeneity model and present its properties. The estimation of its local MLE in a neighborhood of the truth and the associated theoretical properties are presented in Section 3. The development of these properties relies on new local deviation bounds which are presented in Section 4. Simulation studies and an analysis of ants interaction data are reported in Section 5. We conclude the paper in Section 6. All technical proofs are relegated to Appendix A. Additional numerical results showcasing the effectiveness of our method in aiding community detection within stochastic block structures, along with an application aimed at understanding dynamic protein-protein interaction networks, are provided in Appendix B.

2 Two-way Heterogeneity Model

Consider a dynamic network defined on $p$ nodes which are unchanged over time. Denote by a $p\times p$ matrix ${\mathbf{X}}^{t}=(X_{i,j}^{t})$ its adjacency matrix at time $t$ , i.e. $X_{i,j}^{t}=1$ indicates the existence of a connection between nodes $i$ and $j$ at time $t$ , and 0 otherwise. We focus on undirected networks without self-loops, i.e., $X_{i,j}^{t}=X_{j,i}^{t}$ for all $(i,j)\in{{\mathcal{J}}\equiv\{(i,j):1\leq i<j\leq p\}}$ , and $X_{i,i}^{t}=0$ for $1\leq i\leq p$ , though our approach can be readily extended to directed networks.

To capture the autoregressive pattern in dynamic networks, [14] proposed to model the network process via the following stationary AR(1) framework:

X^{t}_{i,j}\;=\;X^{t-1}_{i,j}\,I({\varepsilon}_{i,j}^{t}=0)\;+\;I({\varepsilon}_{i,j}^{t}=1),\quad t\geq 1,

where $I(\cdot)$ denotes the indicator function, and the ${\varepsilon}_{i,j}^{t}$ , $(i,j)\in{\mathcal{J}}$ are independent innovations satisfying

P({\varepsilon}_{i,j}^{t}=1)=\alpha_{i,j},\quad P({\varepsilon}_{i,j}^{t}=-1)=\beta_{i,j},\quad P({\varepsilon}_{i,j}^{t}=0)=1-\alpha_{i,j}-\beta_{i,j},

for some positive parameters $\alpha_{i,j}$ and $\beta_{i,j}$ . This general model opts to neglect the inherent nature of the networks and chooses to estimate each pair $\left(\alpha_{i,j},\beta_{i,j}\right)$ independently. As a result, there are $p(p-1)$ parameters and consistent model estimation requires $n\rightarrow\infty$ . Conversely, in numerous real-world scenarios, it is frequently noted that the number of network observations $n$ is modest, while the number of nodes $p$ can significantly exceed $n$ . Under such a scenario of small- $n$ -large- $p$ , the conventional model outlined in [14] may not be suitable. To address this and to effectively capture node heterogeneity in dynamic networks, as well as accommodate small- $n$ -large- $p$ networks, we propose the following reparameterization for the general AR(1) model mentioned above. This reparameterization not only accounts for inherent node heterogeneity but also reduces the parameter count from $p(p-1)$ to $2p$ .

Definition 1.

Two-way Heterogeneity Model (TWHM). The data generating process satisfies

(2.1)

X^{t}_{i,j}=I({\varepsilon}_{i,j}^{t}=0)+X^{t-1}_{i,j}I({\varepsilon}_{i,j}^{t}=1),\qquad(i,j)\in{\mathcal{J}},

where the ${\varepsilon}_{i,j}^{t}$ , for $(i,j)\in{\cal J}$ and $t\geq 1$ are independent innovations with their distributions satisfying

(2.2)

P({\varepsilon}_{i,j}^{t}=r)=\frac{e^{\beta_{i,r}+\beta_{j,r}}}{1+\sum_{k=0}^{1}e^{\beta_{i,k}+\beta_{j,k}}}~{}~{}{\rm for}~{}r=0,1,\quad P({\varepsilon}_{i,j}^{t}=-1)=\frac{1}{1+\sum_{k=0}^{1}e^{\beta_{i,k}+\beta_{j,k}}}.

TWHM defined above is a reparametrization of the AR(1) network model [14] as it reduces the total number of parameters from $2p^{2}$ therein to $2p$ . By Proposition 1 of [14], the matrix process $\{{\mathbf{X}}^{t},t\geq 1\}$ is strictly stationary with

(2.3)

P(X_{i,j}^{t}=1)=\frac{e^{\beta_{i,0}+\beta_{j,0}}}{1+e^{\beta_{i,0}+\beta_{j,0}}}=1-P(X_{i,j}^{t}=0),

provided that we activate the process with ${\mathbf{X}}^{0}=(X_{i,j}^{0})$ also following this stationary marginal distribution.

Furthermore,

{\rm E}(X_{i,j}^{t})=\frac{e^{\beta_{i,0}+\beta_{j,0}}}{1+e^{\beta_{i,0}+\beta_{j,0}}},\qquad{\rm Var}(X_{i,j}^{t})=\frac{e^{\beta_{i,0}+\beta_{j,0}}}{(1+e^{\beta_{i,0}+\beta_{j,0}})^{2}},

(2.4)

\rho_{i,j}(|t-s|)\equiv{\rm Corr}(X_{i,j}^{t},X_{i,j}^{s})=\left(\frac{e^{\beta_{i,1}+\beta_{j,1}}}{1+\sum_{r=0}^{1}e^{\beta_{i,r}+\beta_{j,r}}}\right)^{|t-s|}.

Note that the connection probabilities in (2.3) depend on $\boldsymbol{\beta}_{0}=(\beta_{1,0},\cdots,\beta_{p,0})^{\top}$ only, and are of the same form as the (static) $\beta$ -model [2]. Hence we call $\boldsymbol{\beta}_{0}$ the static heterogeneity parameter. Proposition 1 below confirms that means and variances of node degrees in TWHM also depend on $\boldsymbol{\beta}_{0}$ only, and that different values of $\beta_{i,0}$ reflect the heterogeneity in the degrees of nodes.

Under TWHM, it holds that

(2.5)

P(X^{t}_{i,j}=1|X^{t-1}_{i,j}=0)=\frac{e^{\beta_{i,0}+\beta_{j,0}}}{1+\sum_{k=0}^{1}e^{\beta_{i,k}+\beta_{j,k}}},P(X^{t}_{i,j}=0|X^{t-1}_{i,j}=1)=\frac{1}{1+\sum_{k=0}^{1}e^{\beta_{i,k}+\beta_{j,k}}}.

Hence the dynamic changes (over time) of network ${\mathbf{X}}^{t}$ depend on, in addition to $\boldsymbol{\beta}_{0}$ , $\boldsymbol{\beta}_{1}\equiv(\beta_{1,1},\cdots,\beta_{p,1})^{\top}$ : the larger $\beta_{i,1}$ is, the more likely $X_{i,j}^{t}$ will retain the value of $X_{i,j}^{t-1}$ for all $j$ . Thus we call $\boldsymbol{\beta}_{1}$ the dynamic heterogeneity parameter, as its components reflect the different dynamic behaviours of the $p$ nodes. A schematic description of the model can be seen from Figure 1 where three snapshots of a dynamic network with four nodes are depicted.

Refer to caption — Figure 1: A schematic depiction of TWHM: $\beta_{i,0},i=1,...,4$ , are parameters to characterize the static heterogeneity of nodes, while $\beta_{i,1}$ characterize their dynamic heterogeneity.

From now on, let $\{{\mathbf{X}}^{t}\}\sim P_{\boldsymbol{\theta}}$ denote the stationary TWHM with parameters ${\boldsymbol{\theta}}=(\boldsymbol{\beta}_{0}^{\top},\boldsymbol{\beta}_{1}^{\top})^{\top}$ , and $d_{i}^{t}=\sum_{j=1}^{p}X_{i,j}^{t}$ be the degree of node $i$ at time $t$ . The proposition below lists some properties of the node degrees.

Proposition 1.

Let $\{{\mathbf{X}}^{t}\}\sim P_{\boldsymbol{\theta}}$ . Then $\{(d_{1}^{t},\ldots,d_{p}^{t}),t=0,1,2,\cdots\}$ is a strictly stationary process. Furthermore for any $1\leq i<j\leq p$ and $t,s\geq 0$ ,

{\rm E}(d_{i}^{t})=\sum_{k=1,\>k\neq i}^{p}\frac{e^{\beta_{i,0}+\beta_{k,0}}}{1+e^{\beta_{i,0}+\beta_{k,0}}},\qquad{\rm Var}(d_{i}^{t})=\sum_{k=1,\>k\neq i}^{p}\frac{e^{\beta_{i,0}+\beta_{k,0}}}{(1+e^{\beta_{i,0}+\beta_{k,0}})^{2}},

	$\displaystyle\rho^{d}_{i,j}(\|t-s\|)$	$\displaystyle\equiv$	$\displaystyle{\rm Corr}(d_{i}^{t},d_{j}^{s})$
		$\displaystyle=$	$\displaystyle\begin{cases}C_{i,\rho}\sum_{k=1,\>k\neq i}^{p}\left(\frac{e^{\beta_{i,1}+\beta_{k,1}}}{1+\sum_{r=0}^{1}e^{\beta_{i,r}+\beta_{k,r}}}\right)^{\|t-s\|}\frac{e^{\beta_{i,0}+\beta_{k,0}}}{(1+e^{\beta_{i,0}+\beta_{k,0}})^{2}}\quad&{\rm if}\;i=j,\\ 0&{\rm if}\;i\neq j,\end{cases}$

where $C_{i,\rho}=\left(\sum_{k=1,\>k\neq i}^{p}\frac{e^{\beta_{i,0}+\beta_{k,0}}}{(1+e^{\beta_{i,0}+\beta_{k,0}})^{2}}\right)^{-1}$ .

Proposition 1 implies that when there exist constants $\beta_{0}$ and $\beta_{1}$ such that $\beta_{i,0}\approx\beta_{0}$ and $\beta_{i,1}\approx\beta_{1}$ for all $i$ , the degree sequence $\{d_{i}^{t},t=1,\ldots,n\}$ is approximately AR(1).

3 Parameter Estimation

We introduce some notation first. Denote by ${{\mathbf{I}}}_{p}$ the $p\times p$ identity matrix. For any $s\in\mathbb{R}$ , $\textbf{s}_{p}$ denotes the $p\times 1$ vector with all its elements equal to $s$ . For ${\mathbf{a}}=(a_{1},\ldots,a_{p})^{\top}\in\mathbb{R}^{p}$ and ${\mathbf{A}}=(A_{i,j})\in\mathbb{R}^{p\times p}$ , let $\|{\mathbf{a}}\|_{q}=\left(a_{i}^{q}\right)^{1/q}$ for any $q\geq 1$ , $\|{\mathbf{a}}\|_{\infty}=\max_{i}|a_{i}|$ , and $\|{\mathbf{A}}\|_{\infty}=\max_{i}\sum_{j=1}^{p}|A_{i,j}|$ . Furthermore, let $\|{\mathbf{A}}\|_{2}$ denote the spectral norm of ${\mathbf{A}}$ which equals its largest eigenvalue. For a random matrix ${\mathbf{W}}\in\mathbb{R}^{p\times p}$ with ${\rm E}\left({\mathbf{W}}\right)=\textbf{0}$ , define its matrix variance as ${\rm Var}({\mathbf{W}})=\max\left\{\|{\rm E}\left({\mathbf{W}}{\mathbf{W}}^{\top}\right)\|_{2},\|{\rm E}\left({\mathbf{W}}^{\top}{\mathbf{W}}\right)\|_{2}\right\}$ . The notation $x\lesssim y$ means that there exists a constant $c_{1}>0$ such that $|x|\leq c_{1}|y|$ , while notation $x\gtrsim y$ means there exists a constant $c_{2}>0$ such that $|x|\geq c_{2}|y|$ . Denote by ${\mathbf{B}}_{\infty}\left({\mathbf{x}},r\right)=\left\{{\mathbf{y}}:\|{\mathbf{y}}-{\mathbf{x}}\|_{\infty}\leq r\right\}$ the ball centred at ${\mathbf{x}}$ with $\ell_{\infty}$ radius $r$ . Let $c,c_{0},c_{1},\ldots,C,C_{0},C_{1},\ldots$ denote some generic constants that may be different in different places. Let ${\boldsymbol{\theta}}^{*}=(\boldsymbol{\beta}_{0}^{*\top},\boldsymbol{\beta}_{1}^{*\top})^{\top}=(\beta^{*}_{1,0},\cdots,\beta^{*}_{p,0},\beta^{*}_{1,1},\cdots,\beta^{*}_{p,1})^{\top}$ be the true unknown parameters. We assume:

(A1)

There exists a constant $K$ such that for any $i=1,2,\cdots,p$ , the true parameters satisfy $\beta^{*}_{i,1}-\max\big{(}\beta^{*}_{i,0},0\big{)}<K$ .

Condition (A1) ensures that the autocorrelation functions (ACFs) in (2.4) are bounded away from 1 for any $(i,j)\in{\mathcal{J}}$ . It is worth noting that both $\beta^{*}_{i,1}$ and $\beta^{*}_{i,0}$ are allowed to vary with $p$ , thus accommodating sparse networks in our analysis. In practical terms, $\beta_{i,0}^{*}$ , which reflects the sparsity of the stationary network, tends to be very small for large networks. Consequently, condition (A1) holds when $\beta_{i,1}^{*}$ is bounded from above.

3.1 Maximum likelihood estimation

With the available observations ${\mathbf{X}}^{0},\cdots,{\mathbf{X}}^{n}$ , the log-likelihood function conditionally on ${\mathbf{X}}^{0}$ is of the form $L(\boldsymbol{\theta};{\mathbf{X}}^{n},\cdots,{\mathbf{X}}^{1}|{\mathbf{X}}^{0})=\prod_{t=1}^{n}L(\boldsymbol{\theta};{\mathbf{X}}^{t}|{\mathbf{X}}^{t-1})$ . Note $\{X^{t}_{i,j}\}$ for different $(i,j)\in{\mathcal{J}}$ are independent with each other. By (2.5), a (normalized) negative log-likelihood admits the following form:

(3.6)		$\displaystyle l(\boldsymbol{\theta})=-{\frac{1}{np}}L(\boldsymbol{\theta};{\mathbf{X}}^{n},{\mathbf{X}}^{n-1},\cdots,{\mathbf{X}}^{1}\|{\mathbf{X}}^{0})$
	$\displaystyle=-{\frac{1}{p}}\sum_{1\leq i<j\leq p}\log\Big{(}{1+e^{\beta_{i,0}+\beta_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}}}\Big{)}+{\frac{1}{np}}{\sum_{1\leq i<j\leq p}\Bigg{\{}\left(\beta_{i,0}+\beta_{j,0}\right)\sum_{t=1}^{n}X_{i,j}^{t}}$
	$\displaystyle+\log\left(1+e^{\beta_{i,1}+\beta_{j,1}}\right)\sum_{t=1}^{n}\left(1-X_{i,j}^{t}\right)\left(1-X_{i,j}^{t-1}\right)$
	$\displaystyle+\log\big{(}1+e^{\beta_{i,1}+\beta_{j,1}-\beta_{i,0}-\beta_{j,0}}\big{)}\sum_{t=1}^{n}X_{i,j}^{t}X_{i,j}^{t-1}\Bigg{\}}.$

For brevity, write

(3.7)

a_{i,j}=\sum_{t=1}^{n}X_{i,j}^{t},\quad b_{i,j}=\sum_{t=1}^{n}X_{i,j}^{t}X_{i,j}^{t-1},\quad d_{i,j}=\sum_{t=1}^{n}\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}.

Then the Hessian matrix of $l(\boldsymbol{\theta})$ is of the form

{\mathbf{V}}(\boldsymbol{\theta})=\frac{\partial^{2}{l(\boldsymbol{\theta})}}{\partial{\boldsymbol{\theta}}\partial{\boldsymbol{\theta}^{\top}}}=\begin{bmatrix}\frac{\partial^{2}{l(\boldsymbol{\theta})}}{\partial{\boldsymbol{\beta}_{0}}\partial{\boldsymbol{\beta}_{0}^{\top}}}&\frac{\partial^{2}{l(\boldsymbol{\theta})}}{\partial{\boldsymbol{\beta}_{0}}\partial{\boldsymbol{\beta}_{1}^{\top}}}\\ \frac{\partial^{2}{l(\boldsymbol{\theta})}}{\partial{\boldsymbol{\beta}_{1}}\partial{\boldsymbol{\beta}_{0}^{\top}}}&\frac{\partial^{2}{l(\boldsymbol{\theta})}}{\partial{\boldsymbol{\beta}_{1}}\partial{\boldsymbol{\beta}_{1}^{\top}}}\end{bmatrix}:=\begin{bmatrix}{\mathbf{V}}_{1}(\boldsymbol{\theta})&{\mathbf{V}}_{2}(\boldsymbol{\theta})\\ {\mathbf{V}}_{2}(\boldsymbol{\theta})&{\mathbf{V}}_{3}(\boldsymbol{\theta})\end{bmatrix},

where for $i\not=j$ ,

$\displaystyle\frac{\partial^{2}{l(\boldsymbol{\theta})}}{\partial{\beta_{i,0}}\partial{\beta_{j,0}}}$	$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{e^{\beta_{i,0}+\beta_{j,0}}(1+e^{\beta_{i,1}+\beta_{j,1}})}{(1+e^{\beta_{i,0}+\beta_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}})^{2}}-\frac{1}{np}b_{i,j}\frac{e^{\beta_{i,0}+\beta_{j,0}+\beta_{i,1}+\beta_{j,1}}}{(e^{\beta_{i,0}+\beta_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}})^{2}},$
$\displaystyle\frac{\partial^{2}{l(\boldsymbol{\theta})}}{\partial{\beta_{i,0}}\partial{\beta_{j,1}}}$	$\displaystyle=$	$\displaystyle-\frac{1}{p}\frac{e^{\beta_{i,0}+\beta_{j,0}+\beta_{i,1}+\beta_{j,1}}}{(1+e^{\beta_{i,0}+\beta_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}})^{2}}+\frac{1}{np}b_{i,j}\frac{e^{\beta_{i,0}+\beta_{j,0}+\beta_{i,1}+\beta_{j,1}}}{(e^{\beta_{i,0}+\beta_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}})^{2}},$
$\displaystyle\frac{\partial^{2}{l(\boldsymbol{\theta})}}{\partial{\beta_{i,1}}\partial{\beta_{j,1}}}$	$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{e^{\beta_{i,1}+\beta_{j,1}}(1+e^{\beta_{i,0}+\beta_{j,0}})}{(1+e^{\beta_{i,0}+\beta_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}})^{2}}-\frac{1}{np}d_{i,j}\frac{e^{\beta_{i,1}+\beta_{j,1}}}{(1+e^{\beta_{i,1}+\beta_{j,1}})^{2}}$
		$\displaystyle-\frac{1}{np}b_{i,j}\frac{e^{\beta_{i,0}+\beta_{j,0}+\beta_{i,1}+\beta_{j,1}}}{(e^{\beta_{i,0}+\beta_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}})^{2}}.$

Note that matrix ${\mathbf{V}}_{2}(\boldsymbol{\theta})$ is symmetric. Furthermore, the three matrices ${\mathbf{V}}_{1}(\boldsymbol{\theta}),{\mathbf{V}}_{2}(\boldsymbol{\theta})$ and ${\mathbf{V}}_{3}(\boldsymbol{\theta})$ are diagonally balanced [12] in the sense that their diagonal elements are the sums of their respective rows, namely,

({\mathbf{V}}_{k}(\boldsymbol{\theta}))_{i,i}=\sum_{j=1,\>j\neq i}^{p}({\mathbf{V}}_{k}(\boldsymbol{\theta}))_{i,j},\quad k=1,2,3.

Unfortunately the Hessian matrix ${\mathbf{V}}(\boldsymbol{\theta})$ is not uniformly positive-definite. Hence $l(\boldsymbol{\theta})$ is not convex; see Section 5.1 for an example. Therefore, finding the global MLE by minimizing $l(\boldsymbol{\theta})$ would be infeasible, especially given the large dimensionality of $\boldsymbol{\theta}$ . To overcome the obstacle, we propose the following roadmap to search for the local MLE over a neighbourhood of the true parameter values $\boldsymbol{\theta}^{*}$ .

(1)

First we show that $l(\boldsymbol{\theta})$ is locally convex in a neighbourhood of $\boldsymbol{\theta}^{*}$ (see Theorem 1 below). Towards this end, we first prove that ${\rm E}({\mathbf{V}}(\boldsymbol{\theta}))$ is positive definite in a neighborhood of $\boldsymbol{\theta}^{*}$ . Leveraging on some newly proved concentration results, we show that ${\mathbf{V}}(\boldsymbol{\theta})$ converges to ${\rm E}({\mathbf{V}}(\boldsymbol{\theta}))$ uniformly over the neighborhood.
(2)

Denote by $\widehat{\boldsymbol{\theta}}$ the local MLE in the neighbourhood identified above. We derive the bounds for $\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}$ respectively in both $\ell_{2}$ and $\ell_{\infty}$ norms (see Theorems 2 and 3 below). The $\ell_{2}$ convergence is established by providing a uniform upper bound for the local deviation between $l(\boldsymbol{\theta})-{\rm E}(l(\boldsymbol{\theta}))$ and $l(\boldsymbol{\theta}^{*})-{\rm E}(l(\boldsymbol{\theta}^{*}))$ (see Corollary 4 in Section 4). The $\ell_{\infty}$ convergence of $\widehat{\boldsymbol{\theta}}$ is established by further exploiting the special structure of the objective function.
(3)

We propose a new method of moments estimator (MME) which is proved to lie asymptotically in the neighbourhood specified in (1) above. With this MME as the initial value, the local MLE $\widehat{\boldsymbol{\theta}}$ can be simply obtained via a gradient decent algorithm.

The main technical challenges in the roadmap above can be summarized as follows.

Firstly, to establish the upper bounds as stated in (2) above, we need to evaluate the uniform local deviations of the loss function. While the theoretical framework for deriving similar deviations of M-estimators has been well established in, for example, [32, 31], classical techniques in empirical process for establishing uniform laws [33] are not applicable because the number of parameters in TWHM diverges.

Secondly, for the classical $\beta$ -model, proving the existence and convergence of its MLE relies strongly on the interior point theorem [6]. In particular, this theorem is applicable only because the Hessian matrix of the $\beta$ -model admits a nice structure, i.e. it is diagonally dominant and all its elements are positive depending on the parameters only [2, 39, 37, 8]. However the Hessian matrix of $l(\boldsymbol{\theta})$ for TWHM depends on random variables $X_{i,j}^{t}$ ’s in addition to the parameters, making it impossible to verify if the score function is uniformly Fréchet differentiable or not, a key assumption required by the interior point theorem.

Lastly, the higher order derivatives of $l(\boldsymbol{\theta})$ may diverge as the order increases. To see this, notice that for any integer $k$ , the $k$ -th order derivatives of $l(\boldsymbol{\theta})$ is closely related to the $(k-1)$ -th order derivatives of the Sigmoid function $S(x)=\frac{1}{1+e^{-x}}$ in that $\frac{\partial^{k}S\left(x\right)}{\partial x^{k}}=\frac{\sum_{m=0}^{k-2}-A\left(k-1,m\right)\left(-e^{x}\right)^{m+1}}{\left(1+e^{x}\right)^{k}}$ , where $A\left(k-1,m\right)$ is the Eulerian number [26]. Some of the coefficients $A\left(k-1,m\right)$ can diverge very quickly as $k$ increases. Thus, loosely speaking, $l(\boldsymbol{\theta})$ is not smooth. This non-smoothness and the need to deal with a growing number of parameters make various local approximations based on Taylor expansions highly non-trivial; noting that the consistency of MLEs in many finite-dimensional models is often established via these approximations.

In our proofs, we have made great use of the special sparse structure of the loss function in the form (4.11) below. This sparsity structure stems from the fact that most of its higher order derivatives are zero. Based on the uniform local deviation bound obtained in Section 4, we have established an upper bound for the error of the local MLE under the $l_{2}$ norm. Utilizing the structure of the marginalized functions of the loss we have further established an upper bound for the estimation error under the $l_{\infty}$ norm thanks to an iterative procedure stated in Section 3.3.

3.2 Existence of the local MLE

To establish the convexity of $l(\boldsymbol{\theta})$ in a neighborhood of $\boldsymbol{\theta}^{*}$ , we first show that such a local convexity holds for ${\rm E}({\mathbf{V}}(\boldsymbol{\theta}))$ .

Proposition 2.

Let ${\mathbf{A}}$ be a $2p\times 2p$ matrix defined as ${\mathbf{A}}=\left[\begin{array}[]{ccc}{\mathbf{A}}_{1}&{\mathbf{A}}_{2}\\ {\mathbf{A}}_{2}&{\mathbf{A}}_{3}\end{array}\right],$ where ${\mathbf{A}}_{1}$ , ${\mathbf{A}}_{2}$ , ${\mathbf{A}}_{3}$ are $p\times p$ symmetric matrices. Then ${\mathbf{A}}$ is positive (negative) definite if $-{\mathbf{A}}_{2},{\mathbf{A}}_{2}+{\mathbf{A}}_{3},{\mathbf{A}}_{2}+{\mathbf{A}}_{1}$ are all positive (negative) definite.

Proof.

Consider any nonzero ${\mathbf{x}}=({\mathbf{x}}_{1}^{\top},{\mathbf{x}}_{2}^{\top})^{\top}\in\mathbb{R}^{2p}$ where ${\mathbf{x}}_{1},{\mathbf{x}}_{2}\in\mathbb{R}^{p}$ , we have:

	$\displaystyle{\mathbf{x}}^{T}{\mathbf{A}}{\mathbf{x}}$	$\displaystyle=$	$\displaystyle{\mathbf{x}}_{1}^{\top}{\mathbf{A}}_{1}{\mathbf{x}}_{1}+{\mathbf{x}}_{2}^{\top}{\mathbf{A}}_{3}{\mathbf{x}}_{2}+2{\mathbf{x}}_{1}^{\top}{\mathbf{A}}_{2}{\mathbf{x}}_{2}$
		$\displaystyle=$	$\displaystyle{\mathbf{x}}_{1}^{\top}({\mathbf{A}}_{1}+{\mathbf{A}}_{2}){\mathbf{x}}_{1}+{\mathbf{x}}_{2}^{\top}({\mathbf{A}}_{3}+{\mathbf{A}}_{2}){\mathbf{x}}_{2}-({\mathbf{x}}_{1}-{\mathbf{x}}_{2})^{\top}{\mathbf{A}}_{2}({\mathbf{x}}_{1}-{\mathbf{x}}_{2}).$

This proves the proposition. ∎

Noting that $-{\mathbf{V}}_{2}(\boldsymbol{\theta}),{\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{3}(\boldsymbol{\theta})$ and ${\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{1}(\boldsymbol{\theta})$ are all diagonally balanced matrices, with some routine calculations it can be shown that $-{\rm E}{\mathbf{V}}_{2}(\boldsymbol{\theta}^{*}),{\rm E}({\mathbf{V}}_{2}(\boldsymbol{\theta}^{*})+{\mathbf{V}}_{3}(\boldsymbol{\theta}^{*}))$ and ${\rm E}({\mathbf{V}}_{2}(\boldsymbol{\theta}^{*})+{\mathbf{V}}_{1}(\boldsymbol{\theta}^{*}))$ have only positive elements, and thus are all positive definite. Therefore, ${\rm E}{\mathbf{V}}(\boldsymbol{\theta}^{*})$ is positive definite by Proposition 2. By continuity, when $\boldsymbol{\theta}$ is close enough to $\boldsymbol{\theta}^{*}$ , ${\rm E}{\mathbf{V}}(\boldsymbol{\theta})$ is also positive definite, and hence ${\rm E}l(\boldsymbol{\theta})$ is strongly convex in a neighborhood of $\boldsymbol{\theta}^{*}$ . Next we want to show the local convexity of $l(\boldsymbol{\theta})$ whose second order derivatives depend on the sufficient statistics $b_{i,j}=\sum_{t=1}^{n}X_{i,j}^{t}X_{i,j}^{t-1}$ , and $d_{i,j}=\sum_{t=1}^{n}\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}$ . By noticing that the network process is $\alpha$ -mixing with an exponential decaying mixing coefficient, we first obtain the following concentration results for $b_{i,j}$ and $d_{i,j}$ , which ensure element-wise convergence of ${\mathbf{V}}(\boldsymbol{\theta})$ to ${\rm E}{\mathbf{V}}(\boldsymbol{\theta})$ for a given $\boldsymbol{\theta}$ when $np\rightarrow\infty$ .

Lemma 1.

Suppose $\{{\mathbf{X}}^{t}\}\sim P_{\boldsymbol{\theta}}$ for some ${\boldsymbol{\theta}}=(\beta_{1,0},\cdots,\beta_{p,0},\beta_{1,1},\cdots,\beta_{p,1})^{\top}$ satisfying condition (A1). Then for any $(i,j)\in{\cal J}$ , $\{X^{t}_{i,j},t\geq 1\}$ is $\alpha$ -mixing with exponential decaying rates. Moreover, for any positive constant $c>0$ , by choosing $c_{1}>0$ to be large enough, it holds with probability greater than $1-(np)^{-c}$ that

\max_{1\leq i<j\leq p}\left\{n^{-1}\left|\sum_{t=1}^{n}\left\{X_{i,j}^{t}-{\rm E}\left(X_{i,j}^{t}\right)\right\}\right|,n^{-1}\left|b_{i,j}-{\rm E}(b_{i,j})\right|,n^{-1}\left|d_{i,j}-{\rm E}(d_{i,j})\right|\right\}\leq c_{1}r_{n,p},

where $r_{n,p}=\sqrt{n^{-1}\log(np)}+n^{-1}\log\left(n\right)\log\log\left(n\right)\log\left(np\right)$ .

The following lemma provides a lower bound for the smallest eigenvalue of ${\rm E}({\mathbf{V}}(\boldsymbol{\theta}))$ .

Lemma 2.

Let $\{{\mathbf{X}}^{t}\}\sim P_{\boldsymbol{\theta}^{*}}$ , ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right):=\left\{\boldsymbol{\theta}:\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\|_{\infty}\leq r\right\}$ and ${\mathbf{B}}\left(\kappa_{0},\kappa_{1}\right):=$ $\Big{\{}\left(\boldsymbol{\beta}_{0},\boldsymbol{\beta}_{1}\right):\|\boldsymbol{\beta}_{0}\|_{\infty}\leq\kappa_{0},\|\boldsymbol{\beta}_{1}\|_{\infty}\leq\kappa_{1}\Big{\}}$ . Under condition (A1), for any $\kappa_{0},\kappa_{1}$ and $r=c_{r}e^{-4\kappa_{0}-4\kappa_{1}}$ with $c_{r}>0$ being a small enough constant, there exists a constant $C>0$ such that

\inf_{\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)\cap{\mathbf{B}}\left(\kappa_{0},\kappa_{1}\right);\|{\mathbf{a}}\|_{2}=1}{\mathbf{a}}^{\top}{\rm E}\left({\mathbf{V}}(\boldsymbol{\theta})\right){\mathbf{a}}\geq Ce^{-4\kappa_{0}-4\kappa_{1}}.

Examining the proof indicates that the lower bound in Lemma 2 is attained when $\boldsymbol{\beta}_{0}=(\kappa_{0},\ldots,\kappa_{0})^{\top}$ and $\boldsymbol{\beta}_{1}=(-\kappa_{1},\ldots,-\kappa_{1})^{\top}$ . Hence the smallest eigenvalue of ${\rm E}\left({\mathbf{V}}(\boldsymbol{\theta})\right)$ can decay exponentially in $\kappa_{0}$ and $\kappa_{1}$ . Consequently, an upper bound for the radius $\kappa_{0}$ and $\kappa_{1}$ must be imposed so as to ensure the positive definiteness of the sample analog ${\mathbf{V}}(\boldsymbol{\theta})$ . Moreover, Lemma 2 also indicates that the positive definiteness of ${\rm E}\left({\mathbf{V}}(\boldsymbol{\theta})\right)$ can be guaranteed when $\boldsymbol{\theta}$ is within the $\ell_{\infty}$ ball ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ . To establish the existence of the local MLE in the neighborhood, we need to evaluate the closeness of ${\rm E}\left({\mathbf{V}}(\boldsymbol{\theta})\right)$ and ${\mathbf{V}}(\boldsymbol{\theta})$ in terms of the operator norm. Intuitively, for some appropriately chosen $\kappa_{0},\kappa_{1}$ , if $\|{\rm E}\left({\mathbf{V}}(\boldsymbol{\theta})\right)-{\mathbf{V}}(\boldsymbol{\theta})\|_{2}$ has a smaller order than $e^{-4\kappa_{0}-4\kappa_{1}}$ uniformly over the parameter space $\{\boldsymbol{\theta}:\|\boldsymbol{\beta}_{0}\|_{\infty}\leq\kappa_{0},\|\boldsymbol{\beta}_{1}\|_{\infty}\leq\kappa_{1}$ and $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)\}$ , the positive definiteness of ${\mathbf{V}}(\boldsymbol{\theta})$ can be concluded.

Note that ${\mathbf{V}}_{2}(\boldsymbol{\theta})-{\rm E}{\mathbf{V}}_{2}(\boldsymbol{\theta}),{\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{3}(\boldsymbol{\theta})-{\rm E}\left({\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{3}(\boldsymbol{\theta})\right)$ and ${\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{1}(\boldsymbol{\theta})-{\rm E}\left({\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{1}(\boldsymbol{\theta})\right)$ are all centered and diagonally balanced matrices which can be decomposed into sums of independent random matrices. The following lemma provides a bound for evaluating the moderate deviations of these centered matrices.

Lemma 3.

Let ${\mathbf{Z}}=(Z_{i,j})_{1\leq i,j\leq p}$ be a symmetric $p\times p$ random matrix such that the off-diagonal elements $Z_{i,j},1\leq i<j\leq p$ are independent of each other and satisfy

Z_{i,i}=\sum_{j=1,\>j\neq i}^{n}Z_{i,j},\quad{\rm E}\left(Z_{i,j}\right)=0,\quad{\rm Var}\left(Z_{i,j}\right)\leq\sigma^{2},\quad{\rm and}\quad Z_{i,j}\leq b\quad{\rm almost~{}~{}surely}.

Then it holds that

P\left(\left\|{\mathbf{Z}}\right\|_{2}>\epsilon\right)\leq 2p\ \exp\left(-\frac{\epsilon^{2}}{2\sigma^{2}(p-1)+4b\epsilon}\right).

Proposition 2, Lemma 2 and Lemma 3 imply the theorem below.

Theorem 1.

Let condition (A1) hold, assume $\{{\mathbf{X}}^{t}\}\sim P_{\boldsymbol{\theta}^{*}}$ , and $\kappa_{r}:=\|\boldsymbol{\beta}_{r}^{*}\|_{\infty}$ where $r=0,1$ with $\kappa_{0}+\kappa_{1}\leq c\log(np)$ for some small enough constant $c>0$ . Then as $np\rightarrow\infty$ with $n\geq 2$ , we have that, with probability tending to one, there exists a unique MLE in the $\ell_{\infty}$ ball ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)=\left\{\boldsymbol{\theta}:\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\|_{\infty}\leq r\right\}$ for some $r=c_{r}e^{-4\kappa_{0}-4\kappa_{1}}$ , where $c_{r}>0$ is a constant.

In the proof of Theorem 1, we have shown that with probability tending to 1, $l(\boldsymbol{\theta})$ is convex in the convex and closed set ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ . Consequently, we conclude that there exists a unique local MLE in ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ . From Theorem 1 we can also see that when $\kappa_{0}+\kappa_{1}$ becomes larger, the radius $r$ will be smaller, and when $\kappa_{0}+\kappa_{1}$ is bounded away from infinity, $r$ has a constant order. From the proof we can also see that the constant $c_{r}$ can be larger if the smallest eigenvalue of the expected Hessian matrix ${\rm E}({\mathbf{V}}(\boldsymbol{\theta}))$ is larger. Further, by allowing the upper bound of $\|\boldsymbol{\beta}^{*}_{0}\|_{\infty}$ to grow to infinity, our theoretical analysis covers the case where networks are sparse. Specifically, under the condition that $\|\boldsymbol{\beta}_{0}^{*}\|_{\infty}\leq\kappa_{0}$ , from (2) we can obtain the following lower bound (which is achievable when $\beta_{1,0}^{*}=\ldots=\beta_{p,0}^{*}=-\kappa_{0}$ ) for the density of the stationary network:

\rho:=\frac{2}{p(p-1)}\sum_{1\leq i<j\leq p}{\mathbf{P}}\left(X_{i,j}^{t}=1\right)\geq\frac{e^{-2\kappa_{0}}}{1+e^{-2\kappa_{0}}}=O\left(e^{-2\kappa_{0}}\right).

In particular, when $\kappa_{0}\leq c\log(np)$ for some constant $c>0$ , we have $\rho\geq\frac{1}{[1+(np)^{2c}]}$ . Thus, compared to full dense network processes where the total number of edges for each network is of the order $p^{2}$ , TWHM allows the networks with much fewer edges.

3.3 Consistency of the local MLE

In the previous subsection, we have proved that with probability tending to one, $l(\boldsymbol{\theta})$ is convex in ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ , where $r=c_{r}e^{-4\kappa_{0}-4\kappa_{1}}$ is defined in Theorem 1. Denote by $\widehat{\boldsymbol{\theta}}$ the (local) MLE in ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ . We now evaluate the $\ell_{2}$ and $\ell_{\infty}$ distances between $\widehat{\boldsymbol{\theta}}$ and the true value $\boldsymbol{\theta}^{*}$ .

Based on Theorem 5 we obtain a local deviation bound for $l(\boldsymbol{\theta})$ as in Corollary 4 in Section 4, from which we establish the following upper bound for the estimation error of $\widehat{\boldsymbol{\theta}}$ under the $\ell_{2}$ norm:

Theorem 2.

\frac{1}{\sqrt{p}}\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}\lesssim e^{4\kappa_{0}+4\kappa_{1}}\sqrt{\frac{\log(np)}{np}}\left(1+\frac{\log(np)}{\sqrt{p}}\right).

We discuss the implication of this theorem. When $n\to\infty$ and $p$ is finite, that is, when we have a fixed number of nodes but a growing number of network snapshots, Theorem 2 indicates that $\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}=O_{p}\left(\sqrt{\frac{\log^{3}n}{n}}e^{4\kappa_{0}+4\kappa_{1}}\right)=o_{p}(1)$ when $c$ is small enough. On the other hand, when $n$ , $\kappa_{0}$ and $\kappa_{1}$ are finite, Theorem 2 indicates that as the number of parameters $p$ increases, the $\ell_{2}$ error bound of $\widehat{\boldsymbol{\theta}}$ increases at a much slower rate $O\left(\sqrt{\log p}\right)$ .

Although Theorem 2 indicates that $\frac{1}{\sqrt{p}}\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}=o_{p}(1)$ as $np\rightarrow\infty$ , it does not guarantee the uniform convergence of all the elements in $\widehat{\boldsymbol{\theta}}$ . To prove the uniform convergence in the $\ell_{\infty}$ norm, we exploit a special structure of the loss function and the $\ell_{2}$ norm bound obtained in Theorem 2. Specifically, denote $l(\boldsymbol{\theta})$ in (3.6) as $l(\boldsymbol{\theta})=l(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)})$ where $\boldsymbol{\theta}_{(i)}:=(\beta_{i,0},\beta_{i,1})^{\top}$ , and $\boldsymbol{\theta}_{(-i)}$ contains the remaining elements of $\boldsymbol{\theta}$ except $\boldsymbol{\theta}_{(i)}$ . Using this notation, we can analogously define $\boldsymbol{\theta}^{*}_{(i)}$ and $\boldsymbol{\theta}^{*}_{(-i)}$ for the true parameter $\boldsymbol{\theta}^{*}$ , and $\widehat{\boldsymbol{\theta}}_{(i)}$ and $\widehat{\boldsymbol{\theta}}_{(-i)}$ for the local MLE $\widehat{\boldsymbol{\theta}}$ . We then have that $\boldsymbol{\theta}_{(i)}^{*}$ is the mimizer of ${\rm E}l\left(\cdot,\boldsymbol{\theta}_{(-i)}^{*}\right)$ while $\widehat{\boldsymbol{\theta}}_{(i)}$ is the minimizer of $l\left(\cdot,\widehat{\boldsymbol{\theta}}_{(-i)}\right)$ . The error of $\widehat{\boldsymbol{\theta}}_{(i)}$ in estimating $\boldsymbol{\theta}_{(i)}^{*}$ then relies on the distance between ${\rm E}l\left(\cdot,\boldsymbol{\theta}_{(-i)}^{*}\right)$ and $l\left(\cdot,\widehat{\boldsymbol{\theta}}_{(-i)}\right)$ , which on the other hand depends on both the $\ell_{2}$ bound of $\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\|_{2}$ and the uniform local deviation bound of $l\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}\right)$ . Based on Theorem 2, Corollary 4 in Section 4, and a sequential approach (see equations (A.30) and (A.31) in the appendix), we obtain the following bound for the estimation error under the $l_{\infty}$ norm.

Theorem 3.

\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}\lesssim e^{8\kappa_{0}+8\kappa_{1}}\log\log(np)\sqrt{\frac{\log(np)}{np}}\left(1+\frac{\log(np)}{\sqrt{p}}\right).

Theorem 3 indicates that $\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}=o_{p}(1)$ as $np\rightarrow\infty$ . Thus all the components of $\widehat{\boldsymbol{\theta}}$ converge uniformly. On the other hand, when $\kappa=c\log(np)$ for some small enough positive constant $c$ , we have $e^{8\kappa_{0}+8\kappa_{1}}\log\log(np)\sqrt{\frac{\log(np)}{np}}\left(1+\frac{\log(np)}{\sqrt{p}}\right)\leq o(c_{r}e^{-4\kappa_{0}-4\kappa_{1}})$ . Compared with Theorem 1, we observe that although the radius $r$ in Theorem 1 already tends to zero when $\|\boldsymbol{\beta}_{0}^{*}\|_{\infty}\leq\kappa_{0},\|\boldsymbol{\beta}_{1}^{*}\|_{\infty}\leq\kappa_{1}$ and $\kappa_{0}+\kappa_{1}\leq c\log(np)$ for some small enough constant $c>0$ , the $\ell_{\infty}$ error bound of $\widehat{\boldsymbol{\theta}}$ has a smaller order asymptotically and thus gives a tighter convergence rate.

We remark that in the MLE, $\boldsymbol{\beta}_{0}^{*}$ and $\boldsymbol{\beta}_{1}^{*}$ are estimated jointly. As we can see from the log-likelihood function, the information related to $\beta_{i,0}$ is captured by $X_{i,j}^{t}$ and $X_{i,j}^{t}X_{i,j}^{t-1}$ , $t=1,\ldots,n,j\neq i$ , while that related to $\beta_{i,1}$ is captured by $(1-X_{i,j}^{t})(1-X_{i,j}^{t-1})$ and $X_{i,j}^{t}X_{i,j}^{t-1}$ , $t=1,\ldots,n,j\neq i$ . This indicates that the effective “sample sizes” for estimating $\beta_{i,0}$ and $\beta_{i,1}$ are both of the order $O(np)$ . While the theorems we have established in this section is for $\widehat{\boldsymbol{\theta}}=({\widehat{\boldsymbol{\beta}}_{0}}^{\top},{\widehat{\boldsymbol{\beta}}_{1}}^{\top})^{\top}$ jointly, we would expect ${\widehat{\boldsymbol{\beta}}_{0}}$ and ${\widehat{\boldsymbol{\beta}}_{1}}$ to have the same rate of convergence.

3.4 A method of moments estimator

Having established the existence of a unique local MLE in ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ and proved its convergence, we still need to specify how to find this local MLE. To this end, we propose an initial estimator lying in this neighborhood. Consequently we can adopt any convex optimization method such as the coordinate descent algorithm to locate the local MLE, thanks to the convexity of the loss function in this neighborhood. Based on (2.3), an initial estimator of $\boldsymbol{\beta}_{0}$ denoted as $\tilde{\boldsymbol{\beta}}_{(0)}$ can be found by solving the following method of moments equations

(3.8)

\frac{\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}X_{i,j}^{t}}{n}-\sum_{j=1,\>j\neq i}^{p}\frac{e^{\beta_{i,0}+\beta_{j,0}}}{1+e^{\beta_{i,0}+\beta_{j,0}}}=0,\quad i=1,\cdots,p.

These equations can be viewed as the score functions of the pseudo loss function $f(\boldsymbol{\beta}_{0}):=\sum_{1\leq i,j\leq p}\log\{1+e^{\beta_{i,0}+\beta_{j,0}}\}-n^{-1}\sum_{i=1}^{p}\{\beta_{i,0}{\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}X_{i,j}^{t}}\}$ . Since the Hessian matrix of $f(\boldsymbol{\beta}_{0})$ is diagonally balanced with positive elements, the Hessian matrix is positive definite, and, hence, $f(\boldsymbol{\beta}_{0})$ is strongly convex. With the strong convexity, the solution of (3.8) is the minimizer of $f(\cdot)$ which can be easily obtained using any standard algorithms such as the gradient descent. On the other hand, note that

{\rm E}(X_{i,j}^{t}X_{i,j}^{t-1})=\frac{e^{{\beta}_{i,0}+{\beta}_{j,0}}}{1+e^{{\beta}_{i,0}+{\beta}_{j,0}}}\left(1-\frac{1}{1+e^{{\beta}_{i,0}+{\beta}_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}}}\right),

which motivates the use of the following estimating equations to obtain $\tilde{\boldsymbol{\beta}}_{1}$ , the initial estimator of $\boldsymbol{\beta}_{1}$ ,

(3.9)

\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}\left\{X_{i,j}^{t}X_{i,j}^{t-1}-\frac{e^{\tilde{\beta}_{i,0}+\tilde{\beta}_{j,0}}}{1+e^{\tilde{\beta}_{i,0}+\tilde{\beta}_{j,0}}}\left(1-\frac{1}{1+e^{\tilde{\beta}_{i,0}+\tilde{\beta}_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}}}\right)\right\}=0,

with $i=1,\cdots,p$ . Similar to (3.8), we can formulate a pseudo loss function such that given $\tilde{\boldsymbol{\beta}}_{0}$ , its Hessian matrix corresponding to the score equations (3.9) is positive definite, and hence (3.9) can also be solved via the standard gradient descent algorithm. Since $\tilde{\boldsymbol{\theta}}=(\tilde{\boldsymbol{\beta}}_{0}^{\top},\tilde{\boldsymbol{\beta}}_{1}^{\top})^{\top}$ is obtained by solving two sets of moment equations, we call it the method of moments estimator (MME). An interesting aspect of our construction of these moment equations is that the equations corresponding to the estimation of $\boldsymbol{\beta}_{0}$ and $\boldsymbol{\beta}_{1}$ are decoupled. While the estimator error in estimating $\boldsymbol{\beta}_{0}$ propagates clearly in that of estimating $\boldsymbol{\beta}_{1}$ , we have the following existence, uniqueness, and a uniform upper bound for the estimation error of $\tilde{\boldsymbol{\theta}}$ . Our results build on a novel application of the classical interior mapping theorem [6, 38, 37].

Theorem 4.

Let condition (A1) hold, and $\{{\mathbf{X}}^{t}\}\sim P_{\boldsymbol{\theta}^{*}}$ . The MME $\tilde{\boldsymbol{\theta}}$ defined by equations (3.8) and (3.9) exists and is unique in probability. Further, assume that $\kappa_{r}:=\|\boldsymbol{\beta}_{r}^{*}\|_{\infty}$ where $r=0,1$ with $\kappa_{0}+\kappa_{1}\leq c\log(np)$ for some small enough constant $c>0$ . Then as $np\to\infty$ and $n\geq 2$ , it holds that

\left\|\tilde{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}\leq O_{p}\left(e^{14\kappa_{0}+6\kappa_{1}}\sqrt{\frac{\log(n)\log(p)}{np}}\right).

When $np\to\infty$ and $\kappa_{0},\kappa_{1}$ are finite, Theorem 4 gives $\left\|\tilde{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}=O_{p}\left(\sqrt{\frac{\log(n)\log(p)}{np}}\right)$ . When $\kappa_{0}+\kappa_{1}\asymp\log(np)$ , we see that the upper bound for the local MLE in Theorem 3 is dominated by the upper bound of the MME in Theorem 4. Moreover, when $\kappa_{0}+\kappa_{1}\leq c\log(np)$ for some small enough constant $c>0$ , we have $\tilde{\boldsymbol{\theta}}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ , where $r$ is defined in Theorem 1. Thus, $\tilde{\boldsymbol{\theta}}$ is in the small neighborhood of $\boldsymbol{\theta}^{*}$ as required.

3.5 The sparse case

In the previous results, the estimation error bounds depend on $\kappa_{0}$ and $\kappa_{1}$ , i.e., the upper bounds on $\|\boldsymbol{\beta}_{0}^{*}\|_{\infty}$ and $\|\boldsymbol{\beta}_{1}^{*}\|_{\infty}$ . Clearly, the larger $\kappa_{0}$ is, the more sparse the networks could be, and the larger $\kappa_{1}$ is, the lag-one correlations (c.f. equation (2.4)) could be closer to one, indicating fewer fluctuations in the network process. To further characterize the effect of network sparsity, in this section, we derive further properties under a relatively sparse scenario where $-\kappa_{0}\leq\beta_{i,0}^{*}\leq C_{\kappa}$ and $-\kappa_{1}\leq\beta_{i,1}^{*}\leq\kappa_{1}$ for all $i=1,\ldots,p$ and $C_{\kappa}>0$ here is a constant. Under this case, there exist constants $C>0$ and $C_{1}>0$ such that $Ce^{-2\kappa_{0}}\leq{\rm E}\left(X_{i,j}^{t}\right)\leq C_{1}<1.$ In the most sparse case where $\beta_{0,i}=-\kappa_{0},i=1,\ldots,p$ , the density of the stationary network is of the order $O(e^{-2\kappa_{0}})$ . Similar to Lemma 2 and Theorem 1, the following corollary provides a lower bound for the smallest eigenvalue of ${\rm E}({\mathbf{V}}(\boldsymbol{\theta}))$ and the existence of the MLE.

Corollary 1.

Let $\{{\mathbf{X}}^{t}\}\sim P_{\boldsymbol{\theta}^{*}}$ , ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)=\left\{\boldsymbol{\theta}:\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\|_{\infty}\leq r\right\}$ for some $r=c_{r}e^{-2\kappa_{0}-4\kappa_{1}}$ where $c_{r}>0$ is a small enough constant. Denote ${\mathbf{B}}^{\prime}\left(\kappa_{0},\kappa_{1}\right):=\Big{\{}\left(\boldsymbol{\beta}_{0},\boldsymbol{\beta}_{1}\right):-\kappa_{0}\leq\boldsymbol{\beta}_{i,0}\leq C_{\kappa},i=1,\ldots,p,$ $\|\boldsymbol{\beta}_{1}\|_{\infty}\leq\kappa_{1}\Big{\}}$ for some constant $C_{\kappa}>0$ . Then, under condition (A1), there exists a constant $C>0$ such that

\inf_{\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)\cap{\mathbf{B}}^{\prime}\left(\kappa_{0},\kappa_{1}\right);\|{\mathbf{a}}\|_{2}=1}{\mathbf{a}}^{\top}{\rm E}\left({\mathbf{V}}(\boldsymbol{\theta})\right){\mathbf{a}}\geq Ce^{-2\kappa_{0}-4\kappa_{1}}.

Further, assume that $\boldsymbol{\theta}^{*}\in{\mathbf{B}}^{\prime}\left(\kappa_{0},\kappa_{1}\right)$ and $\kappa_{0}+2\kappa_{1}<c\log(np)$ for some positive constant $c<1/6$ . Then, as $np\rightarrow\infty$ with $n\geq 2$ , with probability tending to 1, there exists a unique MLE in ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ .

Following Theorems 2-4, we also establish the estimation errors for the MLE and MME in the subsequent corollaries below.

Corollary 2.

\frac{1}{\sqrt{p}}\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}\leq Ce^{2\kappa_{0}+4\kappa_{1}}\sqrt{\frac{\log(np)}{np}}\left(1+\frac{\log(np)}{\sqrt{p}}\right),

{\rm and}~{}~{}~{}\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}\leq Ce^{4\kappa_{0}+8\kappa_{1}}\log\log(np)\sqrt{\frac{\log(np)}{np}}\left(1+\frac{\log(np)}{\sqrt{p}}\right).

Corollary 3.

Let condition (A1) hold. Assume $\{{\mathbf{X}}^{t}\}\sim P_{\boldsymbol{\theta}^{*}}$ , $\|\boldsymbol{\beta}_{1}^{*}\|_{\infty}\leq\kappa_{1}$ , and $-\kappa_{0}\leq\beta_{i,0}^{*}\leq C_{\kappa}$ for $i=1,\ldots,p$ and some constant $C_{\kappa}>0$ . Then as $np\rightarrow\infty$ with $n\geq 2$ , it holds with probability tending to one that the MME $\tilde{\boldsymbol{\theta}}$ defined in equations (3.8) and (3.9) exists uniquely, and when $\kappa_{0}+2\kappa_{1}<c\log(np)$ for some constant $c<1/12$ , it holds that

\left\|\tilde{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}\leq O_{p}\left(e^{4\kappa_{0}+6\kappa_{1}}\sqrt{\frac{\log(n)\log(p)}{np}}\right).

From Corollary 2, we can see that when $\kappa_{1}\asymp O(1)$ , the MLE is consistent when $\kappa_{0}\leq c\log(np)$ for some positive constant $c<1/8$ , with the corresponding lower bound in the density as $O(e^{-2c\log(np)})\succ O((np)^{-1/4})$ . Similarly, from Corollary 3 we can see that when $\kappa_{1}\asymp O(1)$ , the density of the networks can be as small as $O(e^{-2c\log(np)})$ for some constant $c<1/12$ , i.e., the density has a larger order than $(np)^{-1/6}$ for the estimation of the MME. Further, when $6\kappa_{0}+10\kappa_{1}\leq c_{1}\log(np)$ for some constant $c_{1}<1/2$ , we have $\tilde{\boldsymbol{\theta}}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ , where $r$ is defined in Corollary 1. This implies the validity of using $\tilde{\boldsymbol{\theta}}$ as an initial estimator for computing the local MLE.

4 A uniform local deviation bound under high dimensionality

As we have discussed, a key to establish the consistency of the local MLE is to evaluate the magnitude of $\big{|}[l(\boldsymbol{\theta})-{\rm E}l(\boldsymbol{\theta})]-[l(\boldsymbol{\theta}^{*})-{\rm E}l(\boldsymbol{\theta}^{*})]\big{|}$ for all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ with $r$ specified in Theorem 1. Such local deviation bounds are important for establishing error bounds for general M-estimators in the empirical processes [32]. Note that

(4.10)		$\displaystyle l(\boldsymbol{\theta})-{\rm E}l(\boldsymbol{\theta})$	$\displaystyle=-\frac{1}{p}\sum_{1\leq i<j\leq p}\Bigg{\{}\left(\beta_{i,0}+\beta_{j,0}\right)\Big{(}\frac{a_{i,j}-{\rm E}(a_{i,j})}{n}\Big{)}$
	$\displaystyle+$	$\displaystyle\log\left(1+e^{\left(\beta_{i,1}+\beta_{j,1}\right)}\right)\Big{(}\frac{d_{i,j}-{\rm E}(d_{i,j})}{n}\Big{)}$
	$\displaystyle+$	$\displaystyle\log\left(1+e^{\left(\beta_{i,1}-\beta_{i,0}\right)+\left(\beta_{j,1}-\beta_{j,0}\right)}\right)\Big{(}\frac{b_{i,j}-{\rm E}(b_{i,j})}{n}\Big{)}\Bigg{\}}$

where $a_{i,j},b_{i,j}$ and $d_{i,j}$ are defined in (3.7). The three terms on the right-hand side all admit the following form

(4.11)

{\mathbf{L}}\left(\boldsymbol{\theta}\right)=\frac{1}{p}\sum_{1\leq i\neq j\leq p}l_{i,j}\left(\theta_{i},\theta_{j}\right)Y_{i,j},

for some functions ${\mathbf{L}}:\mathbb{R}^{p}\to\mathbb{R}$ , $l_{i,j}:\mathbb{R}^{2}\to\mathbb{R}$ , and centered random variables $Y_{i,j}$ $(1\leq i,j\leq p)$ . Instead of establishing the uniform bound for each term in (4.10) separately, below we will establish a unified result for bounding $|{\mathbf{L}}\left(\boldsymbol{\theta}\right)-{\mathbf{L}}\left(\boldsymbol{\theta}^{\prime}\right)|$ over a local $\ell_{\infty}$ ball defined as $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}(\boldsymbol{\theta}^{\prime},\cdot)$ for a general ${\mathbf{L}}$ function as in (4.11). We remark that in general without further assumptions on ${\mathbf{L}}$ , establishing uniform deviation bounds is impossible when the dimension of the problem diverges. For our TWHM however, the decomposition (4.10) is of a particularly appealing structure in the sense that only two-way interactions between parameters $\theta_{i}$ exist. Based on this “sparsity” structure, we develop a novel reformulation (c.f. equation (A.15)) for the main components of the Taylor series of ${\mathbf{L}}(\boldsymbol{\theta})$ satisfying the following two conditions.

(L-A1)

There exists a constant $\alpha>0$ , such that for any $1\leq i\neq j\leq p$ , any positive integer $k$ , and any non-negative integer $s\leq k$ , we have:

\frac{\partial^{k}l_{i,j}\left(\theta_{i},\theta_{j}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\leq\frac{\left(k-1\right)!}{\alpha^{k}}.

(L-A2)

Random variables $Y_{i,j},1\leq i\neq j\leq p$ are independent satisfying ${\rm E}\left(Y_{i,j}\right)=0$ , $|Y_{i,j}|\leq b_{(p)}$ and ${\rm Var}\left(Y_{i,j}\right)\leq\sigma_{(p)}^{2}$ for any $i$ and $j$ , where $b_{(p)}$ and $\sigma_{(p)}^{2}$ are constants depending on $n$ and $p$ but independent of $i$ and $j$ .

Loosely speaking, Condition (L-A1) can be seen as a smoothness assumption on the higher order derivatives of $l_{i,j}\left(\theta_{i},\theta_{j}\right)$ so that we can properly bound these derivatives when Taylor expansion is applied. On the other hand, the upper bound for these derivatives is mild as it can diverge very quickly as $k$ increases. For our TWHM, it can be verified that (L-A1) holds for $l_{i,j}(\theta_{i},\theta_{j})=\theta_{i}+\theta_{j}$ and $l_{i,j}(\theta_{i},\theta_{j})=\log(1+e^{\theta_{i}+\theta_{j}})$ ; see (3.6). For the latter, note that the first derivative of function $l(x)=\log(1+e^{x})$ is seen as the Sigmoid function:

S\left(x\right)=\frac{e^{x}}{1+e^{x}}=\frac{1}{1+e^{-x}}.

By the expression of the higher order derivatives of the Sigmoid function [26], the $k$ -th order derivative of $l$ is

\frac{\partial^{k}l\left(x\right)}{\partial x^{k}}=\frac{\sum_{m=0}^{k-2}-A\left(k-1,m\right)\left(-e^{x}\right)^{m+1}}{\left(1+e^{x}\right)^{k}},

where $k\geq 2$ and $A\left(k-1,m\right)$ is the Eulerian number. Now for any $x$ , we have

\left|\frac{\sum_{m=0}^{k-2}-A\left(k-1,m\right)\left(-e^{x}\right)^{m+1}}{\left(1+e^{x}\right)^{k}}\right|\leq\sum_{m=0}^{k-2}A\left(k-1,m\right)=\left(k-1\right)!.

Therefore,

\left|\frac{\partial^{k}l\left(x\right)}{\partial x^{k}}\right|\leq{\left(k-1\right)!}

holds for all $x\in\mathbb{R}$ and $k\geq 2$ . With extra arguments using the chain rule, this in return implies that (L-A1) is satisfied with $\alpha=1$ when $l_{i,j}(\boldsymbol{\theta})=\log\left(1+e^{\theta_{i}+\theta_{j}}\right)$ .

Condition (L-A2) is a regularization assumption for the random variables $Y_{i,j},1\leq i,j\leq p$ , and the bounds on their moments are imposed to ensure point-wise concentration. For our TWHM, from Lemma 1 and Lemma 5, we have that there exist large enough constants $C>0$ and $c>0$ such that with probability greater than $1-(np)^{-c}$ , the random variables $\frac{a_{i,j}-{\rm E}(a_{i,j})}{n}$ , $\frac{b_{i,j}-{\rm E}(b_{i,j})}{n}$ and $\frac{d_{i,j}-{\rm E}(d_{i,j})}{n}$ all satisfy condition (L-A2) with $b_{(p)}=C\sqrt{n^{-1}\log(np)}+Cn^{-1}\log\left(n\right)\log\log\left(n\right)\log\left(np\right)$ and $\sigma_{(p)}^{2}=Cn^{-1}$ .

We present the uniform upper bound on the deviation of ${\mathbf{L}}(\boldsymbol{\theta})$ below.

Theorem 5.

Assume conditions (L-A1) and (L-A2). For any given $\boldsymbol{\theta}^{\prime}\in\mathbb{R}^{p}$ and $\alpha_{0}\in(0,\alpha/2)$ , there exist large enough constants $C>0$ and $c>0$ which are independent of $\boldsymbol{\theta}^{\prime}$ , such that, as $np\to\infty$ , with probability greater than $1-(np)^{-c}$ ,

\left|{\mathbf{L}}\left(\boldsymbol{\theta}\right)-{\mathbf{L}}\left(\boldsymbol{\theta}^{\prime}\right)\right|\leq C\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{p}\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\right\|_{1}

holds uniformly for all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ .

One of the main difficulties in analyzing ${\mathbf{L}}(\boldsymbol{\theta})$ defined in (4.11) is that $l_{i,j}(\theta_{i},\theta_{j})$ and $Y_{i,j}$ are coupled, giving rise to complex terms involving both in the Taylor expansion of ${\mathbf{L}}(\boldsymbol{\theta})$ . When Taylor expansion with order $K$ is used, condition (L-A1) can reduce the number of higher order terms from $O(p^{K})$ to $O(p^{2}2^{K})$ . On the other hand, by formulating the main terms in the Taylor series into a matrix form in (A.15), the uniform convergence of the sum of these terms is equivalent to that of the spectral norm of a centered random matrix, which is independent of the parameters. Further details can be found in the proofs of Theorem 5.

Define the marginal functions of ${\mathbf{L}}\left(\boldsymbol{\theta}\right)$ as

{\mathbf{L}}_{i}\left(\boldsymbol{\theta}\right)=\frac{1}{p}\sum_{j=1,\>j\neq i}^{p}l_{i,j}\left(\theta_{i},\theta_{j}\right)Y_{i,j},\quad i=1,\ldots,p,

by retaining only those terms related to $\theta_{i}$ . Similar to Theorem 5, we state the following upper bound for these marginal functions. With some abuse of notation, let $\boldsymbol{\theta}_{-i}:=\left(\theta_{1},\cdots,\theta_{i-1},\theta_{i+1},\cdots,\theta_{p}\right)^{\top}$ be the vector containing all the elements in $\boldsymbol{\theta}$ except $\theta_{i}$ .

Theorem 6.

If conditions (L-A1) and (L-A2) hold, then for any given $\boldsymbol{\theta}^{\prime}\in\mathbb{R}^{p}$ and $\alpha_{0}\in(0,\alpha/2)$ , there exist large enough constants $C>0$ and $c>0$ which are independent of $\boldsymbol{\theta}^{\prime}$ , such that, as $np\to\infty$ , with probability greater than $1-(np)^{-c}$ ,

\left|{\mathbf{L}}_{i}\left(\boldsymbol{\theta}\right)-{\mathbf{L}}_{i}\left(\boldsymbol{\theta}^{\prime}\right)\right|\\ \leq C\frac{b_{(p)}}{p}\left\|\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}_{-i}^{\prime}\right\|_{1}+C\left(\left\|\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}_{-i}^{\prime}\right\|_{1}+1\right)\left|\theta_{i}-\theta^{\prime}_{i}\right|\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{p}

holds uniformly for all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ , and $i=1,\cdots,p$ .

Similar to (4.10), we can also decompose $l\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}\right)-{\rm E}l\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}\right)$ into the sum of three components taking the form (4.11). Consequently, by setting $\boldsymbol{\theta}^{\prime}$ in Theorems 5 and 6 to be the true parameter $\boldsymbol{\theta}^{*}$ , we can obtain the following upper bounds.

Corollary 4.

For any given $0<\alpha_{0}<1/4$ , there exist large enough positive constants $c_{1},c_{2},$ and $C$ such that

(i)

with probability greater than $1-(np)^{-c_{1}}$ ,

(4.12)

\left|\big{(}l(\boldsymbol{\theta})-l(\boldsymbol{\theta}^{*})\big{)}-\big{(}{\rm E}l(\boldsymbol{\theta})-{\rm E}l(\boldsymbol{\theta}^{*})\big{)}\right|\leq C_{1}\left(1+\frac{\log(np)}{\sqrt{p}}\right)\sqrt{\frac{\log(np)}{n}}\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^{*}\right\|_{2}

holds uniformly for all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},\alpha_{0}\right)$ with some constant $\alpha_{0}<1/2$ ;

(ii)

with probability greater than $1-(np)^{-c_{2}}$ ,

(4.13)

\left|l\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}^{*}\right)-l\left(\boldsymbol{\theta}_{(i)}^{*},\boldsymbol{\theta}_{(-i)}^{*}\right)-\left[{\rm E}l\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}^{*}\right)-{\rm E}l\left(\boldsymbol{\theta}_{(i)}^{*},\boldsymbol{\theta}_{(-i)}^{*}\right)\right]\right|\\ \leq C_{2}\left(1+\frac{\log(np)}{\sqrt{p}}\right)\sqrt{\frac{\log(np)}{n}}\left\|\boldsymbol{\theta}_{(i)}-\boldsymbol{\theta}^{*}_{(i)}\right\|_{2}

holds uniformly for all $\boldsymbol{\theta}_{(i)}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*}_{(i)},\alpha_{0}\right)$ with some constant $\alpha_{0}<1/2$ .

In (4.12) and (4.13) we have replaced the $\ell_{1}$ norm based upper bounds in Theorems 5 and 6 with $\ell_{2}$ norm based upper bounds using the fact that for all ${\mathbf{x}}\in\mathbb{R}^{p}$ , $\|{\mathbf{x}}\|_{1}\leq\sqrt{p}\|{\mathbf{x}}\|_{2}$ . It is recognized that networks often exhibit diverse characteristics, including dynamic changes, node heterogeneity, homophily, and transitivity, among others. In this paper, our primary emphasis is on addressing node heterogeneity within dynamic networks. When integrated with other stylized features, the objective function may adopt a similar structure to the ${\mathbf{L}}(\boldsymbol{\theta})$ function defined in Equation $\eqref{Lfunction}$ . Moreover, many other models that incorporate node heterogeneity can express their log-likelihood functions in a form analogous to Equation (4.11). For instance, the general category of network models with edge formation probabilities represented as $f(\alpha_{i},\beta_{j})$ , where $f(\cdot)$ is a density or probability mass function, and $(\alpha_{i},\beta_{i})$ denote node-specific parameters for node $i$ . This encompasses models such as the $p_{1}$ model [13], the directed $\beta$ -model [36], and the bivariate gamma model [8]. Additionally, in the domain of ranking data analysis, it is common to introduce individual-specific parameters or scores for ranking, as observed in the classical Bradley-Terry model and its variants [9]. Our discoveries have the potential for application in the theoretical examination of these models or their modifications when considering additional stylized features alongside node heterogeneity.

5 Numerical study

In this section, we assess the performance of the local MLE. For comparison, we have also computed a regularized MME that is numerically more stable than the vanilla MME in (3.9). Specifically, for the former, we solve

(5.14)

\frac{-1}{np}\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}\left\{X_{i,j}^{t}X_{i,j}^{t-1}-\frac{e^{\tilde{\beta}_{i,0}+\tilde{\beta}_{j,0}}}{1+e^{\tilde{\beta}_{i,0}+\tilde{\beta}_{j,0}}}\left(1-\frac{1}{1+e^{\tilde{\beta}_{i,0}+\tilde{\beta}_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}}}\right)\right\}+\lambda\beta_{i,1}=0,

with $i=1,\cdots,p$ , where $\lambda\beta_{i,1}$ can be seen as a ridge penalty with $\lambda>0$ as the regularization parameter. Denote the regularized MME as $\tilde{\boldsymbol{\theta}}_{\lambda}$ . Similar to Theorem 4, by choosing $\lambda=C_{\lambda}e^{2\kappa}\sqrt{\frac{\log(np)}{np}}$ for some constant $C_{\lambda}$ , we can show that $\left\|\tilde{\boldsymbol{\theta}}_{\lambda}-\boldsymbol{\theta}^{*}\right\|_{\infty}\leq O_{p}\left(e^{26\kappa}\sqrt{\frac{\log(n)\log(p)}{np}}\right).$ In our implementation we take $\lambda=\sqrt{\frac{\log(np)}{np}}$ . The MLE of TWHM is obtained via gradient descent using $\tilde{\boldsymbol{\theta}}_{\lambda}$ as the initial value.

5.1 Non-convexity of $l(\boldsymbol{\theta})$ and ${\rm E}l(\boldsymbol{\theta})$

Given the form of $l(\boldsymbol{\theta})$ , it is intuitively true that it may not be convex everywhere. We confirm this via a simple example. Take $(n,p)=(2,1000)$ and set $\boldsymbol{\beta}^{*}_{0},\boldsymbol{\beta}^{*}_{1}$ to be $\textbf{0.2}_{p}$ , $\textbf{0.5}_{p}$ or $\textbf{1}_{p}$ . We evaluate the smallest eigenvalue of the Hessian matrix of $l(\boldsymbol{\theta})$ and its expectation ${\rm E}l(\boldsymbol{\theta})$ at the true parameter value $\boldsymbol{\theta}^{*}=(\boldsymbol{\beta}^{*\top}_{0},\boldsymbol{\beta}^{*\top}_{1})^{\top}$ , or at $\boldsymbol{\theta}=\textbf{0}_{2p}$ in one experiment.

Table 1: Signs of the smallest eigenvalues of the Hessian matrices of

l(\boldsymbol{\theta})

and

{\rm E}l(\boldsymbol{\theta})

evaluated at

\boldsymbol{\theta}=\boldsymbol{\theta}^{*}

\textbf{0}_{2p}

when different values of

\boldsymbol{\theta}^{*}=(\boldsymbol{\beta}^{*\top}_{0},\boldsymbol{\beta}^{*\top}_{1})^{\top}

are used to generate data.

Sign of the smallest eigenvalue of $l(\boldsymbol{\theta}^{*})$				Sign of the smallest eigenvalue of ${\rm E}l(\boldsymbol{\theta}^{*})$
	$\boldsymbol{\beta}^{*}_{0}$ = $\textbf{0.2}_{p}$	$\boldsymbol{\beta}^{*}_{0}$ = $\textbf{0.5}_{p}$	$\boldsymbol{\beta}^{*}_{0}$ = $\textbf{1}_{p}$		$\boldsymbol{\beta}^{*}_{0}$ = $\textbf{0.2}_{p}$	$\boldsymbol{\beta}^{*}_{0}$ = $\textbf{0.5}_{p}$	$\boldsymbol{\beta}^{*}_{0}$ = $\textbf{1}_{p}$
$\boldsymbol{\beta}^{*}_{1}$ = $\textbf{0.2}_{p}$	$+$	$+$	$+$	$\boldsymbol{\beta}^{*}_{1}$ = $\textbf{0.2}_{p}$	$+$	$+$	$+$
$\boldsymbol{\beta}^{*}_{1}$ = $\textbf{0.5}_{p}$	$+$	$+$	$+$	$\boldsymbol{\beta}^{*}_{1}$ = $\textbf{0.5}_{p}$	$+$	$+$	$+$
$\boldsymbol{\beta}^{*}_{1}$ = $\textbf{1}_{p}$	$+$	$+$	$+$	$\boldsymbol{\beta}^{*}_{1}$ = $\textbf{1}_{p}$	$+$	$+$	$+$
Sign of the smallest eigenvalue of $l(\textbf{0}_{2p})$				Sign of the smallest eigenvalue of ${\rm E}l(\textbf{0}_{2p})$
	$\boldsymbol{\beta}^{*}_{0}$ = $\textbf{0.2}_{p}$	$\boldsymbol{\beta}^{*}_{0}$ = $\textbf{0.5}_{p}$	$\boldsymbol{\beta}^{*}_{0}$ = $\textbf{1}_{p}$		$\boldsymbol{\beta}^{*}_{0}$ = $\textbf{0.2}_{p}$	$\boldsymbol{\beta}^{*}_{0}$ = $\textbf{0.5}_{p}$	$\boldsymbol{\beta}^{*}_{0}$ = $\textbf{1}_{p}$
$\boldsymbol{\beta}^{*}_{1}$ = $\textbf{0.2}_{p}$	$+$	$+$	$-$	$\boldsymbol{\beta}^{*}_{1}$ = $\textbf{0.2}_{p}$	$+$	$+$	$-$
$\boldsymbol{\beta}^{*}_{1}$ = $\textbf{0.5}_{p}$	$-$	$-$	$-$	$\boldsymbol{\beta}^{*}_{1}$ = $\textbf{0.5}_{p}$	$+$	$+$	$-$
$\boldsymbol{\beta}^{*}_{1}$ = $\textbf{1}_{p}$	$-$	$-$	$-$	$\boldsymbol{\beta}^{*}_{1}$ = $\textbf{1}_{p}$	$-$	$-$	$-$

From the top half of Table 1 we can see that, when evaluated at $\boldsymbol{\theta}^{*}$ , the Hessian matrices are all positive definite. However, when evaluated at $\boldsymbol{\theta}=\textbf{0}_{2p}$ , from the bottom half of the table we can see that the Hessian matrices are no longer positive definite when $\boldsymbol{\theta}^{*}$ is far away from $\textbf{0}_{2p}$ . Even when the Hessian matrix of ${\rm E}l(\boldsymbol{\theta})$ is so at $\boldsymbol{\theta}=\textbf{0}_{2p}$ with $\boldsymbol{\theta}^{*}=\textbf{0.5}_{2p}$ , the corresponding Hessian matrix of $l(\boldsymbol{\theta})$ at this point has a negative eigenvalue. Thus, ${\rm E}l(\boldsymbol{\theta})$ and $l(\boldsymbol{\theta})$ are not globally convex.

5.2 Parameter estimation

We first evaluate the error rates of the MLE and MME under different combinations of $n$ and $p$ . We set $n=2,5,10,$ or $20$ and $p\in\lfloor 200\times 1.2^{0:6}\rfloor=\{200,240,288,346,415,498,598\}$ , which results in a total of 28 different combinations of $(n,p)$ . For each $(n,p)$ , the data are generated such that $\{{\mathbf{X}}^{t}\}\sim P_{\boldsymbol{\theta}^{*}}$ where the parameters $\beta_{i,0}^{*}$ and $\beta_{i,1}^{*}$ ( $1\leq i\leq p$ ) are drawn independently from the uniform distribution with parameters in $(-1,1)$ . Each experiment is repeated 100 times under each setting. Denote the estimator (which is either the MLE or the MME) as $\widehat{\boldsymbol{\theta}}$ , and the true parameter value as $\boldsymbol{\theta}^{*}$ . We report the average $\ell_{2}$ error in terms of $\frac{\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\|_{2}}{\sqrt{p}}$ and the average $\ell_{\infty}$ error $\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\|_{\infty}$ in Figure 2.

From this figure, we can see that the errors in terms of the $\ell_{\infty}$ norm and the $\ell_{2}$ norm decrease for MME and MLE as $n$ or $p$ increases, while the errors of MLE are smaller across all settings. These observations are consistent with our findings in the main theory.

Next, we provide more numerical simulation to evaluate the performance of MLE and MME by imposing different structures on $\boldsymbol{\beta}_{0}^{*}$ and $\boldsymbol{\beta}_{1}^{*}$ . In particular, we want to evaluate how the estimation accuracy changes by varying the sparsity of the networks as well as varying the correlations of the network sequence. Note that the expected density of the stationary distribution of the network process is simply

\frac{1}{p(p-1)}{\rm E}\left(\sum_{1\leq i\neq j\leq p}X_{i,j}^{t}\right)=\frac{1}{p(p-1)}\left(\sum_{1\leq i\neq j\leq p}\frac{e^{\beta_{i,0}^{*}+\beta_{j,0}^{*}}}{1+e^{\beta_{i,0}^{*}+\beta_{j,0}^{*}}}\right).

In the sequel, we will use two parameters $a$ and $b$ to generate $\boldsymbol{\beta}_{r}^{*}$ , $r=0,1$ , according to the following four settings:

Setting 1.

$\{a\}$ : all the elements in $\boldsymbol{\beta}_{r}^{*}$ are set to be equal to $a$ .
Setting 2.

${\{a,b\}}$ : the first 10 $\%$ elements of $\boldsymbol{\beta}_{r}^{*}$ are set to be equal to $a$ , while the other elements are set to be equal to $b$ .
Setting 3.

$\mathcal{L}_{(a,b)}$ : the parameters take values in a linear form as $\beta_{i,r}^{*}=a+(a-b)*(i-1)/(p-1)$ , $i=1,\cdots,p$ .
Setting 4.

$U_{(a,b)}$ : the $p$ elements in $\boldsymbol{\beta}_{r}^{*}$ are generated independently from the uniform distribution with parameters $a$ and $b$ .

In Table 2, we generate $\boldsymbol{\beta}_{1}^{*}$ using Setting 1 with $a=0$ , and generate $\boldsymbol{\beta}_{0}^{*}$ using Setting 2 with different choices for $a$ and $b$ to obtain networks with different expected density. In Table 3, we generate $\boldsymbol{\beta}_{0}^{*}$ and $\boldsymbol{\beta}_{1}^{*}$ using combinations of these four settings with different parameters such that the resulting networks have expected density either around 0.05 (sparse) or 0.5 (dense). The number of networks in each process and the number of nodes in each network are set as $(n,p)=(20,200),(20,500),(50,200)$ or $(50,500)$ . The errors for estimating $\boldsymbol{\theta}^{*}$ in terms of the $\ell_{\infty}$ and $\ell_{2}$ norms are reported via 100 replications. To further compare the accuracy for estimating $\boldsymbol{\beta}_{0}^{*}$ and $\boldsymbol{\beta}_{1}^{*}$ , in Table 4, we have conducted experiments under Settings 3 and 4, and reported the estimation errors for $\boldsymbol{\beta}_{0}^{*}$ and $\boldsymbol{\beta}_{1}^{*}$ separately. We summarize the simulation results below:

Table 2: The estimation errors of MME and MLE under Setting 1 and Setting 2 for

\boldsymbol{\beta}_{0}^{*}

by setting

\boldsymbol{\beta}^{*}_{1}=\boldsymbol{0}_{p}

$n$	$p$	$\boldsymbol{\beta}_{0}^{*}$	MME, $\ell_{2}$	MME, $\ell_{\infty}$	MLE, $\ell_{2}$	MLE, $\ell_{\infty}$
20	200	{0}	0.074	0.219	0.071	0.212
50	200	{0}	0.046	0.138	0.045	0.136
20	500	{0}	0.046	0.150	0.045	0.146
50	500	{0}	0.029	0.093	0.028	0.092
20	200	$\{0.5,-0.5\}$	0.092	0.222	0.091	0.217
50	200	$\{0.5,-0.5\}$	0.058	0.140	0.058	0.139
20	500	$\{0.5,-0.5\}$	0.058	0.154	0.057	0.148
50	500	$\{0.5,-0.5\}$	0.036	0.095	0.036	0.093
20	200	$\{1,-1\}$	0.120	0.305	0.117	0.284
50	200	$\{1,-1\}$	0.074	0.186	0.074	0.177
20	500	$\{1,-1\}$	0.075	0.200	0.073	0.190
50	500	$\{1,-1\}$	0.038	0.125	0.036	0.119
20	200	$\{1.5,-1.5\}$	0.164	0.436	0.156	0.397
50	200	$\{1.5,-1.5\}$	0.102	0.255	0.097	0.236
20	500	$\{1.5,-1.5\}$	0.103	0.287	0.097	0.262
50	500	$\{1.5,-1.5\}$	0.065	0.178	0.061	0.164

Table 3: The average estimation errors of MME and MLE under combinations of different settings

Density = 0.05
$n$	$p$	$\boldsymbol{\beta}_{0}^{*}$	$\boldsymbol{\beta}_{1}^{*}$	MME, $\ell_{2}$	MME, $\ell_{\infty}$	MLE, $\ell_{2}$	MLE, $\ell_{\infty}$
20	200	$\mathcal{L}_{(-4,0)}$	$U_{(-1,1)}$	0.419	1.833	0.392	1.8
50	200	$\mathcal{L}_{(-4,0)}$	$U_{(-1,1)}$	0.253	0.913	0.227	0.82
20	500	$\mathcal{L}_{(-4,0)}$	$U_{(-1,1)}$	0.246	1.119	0.218	0.9
50	500	$\mathcal{L}_{(-4,0)}$	$U_{(-1,1)}$	0.170	0.626	0.148	0.621
20	200	$\mathcal{L}_{(-4,0)}$	{0}	0.275	1.452	0.280	1.516
50	200	$\mathcal{L}_{(-4,0)}$	{0}	0.161	0.771	0.162	0.774
20	500	$\mathcal{L}_{(-4,0)}$	{0}	0.160	0.892	0.162	0.904
50	500	$\mathcal{L}_{(-4,0)}$	{0}	0.098	0.506	0.099	0.507
20	200	$\{-1.47\}$	$U_{(-1,1)}$	0.187	0.588	0.161	0.514
50	200	$\{-1.47\}$	$U_{(-1,1)}$	0.116	0.351	0.099	0.305
20	500	$\{-1.47\}$	$U_{(-1,1)}$	0.114	0.387	0.099	0.339
50	500	$\{-1.47\}$	$U_{(-1,1)}$	0.073	0.246	0.062	0.208
20	200	$\{-1.47\}$	{0}	0.150	0.482	0.151	0.484
50	200	$\{-1.47\}$	{0}	0.93	0.289	0.093	0.29
20	500	$\{-1.47\}$	{0}	0.93	0.309	0.093	0.311
50	500	$\{-1.47\}$	{0}	0.058	0.195	0.058	0.195
Density = 0.5
20	200	$\mathcal{L}_{(-2,2)}$	$U_{(-0.1,0.1)}$	0.132	0.415	0.012	0.318
50	200	$\mathcal{L}_{(-2,2)}$	$U_{(-0.1,0.1)}$	0.080	0.238	0.069	0.194
20	500	$\mathcal{L}_{(-2,2)}$	$U_{(-0.1,0.1)}$	0.080	0.272	0.068	0.217
50	500	$\mathcal{L}_{(-2,2)}$	$U_{(-0.1,0.1)}$	0.050	0.168	0.043	0.135
20	200	$\mathcal{L}_{(-1,1)}$	$U_{(-1,1)}$	0.107	0.324	0.095	0.264
50	200	$\mathcal{L}_{(-1,1)}$	$U_{(-1,1)}$	0.067	0.194	0.060	0.163
20	500	$\mathcal{L}_{(-1,1)}$	$U_{(-1,1)}$	0.071	0.267	0.061	0.205
50	500	$\mathcal{L}_{(-1,1)}$	$U_{(-1,1)}$	0.044	0.156	0.039	0.130
20	200	$\mathcal{L}_{(-2,2)}$	$U_{(-1,1)}$	0.137	0.478	0.112	0.329
50	200	$\mathcal{L}_{(-2,2)}$	$U_{(-1,1)}$	0.084	0.274	0.070	0.205
20	500	$\mathcal{L}_{(-2,2)}$	$U_{(-1,1)}$	0.087	0.352	0.071	0.250
50	500	$\mathcal{L}_{(-2,2)}$	$U_{(-1,1)}$	0.054	0.211	0.044	0.150

Table 4: The means and standard deviations of the errors of MME and MLE for estimating

\boldsymbol{\beta}_{0}^{*}

and

\boldsymbol{\beta}_{1}^{*}

$\boldsymbol{\beta}_{0}^{}\sim\mathcal{L}_{(-1,1)}$ and $\boldsymbol{\beta}_{1}^{}\sim U_{(0,2)}$
$n$		20	100	20	100
$p$		200	200	500	500
MME, $\ell_{2}$	$\boldsymbol{\beta}_{0}^{*}$	0.163(0.010)	0.096(0.006)	0.099(0.004)	0.057(0.002)
MME, $\ell_{2}$	$\boldsymbol{\beta}_{1}^{*}$	0.177(0.010)	0.084(0.005)	0.104(0.004)	0.050(0.002)
MME, $\ell_{\infty}$	$\boldsymbol{\beta}_{0}^{*}$	0.570(0.103)	0.367(0.085)	0.395(0.070)	0.241(0.042)
MME, $\ell_{\infty}$	$\boldsymbol{\beta}_{1}^{*}$	0.658(0.137)	0.421(0.079)	0.438(0.076)	0.214(0.037)
MLE, $\ell_{2}$	$\boldsymbol{\beta}_{0}^{*}$	0.211(0.013)	0.091(0.006)	0.121(0.005)	0.054(0.002)
MLE, $\ell_{2}$	$\boldsymbol{\beta}_{1}^{*}$	0.166(0.011)	0.072(0.005)	0.096(0.004)	0.043(0.002)
MLE, $\ell_{\infty}$	$\boldsymbol{\beta}_{0}^{*}$	0.809(0.180)	0.354(0.076)	0.532(0.098)	0.232(0.041)
MLE, $\ell_{\infty}$	$\boldsymbol{\beta}_{1}^{*}$	0.617(0.116)	0.265(0.052)	0.399(0.065)	0.172(0.028)
$\boldsymbol{\beta}_{0}^{}\sim\mathcal{L}_{(-2,0)}$ and $\boldsymbol{\beta}_{1}^{}\sim U_{(0,2)}$
MME, $\ell_{2}$	$\boldsymbol{\beta}_{0}^{*}$	0.133(0.012)	0.080(0.007)	0.081(0.004)	0.047(0.002)
MME, $\ell_{2}$	$\boldsymbol{\beta}_{1}^{*}$	0.093(0.006)	0.053(0.004)	0.056(0.003)	0.032(0.002)
MME, $\ell_{\infty}$	$\boldsymbol{\beta}_{0}^{*}$	0.568(0.104)	0.365(0.087)	0.394(0.071)	0.241(0.042)
MME, $\ell_{\infty}$	$\boldsymbol{\beta}_{1}^{*}$	0.387(0.069)	0.236(0.043)	0.258(0.037)	0.162(0.025)
MLE, $\ell_{2}$	$\boldsymbol{\beta}_{0}^{*}$	0.176(0.016)	0.076(0.007)	0.100(0.006)	0.044(0.002)
MLE, $\ell_{2}$	$\boldsymbol{\beta}_{1}^{*}$	0.116(0.009)	0.051(0.004)	0.068(0.003)	0.031(0.002)
MLE, $\ell_{\infty}$	$\boldsymbol{\beta}_{0}^{*}$	0.809(0.181)	0.351(0.078)	0.531(0.099)	0.232(0.041)
MLE, $\ell_{\infty}$	$\boldsymbol{\beta}_{1}^{*}$	0.513(0.088)	0.227(0.047)	0.348(0.058)	0.158(0.024)

•

The effect of $(n,p)$ . Similar to what we have observed in Figure 2, the estimation errors become smaller when $n$ or $p$ becomes larger. Interestingly, from Tables 2–3 we can observe that, under the same setting, the errors in $\ell_{2}$ norm when $(n,p)=(50,200)$ are very close to those when $(n,p)=(20,500)$ . This is to some degree consistent with our finding in Theorem 2 where the upper bound depends on $(n,p)$ through their product $np$ .
•

The effect of sparsity. From Table 2 we can see that, as the expected density decreases, the estimation errors increase in almost all the cases. On the other hand, even though the parameters take different values in Table 3, the errors in the sparse cases are in general larger than those in the dense cases.
•

The impact of $\kappa_{0}:=\|\boldsymbol{\beta}_{0}^{*}\|_{\infty}$ . Typically, estimation errors tend to increase with larger values of $\kappa_{0}$ , as evidenced in Table 2. Additionally, when maintaining the same overall sparsity level, larger $\kappa_{0}$ values are correlated with greater estimation errors, as illustrated in Table 3.
•

MLE vs MME. In general, the estimation errors of the MLE are smaller than those of the MME in most cases as can be seen in Tables 2 and Table 3. In Table 4 where the estimation errors for $\boldsymbol{\beta}_{0}^{*}$ and $\boldsymbol{\beta}_{1}^{*}$ are reported separately, we can see that the estimation errors of the MME of $\boldsymbol{\beta}_{1}^{*}$ are generally larger than those of the MLE of $\boldsymbol{\beta}_{1}^{*}$ , especially when $n$ is large.

5.3 Real data

In this section, we apply our TWHM to a real dataset to examine an insect interaction network process [25]. We focus on a subset of the data named insecta-ant-colony4 that contains the social interactions of 102 ants in 41 days. In this dataset, the position and orientation of all the ants were recorded twice per second to infer their movements and interactions, based on which 41 daily networks were constructed. More specifically, $X_{i,j}^{t}$ is 1 if there is an interaction between ants $i$ and $j$ during day $t$ , and 0 otherwise. In the ACF and PACF plots of the degree sequences of selected ants (c.f. Figure 1 in Appendix B.1), we can observe patterns similar to those of a first-order autoregressive model with long memory. This motivates the use of TWHM for the analysis of this dataset.

In [25], the 41 daily networks were split into four periods with 11, 10, 10, and 10 days respectively, because the corresponding days separating these periods were identified as change-points. By excluding ants that did not interact with others, we are left with $p=102$ nodes in period one, $p=73$ nodes in period two, $p=55$ nodes in period three and $p=35$ nodes in period four. Thus we take the networks on day 1, day 12, day 22 and day 32 as the initial networks and fit four different TWHMs, one for each of the four periods.

To appreciate how TWHM captures static heterogeneity, we present a subgraph of 10 nodes during the fourth period ( $t=32$ – $41$ ), 5 of which have the largest and 5 have the smallest fitted $\beta_{i,0}$ values. The edges of this subgraph are drawn to represent aggregated static connections defined as $({\mathbf{X}}^{32}+\cdots+{\mathbf{X}}^{42})/10$ between these ants. We can see from the left panel of Figure 3 that the magnitudes of the fitted static heterogeneity parameters agree in principle with the activeness of each ant making connections.

On the other hand, we examine how TWHM can capture dynamic heterogeneity. Towards this, we plot a subgraph of the 10 nodes having the smallest fitted $\beta_{i,0}$ values in Figure 3(b), where edges represent the magnitude of $\sum_{t=33}^{41}I\left(X_{i,j}^{t}=X_{i,j}^{t-1}\right)/9$ which is a measure of the extent that an edge is preserved across the whole period and hence dynamic heterogeneity. Again, we can see an agreement between the fitted $\boldsymbol{\beta}^{*}_{1}$ and how likely these nodes will preserve their ties.

To evaluate how TWHM performs when it comes to making prediction, we further carry out the following experiments:

(i)

From (1), given the MLE $\{\widehat{\beta}_{i,r},i=1,\ldots,p,r=0,1\}$ and the network at time $t-1$ , we can estimate the conditional expectation of node $i$ ’s degree as

	$\displaystyle\tilde{d}_{i}^{t}$	$\displaystyle:=$	$\displaystyle\sum_{j=1,\>j\neq i}^{p}{\rm E}\left(X_{i,j}^{t}\Big{\|}X_{i,j}^{t-1},\widehat{\boldsymbol{\theta}}\right)$
		$\displaystyle=$	$\displaystyle\sum_{j=1,\>j\neq i}^{p}\left(\frac{e^{\widehat{\beta}_{i,0}+\widehat{\beta}_{j,0}}}{1+e^{\widehat{\beta}_{i,0}+\widehat{\beta}_{j,0}}+e^{\widehat{\beta}_{i,1}+\widehat{\beta}_{j,1}}}+\frac{e^{\widehat{\beta}_{i,1}+\widehat{\beta}_{j,1}}}{1+e^{\widehat{\beta}_{i,0}+\widehat{\beta}_{j,}}+e^{\widehat{\beta}_{i,1}+\widehat{\beta}_{j,1}}}X_{i,j}^{t-1}\right).$

We can then compare the density of the estimated degree sequence $\{\tilde{d}_{i}^{t},i=1,\ldots,p\}$ with that of the observed degree sequence $\{d_{i}^{t},i=1,\ldots,p\}$ at time $t$ . To provide a comparison, we treat networks in each period as i.i.d. observations and utilize the classical $\beta$ -model to derive the degree sequence estimator $\{\check{{\mathbf{d}}}^{t}\}$ for the four periods. The fitted degree distributions are depicted in Figure 4, revealing a close resemblance between the estimated and observed densities. This observation suggests that the TWHM demonstrates strong performance in one-step-ahead prediction.

To further assess the similarity between the estimated degree sequences $\{\tilde{{\mathbf{d}}}^{t}\}$ , $\{\check{{\mathbf{d}}}^{t}\}$ , and the true degree sequence $\{{\mathbf{d}}^{t}\}$ , we compute the Kolmogorov-Smirnov (KS) distance and conduct the KS test for $t=2,\ldots,41$ . The mean and standard deviation of the KS distances, the p-values of the KS test, and the rejection rate are summarized in Table 5. Notably, at a significance level of 0.05, out of the 40 KS tests, we fail to reject the null hypothesis that $\{\tilde{{\mathbf{d}}}^{t}\}$ and $\{{\mathbf{d}}^{t}\}$ originate from the same distribution in 38 instances, resulting in a rejection rate consistent with the significance level. Conversely, for the degree sequence estimators based on the $\beta$ -model $\{\check{{\mathbf{d}}}^{t}\}$ , 8 out of the 40 tests were rejected. These findings indicate that our model exhibits highly promising performance in recovering the degree sequences.

Table 5: The mean and standard deviation of the KS distances, the p-values of KS test, and the rejection rate between the true degree sequence

\{{\mathbf{d}}^{t}\}

, and the THWM based estimator

\tilde{{\mathbf{d}}}^{t}

and the

\beta

-model based estimator

\check{{\mathbf{d}}}^{t}

over the 40 networks (

t=2,\ldots,41

) in the ant dataset.

	KS distance	KS test p-value	Rejection rate
$\tilde{{\mathbf{d}}}^{t}$ vs ${\mathbf{d}}^{t}$	0.179(0.058)	0.361(0.267)	0.05
$\check{{\mathbf{d}}}^{t}$ vs ${\mathbf{d}}^{t}$	0.192(0.061)	0.298(0.246)	0.20

(ii)

By incorporating network dynamics, TWHM naturally enables one-step-ahead link prediction via

(5.15)

{\mathbf{P}}\left(\widehat{X}_{i,j}^{t}=1\Big{|}X_{i,j}^{t-1}\right)=\frac{e^{\widehat{\beta}_{i,0}+\widehat{\beta}_{j,0}}}{1+e^{\widehat{\beta}_{i,0}+\widehat{\beta}_{j,0}}+e^{\widehat{\beta}_{i,1}+\widehat{\beta}_{j,1}}}+\frac{e^{\widehat{\beta}_{i,1}+\widehat{\beta}_{j,1}}}{1+e^{\widehat{\beta}_{i,0}+\widehat{\beta}_{j,0}}+e^{\widehat{\beta}_{i,1}+\widehat{\beta}_{j,1}}}X_{i,j}^{t-1}.

To transform these probabilities into links, we threshold them by setting $\widehat{X}_{i,j}^{t}=1$ when ${\mathbf{P}}\left(\widehat{X}_{i,j}^{t}=1\right)$ $\geq c_{i,j}$ and $\widehat{X}_{i,j}^{t}=0$ when ${\mathbf{P}}\left(\widehat{X}_{i,j}^{t}=1\right)<c_{i,j}$ for some cut-off constants $c_{i,j}$ . As an illustration, we first consider simply setting $c_{i,j}=0.5$ for all $1\leq i<j\leq p$ for predicting links. We shall denote this approach as TWHM_0.5.

As an alternative, owing to the fact that networks may change slowly, for a given parameter $\omega$ , we also consider the following adaptive approach for choosing $c_{i,j}$ :

(5.16)

\tilde{X}_{i,j}^{t}:=I\{\omega{\mathbf{P}}\left(\widehat{X}_{i,j}^{t}=1\right)+\left(1-\omega\right)X_{i,j}^{t-1}>0.5\}.

It can be shown that the above estimator is equivalent to the prediction rule $I\left\{{\mathbf{P}}\left(\widehat{X}_{i,j}^{t}=1\right)>c_{i,j}\right\}$ with cut-off values specified as

c_{i,j}=\frac{0.5e^{\widehat{\beta}_{i,1}+\widehat{\beta}_{j,1}}+\left(1-w\right)e^{\widehat{\beta}_{i,0}+\widehat{\beta}_{j,0}}}{\left(1-w\right)+e^{\widehat{\beta}_{i,1}+\widehat{\beta}_{j,1}}+\left(1-w\right)e^{\widehat{\beta}_{i,0}+\widehat{\beta}_{j,0}}},\quad 1\leq i<j\leq p.

This method is denoted as TWHM_adaptive. Lastly, as a benchmark, we have also considered a naive approach that simply predicts ${\mathbf{X}}^{t}$ as ${\mathbf{X}}^{t-1}$ .

In this experiment, we set the number of training samples to be $n_{train}=2,5$ or $8$ . For a given training sample size $n_{train}$ and a period with $n$ networks, we predict the graph ${\bf X}^{n_{train}+i}$ based on the previous $n_{train}$ networks $\{{\bf X}^{t},t=i,\ldots,n_{train}+i-1\}$ for $i=1,\ldots,n-n_{train}$ . That is, over the four periods in the data, we have predicted 33, 21 and 9 networks, with 5151 edges in each network in the first period, 2628 in the second period, 1485 in the third period, and 595 in the fourth period for our choices of $n_{train}$ . The $\omega$ parameter employed in TWHM_adaptive is selected as follows. For prediction in each period, we choose the value in a sequence of $\omega$ values that produces the highest prediction accuracy in predicting ${\mathbf{X}}^{n_{train}+i-1}$ for predicting ${\mathbf{X}}^{n_{train}+i}$ . For example, in the first period with $n=11$ networks, when $n_{train}=8$ , we used $\{{\mathbf{X}}^{t},t=i,\cdots,i+7\}$ to predict ${\mathbf{X}}^{i+8}$ for $i=1,2,3$ . For each $i$ , let $\tilde{\bf X}^{i+7}$ be defined as in (5.16). A set of candidate values for $\omega$ were used to compute $\tilde{\bf X}^{i+7}$ , and the one that returns the smallest misclassification rate (in predicting ${\bf X}^{i+7}$ ) was used in TWHM_adaptive for predicting ${\mathbf{X}}^{i+8}$ . The mean of the chosen $\omega$ is $0.936$ when $n=2$ , $0.895$ when $n=5$ , and $0.905$ when $n=8$ . The prediction accuracy of the above-mentioned methods, defined as the percentages of correctly predicted links, are reported in Table 6. We can see that TWHM_0.5 and TWHM_adaptive both perform better than the naive approach in all the cases. On the other hand, TWHM coupled with adaptive cut-off points can improve the prediction accuracy of TWHM with a cur-off value 0.5 in most periods.

Table 6: The prediction accuracy of TWHM with

0.5

as a cut-off point, TWHM with adaptive cut-off points, and the naive estimator

{\mathbf{X}}^{t-1}

$n_{train}$	Period	TWHM_0.5	TWHM_adaptive	Naive
2	One	0.773	0.800	0.749
	Two	0.817	0.817	0.780
	Three	0.837	0.837	0.806
	Four	0.824	0.831	0.807
	Overall	0.811	0.822	0.784
5	One	0.789	0.807	0.759
	Two	0.826	0.823	0.779
	Three	0.846	0.849	0.805
	Four	0.833	0.842	0.805
	Overall	0.822	0.829	0.786
8	One	0.795	0.800	0.759
	Two	0.832	0.832	0.778
	Three	0.855	0.845	0.823
	Four	0.831	0.863	0.779
	Overall	0.825	0.831	0.782

6 Summary and Discussion

We have proposed a novel two-way heterogeneity model that utilizes two sets of parameters to explicitly capture static heterogeneity and dynamic heterogeneity. In a high-dimension setup, we have provided the existence and the rate of convergence of its local MLE, and proposed a novel method of moments estimator as an initial value to find this local MLE. To the best of our knowledge, this is the first model in the network literature that the local MLE is obtained for a non-convex loss function. The theory of our model is established by developing new uniform upper bounds for the deviation of the loss function.

While we have focused on the estimation of the parameters in this paper, how to conduct statistical inference for the local MLE is a natural next step for research. In our setup, we assume that the parameters are time invariant but this need not be the case. A future direction is to allow the static heterogeneity parameter $\boldsymbol{\beta}_{0}$ and/or the dynamic heterogeneity parameter $\boldsymbol{\beta}_{1}$ to depend on time, giving rise to non-stationary network processes. In case when these parameters change smoothly over time, we may consider estimating the parameters $\beta_{i,0}^{\tau},\beta_{i,1}^{\tau}$ at time $\tau$ by kernel smoothing, that is, by maximizing the following smoothed log-likelihood:

\tilde{L}(\tau,{\mathbf{X}}^{n},{\mathbf{X}}^{n-1},\cdots,{\mathbf{X}}^{1}|{\mathbf{X}}^{0})\\ ={\sum_{t=1}^{n}w_{t}\sum_{1\leq i<j\leq p}\Bigg{\{}-\log\Big{(}{1+e^{\beta_{i,0}+\beta_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}}}\Big{)}+\left(\beta_{i,0}+\beta_{j,0}\right)X_{i,j}^{t}\left(1-X_{i,j}^{t-1}\right)}\\ +\left(1-X_{i,j}^{t}\right)\left(1-X_{i,j}^{t-1}\right)\log\left(1+e^{\beta_{i,1}+\beta_{j,1}}\right)+X_{i,j}^{t}X_{i,j}^{t-1}\log\big{(}e^{\beta_{i,0}+\beta_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}}\big{)}\Bigg{\}},

with $w_{t}=\frac{K(h^{-1}|t-\tau|)}{\sum_{t=1}^{n}K(h^{-1}|t-\tau|)},$ where $K(\cdot)$ is a kernel function and $h$ is the bandwidth parameter. As another line of research, note that TWHM is formulated as an AR(1) process. We can extend it by including more time lags. For example, we can extend TWHM to include lag- $k$ dependence by writing

X^{t}_{i,j}=I({\varepsilon}_{i,j}^{t}=0)+\sum_{r=1}^{k}X^{t-r}_{i,j}I({\varepsilon}_{i,j}^{t}=r),

where the innovations ${\varepsilon}_{i,j}^{t}$ are independent such that

P({\varepsilon}_{i,j}^{t}=r)=\frac{e^{\beta_{i,r}+\beta_{j,r}}}{1+\sum_{s=0}^{k}e^{\beta_{i,s}+\beta_{j,s}}}~{}~{}{\rm for}~{}r=0,\cdots,k;\quad P({\varepsilon}_{i,j}^{t}=-1)=\frac{1}{1+\sum_{s=0}^{k}e^{\beta_{i,s}+\beta_{j,s}}},

with parameter $\boldsymbol{\beta}_{0}=(\beta_{1,0},\ldots,\beta_{p,0})^{\top}$ denoting node-specific static heterogeneity and $\boldsymbol{\beta}=\left(\beta_{i,r}\right)_{1\leq i\leq p;1\leq r\leq k}$ $\in\mathbb{R}^{p\times k}$ denoting lag- $k$ dynamic fluctuation. Other future lines of research include adding covariates to model the tendency of nodes making connections [35] and exploring additional structures [3].

References

Bhattacharjee et al., [2020] Bhattacharjee, M., Banerjee, M., and Michailidis, G. (2020). Change point estimation in a dynamic stochastic block model. Journal of Machine Learning Research, 21(107):1–59.
Chatterjee et al., [2011] Chatterjee, S., Diaconis, P., and Sly, A. (2011). Random graphs with a given degree sequence. The Annals of Applied Probability, 21(4):1400–1435.
Chen et al., [2021] Chen, M., Kato, K., and Leng, C. (2021). Analysis of networks via the sparse $\beta$ -model. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 83(5).
Durante et al., [2016] Durante, D., Dunson, D. B., et al. (2016). Locally adaptive dynamic networks. The Annals of Applied Statistics, 10(4):2203–2232.
Fu and He, [2022] Fu, D. and He, J. (2022). Dppin: A biological repository of dynamic protein-protein interaction network data. In 2022 IEEE International Conference on Big Data (Big Data), pages 5269–5277. IEEE.
Gragg and Tapia, [1974] Gragg, W. and Tapia, R. (1974). Optimal error bounds for the newton–kantorovich theorem. SIAM Journal on Numerical Analysis, 11(1):10–13.
Graham, [2017] Graham, B. S. (2017). An econometric model of network formation with degree heterogeneity. Econometrica, 85(4):1033–1063.
Han et al., [2020] Han, R., Chen, K., and Tan, C. (2020). Bivariate gamma model. Journal of Multivariate Analysis, 180:104666.
Han et al., [2023] Han, R., Xu, Y., and Chen, K. (2023). A general pairwise comparison model for extremely sparse networks. Journal of the American Statistical Association, 118(544):2422–2432.
Hanneke et al., [2010] Hanneke, S., Fu, W., and Xing, E. P. (2010). Discrete temporal models of social networks. Electronic journal of statistics, 4:585–605.
Hanneke and Xing, [2007] Hanneke, S. and Xing, E. P. (2007). Discrete temporal models of social networks. In Statistical network analysis: models, issues, and new directions: ICML 2006 workshop on statistical network analysis, Pittsburgh, PA, USA, June 29, 2006, Revised Selected Papers, pages 115–125. Springer.
Hillar et al., [2012] Hillar, C. J., Lin, S., and Wibisono, A. (2012). Inverses of symmetric, diagonally dominant positive matrices and applications. arXiv preprint arXiv:1203.6812.
Holland and Leinhardt, [1981] Holland, P. W. and Leinhardt, S. (1981). An exponential family of probability distributions for directed graphs. Journal of the American Statistical Association, 76(373):33–50.
Jiang et al., [2023] Jiang, B., Li, J., and Yao, Q. (2023). Autoregressive networks. Journal of Machine Learning Research, 24(227):1–69.
Jin, [2015] Jin, J. (2015). Fast community detection by score. The Annals of Statistics, 43(1):57–89.
Jin et al., [2022] Jin, J., Ke, Z. T., Luo, S., and Wang, M. (2022). Optimal estimation of the number of network communities. Journal of the American Statistical Association, pages 1–16.
Karrer and Newman, [2011] Karrer, B. and Newman, M. E. (2011). Stochastic blockmodels and community structure in networks. Physical review E, 83(1):016107.
Karwa et al., [2016] Karwa, V., Slavković, A., et al. (2016). Inference using noisy degrees: Differentially private $beta$ -model and synthetic graphs. The Annals of Statistics, 44(1):87–112.
Ke and Jin, [2022] Ke, Z. T. and Jin, J. (2022). The score normalization, especially for highly heterogeneous network and text data. arXiv preprint arXiv:2204.11097.
Kolaczyk and Csárdi, [2020] Kolaczyk, E. D. and Csárdi, G. (2020). Statistical analysis of network data with R, volume 65. Springer, 2 edition.
Krivitsky and Handcock, [2014] Krivitsky, P. N. and Handcock, M. S. (2014). A separable model for dynamic networks. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 76(1):29.
Lin and Bai, [2011] Lin, Z. and Bai, Z. (2011). Probability inequalities. Springer Science & Business Media.
Matias and Miele, [2017] Matias, C. and Miele, V. (2017). Statistical clustering of temporal networks through a dynamic stochastic block model. Journal of the Royal Statistical Society, B, 79(4):1119–1141.
Merlevède et al., [2009] Merlevède, F., Peligrad, M., Rio, E., et al. (2009). Bernstein inequality and moderate deviations under strong mixing conditions. In High dimensional probability V: the Luminy volume, pages 273–292. Institute of Mathematical Statistics.
Mersch et al., [2013] Mersch, D. P., Crespi, A., and Keller, L. (2013). Tracking individuals shows spatial fidelity is a key regulator of ant social organization. Science, 340(6136):1090–1093.
Minai and Williams, [1993] Minai, A. A. and Williams, R. D. (1993). On the derivatives of the sigmoid. Neural Networks, 6(6):845–853.
Newman, [2018] Newman, M. (2018). Networks. Oxford university press.
Pensky, [2019] Pensky, M. (2019). Dynamic network models and graphon estimation. Annals of Statistics, 47(4):2378–2403.
Sengupta and Chen, [2018] Sengupta, S. and Chen, Y. (2018). A block model for node popularity in networks with community structure. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80(2):365–386.
Stein and Leng, [2020] Stein, S. and Leng, C. (2020). A sparse $\beta$ -model with covariates for networks. arXiv preprint arXiv:2010.13604.
Van der Vaart, [2000] Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.
Van Der Vaart and Wellner, [1996] Van Der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence. In Weak convergence and empirical processes, pages 16–28. Springer.
Wainwright, [2019] Wainwright, M. J. (2019). High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press.
Wu et al., [2006] Wu, X., Zhu, L., Guo, J., Zhang, D.-Y., and Lin, K. (2006). Prediction of yeast protein–protein interaction network: insights from the gene ontology and annotations. Nucleic acids research, 34(7):2137–2150.
Yan et al., [2019] Yan, T., Jiang, B., Fienberg, S. E., and Leng, C. (2019). Statistical inference in a directed network model with covariates. Journal of the American Statistical Association, 114(526):857–868.
Yan et al., [2015] Yan, T., Leng, C., and Zhu, J. (2015). Supplement to “asymptotics in directed exponential random graph models with an increasing bi-degree sequence”. Annals of Statistics.
Yan et al., [2016] Yan, T., Leng, C., Zhu, J., et al. (2016). Asymptotics in directed exponential random graph models with an increasing bi-degree sequence. Annals of Statistics, 44(1):31–57.
Yan and Xu, [2012] Yan, T. and Xu, J. (2012). Approximating the inverse of a balanced symmetric matrix with positive elements. arXiv preprint arXiv:1202.1058.
Yan and Xu, [2013] Yan, T. and Xu, J. (2013). A central limit theorem in the $\beta$ -model for undirected random graphs with a diverging number of vertices. Biometrika, 100(2):519–524.

Appendix A Technical proofs

For brevity, we denote $\alpha_{0,i,j}:=e^{\beta_{i,0}+\beta_{j,0}}$ and $\alpha_{1,i,j}:=e^{\beta_{i,1}+\beta_{j,1}}$ , and define $\alpha^{*}_{0,i,j},\alpha^{*}_{1,i,j}$ and $\widehat{\alpha}_{0,i,j},\widehat{\alpha}_{1,i,j}$ similarly based on the true parameter $\boldsymbol{\theta}^{*}$ and the MLE $\widehat{\boldsymbol{\theta}}$ .

A.1 Some technical lemmas

Before presenting the proofs for our main results, we first provide some technical lemmas which will be used from time to time in our proofs. Lemmas 4 and 5 below provide further properties about the process $\{{\mathbf{X}}^{t}\}$ .

Lemma 4.

Let $\{{\mathbf{X}}^{t}\}\sim P_{\boldsymbol{\theta}}$ We have:

$\{{\mathbf{X}}^{t}\circ{\mathbf{X}}^{t-1},t=0,1,2,\cdots\}$ where $\circ$ is the Hadamard product operator, is strictly stationary. Furthermore for any $1\leq i<j\leq p,1\leq\ell<m\leq p$ and $|t-s|\geq 1$ , we have

	$\displaystyle{\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)=\frac{\alpha_{0,i,j}(\alpha_{0,i,j}+\alpha_{1,i,j})}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})},$
	$\displaystyle{\rm Var}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)=\frac{\alpha_{0,i,j}(\alpha_{0,i,j}+\alpha_{1,i,j})(2\alpha_{0,i,j}+\alpha_{1,i,j}+1)}{(1+\alpha_{0,i,j})^{2}(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}},$
	$\displaystyle{\rm Cov}(X_{i,j}^{t}X_{i,j}^{t-1},X_{l,m}^{s}X_{l,m}^{s-1})=$
	$\displaystyle\left\{\begin{array}[]{ccc}\left(\frac{\alpha_{1,i,j}}{1+\alpha_{0,i,j}+\alpha_{1,i,j}}\right)^{\|t-s\|-1}\frac{\alpha_{0,i,j}(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}{(1+\alpha_{0,i,j})^{2}(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}},&{(i,j)=(l,m)},\\ \\ 0,&{(i,j)\neq(l,m)}.\end{array}\right.$

$\{\left(1-{\mathbf{X}}^{t}\right)\circ\left(1-{\mathbf{X}}^{t-1}\right),t=0,1,2,\cdots\}$ is strictly stationary. Furthermore for any $1\leq i<j\leq p,1\leq\ell<m\leq p$ and $|t-s|\geq 1$ , we have

	$\displaystyle{\rm E}\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}\right)=\frac{1+\alpha_{1,i,j}}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})},$
	$\displaystyle{\rm Var}\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}\right)=\frac{\alpha_{0,i,j}(1+\alpha_{1,i,j})(\alpha_{0,i,j}+\alpha_{1,i,j}+2)}{(1+\alpha_{0,i,j})^{2}(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}},$
	$\displaystyle{\rm Cov}\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)},(1-X_{l.m}^{s})(1-X_{l,m}^{s-1})\right)=$
	$\displaystyle\left\{\begin{array}[]{ccc}\left(\frac{\alpha_{1,i,j}}{1+\alpha_{0,i,j}+\alpha_{1,i,j}}\right)^{\|t-s\|-1}\frac{\alpha_{0,i,j}(1+\alpha_{1,i,j})^{2}}{(1+\alpha_{0,i,j})^{2}(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}},&{(i,j)=(l,m)},\\ \\ 0,&{(i,j)\neq(l,m)}.\end{array}\right.$

Proof.

(i) Denote $\mu_{i,j}={\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)$ , $\gamma_{i,j}(k)={\rm Cov}\left(X_{i,j}^{t}X_{i,j}^{t-1},X_{i,j}^{t-k}X_{i,j}^{t-k-1}\right)$ and $\rho_{i,j}(k)=$ $\gamma_{i,j}(k)/\gamma_{i,j}(1)$ ( $k\geq 1$ ). For every $i<j$ , we have

	$\displaystyle{\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)=P\left(X_{i,j}^{t}=1\Big{\|}X_{i,j}^{t-1}=1\right)P\left(X_{i,j}^{t-1}=1\right)=\frac{\alpha_{0,i,j}(\alpha_{0,i,j}+\alpha_{1,i,j})}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})},$
	$\displaystyle{\rm Var}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)={\rm E}\left(\Big{(}X_{i,j}^{t}X_{i,j}^{t-1}\Big{)}^{2}\right)-{\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)^{2}$
	$\displaystyle=\left(1-{\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)\right){\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)=\frac{\alpha_{0,i,j}(\alpha_{0,i,j}+\alpha_{1,i,j})(2\alpha_{0,i,j}+\alpha_{1,i,j}+1)}{(1+\alpha_{0,i,j})^{2}(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}},$

and

$\displaystyle\gamma_{i,j}(1)$	$\displaystyle=$	$\displaystyle{\rm E}\left(X_{i,j}^{t}(X_{i,j}^{t-1})^{2}X_{i,j}^{t-2}\right)-\mu^{2}_{i,j}=P\left(X_{i,j}^{t}X_{i,j}^{t-1}X_{i,j}^{t-2}=1\right)-\mu^{2}_{i,j}$
	$\displaystyle=$	$\displaystyle P\left(X_{i,j}^{t}\Big{\|}X_{i,j}^{t-1}X_{i,j}^{t-2}=1\right)P\left(X_{i,j}^{t-1}X_{i,j}^{t-2}=1\right)-\mu^{2}_{i,j}$
	$\displaystyle=$	$\displaystyle{\rm E}\left(X_{i,j}^{t}\Big{\|}X_{i,j}^{t-1}=1\right)\mu_{i,j}-\mu^{2}_{i,j}$
	$\displaystyle=$	$\displaystyle\frac{\alpha_{0,i,j}(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}{(1+\alpha_{0,i,j})^{2}(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}.$

For $k\geq 2$ , by Proposition 1, we have,

$\displaystyle\gamma_{i,j}(k)$	$\displaystyle=$	$\displaystyle{\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}X_{i,j}^{t-k}X_{i,j}^{t-k-1}\right)-\mu^{2}_{i,j}$
	$\displaystyle=$	$\displaystyle P\left(X_{i,j}^{t}X_{i,j}^{t-1}=1\Big{\|}X_{i,j}^{t-k}X_{i,j}^{t-k-1}=1\right)P\left(X_{i,j}^{t-k}X_{i,j}^{t-k-1}=1\right)-\mu^{2}_{i,j}$
	$\displaystyle=$	$\displaystyle P\left(X_{i,j}^{t}=1\Big{\|}X_{i,j}^{t-1}=1\right)P\left(X_{i,j}^{t-1}=1\Big{\|}X_{i,j}^{t-k}=1\right)\mu_{i,j}-\mu_{i,j}^{2}$
	$\displaystyle=$	$\displaystyle P\left(X_{i,j}^{t}=1\Big{\|}X_{i,j}^{t-1}=1\right)P\left(X_{i,j}^{t-1}X_{i,j}^{t-k}=1\right)P\left(X_{i,j}^{t-k}=1\right)^{-1}\mu_{i,j}-\mu^{2}_{i,j}$
	$\displaystyle=$	$\displaystyle P\left(X_{i,j}^{t}=1\Big{\|}X_{i,j}^{t-1}=1\right)P\left(X_{i,j}^{t-1}X_{i,j}^{t-k}=1\right)P\left(X_{i,j}^{t-k+1}=1\Big{\|}X_{i,j}^{t-k}=1\right)-\mu^{2}_{i,j}$
	$\displaystyle=$	$\displaystyle P\left(X_{i,j}^{t-1}X_{i,j}^{t-k}=1\right)P\left(X_{i,j}^{t}=1\Big{\|}X_{i,j}^{t-1}=1\right)^{2}-\mu^{2}_{i,j}$
	$\displaystyle=$	$\displaystyle\left(P\left(X_{i,j}^{t-1}X_{i,j}^{t-k}=1\right)-{\rm E}\left(X_{i,j}^{t}\right)^{2}\right)P\left(X_{i,j}^{t}=1\Big{\|}X_{i,j}^{t-1}=1\right)^{2}$
	$\displaystyle=$	$\displaystyle{\rm Cov}\left(X_{i,j}^{t-1},X_{i,j}^{t-k}\right)\frac{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}{(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}$
	$\displaystyle=$	$\displaystyle\left(\frac{\alpha_{1,i,j}}{1+\alpha_{0,i,j}+\alpha_{1,i,j}}\right)^{k-1}\frac{\alpha_{0,i,j}}{(1+\alpha_{0,i,j})^{2}}\frac{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}{(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}$
	$\displaystyle=$	$\displaystyle\left(\frac{\alpha_{1,i,j}}{1+\alpha_{0,i,j}+\alpha_{1,i,j}}\right)^{k-1}\gamma_{i,j}(1).$

This proves (i).

(ii) Let $\mu_{i,j}^{\prime}={\rm E}\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}\right)$ , $\gamma_{i,j}^{\prime}(k)={\rm Cov}\Bigg{(}\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)},$ $\left(1-X_{i,j}^{t-k}\right)\left(1-X_{i,j}^{t-k-1}\right)\Bigg{)}$ and $\rho_{i,j}^{\prime}(k)=$ $\gamma_{i,j}^{\prime}(k)/\gamma_{i,j}^{\prime}(1)$ ( $k\geq 1$ ). Similarly, for every $i<j$ , we have

$\displaystyle{\rm E}\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}\right)$	$\displaystyle=$	$\displaystyle P\left(X_{i,j}^{t}=0\Big{\|}X_{i,j}^{t-1}=0\right)P\left(X_{i,j}^{t-1}=0\right)$
	$\displaystyle=$	$\displaystyle\frac{1+\alpha_{1,i,j}}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})},$
$\displaystyle{\rm Var}\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}\right)$	$\displaystyle=$	$\displaystyle\left(1-\mu^{\prime}_{i,j}\right)\mu^{\prime}_{i,j}=\frac{\alpha_{0,i,j}(1+\alpha_{1,i,j})(\alpha_{0,i,j}+\alpha_{1,i,j}+2)}{(1+\alpha_{0,i,j})^{2}(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}},$

and

$\displaystyle\gamma_{i,j}^{\prime}(1)$	$\displaystyle=$	$\displaystyle{\rm Cov}\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)},\left(1-X_{i,j}^{t-1}\right)\left(1-X_{i,j}^{t-2}\right)\right)$
	$\displaystyle=$	$\displaystyle{\rm E}\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}^{2}\left(1-X_{i,j}^{t-2}\right)\right)-{\rm E}\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}\right)^{2}$
	$\displaystyle=$	$\displaystyle P\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}^{2}\left(1-X_{i,j}^{t-2}\right)=1\right)-{\rm E}\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}\right)^{2}$
	$\displaystyle=$	$\displaystyle P\left((1-X_{i,j}^{t-1})(1-X_{i,j}^{t-2})=1\right)P\left(X_{i,j}^{t}=0\Big{\|}X_{i,j}^{t-1}=0\right)-\left(\mu^{\prime}_{i,j}\right)^{2}$
	$\displaystyle=$	$\displaystyle(1+\alpha_{0,i,j})\left(\mu^{\prime}_{i,j}\right)^{2}-\left(\mu_{i,j}^{\prime}\right)^{2}$
	$\displaystyle=$	$\displaystyle\alpha_{0,i,j}\left(\mu^{\prime}_{i,j}\right)^{2}$
	$\displaystyle=$	$\displaystyle\frac{\alpha_{0,i,j}(1+\alpha_{1,i,j})^{2}}{(1+\alpha_{0,i,j})^{2}(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}.$

For $k\geq 2$ we have,

$\displaystyle\gamma_{i,j}^{\prime}(k)$	$\displaystyle=$	$\displaystyle{\rm E}\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}\left(1-X_{i,j}^{t-k}\right)\left(1-X_{i,j}^{t-k-1}\right)\right)-\left(\mu^{\prime}_{i,j}\right)^{2}$
	$\displaystyle=$	$\displaystyle P\left(\Big{(}1-X_{i,j}^{t}\Big{)}\Big{(}1-X_{i,j}^{t-1}\Big{)}=1\Big{\|}\left(1-X_{i,j}^{t-k}\right)\left(1-X_{i,j}^{t-k-1}\right)=1\right)\mu^{\prime}_{i,j}-\left(\mu^{\prime}_{i,j}\right)^{2}$
	$\displaystyle=$	$\displaystyle P\left(X_{i,j}^{t}=0\Big{\|}X_{i,j}^{t-1}=0\right)P\left(X_{i,j}^{t-1}=0\Big{\|}X_{i,j}^{t-k}=0\right)\mu^{\prime}_{i,j}-\left(\mu^{\prime}_{i,j}\right)^{2}$
	$\displaystyle=$	$\displaystyle P\left(X_{i,j}^{t}=0\Big{\|}X_{i,j}^{t-1}=0\right)P\left(X_{i,j}^{t-1}=0,X_{i,j}^{t-k}=0\right)P\left(X_{i,j}^{t-k}=0\right)^{-1}\mu^{\prime}_{i,j}-\left(\mu^{\prime}_{i,j}\right)^{2},$

with

$\displaystyle P\left(X_{i,j}^{t-1}=0,X_{i,j}^{t-k}=0\right)$	$\displaystyle=$	$\displaystyle{\rm Cov}\left(1-X_{i,j}^{t-1},1-X_{i,j}^{t-k}\right)+{\rm E}\left(1-X_{i,j}^{t}\right)^{2}$
	$\displaystyle=$	$\displaystyle{\rm Cov}\left(X_{i,j}^{t-1},X_{i,j}^{t-k}\right)+\frac{1}{(1+\alpha_{0,i,j})^{2}}$
	$\displaystyle=$	$\displaystyle\left(\frac{\alpha_{1,i,j}}{1+\alpha_{0,i,j}+\alpha_{1,i,j}}\right)^{k-1}\frac{\alpha_{0,i,j}}{(1+\alpha_{0,i,j})^{2}}+\frac{1}{(1+\alpha_{0,i,j})^{2}}$
	$\displaystyle=$	$\displaystyle\frac{1}{(1+\alpha_{0,i,j})^{2}}\left(\left(\frac{\alpha_{1,i,j}}{1+\alpha_{0,i,j}+\alpha_{1,i,j}}\right)^{k-1}\alpha_{0,i,j}+1\right),$

and

P\left(X_{i,j}^{t}=0\Big{|}X_{i,j}^{t-1}=0\right)\frac{1}{(1+\alpha_{0,i,j})^{2}}P\left(X_{i,j}^{t-k}=0\right)^{-1}=\frac{1+\alpha_{1,i,j}}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})}.

Thus,

			$\displaystyle\gamma_{i,j}^{\prime}(k)$
		$\displaystyle=$	$\displaystyle P\left(X_{i,j}^{t}=0\Big{\|}X_{i,j}^{t-1}=0\right)P\left(X_{i,j}^{t-1}=0,X_{i,j}^{t-k}=0\right)P\left(X_{i,j}^{t-k}=0\right)^{-1}\mu^{\prime}_{i,j}-\left(\mu^{\prime}_{i,j}\right)^{2}$
		$\displaystyle=$	$\displaystyle\left(P\left(X_{i,j}^{t}=0\Big{\|}X_{i,j}^{t-1}=0\right)P\left(X_{i,j}^{t-1}=0,X_{i,j}^{t-k}=0\right)P\left(X_{i,j}^{t-k}=0\right)^{-1}-1\right)\left(\mu^{\prime}_{i,j}\right)^{2}$
		$\displaystyle=$	$\displaystyle P\left(X_{i,j}^{t}=0\Big{\|}X_{i,j}^{t-1}=0\right)\left(\frac{\alpha_{1,i,j}}{1+\alpha_{0,i,j}+\alpha_{1,i,j}}\right)^{k-1}\frac{\alpha_{0,i,j}}{(1+\alpha_{0,i,j})^{2}}P\left(X_{i,j}^{t-k}=0\right)^{-1}\mu^{\prime}_{i,j}$
		$\displaystyle=$	$\displaystyle\left(\frac{\alpha_{1,i,j}}{1+\alpha_{0,i,j}+\alpha_{1,i,j}}\right)^{k-1}\frac{\alpha_{0,i,j}(1+\alpha_{1,i,j})^{2}}{(1+\alpha_{0,i,j})^{2}(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}$
		$\displaystyle=$	$\displaystyle\left(\frac{\alpha_{1,i,j}}{1+\alpha_{0,i,j}+\alpha_{1,i,j}}\right)^{k-1}\gamma_{i,j}^{\prime}(1).$

This proves (ii). ∎

Lemma 5.

Let $\{{\mathbf{X}}^{t}\}\sim P_{\boldsymbol{\theta}}$ hold. Under condition (A1) we have,

\sup_{1\leq i<j\leq p}\left\{{\rm Var}\left(\sum_{t=1}^{n}X_{i,j}^{t}X_{i,j}^{t-1}\right),\quad{\rm Var}\left(\sum_{t=1}^{n}X_{i,j}^{t}\right)\right\}=O(n).

Proof.

Let $Y_{1},Y_{2},\ldots$ be a sequence of Bernoulli random variables with ${\rm E}Y_{i}=\mu$ , ${\rm Var}(Y_{i})=\sigma^{2}$ for all $i=1,2\dots$ , and assume that ${\rm Cov}\left(Y_{i},Y_{j}\right)\leq\sigma^{2}\rho^{|i-j|}$ for some $0\leq\rho<1$ . We have

$\displaystyle{\rm Var}\left(\sum_{i=1}^{n}Y_{i}\right)$	$\displaystyle=$	$\displaystyle\left(\sum_{i=1}^{n}{\rm Var}\left(Y_{i}\right)+\sum_{1\leq i\neq j\leq n}{\rm Cov}\left(Y_{i},Y_{j}\right)\right)$
	$\displaystyle\leq$	$\displaystyle{\sigma^{2}}\left(n+2\rho(n-1)+2\rho^{2}(n-2)+\cdots+2\rho^{n-1}\right)$
	$\displaystyle\leq$	$\displaystyle{2\sigma^{2}}\left(n+\rho(n-1)+\rho^{2}(n-2)+\cdots+\rho^{n-1}\right)$
	$\displaystyle\leq$	$\displaystyle\frac{2n\sigma^{2}}{1-\rho}.$

Lemma 5 then follows directly from Proposition 1, Lemma 4, condition (A1), and the above inequality. ∎

Lemma 6.

Suppose $Z_{i}$ , $i=1,\cdots,p$ are independent random variables with

{\rm E}\left(Z_{i}\right)=0,\quad{\rm Var}\left(Z_{i}\right)\leq\sigma^{2},

and $Z_{i}\leq b$ almost surely. We have, for any constant $c>0$ , there exists a large enough constant $C>0$ such that, with probability greater than $1-(p)^{-c}$ ,

\left|\sum_{i=1}^{p}Z_{i}\right|\leq C[\sqrt{p\log(p)}\sigma+b\log(p)].

Proof.

This is a direct result of Bernstein’s inequality [22]. ∎

A.2 Proof of Proposition 1

Proof.

$(i)$ follows directly from Proposition 1 of [14]. For (ii), we have,

$\displaystyle{\rm E}\left(d_{i}^{t}\right)$	$\displaystyle=$	$\displaystyle\sum_{k=1,\>k\neq i}^{p}{\rm E}\left(X_{i,k}^{t}\right)=\sum_{k=1,\>k\neq i}^{p}\frac{e^{\beta_{i,0}+\beta_{k,0}}}{1+e^{\beta_{i,0}+\beta_{k,0}}},$
$\displaystyle{\rm Var}\left(d_{i}^{t}\right)$	$\displaystyle=$	$\displaystyle\sum_{k=1,\>k\neq i}^{p}{\rm Var}\left(X_{i,k}^{t}\right)=\sum_{k=1,\>k\neq i}^{p}\frac{e^{\beta_{i,0}+\beta_{k,0}}}{(1+e^{\beta_{i,0}+\beta_{k,0}})^{2}},$
$\displaystyle{\rm Cov}\left(d_{i}^{t},d_{i}^{s}\right)$	$\displaystyle=$	$\displaystyle\sum_{k=1,\>k\neq i}^{p}{\rm Cov}\left(X_{i,k}^{t},X_{i,k}^{s}\right)$
	$\displaystyle=$	$\displaystyle\sum_{k=1,\>k\neq i}^{p}\left(\frac{e^{\beta_{i,1}+\beta_{k,1}}}{1+\sum_{r=0}^{1}e^{\beta_{i,r}+\beta_{k,r}}}\right)^{\|t-s\|}\frac{e^{\beta_{i,0}+\beta_{k,0}}}{(1+e^{\beta_{i,0}+\beta_{k,0}})^{2}}.$

Thus,

	$\displaystyle\rho^{d}_{i,j}(\|t-s\|)$	$\displaystyle\equiv$	$\displaystyle{\rm Corr}(d_{i}^{t},d_{j}^{s})$
		$\displaystyle=$	$\displaystyle\begin{cases}C_{i,\rho}\sum_{k=1,\>k\neq i}^{p}\left(\frac{e^{\beta_{i,1}+\beta_{k,1}}}{1+\sum_{r=0}^{1}e^{\beta_{i,r}+\beta_{k,r}}}\right)^{\|t-s\|}\frac{e^{\beta_{i,0}+\beta_{k,0}}}{(1+e^{\beta_{i,0}+\beta_{k,0}})^{2}}\quad&{\rm if}\;i=j,\\ 0&{\rm if}\;i\neq j,\end{cases}$

where $C_{i,\rho}=\left(\sum_{k=1,\>k\neq i}^{p}\frac{e^{\beta_{i,0}+\beta_{k,0}}}{(1+e^{\beta_{i,0}+\beta_{k,0}})^{2}}\right)^{-1}$ .

∎

A.3 Proof of Lemma 1

Proof.

By Proposition 3 of [14] and Lemma 4, we have that the process $\{{\mathbf{X}}^{t},t=1,2,\ldots\}$ is $\alpha$ -mixing with exponential decaying rate. Since the mixing property is hereditary, the processes $\{\left(1-{\mathbf{X}}^{t}\right)\circ\left(1-{\mathbf{X}}^{t-1}\right),t=1,2,\ldots\}$ and $\{{\mathbf{X}}^{t}\circ{\mathbf{X}}^{t-1},t=1,2,\ldots\}$ are also $\alpha$ -mixing with exponential decaying rate. From Theorem 1 in [24], we obtain the following concentration inequalities: there exists a positive constant $C$ such that,

			$\displaystyle{\mathbf{P}}\left(\left\|\sum_{t=1}^{n}\left\{X_{i,j}^{t}-{\rm E}\left(X_{i,j}^{t}\right)\right\}\right\|>\epsilon\right)\leq\exp\left(\frac{-C\epsilon^{2}}{n+\epsilon\log\left(n\right)\log\log\left(n\right)}\right),$
			$\displaystyle{\mathbf{P}}\left(\left\|\sum_{t=1}^{n}\left\{X_{i,j}^{t}X_{i,j}^{t-1}-{\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)\right\}\right\|>\epsilon\right)\leq\exp\left(\frac{-C\epsilon^{2}}{n+\epsilon\log\left(n\right)\log\log\left(n\right)}\right),$
			$\displaystyle{\mathbf{P}}\left(\left\|\sum_{t=1}^{n}\left\{(1-X_{i,j}^{t})(1-X_{i,j}^{t-1})-{\rm E}\left((1-X_{i,j}^{t})(1-X_{i,j}^{t-1})\right)\right\}\right\|>\epsilon\right)$
		$\displaystyle\leq$	$\displaystyle\exp\left(\frac{-C\epsilon^{2}}{n+\epsilon\log\left(n\right)\log\log\left(n\right)}\right),$

hold for all $1\leq i<j\leq p$ . For any positive constant $c>0$ , by setting $\epsilon=c_{1}\sqrt{n\log(np)}+c_{1}\log\left(n\right)\log\log\left(n\right)\log\left(np\right)$ with a big enough constant $c_{1}>0$ , we have, with probability greater than $1-(np)^{-c}$ ,

	$\displaystyle\left\|\sum_{t=1}^{n}\left\{X_{i,j}^{t}-{\rm E}\left(X_{i,j}^{t}\right)\right\}\right\|\leq\epsilon,$
	$\displaystyle\left\|\sum_{t=1}^{n}\left\{X_{i,j}^{t}X_{i,j}^{t-1}-{\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)\right\}\right\|\leq\epsilon,$
	$\displaystyle\left\|\sum_{t=1}^{n}\left\{(1-X_{i,j}^{t})(1-X_{i,j}^{t-1})-{\rm E}\left((1-X_{i,j}^{t})(1-X_{i,j}^{t-1})\right)\right\}\right\|\leq\epsilon,$

hold for all $1\leq i<j\leq p$ . ∎

A.4 Proof of Lemma 2

Proof.

Note that for $a,b,c,d\in[0,1]$ , we have

\left|ab-cd\right|\leq\left|ab-cb\right|+\left|cb-cd\right|=\left|a-c\right|b+\left|b-d\right|c\leq\left|a-c\right|+\left|b-d\right|.

With $-\kappa_{0}\leq\theta_{i,0}\leq\kappa_{0}$ and $-\kappa_{1}\leq\theta_{i,1}\leq\kappa_{1}$ , for all $1\leq i\leq p$ and $\kappa_{r}=\max\left(\kappa_{r},\kappa_{r}\right)$ for $r=0,1$ , we then have, there exist positive constants $C_{1},C_{2}$ such that, for all $1\leq i\neq j\leq p$ and $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},c_{r}e^{-4\kappa_{0}-4\kappa_{1}}\right)$ where $c_{r}>0$ is a small enough constant,

{\rm E}({\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{1}(\boldsymbol{\theta}))_{i,j}=\frac{1}{p}\frac{\alpha_{0,i,j}}{(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}\geq C_{1}\frac{e^{-4\kappa_{1}-2\kappa_{0}}}{p},

			$\displaystyle\quad\quad{\rm E}(-{\mathbf{V}}_{2}(\boldsymbol{\theta}))_{i,j}$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\left(\frac{\alpha_{0,i,j}\alpha_{1,i,j}}{(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-{\rm E}(b_{i,j})\frac{\alpha_{0,i,j}\alpha_{1,i,j}}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}\right)$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{\alpha_{0,i,j}\alpha_{1,i,j}}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}\left(\frac{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}{(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-{\rm E}(b_{i,j})\right)$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{\alpha_{0,i,j}\alpha_{1,i,j}}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}\Bigg{\{}\left(\frac{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}{(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-\frac{\alpha_{0,i,j}(\alpha_{0,i,j}+\alpha_{1,i,j})}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})}\right)$
			$\displaystyle-\left(\frac{\alpha^{}_{0,i,j}(\alpha^{}_{0,i,j}+\alpha^{}_{1,i,j})}{(1+\alpha^{}_{0,i,j})(1+\alpha^{}_{0,i,j}+\alpha^{}_{1,i,j})}-\frac{\alpha_{0,i,j}(\alpha_{0,i,j}+\alpha_{1,i,j})}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})}\right)\Bigg{\}}$
		$\displaystyle\geq$	$\displaystyle\frac{1}{p}\frac{\alpha_{0,i,j}\alpha_{1,i,j}}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}\Bigg{\{}\frac{\alpha_{1,i,j}(\alpha_{0,i,j}+\alpha_{1,i,j})}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}$
			$\displaystyle-\left(\left\|\frac{\alpha^{}_{0,i,j}}{1+\alpha^{}_{0,i,j}}-\frac{\alpha_{0,i,j}}{1+\alpha_{0,i,j}}\right\|+\left\|\frac{\alpha^{}_{0,i,j}+\alpha^{}_{1,i,j}}{1+\alpha^{}_{0,i,j}+\alpha^{}_{1,i,j}}-\frac{\alpha_{0,i,j}+\alpha_{1,i,j}}{1+\alpha_{0,i,j}+\alpha_{1,i,j}}\right\|\right)\Bigg{\}}$
		$\displaystyle\geq$	$\displaystyle\frac{1}{p}\frac{\alpha_{0,i,j}\alpha_{1,i,j}}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}\Bigg{\{}\frac{\alpha_{1,i,j}(\alpha_{0,i,j}+\alpha_{1,i,j})}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}$
			$\displaystyle-\left(\left\|\frac{\alpha_{0,i,j}\left(\frac{\alpha^{}_{0,i,j}}{\alpha_{0,i,j}}-1\right)}{\big{(}1+\alpha_{0,i,j}\big{)}\big{(}1+\alpha^{}_{0,i,j}\big{)}}\right\|+\left\|\frac{\alpha_{0,i,j}\left(\frac{\alpha^{}_{0,i,j}}{\alpha_{0,i,j}}-1\right)+\alpha_{1,i,j}\left(\frac{\alpha^{}_{1,i,j}}{\alpha_{1,i,j}}-1\right)}{\big{(}1+\alpha^{}_{0,i,j}+\alpha^{}_{1,i,j}\big{)}\big{(}1+\alpha_{0,i,j}+\alpha_{1,i,j}\big{)}}\right\|\right)\Bigg{\}}$
		$\displaystyle\geq$	$\displaystyle\frac{1}{p}\frac{\alpha_{0,i,j}\alpha_{1,i,j}}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}\Bigg{\{}\frac{\alpha_{1,i,j}(\alpha_{0,i,j}+\alpha_{1,i,j})}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-\frac{\left(2\left\|\frac{\alpha^{}_{0,i,j}}{\alpha_{0,i,j}}-1\right\|+\left\|\frac{\alpha^{}_{1,i,j}}{\alpha_{1,i,j}}-1\right\|\right)}{1+\alpha^{}_{0,i,j}+\alpha^{}_{1,i,j}}\Bigg{\}}$
		$\displaystyle\geq$	$\displaystyle C_{1}\frac{e^{-6\kappa_{0}-4\kappa_{1}}}{p},$

and

			$\displaystyle{\rm E}({\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{3}(\boldsymbol{\theta}))_{i,j}$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\left(\frac{\alpha_{1,i,j}}{(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-{\rm E}(d_{i,j})\frac{\alpha_{1,i,j}}{(1+\alpha_{1,i,j})^{2}}\right)$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{\alpha_{1,i,j}}{(1+\alpha_{1,i,j})^{2}}\left(\frac{(1+\alpha_{1,i,j})^{2}}{(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-{\rm E}(d_{i,j})\right)$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{\alpha_{1,i,j}}{(1+\alpha_{1,i,j})^{2}}\Bigg{\{}\frac{(1+\alpha_{1,i,j})^{2}}{(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-\frac{1+\alpha_{1,i,j}}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})}$
			$\displaystyle-\left(\frac{1+\alpha^{}_{1,i,j}}{(1+\alpha^{}_{0,i,j})(1+\alpha^{}_{0,i,j}+\alpha^{}_{1,i,j})}-\frac{1+\alpha_{1,i,j}}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})}\right)\Bigg{\}}$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{\alpha_{1,i,j}}{(1+\alpha_{1,i,j})^{2}}\Bigg{\{}\frac{\alpha_{0,i,j}\alpha_{1,i,j}(1+\alpha_{1,i,j})}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}$
			$\displaystyle-\left(\left\|\frac{1}{1+\alpha^{}_{0,i,j}}-\frac{1}{1+\alpha_{0,i,j}}\right\|+\left\|\frac{1+\alpha^{}_{1,i,j}}{1+\alpha^{}_{0,i,j}+\alpha^{}_{1,i,j}}-\frac{1+\alpha_{1,i,j}}{1+\alpha_{0,i,j}+\alpha_{1,i,j}}\right\|\right)\Bigg{\}}$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{\alpha_{1,i,j}}{(1+\alpha_{1,i,j})^{2}}\Bigg{\{}\frac{\alpha_{0,i,j}\alpha_{1,i,j}(1+\alpha_{1,i,j})}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}$
			$\displaystyle-\left(\left\|\frac{\alpha_{0,i,j}-\alpha^{}_{0,i,j}}{\big{(}1+\alpha_{0,i,j}\big{)}\big{(}1+\alpha^{}_{0,i,j}\big{)}}\right\|+\left\|\frac{\alpha_{0,i,j}-\alpha^{}_{0,i,j}+\alpha_{0,i,j}\alpha_{1,i,j}^{}-\alpha^{}_{0,i,j}\alpha_{1,i,j}}{\big{(}1+\alpha^{}_{0,i,j}+\alpha^{*}_{1,i,j}\big{)}\big{(}1+\alpha_{0,i,j}+\alpha_{1,i,j}\big{)}}\right\|\right)\Bigg{\}}$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{\alpha_{1,i,j}}{(1+\alpha_{1,i,j})^{2}}\Bigg{\{}\frac{\alpha_{0,i,j}\alpha_{1,i,j}(1+\alpha_{1,i,j})}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-\left\|\frac{\alpha_{0,i,j}-\alpha^{}_{0,i,j}}{\big{(}1+\alpha_{0,i,j}\big{)}\big{(}1+\alpha^{}_{0,i,j}\big{)}}\right\|$
			$\displaystyle-\left\|\frac{\alpha_{0,i,j}-\alpha^{}_{0,i,j}+\alpha_{0,i,j}\alpha_{1,i,j}^{}-\alpha_{0,i,j}^{}\alpha_{1,i,j}^{}+\alpha_{0,i,j}^{}\alpha_{1,i,j}^{}-\alpha^{}_{0,i,j}\alpha_{1,i,j}}{\big{(}1+\alpha^{}_{0,i,j}+\alpha^{*}_{1,i,j}\big{)}\big{(}1+\alpha_{0,i,j}+\alpha_{1,i,j}\big{)}}\right\|\Bigg{\}}$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{\alpha_{1,i,j}}{(1+\alpha_{1,i,j})^{2}}\Bigg{\{}\frac{\alpha_{0,i,j}\alpha_{1,i,j}(1+\alpha_{1,i,j})}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-\left\|\frac{\alpha_{0,i,j}\left(\frac{\alpha^{}_{0,i,j}}{\alpha_{0,i,j}}-1\right)}{\big{(}1+\alpha_{0,i,j}\big{)}\big{(}1+\alpha^{}_{0,i,j}\big{)}}\right\|$
			$\displaystyle-\left\|\frac{\alpha_{0,i,j}\left(\frac{\alpha^{}_{0,i,j}}{\alpha_{0,i,j}}-1\right)-\alpha_{0,i,j}\alpha^{}_{1,i,j}\left(\frac{\alpha^{}_{0,i,j}}{\alpha_{0,i,j}}-1\right)+\alpha^{}_{0,i,j}\alpha_{1,i,j}\left(\frac{\alpha^{}_{1,i,j}}{\alpha_{1,i,j}}-1\right)}{\big{(}1+\alpha^{}_{0,i,j}+\alpha^{*}_{1,i,j}\big{)}\big{(}1+\alpha_{0,i,j}+\alpha_{1,i,j}\big{)}}\right\|\Bigg{\}}$
		$\displaystyle\geq$	$\displaystyle C_{1}\frac{1}{p}\frac{\alpha_{1,i,j}}{(1+\alpha_{1,i,j})^{2}}\Bigg{\{}\frac{\alpha_{0,i,j}\alpha_{1,i,j}(1+\alpha_{1,i,j})}{(1+\alpha_{0,i,j})(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}$
			$\displaystyle-\left(2\left\|\frac{\alpha^{}_{0,i,j}}{\alpha_{0,i,j}}-1\right\|+\left\|\frac{\alpha^{}_{1,i,j}}{\alpha_{1,i,j}}-1\right\|\right)\Bigg{\}}$
		$\displaystyle\geq$	$\displaystyle C_{1}\frac{e^{-4\kappa_{0}-4\kappa_{1}}}{p}.$

Notice that the elements in $-{\rm E}{\mathbf{V}}_{2}(\boldsymbol{\theta}),{\rm E}({\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{3}(\boldsymbol{\theta}))$ and ${\rm E}({\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{1}(\boldsymbol{\theta}))$ are all positive. Denote ${\mathbf{z}}=({\mathbf{z}}_{1}^{\top},{\mathbf{z}}_{2}^{\top})^{\top}$ with ${\mathbf{z}}_{1}=(z_{1,1},\ldots,z_{1,p})^{\top}\in\mathbb{R}^{p}$ and ${\mathbf{z}}_{2}=(z_{2,1},\ldots,z_{2,p})^{\top}\in\mathbb{R}^{p}$ . Then there exists a constant $C>0$ such that,

			$\displaystyle\left\\|{\rm E}({\mathbf{V}}(\boldsymbol{\theta}))\right\\|_{2}$
		$\displaystyle\geq$	$\displaystyle\inf_{\\|{\mathbf{z}}\\|_{2}=1}\Bigg{(}\sum_{1\leq i<j\leq p}\Big{(}{\rm E}\left({\mathbf{V}}_{1}(\boldsymbol{\theta})+{\mathbf{V}}_{2}(\boldsymbol{\theta})\right)_{i,j}(z_{1,i}+z_{1,j})^{2}$
			$\displaystyle+{\rm E}\left({\mathbf{V}}_{3}(\boldsymbol{\theta})+{\mathbf{V}}_{2}(\boldsymbol{\theta})\right)_{i,j}(z_{2,i}+z_{2,j})^{2}-{\rm E}\left({\mathbf{V}}_{2}(\boldsymbol{\theta})\right)_{i,j}\left(z_{1,i}+z_{1,j}-z_{2,i}-z_{2,j}\right)^{2}\Big{)}\Bigg{)}$
		$\displaystyle\geq$	$\displaystyle\inf_{\\|{\mathbf{z}}\\|_{2}=1}\Bigg{(}\sum_{1\leq i<j\leq p}{\rm E}\left({\mathbf{V}}_{1}(\boldsymbol{\theta})+{\mathbf{V}}_{2}(\boldsymbol{\theta})\right)_{i,j}(z_{1,i}+z_{1,j})^{2}$
			$\displaystyle+{\rm E}\left({\mathbf{V}}_{3}(\boldsymbol{\theta})+{\mathbf{V}}_{2}(\boldsymbol{\theta})\right)_{i,j}(z_{2,i}+z_{2,j})^{2}\Bigg{)}$
		$\displaystyle\geq$	$\displaystyle C_{2}\frac{e^{-4\kappa_{0}-4\kappa_{1}}}{p}\inf_{\\|{\mathbf{z}}\\|_{2}=1}\sum_{1\leq i<j\leq p}\left((z_{1,i}+z_{1,j})^{2}+(z_{2,i}+z_{2,j})^{2}\right)$
		$\displaystyle\geq$	$\displaystyle Ce^{-4\kappa_{0}-4\kappa_{1}}.$

Here in the last step we have used the fact that for any ${\mathbf{a}}=(a_{1},\ldots,a_{p})^{\top}\in\mathbb{R}^{p}$ , $\sum_{1\leq i<j\leq p}(a_{i}+a_{j})^{2}={\mathbf{a}}^{\top}{\mathbf{C}}{\mathbf{a}}$ where ${\mathbf{C}}=(p-2){{\mathbf{I}}}_{p}+{\bf 1}_{p}{\bf 1}_{p}^{\top}$ , and the fact that the eigenvalues of ${\mathbf{C}}$ is greater or equal to $p-2$ . ∎

A.5 Proof of Lemma 3

Proof.

Define a series of matrices $\{{\mathbf{Y}}_{i,j}\}$ $(1\leq i<j\leq p)$ . For ${\mathbf{Y}}_{i,j}$ , the $(i,j)$ , $(j,i)$ , $(i,i)$ , $(j,j)$ elements are $Z_{i,j}$ while other elements are set to be zero. Then all the ${\mathbf{Y}}_{i,j}$ matrices are independent and

\displaystyle\sum_{1\leq i<j\leq p}{\mathbf{Y}}_{i,j}={\mathbf{Z}}.

Since ${\mathbf{Y}}_{i,j}$ are symmetric and centered random matrices, we have ${\rm Var}\left({\mathbf{Y}}_{i,j}\right)={\rm E}\left({\mathbf{Y}}_{i,j}{\mathbf{Y}}_{i,j}\right)$ . Further, by the definition of ${\mathbf{Y}}_{i,j}$ , we know that the $(i,j)$ th, $(j,i)$ th, $(i,i)$ th and $(j,j)$ th elements of ${\rm Var}({\mathbf{Y}}_{i,j})$ are all equal to $2{\rm Var}\left(Z_{i,j}\right)$ , while all other elements are zero. Consequently,

$\displaystyle\\|{\mathbf{Y}}_{i,j}\\|_{2}$		$\displaystyle\leq b\sup_{\\|a\\|_{2}=1}\left((a_{i}+a_{j})^{2}\right)\leq 2b,$
$\displaystyle\left\\|\sum_{1\leq i<j\leq p}{\rm Var}({\mathbf{Y}}_{i,j})\right\\|_{2}$	$\displaystyle=$	$\displaystyle\sup_{\\|a\\|_{2}=1}\left(\sum_{1\leq i<j\leq p}2{\rm Var}\left(Z_{i,j}\right)(a_{i}+a_{j})^{2}\right)$
	$\displaystyle\leq$	$\displaystyle\max_{i,j}\left(2{\rm Var}\Big{(}Z_{i,j}\Big{)}\right)\sup_{\\|a\\|_{2}=1}\left(\sum_{1\leq i<j\leq p}(a_{i}+a_{j})^{2}\right)$
	$\displaystyle\leq$	$\displaystyle 4\sigma^{2}(p-1).$

Using the Matrix Bernstein inequality (c.f. Theorem 6.17 of [33]), we have

P\left(\left\|{\mathbf{Z}}\right\|_{2}>\epsilon\right)=P\left(\left\|\sum_{1\leq i<j\leq p}{\mathbf{Y}}_{i,j}\right\|_{2}>\epsilon\right)\leq 2p\ \exp\left(-\frac{\epsilon^{2}}{4\sigma^{2}(p-1)+4b\epsilon}\right).

∎

A.6 Proof of Theorem 1

Proof.

Note that for any $\boldsymbol{\theta}$ and $i\neq j$ , we have

	$\displaystyle\left({\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{1}(\boldsymbol{\theta})\right)_{i,j}$	$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{\alpha_{0,i,j}}{(1+\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}>0,$
	$\displaystyle\left({\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{1}(\boldsymbol{\theta})\right)_{i,i}$	$\displaystyle=$	$\displaystyle\sum_{k=1,\>k\neq i}^{p}\left({\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{1}(\boldsymbol{\theta})\right)_{i,k}.$

Therefore, ${\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{1}(\boldsymbol{\theta})$ is always positive definite. Next we prove that all ${\mathbf{U}}(\boldsymbol{\theta})\in\{-{\mathbf{V}}_{2}(\boldsymbol{\theta}),{\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{3}(\boldsymbol{\theta})\}$ are positive definite (in probability) by showing that with probability tending to 1,

\inf_{\|{\mathbf{a}}\|_{2}=1}\left({\mathbf{a}}^{\top}({\rm E}({\mathbf{U}}(\boldsymbol{\theta}))){\mathbf{a}}\right)>\sup_{\|{\mathbf{a}}\|_{2}=1}\left({\mathbf{a}}^{\top}({\mathbf{U}}(\boldsymbol{\theta})-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))){\mathbf{a}}\right),

holds uniformly for all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ . Note that

			$\displaystyle\left\\|{\mathbf{U}}(\boldsymbol{\theta})-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right\\|_{2}$
		$\displaystyle\leq$	$\displaystyle\left\\|{\mathbf{U}}(\boldsymbol{\theta}^{})-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))\right\\|_{2}+\left\\|{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right\\|_{2}.$

We consider $\left\|{\mathbf{U}}(\boldsymbol{\theta}^{*})-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{*}))\right\|_{2}$ first. By setting $\epsilon=c_{1}\big{(}\sqrt{2\sigma^{2}(p-1)\log(np)}+b\log(np)\big{)}$ with some big enough constant $c_{1}>0$ in Lemma 3, we have

			$\displaystyle P\left(\left\\|{\mathbf{Z}}\right\\|_{2}>\epsilon\right)$
		$\displaystyle\leq$	$\displaystyle 2p\exp\left(-c_{1}^{2}\frac{2\sigma^{2}(p-1)\log(np)+2b\sqrt{2\sigma^{2}(p-1)}\log^{3/2}(np)+b^{2}\log^{2}(np)}{2\sigma^{2}(p-1)+4c_{1}b\sqrt{2\sigma^{2}(p-1)\log(np)}+4c_{1}b^{2}\log(np)}\right)$
		$\displaystyle\leq$	$\displaystyle 2p\exp\left(-c_{1}\log(np)/4\right)$
		$\displaystyle=$	$\displaystyle 2p\left(np\right)^{-c_{1}/4}.$

As $np\to\infty$ , we have with probability greater than $1-2p\left(np\right)^{-c_{1}/4}$ ,

(A.21)

\left\|{\mathbf{Z}}\right\|_{2}\leq c_{1}\left(\sigma\sqrt{2p\log(np)}+b\log(np)\right).

By Lemma 1 and Lemma 5, we have, uniformly for all $\boldsymbol{\theta}$ and $1\leq i\neq j\leq p$ , there exist positive constants $C_{1}$ , $c_{2}$ such that, with probability greater than $1-(np)^{-c_{2}}$ ,

$\displaystyle\|{\mathbf{U}}(\boldsymbol{\theta})_{i,j}-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))_{i,j}\|$	$\displaystyle\leq$	$\displaystyle C_{1}\left(\sqrt{\frac{\log(np)}{np^{2}}}+\frac{\log\left(n\right)\log\log\left(n\right)\log\left(np\right)}{np}\right);$
$\displaystyle{\rm Var}\left({\mathbf{U}}(\boldsymbol{\theta})_{i,j}\right)$	$\displaystyle\leq$	$\displaystyle\frac{\sup_{i\neq j}\left\{{\rm Var}\left(\sum_{t=1}^{n}X_{i,j}^{t}X_{i,j}^{t-1}\right),{\rm Var}\left(\sum_{t=1}^{n}X_{i,j}^{t}\right)\right\}}{n^{2}p^{2}}$
	$\displaystyle\leq$	$\displaystyle\frac{C_{1}}{np^{2}}.$

Consequently, from (A.21) we have

\displaystyle\left\|{\mathbf{U}}(\boldsymbol{\theta}^{*})-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{*}))\right\|_{2}

\displaystyle=

\displaystyle O_{p}\left(\sqrt{\frac{\log(np)}{np}}+\sqrt{\frac{\log^{3}(np)}{np^{2}}}+\frac{\log\left(n\right)\log\log\left(n\right)\log^{2}\left(np\right)}{np}\right).

Next we derive uniform upper bounds for $\left\|{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{*})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{*}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right\|_{2}$ . When ${\mathbf{U}}(\boldsymbol{\theta})=-{\mathbf{V}}_{2}(\boldsymbol{\theta})$ , by Lemma 1, we have, there exist positive constants $C_{2}$ and $c_{3}$ such that with probability greater than $1-(np)^{-c_{3}}$ ,

			$\displaystyle\left\\|{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right\\|_{2}$
		$\displaystyle=$	$\displaystyle\sup_{\\|{\mathbf{a}}\\|_{2}=1}\sum_{1\leq i,j\leq p}\left[{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right]_{i,j}a_{i}a_{j}$
		$\displaystyle=$	$\displaystyle\sup_{\\|{\mathbf{a}}\\|_{2}=1}\sum_{1\leq i<j\leq p}\left[{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right]_{i,j}\left(a_{i}+a_{j}\right)^{2}$
		$\displaystyle\leq$	$\displaystyle\max_{1\leq i<j\leq p}\left\|\left[{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right]_{i,j}\right\|\sup_{\\|{\mathbf{a}}\\|_{2}=1}\sum_{1\leq i<j\leq p}\left(a_{i}+a_{j}\right)^{2}$
		$\displaystyle\leq$	$\displaystyle\max_{1\leq i<j\leq p}\Bigg{\|}{\mathbf{V}}_{2}(\boldsymbol{\theta})_{i,j}-{\mathbf{V}}_{2}(\boldsymbol{\theta})^{}_{i,j}+{\rm E}\left({\mathbf{V}}_{2}(\boldsymbol{\theta})^{}_{i,j}\right)-{\rm E}\left({\mathbf{V}}_{2}(\boldsymbol{\theta})_{i,j}\right)\Bigg{\|}2\left(p-1\right)$
		$\displaystyle=$	$\displaystyle\frac{2\left(p-1\right)}{np}\max_{1\leq i<j\leq p}\Bigg{\|}\left(b_{i,j}-{\rm E}(b_{i,j})\right)\left(\frac{\alpha_{0,i,j}\alpha_{1,i,j}}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-\frac{\alpha_{0,i,j}^{}\alpha_{1,i,j}^{}}{(\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{})^{2}}\right)\Bigg{\|}$
		$\displaystyle\leq$	$\displaystyle\max_{1\leq i<j\leq p}\frac{2\left\|b_{i,j}-{\rm E}(b_{i,j})\right\|}{n}\max_{1\leq i<j\leq p}\Bigg{\|}\frac{\left(\alpha_{0,i,j}^{}\alpha_{1,i,j}-\alpha_{0,i,j}\alpha_{1,i,j}^{}\right)\left(\alpha_{0,i,j}^{}\alpha_{0,i,j}-\alpha_{1,i,j}\alpha_{1,i,j}^{}\right)}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}(\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{})^{2}}\Bigg{\|}$
		$\displaystyle=$	$\displaystyle\max_{1\leq i<j\leq p}\frac{2\left\|b_{i,j}-{\rm E}(b_{i,j})\right\|}{n}\max_{i\neq j}\Bigg{\|}\left(\frac{\alpha_{0,i,j}^{}}{\alpha_{0,i,j}}-\frac{\alpha_{1,i,j}^{}}{\alpha_{1,i,j}}\right)\frac{\alpha_{0,i,j}\alpha_{1,i,j}\left(\alpha_{0,i,j}^{}\alpha_{0,i,j}-\alpha_{1,i,j}\alpha_{1,i,j}^{}\right)}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}(\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{})^{2}}\Bigg{\|}$
		$\displaystyle\leq$	$\displaystyle C_{2}\min\left\{1,\sqrt{\frac{\log(np)}{{n}}}+\frac{\log\left(n\right)\log\log\left(n\right)\log\left(np\right)}{n}\right\}\max_{1\leq i<j\leq p}\left\|\frac{\alpha_{0,i,j}^{}}{\alpha_{0,i,j}}-\frac{\alpha_{1,i,j}^{}}{\alpha_{1,i,j}}\right\|,$

holds uniformly for all $\boldsymbol{\theta}$ . Similarly, when ${\mathbf{U}}(\boldsymbol{\theta})={\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{3}(\boldsymbol{\theta})$ , by Lemma 1, we have there exist positive constants $C_{3}$ and $c_{4}$ such that, with probability greater than $1-(np)^{-c_{4}}$ ,

			$\displaystyle\left\\|{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right\\|_{2}$
		$\displaystyle\leq$	$\displaystyle\max_{1\leq i<j\leq p}\left\|\left[{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right]_{i,j}\right\|\sup_{\\|{\mathbf{a}}\\|_{2}=1}\sum_{1\leq i<j\leq p}\left(a_{i}+a_{j}\right)^{2}$
		$\displaystyle\leq$	$\displaystyle\max_{1\leq i<j\leq p}\Bigg{\|}\Bigg{\{}{\mathbf{V}}_{2}(\boldsymbol{\theta})_{i,j}+{\mathbf{V}}_{3}(\boldsymbol{\theta})_{i,j}-{\mathbf{V}}_{2}(\boldsymbol{\theta})^{}_{i,j}-{\mathbf{V}}_{3}(\boldsymbol{\theta})^{}_{i,j}+{\rm E}\left({\mathbf{V}}_{2}(\boldsymbol{\theta})^{}_{i,j}\right)+{\rm E}\left({\mathbf{V}}_{3}(\boldsymbol{\theta})^{}_{i,j}\right)$
			$\displaystyle-{\rm E}\left({\mathbf{V}}_{2}(\boldsymbol{\theta})_{i,j}\right)-{\rm E}\left({\mathbf{V}}_{3}(\boldsymbol{\theta})_{i,j}\right)\Bigg{\}}\Bigg{\|}2\left(p-1\right)$
		$\displaystyle=$	$\displaystyle\frac{2\left(p-1\right)}{np}\max_{1\leq i<j\leq p}\Bigg{\|}\left(b_{i,j}-{\rm E}(b_{i,j})\right)\left(\frac{\alpha_{0,i,j}\alpha_{1,i,j}}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-\frac{\alpha_{0,i,j}^{}\alpha_{1,i,j}^{}}{(\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{})^{2}}\right)\Bigg{\|}$
			$\displaystyle+\frac{2\left(p-1\right)}{np}\max_{1\leq i<j\leq p}\Bigg{\|}\left(d_{i,j}-{\rm E}(d_{i,j})\right)\left(\frac{\alpha_{1,i,j}}{(1+\alpha_{1,i,j})^{2}}-\frac{\alpha_{1,i,j}^{}}{(1+\alpha_{1,i,j}^{})^{2}}\right)\Bigg{\|}$
		$\displaystyle\leq$	$\displaystyle 2\max_{1\leq i<j\leq p}\frac{\left\|b_{i,j}-{\rm E}(b_{i,j})\right\|}{n}\max_{1\leq i<j\leq p}\left\|\left(\frac{\alpha_{1,i,j}^{}}{\alpha_{1,i,j}}-\frac{\alpha_{0,i,j}^{}}{\alpha_{0,i,j}}\right)\frac{\alpha_{0,i,j}\alpha_{1,i,j}\left(\alpha_{0,i,j}\alpha_{0,i,j}^{}-\alpha_{1,i,j}\alpha_{1,i,j}^{}\right)}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}(\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{})^{2}}\right\|$
			$\displaystyle+2\max_{1\leq i<j\leq p}\frac{\left\|d_{i,j}-{\rm E}(d_{i,j})\right\|}{n}\max_{1\leq i<j\leq p}\Bigg{\|}\left(1-\frac{\alpha_{1,i,j}^{}}{\alpha_{1,i,j}}\right)\frac{\alpha_{1,i,j}\left(1-\alpha_{1,i,j}\alpha_{1,i,j}^{}\right)}{(1+\alpha_{1,i,j})^{2}(1+\alpha_{1,i,j}^{*})^{2}}\Bigg{\|}$
		$\displaystyle\leq$	$\displaystyle C_{3}\min\left\{1,\sqrt{\frac{\log(np)}{{n}}}+\frac{\log\left(n\right)\log\log\left(n\right)\log\left(np\right)}{n}\right\}$
			$\displaystyle\times\left(\max_{1\leq i<j\leq p}\left\|\frac{\alpha_{0,i,j}^{}}{\alpha_{0,i,j}}-\frac{\alpha_{1,i,j}^{}}{\alpha_{1,i,j}}\right\|+\max_{1\leq i<j\leq p}\left\|1-\frac{\alpha_{1,i,j}^{*}}{\alpha_{1,i,j}}\right\|\right),$

holds uniformly for all $\boldsymbol{\theta}$ . Consequently, there exist positive constants $C_{4}$ and $c_{5}$ such that, with probability greater than $1-(np)^{-c_{5}}$ ,

			$\displaystyle\sup_{\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{},r\right)}\left\\|{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{*}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right\\|_{2}$
		$\displaystyle\leq$	$\displaystyle\max\left\{C_{3},C_{2}\right\}\min\left\{1,\sqrt{\frac{\log(np)}{{n}}}+\frac{\log\left(n\right)\log\log\left(n\right)\log\left(np\right)}{n}\right\}$
			$\displaystyle\times\sup_{\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{},r\right)}\left(\max_{1\leq i<j\leq p}\left\|\frac{\alpha_{0,i,j}^{}}{\alpha_{0,i,j}}-\frac{\alpha_{1,i,j}^{}}{\alpha_{1,i,j}}\right\|+\max_{1\leq i<j\leq p}\left\|1-\frac{\alpha_{1,i,j}^{}}{\alpha_{1,i,j}}\right\|\right)$
		$\displaystyle\leq$	$\displaystyle\max\left\{C_{3},C_{2}\right\}\min\left\{1,\sqrt{\frac{\log(np)}{{n}}}+\frac{\log\left(n\right)\log\log\left(n\right)\log\left(np\right)}{n}\right\}$
			$\displaystyle\times\sup_{\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{},r\right)}\left(\max_{1\leq i<j\leq p}\left\|\frac{\alpha_{0,i,j}^{}}{\alpha_{0,i,j}}-1\right\|+2\max_{1\leq i<j\leq p}\left\|\frac{\alpha_{1,i,j}^{*}}{\alpha_{1,i,j}}-1\right\|\right)$
		$\displaystyle\leq$	$\displaystyle C_{4}\min\left\{1,\sqrt{\frac{\log(np)}{{n}}}+\frac{\log\left(n\right)\log\log\left(n\right)\log\left(np\right)}{n}\right\}r.$

On the other hand, by inequalities (A.4) and (A.4) we have, there exists a positive constant $C_{5}$ such that

	$\displaystyle\inf_{\\|{\mathbf{a}}\\|_{2}=1}\left({\mathbf{a}}^{\top}{\rm E}({\mathbf{U}}(\boldsymbol{\theta})){\mathbf{a}}\right)$	$\displaystyle=$	$\displaystyle\inf_{\\|{\mathbf{a}}\\|_{2}=1}\sum_{1\leq i<j\leq p}{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))_{i,j}\left(a_{i}+a_{j}\right)^{2}$
		$\displaystyle\geq$	$\displaystyle C_{5}e^{-4\kappa_{0}-4\kappa_{1}}.$

holds uniformly for any $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},c_{r}e^{-4\kappa_{0}-4\kappa_{1}}\right)$ where $c_{r}>0$ is a small enough constant. Thus, as $np\to\infty$ , $\kappa_{0}+\kappa_{1}\leq c\log\frac{np}{\log(np)}$ for some small enough constant $c>0$ and $r=c_{r}e^{-4\kappa_{0}-4\kappa_{1}}$ for some small enough constant $0<c_{r}<C_{5}/(2C_{4})$ , we have, there exist big enough positive constants $C_{6},c_{6}$ such that with probability greater than $1-(np)^{-c_{6}}$ ,

			$\displaystyle\left\\|{\mathbf{U}}(\boldsymbol{\theta})-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right\\|_{2}$
		$\displaystyle\leq$	$\displaystyle\left\\|{\mathbf{U}}(\boldsymbol{\theta}^{})-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))\right\\|_{2}+\left\\|{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right\\|_{2}$
		$\displaystyle\leq$	$\displaystyle C_{6}\left(\sqrt{\frac{\log(np)}{np}}+\sqrt{\frac{\log^{3}(np)}{np^{2}}}+\frac{\log\left(n\right)\log\log\left(n\right)\log^{2}\left(np\right)}{np}\right)+c_{r}C_{4}e^{-4\kappa_{0}-4\kappa_{1}}$
		$\displaystyle\leq$	$\displaystyle 2c_{r}C_{4}e^{-4\kappa_{0}-4\kappa_{1}}$
		$\displaystyle<$	$\displaystyle C_{5}e^{-4\kappa_{0}-4\kappa_{1}}$
		$\displaystyle\leq$	$\displaystyle\inf_{\\|{\mathbf{a}}\\|_{2}=1}\left({\mathbf{a}}^{\top}({\rm E}({\mathbf{U}}(\boldsymbol{\theta}))){\mathbf{a}}\right),$

holds uniformly for all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}(\boldsymbol{\theta}^{*},c_{r}e^{-4\kappa_{0}-4\kappa_{1}})$ . The statements in Theorem 1 can then be concluded. ∎

A.7 Proof of Theorem 2

Proof.

Recall that

l(\boldsymbol{\theta}):=\frac{1}{p}\sum_{1\leq i<j\leq p}\log\Big{(}{1+e^{\beta_{i,0}+\beta_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}}}\Big{)}-\frac{1}{np}{\sum_{1\leq i<j\leq p}\Bigg{\{}\left(\beta_{i,0}+\beta_{j,0}\right)\sum_{t=1}^{n}X_{i,j}^{t}}+\\ \log\left(1+e^{\beta_{i,1}+\beta_{j,1}}\right)\sum_{t=1}^{n}\left(1-X_{i,j}^{t}\right)\left(1-X_{i,j}^{t-1}\right)+\log\big{(}1+e^{\beta_{i,1}+\beta_{j,1}-\beta_{i,0}-\beta_{j,0}}\big{)}\sum_{t=1}^{n}X_{i,j}^{t}X_{i,j}^{t-1}\Bigg{\}},

and write $l_{E}(\boldsymbol{\theta})={\rm E}l(\boldsymbol{\theta})$ ; that is,

	$\displaystyle l_{E}(\boldsymbol{\theta}):=$	$\displaystyle\frac{1}{p}\sum_{1\leq i<j\leq p}\log\Big{(}{1+e^{\beta_{i,0}+\beta_{j,0}}+e^{\beta_{i,1}+\beta_{j,1}}}\Big{)}-\frac{1}{np}\sum_{1\leq i<j\leq p}\Bigg{\{}\left(\beta_{i,0}+\beta_{j,0}\right)\sum_{t=1}^{n}{\rm E}\left(X_{i,j}^{t}\right)$
		$\displaystyle+\log\left(1+e^{\beta_{i,1}+\beta_{j,1}}\right)\sum_{t=1}^{n}{\rm E}\left(1-X_{i,j}^{t}\right)\left(1-X_{i,j}^{t-1}\right)$
		$\displaystyle+\log\big{(}1+e^{\beta_{i,1}+\beta_{j,1}-\beta_{i,0}-\beta_{j,0}}\big{)}\sum_{t=1}^{n}{\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)\Bigg{\}},$

Define ${\mathbf{D}}_{n,p}:=\left\{\check{\boldsymbol{\theta}}:l\left(\check{\boldsymbol{\theta}}\right)-l(\boldsymbol{\theta}^{*})\leq 0\ \text{and}\ \|\check{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\|_{\infty}\leq c_{r}e^{-4\kappa_{0}-4\kappa_{1}}\right\}$ . Note that when $c_{r}$ is small enough, $c_{r}e^{-4\kappa_{0}-4\kappa_{1}}<\alpha_{0}$ . By Corollary 4, there exist big enough positive constants $C_{1},c_{1}$ such that, with probability greater than $1-(np)^{-c_{1}}$ ,

\displaystyle\left|\big{(}l(\boldsymbol{\theta}^{*})-l\left(\check{\boldsymbol{\theta}}\right)\big{)}-\big{(}l_{E}(\boldsymbol{\theta}^{*})-l_{E}\left(\check{\boldsymbol{\theta}}\right)\big{)}\right|\leq C_{1}\left(1+\frac{\log(np)}{\sqrt{p}}\right)\sqrt{\frac{\log(np)}{n}}\left\|\check{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2},

holds uniformly for all random variable $\check{\boldsymbol{\theta}}\in{\mathbf{D}}_{n,p}$ .
Note that $\boldsymbol{\theta}^{*}$ is the minimizer of $l_{E}(\cdot)$ . By Lemma 2, there exists a constant $C_{2}>0$ , such that for all $\check{\boldsymbol{\theta}}\in{\mathbf{D}}_{n,p}$ , we have,

			$\displaystyle\big{(}l(\boldsymbol{\theta}^{})-l\left(\check{\boldsymbol{\theta}}\right)\big{)}-\big{(}l_{E}(\boldsymbol{\theta}^{})-l_{E}\left(\check{\boldsymbol{\theta}}\right)\big{)}$
		$\displaystyle\geq$	$\displaystyle l_{E}\left(\check{\boldsymbol{\theta}}\right)-l_{E}(\boldsymbol{\theta}^{*})$
		$\displaystyle=$	$\displaystyle\left(\check{\boldsymbol{\theta}}-\boldsymbol{\theta}^{}\right)^{\top}\nabla^{2}l_{E}(c_{\boldsymbol{\theta}}\boldsymbol{\theta}^{}+(1-c_{\boldsymbol{\theta}})\check{\boldsymbol{\theta}})\left(\check{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right)$
		$\displaystyle\geq$	$\displaystyle\left\\|\check{\boldsymbol{\theta}}-\boldsymbol{\theta}^{}\right\\|_{2}^{2}\inf_{\\|{\mathbf{a}}\\|_{2}\leq 1}\left({\mathbf{a}}^{\top}\nabla^{2}l_{E}(c_{\boldsymbol{\theta}}\check{\boldsymbol{\theta}}+(1-c_{\boldsymbol{\theta}})\boldsymbol{\theta}^{}){\mathbf{a}}\right)$
		$\displaystyle\geq$	$\displaystyle\frac{C_{2}}{e^{4\kappa_{0}+4\kappa_{1}}}\left\\|\check{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\\|_{2}^{2}.$

Here $0\leq c_{\boldsymbol{\theta}}\leq 1$ is a random scalar that may depend on $\check{\boldsymbol{\theta}}$ . Consequently, as $np\to\infty$ , with probability greater than $1-(np)^{-c_{1}}$ , there exists a constant $C_{3}>0$ such that

(A.22)

\sup_{\check{\boldsymbol{\theta}}\in{\mathbf{D}}_{n,p}}\left\|\check{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}\leq C_{3}e^{4\kappa_{0}+4\kappa_{1}}\sqrt{\frac{\log(np)}{n}}\left(1+\frac{\log(np)}{\sqrt{p}}\right).

Note that the MLE $\widehat{\boldsymbol{\theta}}$ satisfies that $\widehat{\boldsymbol{\theta}}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},c_{r}e^{-4\kappa_{0}-4\kappa_{1}}\right)$ and $l(\widehat{\boldsymbol{\theta}})\leq l(\boldsymbol{\theta}^{*})$ . Therefore we have $\widehat{\boldsymbol{\theta}}\in{\mathbf{D}}_{n,p}$ , and from (A.22) we can conclude that as $np\to\infty$ , with probability tending to 1,

\frac{1}{\sqrt{p}}\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{2}\leq C_{3}e^{4\kappa_{0}+4\kappa_{1}}\sqrt{\frac{\log(np)}{np}}\left(1+\frac{\log(np)}{\sqrt{p}}\right).

∎

A.8 Proof of Theorem 3

Before presenting the proof, we first introduce a technical lemma.

Lemma 7.

For all ${\mathbf{a}},{\mathbf{b}},{\mathbf{c}},{\mathbf{d}}\in\mathbb{R}^{K}$ s.t. $\max\{\|{\mathbf{a}}-{\mathbf{d}}\|_{\infty},\|{\mathbf{b}}-{\mathbf{c}}\|_{\infty}\}\leq 1$ and any positive $z_{1},z_{2}$ s.t. $z_{1}z_{2}\geq 1/4$ , we have:

\frac{\left(\sum_{k=1}^{K}e^{a_{k}+b_{k}}\right)\left(\sum_{k=1}^{K}e^{c_{k}+d_{k}}\right)}{\left(\sum_{k=1}^{K}e^{a_{k}+c_{k}}\right)\left(\sum_{k=1}^{K}e^{b_{k}+d_{k}}\right)}\leq(e-1)^{2}k^{2}\prod_{1\leq j,s\leq K}e^{z_{1}\left|a_{k}-d_{k}\right|^{2}+z_{2}\left|b_{s}-c_{s}\right|^{2}}.

Here $a_{k},b_{k},c_{k}$ and $d_{k}$ are the $k$ th element of ${\mathbf{a}},{\mathbf{b}},{\mathbf{c}}$ and ${\mathbf{d}}$ , respectively.

Proof.

Note that for all $0<x\leq 1$ , we have $1\leq\left(e^{x}-1\right)/x\leq e-1$ and for all $y$ , we have $|e^{y}-1|\leq e^{|y|}-1$ . Consequently, for all ${\mathbf{a}},{\mathbf{b}},{\mathbf{c}},{\mathbf{d}}\in\mathbb{R}^{K}$ s.t. $\max\{\|{\mathbf{a}}-{\mathbf{d}}\|_{\infty},\|{\mathbf{b}}-{\mathbf{c}}\|_{\infty}\}\leq 1$ , we have,

			$\displaystyle\frac{\left(\sum_{k=1}^{K}e^{a_{k}+b_{k}}\right)\left(\sum_{k=1}^{K}e^{c_{k}+d_{k}}\right)}{\left(\sum_{k=1}^{K}e^{a_{k}+c_{k}}\right)\left(\sum_{k=1}^{K}e^{b_{k}+d_{k}}\right)}$
		$\displaystyle=$	$\displaystyle 1+\sum_{1\leq j<s\leq K}\frac{e^{a_{j}+b_{j}+c_{s}+d_{s}}-e^{a_{j}+b_{s}+c_{j}+d_{s}}+e^{a_{s}+b_{s}+c_{j}+d_{j}}-e^{a_{s}+b_{j}+c_{s}+d_{j}}}{\left(\sum_{k=1}^{K}e^{a_{k}+c_{k}}\right)\left(\sum_{k=1}^{K}e^{b_{k}+d_{k}}\right)}$
		$\displaystyle=$	$\displaystyle 1+\sum_{1\leq j<s\leq K}\frac{e^{a_{s}+b_{s}+c_{j}+d_{j}}}{\left(\sum_{k=1}^{K}e^{a_{k}+c_{k}}\right)\left(\sum_{k=1}^{K}e^{b_{k}+d_{k}}\right)}\left(e^{a_{j}+d_{s}-a_{s}-d_{j}}-1\right)\left(e^{b_{j}+c_{s}-b_{s}-c_{j}}-1\right)$
		$\displaystyle\leq$	$\displaystyle 1+\sum_{1\leq j<s\leq K}\frac{e^{a_{s}+b_{s}+c_{j}+d_{j}}}{\left(\sum_{k=1}^{K}e^{a_{k}+c_{k}}\right)\left(\sum_{k=1}^{K}e^{b_{k}+d_{k}}\right)}\left(e^{\left\|a_{j}-d_{j}\right\|+\left\|a_{s}-d_{s}\right\|}-1\right)\left(e^{\left\|b_{j}-c_{j}\right\|+\left\|b_{s}-c_{s}\right\|}-1\right)$
		$\displaystyle\leq$	$\displaystyle 1+(e-1)^{2}\sum_{1\leq j<s\leq K}\frac{e^{a_{s}+b_{s}+c_{j}+d_{j}}\left(\left\|a_{j}-d_{j}\right\|+\left\|a_{s}-d_{s}\right\|\right)\left(\left\|b_{j}-c_{j}\right\|+\left\|b_{s}-c_{s}\right\|\right)}{\left(\sum_{k=1}^{K}e^{a_{k}+c_{k}}\right)\left(\sum_{k=1}^{K}e^{b_{k}+d_{k}}\right)}$
		$\displaystyle\leq$	$\displaystyle 1+(e-1)^{2}\sum_{1\leq j<s\leq K}\frac{e^{a_{s}+b_{s}+c_{j}+d_{j}}\left(e^{\left(\left\|a_{j}-d_{j}\right\|+\left\|a_{s}-d_{s}\right\|\right)\left(\left\|b_{j}-c_{j}\right\|+\left\|b_{s}-c_{s}\right\|\right)}-1\right)}{\left(\sum_{k=1}^{K}e^{a_{k}+c_{k}}\right)\left(\sum_{k=1}^{K}e^{b_{k}+d_{k}}\right)}$
		$\displaystyle\leq$	$\displaystyle(e-1)^{2}\sum_{1\leq j,s\leq k}e^{\left\|a_{j}-d_{j}\right\|\left\|b_{s}-c_{s}\right\|}$
		$\displaystyle\leq$	$\displaystyle(e-1)^{2}\sum_{1\leq j,s\leq k}e^{z_{1}\left\|a_{j}-d_{j}\right\|^{2}+z_{2}\left\|b_{s}-c_{s}\right\|^{2}}$
		$\displaystyle\leq$	$\displaystyle(e-1)^{2}k^{2}\prod_{1\leq j,s\leq k}e^{z_{1}\left\|a_{j}-d_{j}\right\|^{2}+z_{2}\left\|b_{s}-c_{s}\right\|^{2}}$

holds for any positive $z_{1},z_{2}$ s.t. $z_{1}z_{2}\geq 1/4$ . ∎

Next we proceed to the proof of Theorem 3.

Proof of Theorem 3

Proof.

Recall the functions $l_{E}\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}\right)$ and $l\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}\right)$ defined in the proof of Theorem 2. We denote the MLE studied in Theorem 2 as $\widehat{\boldsymbol{\theta}}$ $\big{(}$ the local minimizer of $l(\boldsymbol{\theta})$ in ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},c_{r}e^{-4\kappa_{0}-4\kappa_{1}}\right)\big{)}$ . Also, let $\widehat{\boldsymbol{\theta}}^{(s)}=(\widehat{\beta}_{1,0}^{(s)},\ldots,\widehat{\beta}_{p,0}^{(s)},\widehat{\beta}_{1,1}^{(s)},\ldots,\widehat{\beta}_{p,1}^{(s)})^{\top}$ be the local minimizer of $l(\boldsymbol{\theta})$ in the $\ell_{\infty}$ ball ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r_{s}\right)$ where

(A.23)

\displaystyle r_{s}

\displaystyle=

\displaystyle e^{8\kappa_{0}+8\kappa_{1}}\log\log(np)\frac{\sqrt{\log(np)}}{\left(np\right)^{s}}\left(1+\frac{\log(np)}{\sqrt{p}}\right),

for some given constants $s\geq 0$ . Let

(A.24)

\displaystyle s_{0}

\displaystyle=

\displaystyle\frac{12\left(\kappa_{0}+\kappa_{1}\right)+\log\log(np)/2+\log\log\log(np)+\log\left(1+\frac{\log(np)}{\sqrt{p}}\right)-\log(c_{r})}{\log(np)}.

We then have $\widehat{\boldsymbol{\theta}}^{(s_{0})}=\widehat{\boldsymbol{\theta}}$ . Under the condition that $\kappa_{0}+\kappa_{1}\leq c\log(np)$ for some positive constant $c$ , we have $s_{0}<1/2$ when $np$ is large enough.

Next we will show that with probability tending to 1, uniformly for all $s\in[s_{0},1/2]$ , $\widehat{\boldsymbol{\theta}}^{(s)}$ , the local MLE for ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r_{s}\right)$ , is also the local MLE in a smaller ball ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r_{s^{\prime}}\right)$ where

r_{s^{\prime}}=(np)^{\frac{2s-1}{4}}{r_{s}}=e^{8\kappa_{0}+8\kappa_{1}}\log\log(np)\frac{\sqrt{\log(np)}}{\left(np\right)^{s^{\prime}}}\left(1+\frac{\log(np)}{\sqrt{p}}\right),

and $s^{\prime}=\frac{2s+1}{4}<1/2$ . Note that $\boldsymbol{\theta}_{(i)}^{*}$ is the minimizer of $l_{E}\left(\cdot,\boldsymbol{\theta}_{(-i)}^{*}\right)$ . Given $\boldsymbol{\theta}_{(-i)}^{*}$ , the Hessian matrix of $l_{E}\left(\cdot,\boldsymbol{\theta}_{(-i)}^{*}\right)$ evaluated at $\boldsymbol{\theta}_{(i)}$ is a $2\times 2$ matrix given as:

{\rm E}{\mathbf{V}}^{(i)}\left(\boldsymbol{\theta}_{(i)}\right):=\left[\begin{array}[]{ccc}{\rm E}{\mathbf{V}}^{(i)}_{1}\left(\boldsymbol{\theta}_{(i)}\right)&{\rm E}{\mathbf{V}}^{(i)}_{2}\left(\boldsymbol{\theta}_{(i)}\right)\\ {\rm E}{\mathbf{V}}^{(i)}_{2}\left(\boldsymbol{\theta}_{(i)}\right)&{\rm E}{\mathbf{V}}^{(i)}_{3}\left(\boldsymbol{\theta}_{(i)}\right)\end{array}\right].

Following the calculations in Lemma 2, there exists a constant $C_{1}>0$ which is independent of $\boldsymbol{\theta}_{(i)}$ , such that, for all $\boldsymbol{\theta}_{(i)}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*}_{(i)},c_{r}e^{-4\kappa_{0}-4\kappa_{1}}\right)$ , we have

{\rm E}{\mathbf{V}}^{(i)}_{2}\left(\boldsymbol{\theta}_{(i)}\right)+{\rm E}{\mathbf{V}}^{(i)}_{1}\left(\boldsymbol{\theta}_{(i)}\right)=\sum_{j=1,\>j\neq i}^{p}\frac{1}{p}\frac{e^{\beta_{i,0}+\beta^{*}_{j,0}}}{(1+e^{\beta_{i,0}+\beta^{*}_{j,0}}+e^{\beta_{i,1}+\beta^{*}_{j,1}})^{2}}\geq C_{1}e^{-2\kappa_{0}-4\kappa_{1}},

and

-{\rm E}{\mathbf{V}}^{(i)}_{2}\left(\boldsymbol{\theta}_{(i)}\right)\geq C_{1}e^{-4\kappa_{0}-4\kappa_{1}};\quad\quad{\rm E}{\mathbf{V}}^{(i)}_{2}\left(\boldsymbol{\theta}_{(i)}\right)+{\rm E}{\mathbf{V}}^{(i)}_{3}\left(\boldsymbol{\theta}_{(i)}\right)\geq C_{1}e^{-4\kappa_{0}-4\kappa_{1}}.

Then we have for all $\boldsymbol{\theta}_{(i)}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*}_{(i)},c_{r}e^{-4\kappa_{0}-4\kappa_{1}}\right)$ ,

			$\displaystyle\left\\|{\rm E}{\mathbf{V}}^{(i)}\left(\boldsymbol{\theta}_{(i)}\right)\right\\|_{2}$
		$\displaystyle=$	$\displaystyle\inf_{\\|{\mathbf{z}}\\|_{2}=1}\Bigg{(}\left({\rm E}{\mathbf{V}}^{(i)}_{2}\left(\boldsymbol{\theta}_{(i)}\right)+{\rm E}{\mathbf{V}}^{(i)}_{1}\left(\boldsymbol{\theta}_{(i)}\right)\right)z_{1}^{2}+\left({\rm E}{\mathbf{V}}^{(i)}_{2}\left(\boldsymbol{\theta}_{(i)}\right)+{\rm E}{\mathbf{V}}^{(i)}_{3}\left(\boldsymbol{\theta}_{(i)}\right)\right)z_{2}^{2}$
			$\displaystyle-{\rm E}{\mathbf{V}}^{(i)}_{2}\left(\boldsymbol{\theta}_{(i)}\right)\left(z_{1}-z_{2}\right)^{2}\Bigg{)}$
		$\displaystyle\geq$	$\displaystyle\inf_{\\|{\mathbf{z}}\\|_{2}=1}\Bigg{\{}\left({\rm E}{\mathbf{V}}^{(i)}_{2}\left(\boldsymbol{\theta}_{(i)}\right)+{\rm E}{\mathbf{V}}^{(i)}_{1}\left(\boldsymbol{\theta}_{(i)}\right)\right)z_{1}^{2}+\left({\rm E}{\mathbf{V}}^{(i)}_{2}\left(\boldsymbol{\theta}_{(i)}\right)+{\rm E}{\mathbf{V}}^{(i)}_{3}\left(\boldsymbol{\theta}_{(i)}\right)\right)z_{2}^{2}\Bigg{\}}$
		$\displaystyle\geq$	$\displaystyle\frac{C_{1}}{e^{4\kappa_{0}+4\kappa_{1}}}\inf_{\\|{\mathbf{z}}\\|_{2}=1}\Bigg{\{}z_{1}^{2}+z_{2}^{2}\Bigg{\}}$
		$\displaystyle=$	$\displaystyle\frac{C_{1}}{e^{4\kappa_{0}+4\kappa_{1}}}.$

Consequently, for any $i$ and $s\in[s_{0},1/2]$ , there exists a random scalar $\boldsymbol{\theta}_{(i)}^{(\xi)}$ , s.t.

			$\displaystyle l\left(\boldsymbol{\theta}_{(i)}^{},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-l\left(\widehat{\boldsymbol{\theta}}^{(s)}_{(i)},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-\left[l_{E}\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{}\right)-l_{E}\left(\widehat{\boldsymbol{\theta}}^{(s)}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)\right]$
		$\displaystyle\geq$	$\displaystyle l_{E}\left(\widehat{\boldsymbol{\theta}}^{(s)}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)-l_{E}\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{*}\right)$
		$\displaystyle=$	$\displaystyle\frac{1}{2}\left(\boldsymbol{\theta}_{(i)}^{}-\widehat{\boldsymbol{\theta}}^{(s)}_{(i)}\right)^{\top}l_{E}^{\prime\prime}\left(\boldsymbol{\theta}_{(i)}^{(\xi)},\boldsymbol{\theta}_{(-i)}^{}\right)\left(\boldsymbol{\theta}_{(i)}^{*}-\widehat{\boldsymbol{\theta}}^{(s)}_{(i)}\right)$
		$\displaystyle\geq$	$\displaystyle\\|\boldsymbol{\theta}_{(i)}^{}-\widehat{\boldsymbol{\theta}}^{(s)}_{(i)}\\|_{2}^{2}\left(\inf_{\\|\boldsymbol{\theta}_{(i)}\\|_{\infty}<\kappa}\left\\|{\rm E}{\mathbf{V}}^{(i)}\left(\boldsymbol{\theta}_{(i)}^{(\xi)},\boldsymbol{\theta}_{(-i)}^{}\right)\right\\|_{2}\right)$
		$\displaystyle\geq$	$\displaystyle\frac{C_{1}\\|\boldsymbol{\theta}_{(i)}^{*}-\widehat{\boldsymbol{\theta}}^{(s)}_{(i)}\\|_{\infty}^{2}}{e^{-4\kappa_{0}-4\kappa_{1}}}.$

On the other hand, notice that

			$\displaystyle l\left(\boldsymbol{\theta}_{(i)}^{},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-l\left(\boldsymbol{\theta}_{(i)},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-\left[l_{E}\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{}\right)-l_{E}\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)\right]$
		$\displaystyle\leq$	$\displaystyle\left\|l\left(\boldsymbol{\theta}_{(i)}^{},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-l\left(\boldsymbol{\theta}_{(i)},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-\left[l\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{}\right)-l\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)\right]\right\|$
			$\displaystyle+\left\|l\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{}\right)-l\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)-\left[l_{E}\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{}\right)-l_{E}\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)\right]\right\|.$

By Corollary 4, there exist large enough positive constants $C_{2}$ and $c_{1}$ which are independent of $\boldsymbol{\theta}_{(i)}$ such that, with probability greater than $1-(np)^{-c_{1}}$ ,

			$\displaystyle\left\|l\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{}\right)-l\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)-\left[l_{E}\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{}\right)-l_{E}\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)\right]\right\|$
		$\displaystyle\leq$	$\displaystyle C_{2}\left(1+\frac{\log(np)}{\sqrt{p}}\right)\sqrt{\frac{\log(np)}{np}}\left\\|\boldsymbol{\theta}_{(i)}-\boldsymbol{\theta}^{*}_{(i)}\right\\|_{2}$
		$\displaystyle\leq$	$\displaystyle\sqrt{2}C_{2}\left(1+\frac{\log(np)}{\sqrt{p}}\right)\sqrt{\frac{\log(np)}{np}}\left\\|\boldsymbol{\theta}_{(i)}-\boldsymbol{\theta}^{*}_{(i)}\right\\|_{\infty},$

holds uniformly for any $\boldsymbol{\theta}_{(i)}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*}_{(i)},\alpha_{0}\right)$ where $\alpha_{0}<1/4$ . By Lemma 7, as $np\to\infty$ , there exist big enough positive constants $C_{3}$ and $c_{2}$ which are independent of $\boldsymbol{\theta}_{(i)}$ such that, with probability greater than $1-(np)^{-c_{2}}$ ,

			$\displaystyle\left\|l\left(\boldsymbol{\theta}_{(i)}^{},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-l\left(\boldsymbol{\theta}_{(i)},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-\left[l\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{}\right)-l\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)\right]\right\|$
		$\displaystyle=$	$\displaystyle\Bigg{\|}\frac{1}{np}\sum_{j=1,\>j\neq i}^{p}\Bigg{[}-n\log\left(\frac{1+e^{\beta^{}_{i,0}+\beta^{}_{j,0}}+e^{\beta^{}_{i,1}+\beta^{}_{j,1}}}{1+e^{\beta^{}_{i,0}+\widehat{\beta}^{(s)}_{j,0}}+e^{\beta^{}_{i,1}+\widehat{\beta}^{(s)}_{j,1}}}\times\frac{1+e^{\beta_{i,0}+\widehat{\beta}^{(s)}_{j,0}}+e^{\beta_{i,1}+\widehat{\beta}^{(s)}_{j,1}}}{1+e^{\beta_{i,0}+\beta^{}_{j,0}}+e^{\beta_{i,1}+\beta^{}_{j,1}}}\right)$
			$\displaystyle+\log\left(\frac{1+e^{\beta^{}_{i,1}+\beta^{}_{j,1}}}{1+e^{\beta^{}_{i,1}+\widehat{\beta}^{(s)}_{j,1}}}\times\frac{1+e^{\beta_{i,1}+\widehat{\beta}^{(s)}_{j,1}}}{1+e^{\beta_{i,1}+\beta^{}_{j,1}}}\right)d_{i,j}$
			$\displaystyle+\log\left(\frac{e^{\beta^{}_{i,0}+\beta^{}_{j,0}}+e^{\beta^{}_{i,1}+\beta^{}_{j,1}}}{e^{\beta^{}_{i,0}+\widehat{\beta}^{(s)}_{j,0}}+e^{\beta^{}_{i,1}+\widehat{\beta}^{(s)}_{j,1}}}\times\frac{e^{\beta_{i,0}+\widehat{\beta}^{(s)}_{j,0}}+e^{\beta_{i,1}+\widehat{\beta}^{(s)}_{j,1}}}{e^{\beta_{i,0}+\beta^{}_{j,0}}+e^{\beta_{i,1}+\beta^{}_{j,1}}}\right)b_{i,j}\Bigg{]}\Bigg{\|}$
		$\displaystyle\leq$	$\displaystyle\frac{C_{3}}{p}\sum_{j=1,\>j\neq i}^{p}\Bigg{(}z_{1}\left\|\beta^{}_{i,0}-\beta_{i,0}\right\|^{2}+z_{1}\left\|\beta^{}_{i,1}-\beta_{i,1}\right\|^{2}+z_{2}\left\|\beta^{}_{j,0}-\widehat{\beta}^{(s)}_{j,0}\right\|^{2}+z_{2}\left\|\beta^{}_{j,1}-\widehat{\beta}^{(s)}_{j,1}\right\|^{2}\Bigg{)}$
		$\displaystyle\leq$	$\displaystyle 2C_{3}z_{1}\\|\boldsymbol{\theta}_{(i)}-\boldsymbol{\theta}^{}_{(i)}\\|^{2}_{\infty}+\frac{C_{3}z_{2}}{p}\\|\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}-\boldsymbol{\theta}^{}_{(-i)}\\|^{2}_{2}$
		$\displaystyle\leq$	$\displaystyle 2C_{3}z_{1}\frac{e^{16\kappa_{0}+16\kappa_{1}}[\log\log(np)]^{2}\log(np)}{\left(np\right)^{2s}}\left(1+\frac{\log(np)}{\sqrt{p}}\right)^{2}$
			$\displaystyle+\frac{C_{3}z_{2}}{p}\frac{e^{8\kappa_{0}+8\kappa_{1}}\log(np)}{n}\left(1+\frac{\log(np)}{\sqrt{p}}\right)^{2},$

holds uniformly for all for any positive $z_{1},z_{2}$ s.t. $z_{1}z_{2}\geq 1/4$ , $s\in[s_{0},1/2]$ and $\boldsymbol{\theta}_{(i)}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*}_{(i)},r_{s}\right)$ . Here in the last step we have used inequality (A.22) and the fact that for all $s$ , $\widehat{\boldsymbol{\theta}}^{(s)}\in{\mathbf{D}}_{n,p}$ . Let $z_{1}=0.5e^{-4\kappa_{0}-4\kappa_{1}}[\log\log(np)]^{-1}$ $(np)^{s-1/2}$ and $z_{2}=0.5e^{4\kappa_{0}+4\kappa_{1}}\log\log(np)(np)^{1/2-s}$ , we have, there exists a big enough constant $C_{4}>0$ , with probability greater than $1-(np)^{-c_{2}}$ ,

			$\displaystyle\left\|l\left(\boldsymbol{\theta}_{(i)}^{},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-l\left(\boldsymbol{\theta}_{(i)},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-\left[l\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{}\right)-l\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)\right]\right\|$
		$\displaystyle\leq$	$\displaystyle C_{4}\frac{e^{12\kappa_{0}+12\kappa_{1}}\log\log(np)\log(np)}{(np)^{s+1/2}}\left(1+\frac{\log(np)}{\sqrt{p}}\right)^{2},$

holds uniformly for all $s\in[s_{0},1/2]$ and $\boldsymbol{\theta}_{(i)}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*}_{(i)},r_{s}\right)$ . Combining the inequalities (A.8) and (A.8), we have, with probability greater than $1-(np)^{-c_{3}}$ for a large enough constant $c_{3}>0$ ,

(A.28)			$\displaystyle\left\|l\left(\boldsymbol{\theta}_{(i)}^{},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-l\left(\boldsymbol{\theta}_{(i)},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-\left[l_{E}\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{}\right)-l_{E}\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)\right]\right\|$
	$\displaystyle\leq$	$\displaystyle\sqrt{2}C_{2}\left(1+\frac{\log(np)}{\sqrt{p}}\right)\sqrt{\frac{\log(np)}{np}}\left\\|\boldsymbol{\theta}_{(i)}-\boldsymbol{\theta}^{*}_{(i)}\right\\|_{\infty}$
		$\displaystyle+C_{4}\frac{e^{12\kappa_{0}+12\kappa_{1}}\log\log(np)\log(np)}{(np)^{s+1/2}}\left(1+\frac{\log(np)}{\sqrt{p}}\right)^{2},$

holds uniformly for all $s\in[s_{0},1/2]$ , all $i=1,\ldots,p$ and $\boldsymbol{\theta}_{(i)}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*}_{(i)},r_{s}\right)$ .

Combining the inequalities (A.8) and (A.28), we have, with probability greater than $1-(np)^{-c_{3}}$ ,

\frac{C_{1}}{e^{4\kappa_{0}+4\kappa_{1}}}\|\widehat{\boldsymbol{\theta}}^{(s)}_{(i)}-\boldsymbol{\theta}_{(i)}^{*}\|_{\infty}^{2}\leq\sqrt{2}C_{2}\left(1+\frac{\log(np)}{\sqrt{p}}\right)\sqrt{\frac{\log(np)}{np}}\left\|\widehat{\boldsymbol{\theta}}^{(s)}_{(i)}-\boldsymbol{\theta}^{*}_{(i)}\right\|_{\infty}\\ +C_{4}\frac{e^{12\kappa_{0}+12\kappa_{1}}\log\log(np)\log(np)}{(np)^{s+1/2}}\left(1+\frac{\log(np)}{\sqrt{p}}\right)^{2},

holds uniformly for all $s\in[s_{0},1/2]$ and all $i=1,\ldots,p$ . Notice that the constants $C_{1}$ , $C_{2}$ , $C_{3}$ , $C_{4}$ , $c_{2}$ and $c_{3}$ are all independent of $r_{s}$ and $s$ . This indicates that there exists a big enough constant $C_{5}>0$ which is independent of $r_{s}$ and $s$ s.t.

(A.29)			$\displaystyle\\|\widehat{\boldsymbol{\theta}}^{(s)}-\boldsymbol{\theta}^{*}\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle C_{5}e^{4\kappa_{0}+4\kappa_{1}}\sqrt{\frac{\log(np)}{np}}\left(1+\frac{\log(np)}{\sqrt{p}}\right)+C_{5}\frac{e^{8\kappa_{0}+8\kappa_{1}}\sqrt{\log(np)\log\log(np)}}{(np)^{\frac{2s+1}{4}}}\left(1+\frac{\log(np)}{\sqrt{p}}\right)$
	$\displaystyle\leq$	$\displaystyle 2C_{5}\frac{e^{8\kappa_{0}+8\kappa_{1}}\sqrt{\log(np)\log\log(np)}}{(np)^{\frac{2s+1}{4}}}\left(1+\frac{\log(np)}{\sqrt{p}}\right)$
	$\displaystyle\leq$	$\displaystyle e^{8\kappa_{0}+8\kappa_{1}}\log\log(np)\frac{\sqrt{\log(np)}}{(np)^{\frac{2s+1}{4}}}\left(1+\frac{\log(np)}{\sqrt{p}}\right).$

Recall that $\widehat{\boldsymbol{\theta}}^{(s)}$ is the local maximizer of $l(\boldsymbol{\theta})$ in ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r_{s}\right)$ with

\displaystyle r_{s}

\displaystyle=

\displaystyle e^{8\kappa_{0}+8\kappa_{1}}\log\log(np)\frac{\sqrt{\log(np)}}{\left(np\right)^{s}}\left(1+\frac{\log(np)}{\sqrt{p}}\right).

Thus far we have proved that: with probability greater than $1-(np)^{-c_{3}}$ for some large enough constant $c_{3}>0$ , uniformly for all $s\in[s_{0},1/2]$ , $\widehat{\boldsymbol{\theta}}^{(s)}$ is also within the ball ${\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r_{s^{\prime}}\right)$ with

(A.30)

{r_{s^{\prime}}}=(np)^{\frac{2s-1}{4}}{r_{s}}=e^{8\kappa_{0}+8\kappa_{1}}\log\log(np)\frac{\sqrt{\log(np)}}{(np)^{\frac{2s+1}{4}}}\left(1+\frac{\log(np)}{\sqrt{p}}\right).

Now define a series $\{s_{i};i=0,1,\cdots\}$ s.t. $s_{0}$ is defined as in equation (A.24) and $s_{k}=s_{k-1}/2+1/4$ . We have:

(A.31)

s_{k}-\frac{1}{2}=\frac{1}{2}\left(s_{k-1}-\frac{1}{2}\right)=\frac{1}{2^{k}}\left(s_{0}-\frac{1}{2}\right).

Then, we have $s_{k-1}<s_{k}<1/2$ for $k>1$ and $\lim_{k\to\infty}s_{k}\to 1/2$ .

Let $K=\lfloor\log_{2}\left(\log(np)\right)\rfloor+1$ where $\lfloor\cdot\rfloor$ is the smallest integer function. Beginning with $\widehat{\boldsymbol{\theta}}^{(s_{0})}=\widehat{\boldsymbol{\theta}}$ , and repeatedly using the result in (A.29) for $K$ times, we have, with probability greater than $1-(np)^{-c_{3}}$ , we can sequentially reduce the upper bound from $\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\|_{\infty}\leq e^{8\kappa_{0}+8\kappa_{1}}\log\log(np)\frac{\sqrt{\log(np)}}{(np)^{\frac{1}{2}+\frac{2s_{0}-1}{2}}}\left(1+\frac{\log(np)}{\sqrt{p}}\right)$ to:

$\displaystyle\\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\\|_{\infty}$	$\displaystyle\leq$	$\displaystyle e^{8\kappa_{0}+8\kappa_{1}}\log\log(np)\frac{\sqrt{\log(np)}}{(np)^{\frac{1}{2}+\frac{2s_{0}-1}{2^{K}}}}\left(1+\frac{\log(np)}{\sqrt{p}}\right)$
	$\displaystyle=$	$\displaystyle(np)^{\frac{1-2s_{0}}{2^{K}}}e^{8\kappa_{0}+8\kappa_{1}}\log\log(np)\sqrt{\frac{\log(np)}{np}}\left(1+\frac{\log(np)}{\sqrt{p}}\right)$
	$\displaystyle\leq$	$\displaystyle e^{1-2s_{0}}e^{8\kappa_{0}+8\kappa_{1}}\log\log(np)\sqrt{\frac{\log(np)}{np}}\left(1+\frac{\log(np)}{\sqrt{p}}\right).$

Here in the last step of above inequality, we have used the fact that when $K=\lfloor\log_{2}\left(\log(np)\right)\rfloor+1>\log_{2}\left(\log(np)\right)$ ,

(np)^{\frac{1-2s_{0}}{2^{K}}}\leq\left((np)^{\frac{\log(2)}{\log(np)}}\right)^{\frac{1-2s_{0}}{\log(2)}}=2^{\frac{1-2s_{0}}{\log(2)}}=e^{1-2s_{0}}.

Thus, assuming that $np\rightarrow\infty,n\geq 2$ , $\kappa_{0}+\kappa_{1}\leq c\log(np)$ for some positive constant $c$ , we have, with probability tending to 1,

\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}\lesssim e^{8\kappa_{0}+8\kappa_{1}}\log\log(np)\sqrt{\frac{\log(np)}{np}}\left(1+\frac{\log(np)}{\sqrt{p}}\right).

∎

A.9 Proof of Theorem 4

For brevity, we denote $\tilde{\alpha}_{r,i,j}:=e^{\tilde{\beta}_{i,r}+\tilde{\beta}_{i,r}}$ , with $i,j=1,\cdots,p$ and $r=0,1$ . We use the interior mapping theorem [6] to establish the existence and uniform consistency of the moment estimator. The interior mapping theorem is presented in Lemma 8.

Lemma 8.

(Interior mapping theorem). Let ${\mathbf{F}}\left({\mathbf{x}}\right)=\left({\mathbf{F}}_{1}\left({\mathbf{x}}\right),\cdots,{\mathbf{F}}_{p}\left({\mathbf{x}}\right)\right)^{\top}$ be a vector function on an open convex subset ${\mathcal{D}}$ of $\mathbb{R}^{p}$ and $\left\|{\mathbf{F}}^{\prime}\left({\mathbf{x}}\right)-{\mathbf{F}}^{\prime}\left({\mathbf{y}}\right)\right\|_{\infty}\lesssim\gamma\left\|{\mathbf{x}}-{\mathbf{y}}\right\|$ . Assume that $x_{0}\in{\mathcal{D}}$ s.t.

	$\displaystyle\left\\|{\mathbf{F}}^{\prime}\left({\mathbf{x}}_{0}\right)^{-1}\right\\|_{\infty}\leq N,\quad\left\\|{\mathbf{F}}^{\prime}\left({\mathbf{x}}_{0}\right)^{-1}{\mathbf{F}}\left({\mathbf{x}}_{0}\right)\right\\|_{\infty}\leq\delta,\quad h=2N\gamma\delta\leq 1,$
	$\displaystyle t^{}\equiv\frac{2}{h}\left(1-\sqrt{1-h}\right)\delta,\quad{\mathbf{B}}_{\infty}\left({\mathbf{x}}_{0},t^{}\right)\subset{\mathcal{D}},$

where N and $\delta$ are positive constants that may depend on $x_{0}$ and $p$ . Then the Newton iterates ${\mathbf{x}}_{n+1}\equiv{\mathbf{x}}_{n}-{\mathbf{F}}^{\prime}\left({\mathbf{x}}_{n}\right)^{-1}{\mathbf{F}}\left({\mathbf{x}}_{n}\right)$ exists and ${\mathbf{x}}_{n}\in{\mathbf{B}}_{\infty}\left(x_{0},t^{*}\right)\subset{\mathcal{D}}$ for all $n\geq 0$ . Moreover, ${\mathbf{x}}^{*}=\lim_{n\to\infty}{\mathbf{x}}_{n}$ exists, $x^{*}\in\overline{{\mathbf{B}}_{\infty}\left(x_{0},t^{*}\right)}\subset{\mathcal{D}}$ , where $\overline{A}$ denotes the closure of set $A$ and ${\mathbf{F}}\left({\mathbf{x}}^{*}\right)=0$ .

Proof of Theorem 4.

Proof.

Recall that the $\tilde{\boldsymbol{\theta}}_{(0)}$ is defined as the solution of

\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}X_{i,j}^{t}=n\sum_{j=1,\>j\neq i}^{p}\frac{e^{\beta_{i,0}+\beta_{j,0}}}{1+e^{\beta_{i,0}+\beta_{j,0}}},\quad i=1,\cdots,p.

For any ${\mathbf{x}}\in\mathbb{R}^{p}$ , define a system of random functions ${\mathbf{G}}\left({\mathbf{x}}\right)$ :

	$\displaystyle{\mathbf{G}}_{i}\left({\mathbf{x}}\right):=-\frac{1}{np}\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}\left(X_{i,j}^{t}-\frac{e^{x_{i}+x_{j}}}{1+e^{x_{i}+x_{j}}}\right),\quad i=1,\cdots,p;$
	$\displaystyle{\mathbf{G}}\left({\mathbf{x}}\right):=\left({\mathbf{G}}_{1}\left({\mathbf{x}}\right),\cdots,{\mathbf{G}}_{p}\left({\mathbf{x}}\right)\right)^{\top}.$

As $np\to\infty$ and $\kappa=c\log(np)$ with a small enough constant $c>0$ , there exist big enough constants $C_{1}>0,c_{1}>0$ such that with probability greater than $1-(np)^{c_{1}}$ ,

(A.32)		$\displaystyle\left\\|{\mathbf{G}}^{\prime}\left({\mathbf{x}}\right)-{\mathbf{G}}^{\prime}\left({\mathbf{y}}\right)\right\\|_{\infty}$	$\displaystyle\leq$	$\displaystyle C_{1}\\|{\mathbf{x}}-{\mathbf{y}}\\|_{\infty},$
	$\displaystyle\left\\|{\mathbf{G}}^{\prime}\left(\boldsymbol{\theta}_{(0)}^{*}\right)^{-1}\right\\|_{\infty}$	$\displaystyle\leq$	$\displaystyle C_{1}e^{2\kappa_{0}},$
	$\displaystyle\left\\|{\mathbf{G}}^{\prime}\left(\boldsymbol{\theta}_{(0)}^{}\right)^{-1}{\mathbf{G}}\left(\boldsymbol{\theta}_{(0)}^{}\right)\right\\|_{\infty}$	$\displaystyle\leq$	$\displaystyle C_{1}e^{2\kappa_{0}}\sqrt{\frac{\log(n)\log(p)}{np}},$

hold for all ${\mathbf{x}},{\mathbf{y}}\in{\mathbf{B}}_{\infty}\left(0,\kappa_{0}\right)$ . For brevity, the proof of inequalities in (A.32) is provided independently in Section A.10. Let $x_{0}=\boldsymbol{\theta}_{(0)}^{*}$ and ${\mathcal{D}}=\text{Int }({\mathbf{B}}_{\infty}\left(0,\kappa_{0}\right))$ . Here the notation $\text{Int}(A)$ denotes the interior of a given set $A$ . We then have:

	$\displaystyle N=C_{1}e^{2\kappa_{0}},\quad\gamma=C_{1},\quad\delta=C_{1}e^{2\kappa_{0}}\sqrt{\frac{\log(n)\log(p)}{np}},$
	$\displaystyle h=2N\gamma\delta=2C_{1}^{3}e^{4\kappa_{0}}\sqrt{\frac{\log(n)\log(p)}{np}}=o\left(1\right),$
	$\displaystyle t^{*}\equiv\frac{2C_{1}}{h}\left(1-\sqrt{1-h}\right)e^{2\kappa_{0}}\sqrt{\frac{\log(n)\log(p)}{np}},$
	$\displaystyle{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}_{(1)}^{},t^{}\right)\subset{\mathcal{D}}.$

Note that $\forall h\in(0,1)$ , $1-\sqrt{1-h}<1-(1-h)=h$ , we have

t^{*}\equiv\frac{2C_{1}}{h}\left(1-\sqrt{1-h}\right)e^{2\kappa_{0}}\sqrt{\frac{\log(n)\log(p)}{np}}<4e^{4\kappa_{0}}\sqrt{\frac{\log(n)\log(p)}{np}}.

Consequently, by Lemma 8, we have that, with probability tending to 1,

(A.33)

\left\|\tilde{\boldsymbol{\theta}}_{(0)}-\boldsymbol{\theta}_{(0)}\right\|_{\infty}\lesssim\sqrt{\frac{\log(n)\log(p)e^{4\kappa_{0}}}{np}}.

Next, we derive the error bound of $\tilde{\boldsymbol{\theta}}_{(1)}$ based on (A.33). For ${\mathbf{x}}\in\mathbb{R}^{p}$ , define a system of random functions ${\mathbf{F}}\left({\mathbf{x}};\tilde{\boldsymbol{\theta}}_{(0)}\right)$ :

	$\displaystyle{\mathbf{F}}_{i}\left({\mathbf{x}};\tilde{\boldsymbol{\theta}}_{(0)}\right):=-\frac{1}{np}\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}\left\{X_{i,j}^{t}X_{i,j}^{t-1}-\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\left(1-\frac{1}{1+\tilde{\alpha}_{0,i,j}+e^{x_{i}+x_{j}}}\right)\right\},$
	$\displaystyle{\mathbf{F}}\left({\mathbf{x}};\tilde{\boldsymbol{\theta}}_{(0)}\right):=\left({\mathbf{F}}_{1}\left({\mathbf{x}};\tilde{\boldsymbol{\theta}}_{(0)}\right),\cdots,{\mathbf{F}}_{p}\left({\mathbf{x}};\tilde{\boldsymbol{\theta}}_{(0)}\right)\right)^{\top}.$

As $np\to\infty$ and $\kappa_{0}=c\log(np)$ with a small enough constant $c>0$ , there exist big enough constants $C_{2}>0,c_{2}>0$ such that with probability greater than $1-(np)^{c_{2}}$ ,

(A.34)		$\displaystyle\left\\|{\mathbf{F}}^{\prime}\left({\mathbf{x}};\tilde{\boldsymbol{\theta}}_{(0)}\right)-{\mathbf{F}}^{\prime}\left({\mathbf{y}};\tilde{\boldsymbol{\theta}}_{(0)}\right)\right\\|_{\infty}\leq C_{2}\left\\|{\mathbf{x}}-{\mathbf{y}}\right\\|_{\infty},$
	$\displaystyle\left\\|{\mathbf{F}}^{\prime}\left(\boldsymbol{\theta}_{(1)}^{*};\tilde{\boldsymbol{\theta}}_{(0)}\right)^{-1}\right\\|_{\infty}\leq C_{2}e^{12\kappa_{0}+6\kappa_{1}},$
	$\displaystyle\left\\|{\mathbf{F}}^{\prime}\left(\boldsymbol{\theta}_{(1)}^{};\tilde{\boldsymbol{\theta}}_{(0)}\right)^{-1}{\mathbf{F}}\left(\boldsymbol{\theta}_{(1)}^{};\tilde{\boldsymbol{\theta}}_{(0)}\right)\right\\|_{\infty}\leq C_{2}e^{14\kappa_{0}+6\kappa_{1}}\sqrt{\frac{\log(n)\log(p)}{np}},$

hold for all $x,y\in{\mathbf{B}}_{\infty}\left(0,\kappa_{1}\right)$ . For brevity, the proof of inequalities in (A.34) is provided independently in Section A.11. Let $x_{0}=\boldsymbol{\theta}_{(1)}^{*}$ and ${\mathcal{D}}=\text{Int }({\mathbf{B}}_{\infty}\left(0,\kappa\right))$ . We then have:

	$\displaystyle N=C_{2}e^{12\kappa_{0}+6\kappa_{1}},\quad\gamma=C_{2},\quad\delta=C_{2}e^{14\kappa_{0}+6\kappa_{1}}\sqrt{\frac{\log(n)\log(p)}{np}},$
	$\displaystyle h=2N\gamma\delta=2C_{2}^{3}e^{26\kappa_{0}+12\kappa_{1}}\sqrt{\frac{\log(n)\log(p)}{np}}=o\left(1\right),$
	$\displaystyle t^{*}\equiv\frac{2C_{2}}{h}\left(1-\sqrt{1-h}\right)e^{14\kappa_{0}+6\kappa_{1}}\sqrt{\frac{\log(n)\log(p)}{np}},$
	$\displaystyle{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}_{(1)}^{},t^{}\right)\subset{\mathcal{D}}.$

Note that $\forall h\in(0,1)$ , $1-\sqrt{1-h}<1-(1-h)=h$ , we have

t^{*}\equiv\frac{2C_{2}}{h}\left(1-\sqrt{1-h}\right)e^{14\kappa_{0}+6\kappa_{1}}\sqrt{\frac{\log(n)\log(p)}{np}}<4C_{2}e^{14\kappa_{0}+6\kappa_{1}}\sqrt{\frac{\log(n)\log(p)}{np}}.

Consequently, by Lemma 8, we have that, with probability tending to 1,

\displaystyle\left\|\tilde{\boldsymbol{\theta}}_{(1)}-\boldsymbol{\theta}_{(1)}^{*}\right\|_{\infty}

\displaystyle\leq t^{*}\leq

\displaystyle 4C_{2}e^{14\kappa_{0}+6\kappa_{1}}\sqrt{\frac{\log(n)\log(p)}{np}}.

Combining with (A.33), the theorem is proved. ∎

A.10 Proof of (A.32)

Proof.

Note that ${\mathbf{G}}^{\prime}\left({\mathbf{x}}\right)$ is a balanced symmetric matrix s.t.

{\mathbf{G}}^{\prime}\left({\mathbf{x}}\right)_{i,j}=\frac{1}{p}\frac{\alpha_{0,i,j}}{\left(1+\alpha_{0,i,j}\right)^{2}},\quad\quad\quad{\mathbf{G}}^{\prime}\left({\mathbf{x}}\right)_{i,i}=\sum_{j=1,\>j\neq i}^{p}{\mathbf{G}}^{\prime}\left({\mathbf{x}}\right)_{i,j},

for $i=1,\cdots,p$ and $i\neq j$ . Following [38], we construct a matrix ${\mathbf{S}}=(S_{i,j})$ to approximate the inverse of ${\mathbf{G}}^{\prime}\left({\mathbf{x}}\right)$ . Specifically, for any $i\neq j$ , we set

S_{i,j}=-\frac{1}{{\mathcal{T}}},\quad S_{i,i}=\left(\sum_{j=1,\>j\neq i}^{p}\frac{1}{p}\frac{\alpha_{0,i,j}}{\left(1+\alpha_{0,i,j}\right)^{2}}\right)^{-1}-\frac{1}{{\mathcal{T}}},\quad{\mathcal{T}}=\sum_{1\leq i\neq j\leq p}\frac{1}{p}\frac{\alpha_{0,i,j}}{\left(1+\alpha_{0,i,j}\right)^{2}}.

Note that

\frac{e^{-2\kappa_{0}}}{4p}\leq\frac{1}{p}\frac{\alpha_{0,i,j}}{\left(1+\alpha_{0,i,j}\right)^{2}}\leq\frac{1}{4p},

we have

{\mathcal{T}}\in\left(\frac{(p-1)e^{-2\kappa_{0}}}{4},\frac{(p-1)}{4}\right).

By Theorem 1 in [38], with $m=e^{-2\kappa_{0}}/\left(4p\right)$ and $M=1/\left(4p\right)$ , there exists a big enough constant $C_{1}>0$ , we have that

$\displaystyle\left\\|{\mathbf{G}}^{\prime}\left({\mathbf{x}}\right)^{-1}-{\mathbf{S}}\right\\|_{\max}$	$\displaystyle\leq$	$\displaystyle\frac{M}{m^{2}}\frac{pM+(p-2)m}{2(p-2)m}\frac{1}{(p-1)^{2}}+\frac{1}{2m(p-1)^{2}}$
	$\displaystyle=$	$\displaystyle\frac{1}{2m(p-1)^{2}}\left(1+\frac{M}{m}+\frac{M^{2}}{m^{2}}\frac{p}{p-2}\right).$
	$\displaystyle\leq$	$\displaystyle C_{1}\frac{e^{6\kappa_{0}}}{p^{2}},$

where $\|{\mathbf{A}}\|_{\max}=\max_{i,j}A_{i,j}$ . Then, there exists a big enough constant $C_{2}>0$ ,

$\displaystyle\left\\|{\mathbf{G}}^{\prime}\left({\mathbf{x}}\right)^{-1}\right\\|_{\infty}$	$\displaystyle\leq$	$\displaystyle\left\\|{\mathbf{G}}^{\prime}\left({\mathbf{x}}\right)-{\mathbf{S}}\right\\|_{\infty}+\left\\|{\mathbf{S}}\right\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle p\left\\|{\mathbf{G}}^{\prime}\left({\mathbf{x}}\right)^{-1}-{\mathbf{S}}\right\\|_{\max}+\frac{p}{{\mathcal{T}}}+\max_{i}\left(\sum_{j=1,\>j\neq i}^{p}\frac{1}{p}\frac{\alpha_{0,i,j}}{\left(1+\alpha_{0,i,j}\right)^{2}}\right)^{-1}$
	$\displaystyle\leq$	$\displaystyle pC_{1}\frac{e^{6\kappa_{0}}}{p^{2}}+8e^{2\kappa_{0}}$
	$\displaystyle<$	$\displaystyle C_{2}e^{2\kappa_{0}}.$

By Lemma 1 and Lemma 6, there exist big positive constants $C_{3},c_{1}$ such that, with probability greater than $1-(np)^{-c_{1}}$ ,

(A.35)			$\displaystyle\left\\|{\mathbf{G}}\left(\boldsymbol{\theta}_{(0)}^{*}\right)\right\\|_{\infty}$
	$\displaystyle=$	$\displaystyle\max_{i}\left\|\frac{1}{np}\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}\left(X_{i,j}^{t}-\frac{e^{\beta^{}_{i,0}+\beta^{}_{j,0}}}{1+e^{\beta^{}_{i,0}+\beta^{}_{j,0}}}\right)\right\|$
	$\displaystyle=$	$\displaystyle\max_{i}\left\|\frac{1}{np}\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}\left(X_{i,j}^{t}-{\rm E}\left(X_{i,j}^{t}\right)\right)\right\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{np}\max_{i}\left\|\sum_{j=1,\>j\neq i}^{p}\left(\sum_{t=1}^{n}\left(X_{i,j}^{t}-{\rm E}\left(X_{i,j}^{t}\right)\right)\right)\right\|$
	$\displaystyle\leq$	$\displaystyle\frac{C_{3}}{np}\left(\sqrt{np\log(p)}+\sqrt{n\log(np)}+\log\left(n\right)\log\log\left(n\right)\log\left(np\right)\right)$
	$\displaystyle<$	$\displaystyle 3C_{3}\sqrt{\frac{\log(n)\log(p)}{np}}.$

Then, for any ${\mathbf{y}}\in\mathbb{R}^{p}$ with $\|{\mathbf{y}}\|_{\infty}<\kappa_{0}$ , we have that

(A.36)

\left\|{\mathbf{G}}^{\prime}\left(\boldsymbol{\theta}_{(0)}^{*}\right)^{-1}{\mathbf{G}}\left(\boldsymbol{\theta}_{(0)}^{*}\right)\right\|_{\infty}\leq\left\|{\mathbf{G}}^{\prime}\left(\boldsymbol{\theta}_{(0)}^{*}\right)^{-1}\right\|_{\infty}\left\|{\mathbf{G}}\left(\boldsymbol{\theta}_{(0)}^{*}\right)\right\|_{\infty}\leq 2C_{2}C_{3}e^{2\kappa_{0}}\sqrt{\frac{\log(n)\log(p)}{np}}.

There exists a big enough constant $C_{4}>0$ , we have, for every $x,y\in{\mathbf{B}}_{\infty}\left(0,\kappa\right)$ ,

			$\displaystyle\left\\|{\mathbf{G}}^{\prime}\left({\mathbf{x}}\right)-{\mathbf{G}}^{\prime}\left({\mathbf{y}}\right)\right\\|_{\infty}$
		$\displaystyle\leq$	$\displaystyle\max_{i}\left\|\frac{1}{np}\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}\left(X_{i,j}^{t}-\frac{e^{x_{i}+x_{j}}}{1+e^{x_{i}+x_{j}}}\right)-\frac{1}{np}\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}\left(X_{i,j}^{t}-\frac{e^{y_{i}+y_{j}}}{1+e^{y_{i}+y_{j}}}\right)\right\|$
		$\displaystyle\leq$	$\displaystyle\frac{1}{p}\max_{i}\left\|\sum_{j=1,\>j\neq i}^{p}\frac{e^{x_{i}+x_{j}}}{1+e^{x_{i}+x_{j}}}-\frac{e^{y_{i}+y_{j}}}{1+e^{y_{i}+y_{j}}}\right\|$
		$\displaystyle\leq$	$\displaystyle\frac{1}{p}\max_{i}\left\|\sum_{j=1,\>j\neq i}^{p}\frac{e^{z_{i}+z_{j}}}{1+e^{z_{i}+z_{j}}}\left(x_{i}+x_{j}-y_{i}-y_{j}\right)\right\|$
		$\displaystyle\leq$	$\displaystyle C_{4}\left\\|{\mathbf{x}}-{\mathbf{y}}\right\\|_{\infty}$

where $z_{i,j}:=\left(1-c_{i,j}\right)\left(x_{i}+x_{j}\right)+c_{i,j}\left(y_{i}+y_{j}\right)$ with a series of constants $c_{i,j}\in(0,1)$ . Combining the inequalities (A.35), (A.36) and (A.10), we finish the proof of (A.32). ∎

A.11 Proof of (A.34)

Proof.

For brevity, ${\mathbf{F}}\left({\mathbf{x}};\tilde{\boldsymbol{\theta}}_{(0)}\right)$ is denoted by ${\mathbf{F}}\left({\mathbf{x}}\right)$ . Moreover, as all the conclusions hold uniformly for all $\tilde{\boldsymbol{\theta}}_{(0)}$ , the argument “uniformly for all $\tilde{\boldsymbol{\theta}}_{(0)}$ ” are also omitted in what follows.

Note that ${\mathbf{F}}^{\prime}\left({\mathbf{x}}\right)$ is a balanced symmetric matrix s.t.

{\mathbf{F}}^{\prime}\left({\mathbf{x}}\right)_{i,j}=\frac{1}{p}\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\frac{e^{x_{i}+x_{j}}}{\left(1+\tilde{\alpha}_{0,i,j}+e^{x_{i}+x_{j}}\right)^{2}},\quad{\mathbf{F}}^{\prime}\left({\mathbf{x}}\right)_{i,i}=\sum_{j=1,\>j\neq i}^{p}{\mathbf{F}}^{\prime}\left({\mathbf{x}}\right)_{i,j},

for $i=1,\cdots,p$ and $i\neq j$ . Following [38], we construct a matrix ${\mathbf{S}}=(S_{i,j})$ to approximate the inverse of ${\mathbf{F}}^{\prime}\left({\mathbf{x}}\right)$ . Specifically, for all $i\neq j$ , we set

	$\displaystyle S_{i,j}$	$\displaystyle=-\frac{1}{{\mathcal{T}}},\quad S_{i,i}=\left(\sum_{j=1,\>j\neq i}^{p}\frac{1}{p}\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\frac{e^{x_{i}+x_{j}}}{\left(1+\tilde{\alpha}_{0,i,j}+e^{x_{i}+x_{j}}\right)^{2}}\right)^{-1}-\frac{1}{{\mathcal{T}}},$
	$\displaystyle{\mathcal{T}}$	$\displaystyle=\sum_{1\leq i\neq j\leq p}\frac{1}{p}\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\frac{e^{x_{i}+x_{j}}}{\left(1+\tilde{\alpha}_{0,i,j}+e^{x_{i}+x_{j}}\right)^{2}}.$

Note that there exists a big enough constant $C_{1}>0$ s.t., for any $i\neq j$ ,

\frac{C_{1}}{pe^{4\kappa_{0}+2\kappa_{1}}}\leq\frac{1}{p}\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\frac{e^{x_{i}+x_{j}}}{\left(1+\tilde{\alpha}_{0,i,j}+e^{x_{i}+x_{j}}\right)^{2}}<\frac{1}{4p},

we have

(A.38)

{\mathcal{T}}=\sum_{1\leq i\neq j\leq p}\frac{1}{p}\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\frac{e^{x_{i}+x_{j}}}{\left(1+\tilde{\alpha}_{0,i,j}+e^{x_{i}+x_{j}}\right)^{2}}\quad\in\left(C_{1}pe^{-4\kappa_{0}-2\kappa_{1}},\frac{p}{4}\right).

By Theorem 1 in [38], with $m=C_{1}/\left(pe^{4\kappa_{0}+2\kappa_{1}}\right)$ and $M=C_{1}/\left(4p\right)$ , there exists big enough constant $C_{2}>0$ , we have that

$\displaystyle\left\\|{\mathbf{F}}^{\prime}\left({\mathbf{x}}\right)^{-1}-{\mathbf{S}}\right\\|_{\max}$	$\displaystyle\leq$	$\displaystyle\frac{M}{m^{2}}\frac{pM+(p-2)m}{2(p-2)m}\frac{1}{(p-1)^{2}}+\frac{1}{2m(p-1)^{2}}$
	$\displaystyle=$	$\displaystyle\frac{1}{2m(p-1)^{2}}\left(1+\frac{M}{m}+\frac{M^{2}}{m^{2}}\frac{p}{p-2}\right).$
	$\displaystyle\leq$	$\displaystyle C_{2}\frac{e^{12\kappa_{0}+6\kappa_{1}}}{p^{2}},$

where $\|{\mathbf{A}}\|_{\max}=\max_{i,j}A_{i,j}$ . Then we have that

			$\displaystyle\left\\|{\mathbf{F}}^{\prime}\left({\mathbf{x}}\right)^{-1}\right\\|_{\infty}$
		$\displaystyle\leq$	$\displaystyle\left\\|{\mathbf{S}}\right\\|_{\infty}+\left\\|{\mathbf{F}}^{\prime}\left({\mathbf{x}}\right)^{-1}-{\mathbf{S}}\right\\|_{\infty}$
		$\displaystyle\leq$	$\displaystyle\max_{i}\left(\sum_{j=1,\>j\neq i}^{p}\frac{1}{p}\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\frac{e^{x_{i}+x_{j}}}{\left(1+\tilde{\alpha}_{0,i,j}+e^{x_{i}+x_{j}}\right)^{2}}\right)^{-1}+\frac{p}{{\mathcal{T}}}+p\left\\|{\mathbf{F}}^{\prime}\left({\mathbf{x}}\right)^{-1}-{\mathbf{S}}\right\\|_{\max}$
		$\displaystyle<$	$\displaystyle 2C_{2}e^{12\kappa_{0}+6\kappa_{1}}.$

Moreover, we have that

			$\displaystyle\left\\|{\mathbf{F}}\left(\boldsymbol{\theta}_{(1)}^{*}\right)\right\\|_{\infty}$
		$\displaystyle=$	$\displaystyle\max_{i}\left\|-\frac{1}{np}\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}\left\{X_{i,j}^{t}X_{i,j}^{t-1}-\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\left(1-\frac{1}{1+\tilde{\alpha}_{0,i,j}+\alpha_{1,i,j}^{*}}\right)\right\}\right\|$
		$\displaystyle\leq$	$\displaystyle\max_{i}\left\|\frac{1}{np}\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}\left\{X_{i,j}^{t}X_{i,j}^{t-1}-\frac{\alpha_{0,i,j}^{}}{1+\alpha_{0,i,j}^{}}\left(1-\frac{1}{1+\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{}}\right)\right\}\right\|$
			$\displaystyle+\max_{i}\left\|\frac{1}{np}\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}\left\{\frac{\alpha_{0,i,j}^{}}{1+\alpha_{0,i,j}^{}}\frac{\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{}}{1+\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{}}-\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\frac{\tilde{\alpha}_{0,i,j}+\alpha_{1,i,j}^{}}{1+\tilde{\alpha}_{0,i,j}+\alpha_{1,i,j}^{}}\right\}\right\|$
		$\displaystyle=$	$\displaystyle L_{1}+L_{2}.$

By Lemma 1 and Lemma 6, there exist big positive constants $C_{3},c_{2}$ s.t., with probability greater than $1-(np)^{-c_{2}}$ ,

$\displaystyle L_{1}$	$\displaystyle=$	$\displaystyle\frac{1}{np}\max_{i}\left\|\sum_{j=1,\>j\neq i}^{p}\sum_{t=1}^{n}X_{i,j}^{t}X_{i,j}^{t-1}-{\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)\right\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{np}\max_{i}\left\|\sum_{j=1,\>j\neq i}^{p}\left\{\sum_{t=1}^{n}X_{i,j}^{t}X_{i,j}^{t-1}-{\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)\right\}\right\|$
	$\displaystyle\leq$	$\displaystyle\frac{C_{3}}{np}\left(\sqrt{np\log(p)}+\sqrt{n\log(np)}+\log\left(n\right)\log\log\left(n\right)\log\left(np\right)\right)$
	$\displaystyle<$	$\displaystyle 3C_{3}\sqrt{\frac{\log(n)\log(p)}{np}}.$

Moreover, we have

			$\displaystyle L_{2}$
		$\displaystyle=$	$\displaystyle\frac{1}{np}\max_{i}\Bigg{\|}\sum_{t=1}^{n}\sum_{j=1,\>j\neq i}^{p}\Bigg{\{}\frac{\alpha_{0,i,j}^{}}{1+\alpha_{0,i,j}^{}}\left(1-\frac{1}{1+\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{}}\right)$
			$\displaystyle-\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\left(1-\frac{1}{1+\tilde{\alpha}_{0,i,j}+\alpha_{1,i,j}^{*}}\right)\Bigg{\}}\Bigg{\|}$
		$\displaystyle\leq$	$\displaystyle\max_{i,j,\beta_{\xi}}\Bigg{\|}\left(\beta_{i,0}^{}+\beta_{j,0}^{}-\tilde{\beta}_{i,0}-\tilde{\beta}_{j,0}\right)\frac{e^{\beta_{\xi,i,j}}}{\left(1+e^{\beta_{\xi,i,j}}\right)^{2}}$
			$\displaystyle+\frac{e^{\beta_{\xi,i,j}}\left(1+e^{\beta_{\xi,i,j}}\right)\left(1+e^{\beta_{\xi,i,j}}+\alpha_{1,i,j}^{}\right)-e^{\beta_{\xi,i,j}}\left(\left(2+\alpha_{1,i,j}^{}\right)e^{\beta_{\xi,i,j}}+2e^{2\beta_{\xi,i,j}}\right)}{\left(1+e^{\beta_{\xi,i,j}}\right)^{2}\left(1+e^{\beta_{\xi,i,j}}+\alpha_{1,i,j}^{*}\right)^{2}}\Bigg{\|}$
		$\displaystyle\leq$	$\displaystyle\max_{i,j,\beta_{\xi}}\Bigg{\|}\left(\beta_{i,0}^{}+\beta_{j,0}^{}-\tilde{\beta}_{i,0}-\tilde{\beta}_{j,0}\right)\left(\frac{e^{\beta_{\xi,i,j}}}{\left(1+e^{\beta_{\xi,i,j}}\right)^{2}}+\frac{e^{\beta_{\xi,i,j}}\left(1+\alpha_{1,i,j}^{}-e^{2\beta_{\xi,i,j}}\right)}{\left(1+e^{\beta_{\xi,i,j}}\right)^{2}\left(1+e^{\beta_{\xi,i,j}}+\alpha_{1,i,j}^{}\right)^{2}}\right)\Bigg{\|}$
		$\displaystyle\leq$	$\displaystyle 2\left\\|\tilde{\boldsymbol{\theta}}_{(0)}-\boldsymbol{\theta}_{(0)}^{*}\right\\|_{\infty},$

where $\beta_{\xi,i,j}:=\left(1-c_{i,j}\right)\left(\beta_{i,0}^{*}+\beta_{j,0}^{*}\right)+c_{i,j}\left(\tilde{\beta}_{i,0}+\tilde{\beta}_{j,0}\right)$ with a series of constants $c_{i,j}\in(0,1)$ . Then, by equation (A.33), there exists a big enough constant $C_{4}>0$ , we have, with probability tending to 1,

\displaystyle L_{2}

\displaystyle\leq 2\left\|\tilde{\boldsymbol{\theta}}_{(0)}-\boldsymbol{\theta}_{(0)}^{*}\right\|_{\infty}\leq

\displaystyle C_{4}\sqrt{\frac{\log(n)\log(p)e^{4\kappa_{0}}}{np}}.

We can conclude that, with probability tending to 1,

$\displaystyle\left\|{\mathbf{F}}\left(\boldsymbol{\theta}_{(1)}^{*}\right)\right\|_{\infty}$	$\displaystyle\leq$	$\displaystyle\frac{1}{np}\left(L_{1}+L_{2}\right)$
	$\displaystyle\leq$	$\displaystyle 3C_{3}\sqrt{\frac{\log(n)\log(p)}{np}}+C_{4}\sqrt{\frac{\log(n)\log(p)e^{4\kappa_{0}}}{np}}$
	$\displaystyle\leq$	$\displaystyle C_{5}\sqrt{\frac{\log(n)\log(p)e^{4\kappa_{0}}}{np}},$

hold uniformly for any $i$ at the same time. Consequently, we have

	$\displaystyle\left\\|{\mathbf{F}}^{\prime}\left(\boldsymbol{\theta}_{(1)}^{}\right)^{-1}{\mathbf{F}}\left(\boldsymbol{\theta}_{(1)}^{}\right)\right\\|_{\infty}$	$\displaystyle\leq$	$\displaystyle\left\\|{\mathbf{F}}^{\prime}\left(\boldsymbol{\theta}_{(1)}^{}\right)^{-1}\right\\|\left\\|{\mathbf{F}}\left(\boldsymbol{\theta}_{(1)}^{}\right)\right\\|_{\infty}$
		$\displaystyle\leq$	$\displaystyle 2C_{2}C_{5}e^{12\kappa_{0}+6\kappa_{1}}\sqrt{\frac{\log(n)\log(p)e^{4\kappa_{0}}}{np}}.$

There exists an big enough constant $C_{6}>0$ , we have, for every $x,y\in{\mathbf{B}}_{\infty}\left(0,\kappa_{1}\right)$ ,

(A.41)			$\displaystyle\left\\|{\mathbf{F}}^{\prime}\left({\mathbf{x}}\right)-{\mathbf{F}}^{\prime}\left({\mathbf{y}}\right)\right\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle\max_{i}\left\|2\sum_{j=1,\>j\neq i}^{p}\left(\frac{1}{p}\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\frac{e^{x_{i}+x_{j}}}{\left(1+\tilde{\alpha}_{0,i,j}+e^{x_{i}+x_{j}}\right)^{2}}-\frac{1}{p}\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\frac{e^{y_{i}+y_{j}}}{\left(1+\tilde{\alpha}_{0,i,j}+e^{y_{i}+y_{j}}\right)^{2}}\right)\right\|$
	$\displaystyle=$	$\displaystyle\frac{2}{p}\max_{i}\left\|\sum_{j=1,\>j\neq i}^{p}\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\left(\frac{e^{x_{i}+x_{j}}}{\left(1+\tilde{\alpha}_{0,i,j}+e^{x_{i}+x_{j}}\right)^{2}}-\frac{e^{y_{i}+y_{j}}}{\left(1+\tilde{\alpha}_{0,i,j}+e^{y_{i}+y_{j}}\right)^{2}}\right)\right\|$
	$\displaystyle=$	$\displaystyle\frac{1}{p}\max_{i}\left\|\sum_{j=1,\>j\neq i}^{p}\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\frac{e^{z_{i,j}}}{\left(1+\tilde{\alpha}_{0,i,j}+e^{z_{i,j}}\right)^{3}}\left\|x_{i}+x_{j}-y_{i}-y_{j}\right\|\right\|$
	$\displaystyle\leq$	$\displaystyle C_{6}\left\\|{\mathbf{x}}-{\mathbf{y}}\right\\|_{\infty},$

where $z_{i,j}:=\left(1-d_{i,j}\right)\left(x_{i}+x_{j}\right)+d_{i,j}\left(y_{i}+y_{j}\right)$ with a series of constants $d_{i,j}\in(0,1)$ . Combining the inequalities (A.11), (A.11) and (A.41), we finish the proof of (A.34). ∎

A.12 Proof of Corollary 1

Proof.

When the condition $-\kappa_{0}\leq\beta_{i,0}\leq C_{\kappa}$ and $\|\boldsymbol{\beta}_{1}\|_{\infty}\leq\kappa_{1}$ , where $C_{\kappa}=O(1)$ , is satisfied, there exists a constant $C_{1}>0$ such that, for all $1\leq i\neq j\leq p$ and $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ with $r=c_{r}e^{-2\kappa_{0}-4\kappa_{1}}$ for a small enough constant $c_{r}>0$ , it holds

\displaystyle{\rm E}({\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{1}(\boldsymbol{\theta}))_{i,j}\geq C_{1}\frac{e^{-4\kappa_{1}}}{p},\quad{\rm E}({\mathbf{V}}_{2}(\boldsymbol{\theta})+{\mathbf{V}}_{3}(\boldsymbol{\theta}))_{i,j}\geq C_{1}\frac{e^{-4\kappa_{1}-2\kappa_{0}}}{p},

and

\displaystyle{\rm E}({\mathbf{V}}_{2}(\boldsymbol{\theta}))_{i,j}\geq C_{1}\frac{e^{-4\kappa_{1}-2\kappa_{0}}}{p}.

Then there exists a constant $C_{2}>0$ , for all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*},r\right)$ such that we have

\displaystyle\left\|{\rm E}({\mathbf{V}}(\boldsymbol{\theta}))\right\|_{2}\geq C_{2}e^{-4\kappa_{1}-2\kappa_{0}}.

With similar arguments as in the proofs of Lemma 2 and Theorems 1 and 2, we can prove Corollary 1. ∎

A.13 Proof of Corollary 2

Proof.

The first inequality can be proved using the results in Corollary 1, along with analogous reasoning employed in the proof of Theorem 1. Replace equations (A.23) and (A.24) by

	$\displaystyle r_{s}$	$\displaystyle=$	$\displaystyle e^{4\kappa_{0}+8\kappa_{1}}\log\log(np)\frac{\sqrt{\log(np)}}{\left(np\right)^{s}}\left(1+\frac{\log(np)}{\sqrt{p}}\right);$
	$\displaystyle s_{0}$	$\displaystyle=$	$\displaystyle\frac{6\kappa_{0}+12\kappa_{1}+\log\log(np)/2+\log\log\log(np)+\log\left(1+\frac{\log(np)}{\sqrt{p}}\right)-\log(c_{r})}{\log(np)}.$

Similar to the proof of Theorem 3, let $z_{1}=0.5e^{-2\kappa_{0}-4\kappa_{1}}[\log\log(np)]^{-1}$ $(np)^{s-1/2}$ and $z_{2}=0.5e^{2\kappa_{0}+4\kappa_{1}}$ $\log\log(np)(np)^{1/2-s}$ . We can assert the existence of a sufficiently large constant $C_{1}>0$ such that with probability greater than $1-(np)^{-c_{2}}$ ,

			$\displaystyle\left\|l\left(\boldsymbol{\theta}_{(i)}^{},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-l\left(\boldsymbol{\theta}_{(i)},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-\left[l\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{}\right)-l\left(\boldsymbol{\theta}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)\right]\right\|$
		$\displaystyle\leq$	$\displaystyle C_{1}\frac{e^{6\kappa_{0}+12\kappa_{1}}\log\log(np)\log(np)}{(np)^{s+1/2}}\left(1+\frac{\log(np)}{\sqrt{p}}\right)^{2}$

holds uniformly for all $s\in[s_{0},1/2]$ and $\boldsymbol{\theta}_{(i)}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{*}_{(i)},r_{s}\right)$ . Following similar steps, we can ascertain that as $np\rightarrow\infty$ with $n\geq 2$ , with probability converging to one it holds that

\left\|\widehat{\boldsymbol{\theta}}-\boldsymbol{\theta}^{*}\right\|_{\infty}\leq Ce^{4\kappa_{0}+8\kappa_{1}}\log\log(np)\sqrt{\frac{\log(np)}{np}}\left(1+\frac{\log(np)}{\sqrt{p}}\right).

∎

A.14 Proof of Corollary 3

Proof.

There exist positive constants $C_{1}$ and $C_{2}$ such that

\sum_{1\leq i\neq j\leq p}\frac{1}{p}\frac{\tilde{\alpha}_{0,i,j}}{1+\tilde{\alpha}_{0,i,j}}\frac{e^{x_{i}+x_{j}}}{\left(1+\tilde{\alpha}_{0,i,j}+e^{x_{i}+x_{j}}\right)^{2}}\quad\in\left(C_{1}pe^{-2\kappa_{0}-2\kappa_{1}},\frac{p}{4}\right),

for all $\|{\mathbf{x}}\|_{\infty}\leq\kappa_{1}$ . Then the inequalities in (A.34) can be updated to

	$\displaystyle\left\\|{\mathbf{F}}^{\prime}\left({\mathbf{x}};\tilde{\boldsymbol{\theta}}_{(0)}\right)-{\mathbf{F}}^{\prime}\left({\mathbf{y}};\tilde{\boldsymbol{\theta}}_{(0)}\right)\right\\|_{\infty}\leq C_{2}\left\\|{\mathbf{x}}-{\mathbf{y}}\right\\|_{\infty},$
	$\displaystyle\left\\|{\mathbf{F}}^{\prime}\left(\boldsymbol{\theta}_{(1)}^{*};\tilde{\boldsymbol{\theta}}_{(0)}\right)^{-1}\right\\|_{\infty}\leq C_{2}e^{2\kappa_{0}+6\kappa_{1}},$
	$\displaystyle\left\\|{\mathbf{F}}^{\prime}\left(\boldsymbol{\theta}_{(1)}^{};\tilde{\boldsymbol{\theta}}_{(0)}\right)^{-1}{\mathbf{F}}\left(\boldsymbol{\theta}_{(1)}^{};\tilde{\boldsymbol{\theta}}_{(0)}\right)\right\\|_{\infty}\leq C_{2}e^{4\kappa_{0}+6\kappa_{1}}\sqrt{\frac{\log(n)\log(p)}{np}},$

for all $\|{\mathbf{x}}\|_{\infty}\leq\kappa_{1}$ , $\|{\mathbf{y}}\|_{\infty}\leq\kappa_{1}$ . Consequently, with Lemma 8, similar to the proof of Theorem 4, we can prove Corollary 3. ∎

A.15 Proof of the Theorem 5

Proof.

To simplify the notations, in what follows we shall denote $l_{i,j}(\theta_{i},\theta_{j})$ as $l_{i,j}(\boldsymbol{\theta})$ instead. Let $K=\lfloor c_{0}\log(np)\rfloor+1$ for some big enough constant $c_{0}>0$ , where $\lfloor\cdot\rfloor$ is the smallest integer function. By the Taylor expansion with Lagrange remainder we have, for any $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}(\boldsymbol{\theta}^{\prime},\alpha_{0})$ , there exists a $\boldsymbol{\theta}^{\xi}\in{\mathbf{B}}_{\infty}(\boldsymbol{\theta}^{\prime},\alpha_{0})$ dependent on $\boldsymbol{\theta}$ s.t.,

			$\displaystyle{\mathbf{L}}\left(\boldsymbol{\theta}\right)-{\mathbf{L}}\left(\boldsymbol{\theta}^{\prime}\right)$
		$\displaystyle=$	$\displaystyle\sum_{I_{1}=1}^{p}\left(\theta_{I_{1}}-\theta_{I_{1}}^{\prime}\right)\frac{\partial{\mathbf{L}}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{I_{1}}}$
			$\displaystyle+\frac{1}{2!}\sum_{I_{1}=1}^{p}\sum_{I_{2}=1}^{p}\left(\theta_{I_{1}}-\theta_{I_{1}}^{\prime}\right)\left(\theta_{I_{2}}-\theta_{I_{2}}^{\prime}\right)\frac{\partial^{2}{\mathbf{L}}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{I_{2}}\partial\theta_{I_{2}}}$
			$\displaystyle+\cdots$
			$\displaystyle+\frac{1}{K!}\sum_{I_{1}=1}^{p}\sum_{I_{2}=1}^{p}\cdots\sum_{I_{K}=1}^{p}\left(\prod_{\ell=1}^{K}\left(\theta_{I_{\ell}}-\theta_{I_{\ell}}^{\prime}\right)\right)\frac{\partial^{K}{\mathbf{L}}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{I_{1}}\partial\theta_{I_{2}}\cdots\partial\theta_{I_{K}}}$
			$\displaystyle+\frac{1}{\left(K+1\right)!}\sum_{I_{1}=1}^{p}\sum_{I_{2}=1}^{p}\cdots\sum_{I_{K+1}=1}^{p}\left(\prod_{\ell=1}^{K+1}\left(\theta_{I_{\ell}}-\theta_{I_{\ell}}^{\prime}\right)\right)\frac{\partial^{K+1}{\mathbf{L}}\left(\boldsymbol{\theta}^{\xi}\right)}{\partial\theta_{I_{1}}\partial\theta_{I_{2}}\cdots\partial\theta_{I_{K+1}}}$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\sum_{k=1}^{K}\left\{\frac{1}{k!}\sum_{1\leq i\neq j\leq p}\left(Y_{i,j}\sum_{s=0}^{k}\tbinom{k}{s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k-s}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\right)\right\}$
			$\displaystyle+\frac{1}{p}\frac{1}{\left(K+1\right)!}\sum_{1\leq i\neq j\leq p}\left(Y_{i,j}\sum_{s=0}^{K+1}\tbinom{K+1}{s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\left(\theta_{j}-\theta_{j}^{\prime}\right)^{K+1-s}\frac{\partial^{K+1}l_{i,j}\left(\boldsymbol{\theta}^{\xi}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{K+1-s}}\right)$
		$\displaystyle\leq$	$\displaystyle\left\|\sum_{k=1}^{K}\frac{1}{p}\frac{1}{k!}\sum_{1\leq i\neq j\leq p}Y_{i,j}\left(\left(\theta_{i}-\theta_{i}^{\prime}\right)^{k}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{k}}+\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{j}^{k}}\right)\right\|$
			$\displaystyle+\left\|\sum_{k=2}^{K}\frac{1}{p}\frac{1}{k!}\sum_{1\leq i\neq j\leq p}\left(Y_{i,j}\sum_{s=1}^{k-1}\tbinom{k}{s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k-s}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\right)\right\|$
			$\displaystyle+\frac{1}{p}\left\|\frac{1}{\left(K+1\right)!}\sum_{1\leq i\neq j\leq p}\left(Y_{i,j}\sum_{s=0}^{K+1}\tbinom{K+1}{s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\left(\theta_{j}-\theta_{j}^{\prime}\right)^{K+1-s}\frac{\partial^{K+1}l_{i,j}\left(\boldsymbol{\theta}^{\xi}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{K+1-s}}\right)\right\|$
		$\displaystyle=$	$\displaystyle S^{(1)}+S^{(2)}+S^{(3)}.$

We first consider $S^{(1)}$ . By Lemma 6, there exist big enough constants $C_{1}>0,c_{1}>0$ such that, uniformly for all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ and all $k=1,\cdots,K$ , we have, with probability greater than $1-(np)^{-c_{1}}$ ,

			$\displaystyle\frac{1}{p}\frac{1}{k!}\left\|\sum_{1\leq i\neq j\leq p}Y_{i,j}\left(\left(\theta_{i}-\theta_{i}^{\prime}\right)^{k}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{k}}+\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{j}^{k}}\right)\right\|$
		$\displaystyle\leq$	$\displaystyle\frac{2}{p}\frac{1}{k!}\sum_{i=1}^{p}\left\|\frac{\theta_{i}-\theta_{i}^{\prime}}{\alpha}\right\|^{k}\left\|\sum_{j\neq i,j=1}^{p}\alpha^{k}Y_{i,j}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{j}^{k}}\right\|$
		$\displaystyle\leq$	$\displaystyle\frac{2}{p}\frac{1}{k!}\max_{i}\left\|\sum_{j\neq i,j=1}^{p}\alpha^{k}Y_{i,j}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{j}^{k}}\right\|\sum_{i=1}^{p}\left\|\frac{\theta_{i}-\theta_{i}^{\prime}}{\alpha}\right\|^{k}$
		$\displaystyle\leq$	$\displaystyle\frac{C_{1}}{p}\frac{1}{k!}\left(b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}\right)\left(k-1\right)!\left\\|\frac{\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}}{\alpha}\right\\|_{k}^{k}$
		$\displaystyle=$	$\displaystyle C_{1}\left\\|\frac{\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}}{\alpha}\right\\|_{k}^{k}\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{kp}.$

Consequently, with probability greater than $1-(np)^{-c_{1}}$ we have:

$\displaystyle S^{(1)}$	$\displaystyle\leq$	$\displaystyle\sum_{k=1}^{K}\frac{1}{p}\frac{1}{k!}\left\|\sum_{1\leq i\neq j\leq p}Y_{i,j}\left(\left(\theta_{i}-\theta_{i}^{\prime}\right)^{k}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{k}}+\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{j}^{k}}\right)\right\|$
	$\displaystyle\leq$	$\displaystyle C_{1}\sum_{k=1}^{K}\left\\|\frac{\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}}{\alpha}\right\\|_{k}^{k}\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{kp}$
	$\displaystyle\leq$	$\displaystyle C_{1}\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{p}\sum_{k=1}^{K}\frac{1}{k}\left\\|\frac{\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}}{\alpha}\right\\|_{k}^{k},$

holds uniformly for all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ . Note that for any ${\mathbf{x}}=\left(x_{1},\cdots,x_{p}\right)^{\top}$ s.t. $\|{\mathbf{x}}\|_{\infty}\leq a<1$ for some constant $a>0$ , we have

(A.42)

\sum_{k=1}^{K}\frac{1}{k}\left\|{\mathbf{x}}\right\|_{k}^{k}=\sum_{k=1}^{K}\sum_{i=1}^{p}\frac{1}{k}x_{i}^{k}=\sum_{i=1}^{p}x_{i}\sum_{k=1}^{K}\frac{x_{i}^{k-1}}{k}\leq\|{\mathbf{x}}\|_{1}\sum_{k=1}^{K}\frac{a^{k-1}}{k}\leq\left\|{\mathbf{x}}\right\|_{1}\left(-\frac{\log(1-a)}{a}\right).

Here in last step we have used $-\log(1-a)=\sum_{k=1}^{\infty}a^{k}/k$ . With the fact that $\left\|\frac{\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}}{\alpha}\right\|_{\infty}<\alpha_{0}/\alpha<1/2$ holds for any $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ , we have, with probability greater than $1-(np)^{-c_{1}}$ ,

	$\displaystyle S^{(1)}$	$\displaystyle\leq$	$\displaystyle-\frac{\log(1/2)}{1/2}C_{1}\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{p}\left\\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\right\\|_{1}$
		$\displaystyle\leq$	$\displaystyle 2C_{1}\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{p}\left\\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\right\\|_{1}.$

Next we derive an upper bound for $S^{(2)}$ . Define a series of random $p\times p$ matrices $\{{\mathbf{Y}}^{s}_{k}:k=2,\ldots,K,s=1,\cdots,k-1\}$ with the $(i,j)$ -th elements of ${\mathbf{Y}}^{s}_{k}$ given as

\left({\mathbf{Y}}^{s}_{k}\right)_{i,j}=Y_{i,j}\alpha^{k}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}},\quad 1\leq i\neq j\leq p.

Further for any $k=2,\ldots,K$ , define a random $(k-1)p\times(k-1)p$ matrix ${\mathbf{W}}_{k}$ as:

{\mathbf{W}}_{k}=\left[\begin{array}[]{ccccc}0&&&&{\mathbf{Y}}^{1}_{k}\\ &&&{\mathbf{Y}}^{2}_{k}&\\ &&\iddots&&\\ &{\mathbf{Y}}^{k-2}_{k}&&&\\ {\mathbf{Y}}^{k-1}_{k}&&&&0\\ \end{array}\right].

Also we define a series of $p\times 1$ vectors, $\{{\mathbf{z}}^{(s)}_{k}:k=2,\ldots,K,s=1,\cdots,k-1\}$ with $i$ -th element of ${\mathbf{z}}^{(s)}_{k}$ given as:

\left({\mathbf{z}}^{(s)}_{k}\right)_{i}=\left(0.5\alpha\right)^{-s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\sqrt{\tbinom{k}{s}}.

For any $k=2,\ldots,K$ , by denoting $\tilde{{\mathbf{z}}}_{k}$ as:

\tilde{{\mathbf{z}}}_{k}=\left[\begin{array}[]{c}{\mathbf{z}}_{k}^{(1)}\\ \vdots\\ {\mathbf{z}}_{k}^{(k-1)}\\ \end{array}\right],

we have:

			$\displaystyle\frac{1}{p}\frac{1}{k!}\left\|\sum_{1\leq i\neq j\leq p}\left(Y_{i,j}\sum_{s=1}^{k-1}\tbinom{k}{s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k-s}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\right)\right\|$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{1}{2^{k}k!}\left\|\sum_{s=1}^{k-1}\left(\tilde{{\mathbf{z}}}_{k}^{(k-s)}\right)^{\top}{\mathbf{Y}}_{k}^{s}\tilde{{\mathbf{z}}}_{k}^{(s)}\right\|$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{1}{2^{k}k!}\left\|\tilde{{\mathbf{z}}}_{k}^{\top}{\mathbf{W}}_{k}\tilde{{\mathbf{z}}}_{k}\right\|$
		$\displaystyle\leq$	$\displaystyle\frac{1}{p}\frac{1}{2^{k}k!}\\|\tilde{{\mathbf{z}}}_{k}\\|_{2}^{2}\\|{\mathbf{W}}_{k}\\|_{2}.$

We remark that by formulating the confounding terms in $S^{(2)}$ via $\{{\mathbf{z}}^{(s)}_{k},{\mathbf{W}}_{k}\}$ , we have established in (A.15) an upper bound that depends on the parameters in $\{{\mathbf{z}}^{(s)}_{k},k=2,\ldots,K\}$ and the random matrices $\{{\mathbf{W}}_{k},2,\ldots,K\}$ separately. Using the fact that $\sum_{l=0}^{\infty}\tbinom{l+s}{s}0.5^{l+s+1}=1$ , we have

\sum_{k=s+1}^{K}\tbinom{k}{s}0.5^{k}<\sum_{l=0}^{\infty}\tbinom{l+s}{s}0.5^{l+s}=2.

Consequently, there exists a big enough constant $C_{2}>0$ such that, for all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ , we have

$\displaystyle\sum_{k=2}^{K}\frac{0.5^{k}}{k}\\|\tilde{{\mathbf{z}}}_{k}\\|_{2}^{2}$	$\displaystyle=$	$\displaystyle\sum_{k=2}^{K}\frac{0.5^{k}}{k}\sum_{i=1}^{p}\sum_{s=1}^{k-1}\left(0.5\alpha\right)^{-2s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{2s}\tbinom{k}{s}$
	$\displaystyle=$	$\displaystyle\sum_{k=2}^{K}\frac{0.5^{k}}{k}\sum_{s=1}^{k-1}\tbinom{k}{s}\left(0.5\alpha\right)^{-2s}\\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\\|_{2s}^{2s}$
	$\displaystyle\leq$	$\displaystyle\sum_{s=1}^{K-1}\frac{1}{s}\left\\|\frac{\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}}{\alpha/2}\right\\|_{2s}^{2s}\left(\sum_{k=s+1}^{K}\tbinom{k}{s}0.5^{k}\right)$
	$\displaystyle<$	$\displaystyle 2\sum_{s=1}^{K-1}\frac{1}{s}\left\\|\frac{\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}}{\alpha/2}\right\\|_{2s}^{2s}$
	$\displaystyle<$	$\displaystyle 2\sum_{s=1}^{K-1}\frac{1}{s}\left\\|\frac{\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}}{\alpha/2}\right\\|_{s}^{s}$
	$\displaystyle\leq$	$\displaystyle C_{2}\left\\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\right\\|_{1}.$

Here the last step follows from (A.42). With similar arguments as in the proof of the matrix Bernstein inequality (c.f. Lemma 3), we can show that uniformly for all $k\leq K=\lfloor c_{0}\log(np)\rfloor+1$ , there exist big enough constants $C_{3}>0$ and $c_{2}>0$ , such that with probability greater than $1-(np)^{c_{2}}$ ,

(A.45)

\|{\mathbf{W}}_{k}\|_{2}\leq C_{3}\left(k-1\right)!\left(b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}\right).

For brevity, the proof of inequality (A.45) is provided independently in Section (A.16). Consequently, combining (A.15), (A.15) and (A.45) and $K=\lfloor c_{0}\log(np)\rfloor+1$ , we conclude that with probability greater than $1-(np)^{-c_{2}}$ ,

$\displaystyle S^{(2)}$	$\displaystyle\leq$	$\displaystyle\sum_{k=2}^{K}\frac{1}{p}\frac{1}{k!}\left\|\sum_{1\leq i\neq j\leq p}\left(Y_{i,j}\sum_{s=1}^{k-1}\tbinom{k}{s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k-s}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{k}}\right)\right\|$
	$\displaystyle\leq$	$\displaystyle\sum_{k=2}^{K}\frac{1}{p}\frac{1}{2^{k}k!}\\|\tilde{{\mathbf{z}}}_{k}\\|_{2}^{2}\\|{\mathbf{W}}_{k}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle C_{3}\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{p}\sum_{k=2}^{K}\frac{1}{k2^{k}}\\|\tilde{{\mathbf{z}}}_{k}\\|_{2}^{2}$
	$\displaystyle\leq$	$\displaystyle C_{2}C_{3}\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{p}\left\\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\right\\|_{1},$

uniformly for all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ . Finally, we derive an upper bound for $S^{(3)}$ . By condition (L-A1), when $c_{0}$ is chosen to be big enough, there exists a big enough constant $c_{3}>1$ , such that, uniformly for all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}(\boldsymbol{\theta}^{\prime},\alpha_{0})$ and $\boldsymbol{\theta}^{\xi}$ , we have

			$\displaystyle S^{(3)}$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\left\|\frac{1}{\left(K+1\right)!}\sum_{1\leq i\neq j\leq p}\left(Y_{i,j}\sum_{s=0}^{K+1}\tbinom{K+1}{s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\left(\theta_{j}-\theta_{j}^{\prime}\right)^{K+1-s}\frac{\partial^{K+1}l_{i,j}\left(\boldsymbol{\theta}^{\xi}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{K+1-s}}\right)\right\|$
		$\displaystyle\leq$	$\displaystyle\frac{1}{p}\frac{1}{K+1}\sum_{1\leq i\neq j\leq p}\left\|Y_{i,j}\right\|\left(\frac{\left\|\theta_{i}-\theta_{i}^{\prime}\right\|}{\alpha}+\frac{\left\|\theta_{j}-\theta_{j}^{\prime}\right\|}{\alpha}\right)^{K+1}$
		$\displaystyle\leq$	$\displaystyle\frac{pb_{(p)}}{K+1}\left(\frac{2\left\\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\right\\|_{\infty}}{\alpha}\right)^{K+1}$
		$\displaystyle\leq$	$\displaystyle\frac{pb_{(p)}}{K+1}\left(\frac{2\alpha_{0}}{\alpha}\right)^{c_{0}\log\left(np\right)}$
		$\displaystyle\leq$	$\displaystyle b_{(p)}\left(np\right)^{-c_{3}}.$

Here the last step will hold if we choose $c_{0}$ to be large enough such that $(2\alpha_{0}/\alpha)^{c_{0}/(c_{3}+1)}$ $<e^{-1}$ . When $np\rightarrow\infty$ , and $c_{0},c_{3}$ are chosen to be large enough, this bound will be dominated by the upper bounds for $S^{(1)}$ and $S^{(2)}$ .

Combining the upper bound on $S^{(1)},S^{(2)}$ and $S^{(3)}$ , we conclude that, for any given $\boldsymbol{\theta}^{\prime}$ , there exist large enough constants $C>0$ , $c>0$ which are independent of $\boldsymbol{\theta}^{\prime}$ such that for any $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ where $\alpha_{0}\in(0,\alpha/2)$ at the same time with probability greater than $1-(np)^{-c}$ ,

\left|{\mathbf{L}}\left(\boldsymbol{\theta}\right)-{\mathbf{L}}\left(\boldsymbol{\theta}^{\prime}\right)\right|\leq C\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{p}\left\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\right\|_{1}.

∎

A.16 Proof of inequality (A.45)

Proof.

For any $k=2,\ldots,K$ and $1\leq i\neq j\leq p$ , let ${\mathbf{W}}_{k,i,j}$ be defined by keeping the $(i,j)$ th element of all the ${\mathbf{Y}}^{s}_{k},s=1,\cdots,k-1$ in ${\mathbf{W}}_{k}$ unchanged, and setting all other elements to be zero. Then the random matrices ${\mathbf{W}}_{k,i,j}$ , $1\leq i\neq j\leq p$ are independent, and

\sum_{1\leq i\neq j\leq p}{\mathbf{W}}_{k,i,j}={\mathbf{W}}_{k}.

For any ${\mathbf{a}}\in\mathbb{R}^{(k-1)p}$ , we can write it as

{\mathbf{a}}=\left[\begin{array}[]{c}{\mathbf{a}}^{(1)}\\ \vdots\\ {\mathbf{a}}^{(k-1)}\\ \end{array}\right],

with ${\mathbf{a}}^{(s)}=(a^{(s)}_{1},\ldots,a^{(s)}_{p})^{\top}\in\mathbb{R}^{p}$ , $r=1,2,\cdots,k-1$ . Then for any $k=2,\ldots,K$ and $1\leq i\neq j\leq p$ , we have,

$\displaystyle\\|{\mathbf{W}}_{k,i,j}\\|_{2}$	$\displaystyle=$	$\displaystyle\sup_{\\|{\mathbf{a}}\\|_{2}=1}\\|{\mathbf{W}}_{k,i,j}{\mathbf{a}}\\|_{2}$
	$\displaystyle=$	$\displaystyle\sup_{\\|{\mathbf{a}}\\|_{2}=1}\left(\sum_{s=1}^{k-1}\left(\alpha^{k}Y_{i,j}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}a^{(k-s)}_{j}\right)^{2}\right)^{1/2}$
	$\displaystyle\leq$	$\displaystyle\max_{1\leq s\leq k}\left\|\alpha^{k}Y_{i,j}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\right\|\sup_{\\|{\mathbf{a}}\\|_{2}=1}\left(\sum_{s=1}^{k-1}\left(a^{(k-s)}_{j}\right)^{2}\right)^{1/2}$
	$\displaystyle\leq$	$\displaystyle\left(k-1\right)!b_{(p)}.$

On the other hand,

			$\displaystyle\max\left\{\left\\|\sum_{1\leq i\neq j\leq p}{\rm E}\left({\mathbf{W}}_{k,i,j}{\mathbf{W}}_{k,i,j}^{\top}\right)\right\\|_{2},\left\\|\sum_{1\leq i\neq j\leq p}{\rm E}\left({\mathbf{W}}_{k,i,j}^{\top}{\mathbf{W}}_{k,i,j}\right)\right\\|_{2}\right\}$
		$\displaystyle=$	$\displaystyle\max\Bigg{\{}\sup_{\\|{\mathbf{a}}\\|_{2}=1}\left\|{\mathbf{a}}^{\top}\left(\sum_{1\leq i\neq j\leq p}{\rm E}\left({\mathbf{W}}_{k,i,j}{\mathbf{W}}_{k,i,j}^{\top}\right)\right){\mathbf{a}}\right\|,$
			$\displaystyle\sup_{\\|{\mathbf{a}}\\|_{2}=1}\left\|{\mathbf{a}}^{\top}\left(\sum_{1\leq i\neq j\leq p}{\rm E}\left({\mathbf{W}}_{k,i,j}^{\top}{\mathbf{W}}_{k,i,j}\right)\right){\mathbf{a}}\right\|\Bigg{\}}$
		$\displaystyle=$	$\displaystyle\max\Bigg{\{}\sup_{\\|{\mathbf{a}}\\|_{2}=1}\left\|\sum_{1\leq i\neq j\leq p}\sum_{s=1}^{k-1}\left({\rm Var}\left(Y_{i,j}\right)\alpha^{2k}\left\|\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\right\|^{2}\left(a^{(k-s)}_{j}\right)^{2}\right)\right\|,$
			$\displaystyle\sup_{\\|{\mathbf{a}}\\|_{2}=1}\left\|\sum_{1\leq i\neq j\leq p}\sum_{s=1}^{k-1}\left({\rm Var}\left(Y_{i,j}\right)\alpha^{2k}\left\|\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\right\|^{2}\left(a^{(s)}_{i}\right)^{2}\right)\right\|\Bigg{\}}$
		$\displaystyle\leq$	$\displaystyle\max_{i,j,s}\left({\rm Var}\left(Y_{i,j}\right)\alpha^{2k}\left\|\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\right\|^{2}\right)\sup_{\\|{\mathbf{a}}\\|_{2}=1}\left\|\sum_{1\leq i\neq j\leq p}\sum_{s=1}^{k-1}\left(a^{(k-s)}_{i}\right)^{2}+\left(a^{(s)}_{j}\right)^{2}\right\|$
		$\displaystyle\leq$	$\displaystyle 2p\left(\left(k-1\right)!\right)^{2}\sigma_{(p)}^{2}.$

Using the general Matrix Bernstein inequality (c.f. Theorem 6.17 and equation (6.43) of [33]), we have,

	$\displaystyle P\left(\left\\|{\mathbf{W}}_{k}\right\\|_{2}>\epsilon\right)$	$\displaystyle=P\left(\left\\|\sum_{1\leq i<j\leq p}{\mathbf{W}}_{k,i,j}\right\\|_{2}>\epsilon\right)$
		$\displaystyle\leq 2p\ \exp\left(-\frac{\epsilon^{2}}{\left(p-1\right)\left(\left(k-1\right)!\right)^{2}\sigma_{(p)}^{2}+2\left(k-1\right)!b_{(p)}\epsilon}\right).$

Consequently, there exist big enough constants $C_{2}>0$ and $c_{2}>0$ s.t. by choosing

\displaystyle\epsilon=C_{2}\left(k-1\right)!\left(b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}\right),

we have, with probability greater than $1-(np)^{c_{2}}$ ,

\|{\mathbf{W}}_{k}\|_{2}\leq C_{2}\left(k-1\right)!\left(b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}\right),

holds for all $k\leq K=c_{0}\log(np)$ where $c_{0}>0$ is big enough constant. ∎

A.17 Proof of Theorem 6

Proof.

For any vector ${\mathbf{x}}\in\mathbb{R}^{p}$ , we define ${\mathbf{x}}_{-i}:=(x_{1},\cdots,x_{i-1},x_{i+1},\cdots,x_{p})^{\top}$ . With similar arguments as in the proof of Theorem 5, we have that there exists a $\boldsymbol{\theta}^{\xi}$ such that

			$\displaystyle{\mathbf{L}}_{i}\left(\boldsymbol{\theta}\right)-{\mathbf{L}}_{i}\left(\boldsymbol{\theta}^{\prime}\right)$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\sum_{k=1}^{\infty}\left(\frac{1}{k!}\sum_{j=1,\>j\neq i}^{p}\left(Y_{i,j}\sum_{s=0}^{k}\tbinom{k}{s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k-s}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\right)\right)$
		$\displaystyle\leq$	$\displaystyle\left\|\sum_{k=1}^{K}\frac{1}{p}\frac{1}{k!}\sum_{j=1,\>j\neq i}^{p}Y_{i,j}\left(\left(\theta_{i}-\theta_{i}^{\prime}\right)^{k}+\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k}\right)\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{k}}\right\|$
			$\displaystyle+\left\|\sum_{k=2}^{K}\frac{1}{p}\frac{1}{k!}\sum_{j=1,\>j\neq i}^{p}\left(Y_{i,j}\sum_{s=1}^{k-1}\tbinom{k}{s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k-s}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\right)\right\|$
			$\displaystyle+\frac{1}{p}\left\|\frac{1}{\left(K+1\right)!}\sum_{j=1,\>j\neq i}^{p}\left(Y_{i,j}\sum_{s=0}^{K+1}\tbinom{K+1}{s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\left(\theta_{j}-\theta_{j}^{\prime}\right)^{K+1-s}\frac{\partial^{K+1}l_{i,j}\left(\boldsymbol{\theta}^{\xi}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{K+1-s}}\right)\right\|$
		$\displaystyle=$	$\displaystyle S^{(1)}_{i}+S^{(2)}_{i}+S^{(3)}_{i},$

where $K=\lfloor c_{0}\log(np)\rfloor+1$ for some large enough constant $c_{0}$ . First consider $S^{(1)}_{i}$ . There exist big enough constants $C_{1}>0$ , $c_{1}>0$ such that, uniformly for all $i=1,\cdots,p$ , $k=1,2,\cdots,K$ and all $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ , we have, with probability greater than $1-(np)^{-c_{1}}$ ,

			$\displaystyle\frac{1}{p}\frac{1}{k!}\left\|\sum_{j=1,\>j\neq i}^{p}Y_{i,j}\left(\left(\theta_{i}-\theta_{i}^{\prime}\right)^{k}+\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k}\right)\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{k}}\right\|$
		$\displaystyle=$	$\displaystyle\frac{1}{p}\frac{1}{k!}\left\|\frac{\theta_{i}}{\alpha}-\frac{\theta_{i}^{\prime}}{\alpha}\right\|^{k}\left\|\sum_{j\neq i,j=1}^{p}Y_{i,j}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{k}}\alpha^{k}\right\|$
			$\displaystyle+\frac{1}{p}\frac{1}{k!}\sum_{j\neq i,j=1}^{p}\left\|\frac{\theta_{j}}{\alpha}-\frac{\theta_{j}^{\prime}}{\alpha}\right\|^{k}\left\|Y_{i,j}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{k}}\alpha^{k}\right\|$
		$\displaystyle\leq$	$\displaystyle\frac{1}{p}\frac{1}{k!}\left\|\frac{\theta_{i}-\theta^{\prime}_{i}}{\alpha}\right\|^{k}\max_{i}\left\|\sum_{j\neq i,j=1}^{p}Y_{i,j}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{k}}\alpha^{k}\right\|$
			$\displaystyle+\frac{1}{p}\frac{1}{k!}\left\\|\frac{\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}^{\prime}_{-i}}{\alpha}\right\\|_{k}^{k}\max_{i,j}\left\|Y_{i,j}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{k}}\alpha^{k}\right\|$
		$\displaystyle\leq$	$\displaystyle C_{1}\left(\left\|\frac{\theta_{i}-\theta^{\prime}_{i}}{\alpha}\right\|^{k}\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{kp}+\left\\|\frac{\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}^{\prime}_{-i}}{\alpha}\right\\|_{k}^{k}\frac{b_{(p)}}{kp}\right).$

Then, from inequality (A.42) we have, there exist big enough constants $C_{2}>0$ and $c_{2}>0$ , such that with probability greater than $1-(np)^{-c_{2}}$ ,

$\displaystyle S^{(1)}_{i}$	$\displaystyle\leq$	$\displaystyle\sum_{k=1}^{K}\frac{1}{p}\frac{1}{k!}\left\|\sum_{j=1,\>j\neq i}^{p}Y_{i,j}\left(\left(\theta_{i}-\theta_{i}^{\prime}\right)^{k}+\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k}\right)\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{k}}\right\|$
	$\displaystyle\leq$	$\displaystyle C_{1}\sum_{k=1}^{K}\left(\left\|\frac{\theta_{i}-\theta^{\prime}_{i}}{\alpha}\right\|^{k}\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{kp}+\left\\|\frac{\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}^{\prime}_{-i}}{\alpha}\right\\|_{k}^{k}\frac{b_{(p)}}{kp}\right)$
	$\displaystyle\leq$	$\displaystyle C_{1}\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{p}\left(\sum_{k=1}^{K}\frac{1}{k}\left\|\frac{\theta_{i}-\theta^{\prime}_{i}}{\alpha}\right\|^{k}\right)+C_{1}\frac{b_{(p)}}{p}\sum_{k=1}^{K}\frac{1}{k}\left\\|\frac{\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}^{\prime}_{-i}}{\alpha}\right\\|_{k}^{k}$
	$\displaystyle\leq$	$\displaystyle C_{2}\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{p}\left\|\theta_{i}-\theta^{\prime}_{i}\right\|+C_{2}\frac{b_{(p)}}{p}\left\\|\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}^{\prime}_{-i}\right\\|_{1},$

holds uniformly for $i=1,\cdots,p$ and $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ .

Next we derive an upper bound for $S^{(2)}_{i}$ . With the fact that $Y_{i,j}^{2}\leq b_{(p)}^{2}$ and

{\rm Var}\left(Y_{i,j}^{2}\right)\leq{\rm E}\left(Y_{i,j}^{4}\right)\leq b_{(p)}^{2}{\rm E}\left(Y_{i,j}^{2}\right)=b_{(p)}^{2}\sigma^{2}_{(p)},

by Lemma 6 we have, there exist big enough constants $C_{3},c_{3}>0$ such that with probability greater than $1-(np)^{-c_{3}}$ ,

	$\displaystyle\max_{i}\left(\sum_{j=1,\>j\neq i}^{p}Y_{i,j}^{2}\right)^{1/2}$	$\displaystyle\leq C_{3}\left(b_{(p)}^{2}\log(np)+\sigma_{(p)}b_{(p)}\sqrt{p\log(np)}\right)^{1/2}$
		$\displaystyle\leq C_{3}\left(\left(p\log(np)\right)^{1/4}\sqrt{\sigma_{(p)}b_{(p)}}+b_{(p)}\sqrt{\log(np)}\right).$

Consequently, there exists a big enough constant $C_{4}>0$ such that, uniformly for all $i=1,\cdots,p$ and $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ , we have, with probability greater than $1-(np)^{-c_{3}}$ ,

$\displaystyle S^{(2)}_{i}$	$\displaystyle\leq$	$\displaystyle\sum_{k=2}^{K}\frac{1}{p}\frac{1}{k!}\sum_{j=1,\>j\neq i}^{p}\left\|\left(Y_{i,j}\sum_{s=1}^{k-1}\tbinom{k}{s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\left(\theta_{j}-\theta_{j}^{\prime}\right)^{k-s}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\right)\right\|$
	$\displaystyle\leq$	$\displaystyle\sum_{k=2}^{K}\frac{1}{p}\frac{1}{k!}0.5^{k}\left(\sum_{j=1,\>j\neq i}^{p}Y_{i,j}^{2}\sum_{s=1}^{k-1}\left(\frac{\theta_{i}-\theta_{i}^{\prime}}{\alpha/2}\right)^{2s}\right)^{1/2}$
		$\displaystyle\quad\left(\sum_{j=1,\>j\neq i}^{p}\sum_{s=1}^{k-1}\left(\tbinom{k}{s}\left(\frac{\theta_{j}-\theta_{j}^{\prime}}{\alpha/2}\right)^{k-s}\alpha^{k}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\right)^{2}\right)^{1/2}$
	$\displaystyle\leq$	$\displaystyle\sum_{k=2}^{K}\frac{1}{p}\frac{1}{k!}0.5^{k}\left(\frac{\left\|\theta_{i}-\theta^{\prime}_{i}\right\|^{2}}{\alpha^{2}/4-\left\|\theta_{i}-\theta^{\prime}_{i}\right\|^{2}}\right)^{1/2}\max_{i}\left(\sum_{j=1,\>j\neq i}^{p}Y_{i,j}^{2}\right)^{1/2}$
		$\displaystyle\quad\left(\sum_{s=1}^{k-1}\tbinom{k}{s}^{2}\sum_{j=1,\>j\neq i}^{p}\left(\frac{\theta_{j}-\theta_{j}^{\prime}}{\alpha/2}\right)^{2(k-s)}\right)^{1/2}\max_{j,s,k}\left\|\alpha^{k}\frac{\partial^{k}l_{i,j}\left(\boldsymbol{\theta}^{\prime}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{k-s}}\right\|$
	$\displaystyle\leq$	$\displaystyle C_{4}\left\|\theta_{i}-\theta^{\prime}_{i}\right\|\frac{\left(p\log(np)\right)^{1/4}\sqrt{\sigma_{(p)}b_{(p)}}+b_{(p)}\sqrt{\log(np)}}{p}$
		$\displaystyle\quad\sum_{k=2}^{K}\frac{0.5^{k}}{k}\left(\sum_{s=1}^{k-1}\tbinom{k}{s}^{2}\sum_{j=1,\>j\neq i}^{p}\left(\frac{\theta_{j}-\theta_{j}^{\prime}}{\alpha/2}\right)^{2(k-s)}\right)^{1/2}$
	$\displaystyle\leq$	$\displaystyle C_{4}\left\|\theta_{i}-\theta^{\prime}_{i}\right\|\frac{\left(p\log(np)\right)^{1/4}\sqrt{\sigma_{(p)}b_{(p)}}+b_{(p)}\sqrt{\log(np)}}{p}$
		$\displaystyle\times\sum_{k=2}^{K}\sum_{s=1}^{k-1}\sum_{j=1,\>j\neq i}^{p}\frac{0.5^{k}}{s}\tbinom{k}{s}\left\|\frac{\theta_{j}-\theta_{j}^{\prime}}{\alpha/2}\right\|^{k-s}$
	$\displaystyle\leq$	$\displaystyle C_{4}\left\|\theta_{i}-\theta^{\prime}_{i}\right\|\frac{\left(p\log(np)\right)^{1/4}\sqrt{\sigma_{(p)}b_{(p)}}+b_{(p)}\sqrt{\log(np)}}{p}$
		$\displaystyle\times\sum_{s=1}^{K-1}\left(\frac{1}{s}\left\\|\frac{\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}_{-i}^{\prime}}{\alpha/2}\right\\|_{s}^{s}\sum_{k=s+1}^{K}\tbinom{k}{s}0.5^{k}\right)$
	$\displaystyle\leq$	$\displaystyle 2C_{4}\left\|\theta_{i}-\theta^{\prime}_{i}\right\|\frac{\left(p\log(np)\right)^{1/4}\sqrt{\sigma_{(p)}b_{(p)}}+b_{(p)}\sqrt{\log(np)}}{p}\sum_{s=1}^{K-1}\frac{1}{s}\left\\|\frac{\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}_{-i}^{\prime}}{\alpha/2}\right\\|_{s}^{s}$
	$\displaystyle\leq$	$\displaystyle 4C_{4}\left\\|\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}_{-i}^{\prime}\right\\|_{1}\left\|\theta_{i}-\theta^{\prime}_{i}\right\|\frac{\left(p\log(np)\right)^{1/4}\sqrt{\sigma_{(p)}b_{(p)}}+b_{(p)}\sqrt{\log(np)}}{p}.$

Here in the above inequalities we have used that fact that $\sum_{k=s+1}^{K}\tbinom{k}{s}0.5^{k}<2$ , and the last step follows from the inequality (A.42).

Finally, we derive an upper bound for $S^{(3)}_{i}$ . By condition (L-A1), by choosing $K=\lfloor c_{0}\log(np)\rfloor+1$ with $c_{0}$ to be a large enough constant, there exists a big enough constant $c_{4}>0$ such that, uniformly for all $i=1,\cdots,p$ , $\boldsymbol{\theta}^{\xi}$ and $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ , we have

$\displaystyle S^{(3)}_{i}$	$\displaystyle=$	$\displaystyle\frac{1}{p}\left\|\frac{1}{\left(K+1\right)!}\sum_{j=1,\>j\neq i}^{p}\left(Y_{i,j}\sum_{s=0}^{K+1}\tbinom{K+1}{s}\left(\theta_{i}-\theta_{i}^{\prime}\right)^{s}\left(\theta_{j}-\theta_{j}^{\prime}\right)^{K+1-s}\frac{\partial^{K+1}l_{i,j}\left(\boldsymbol{\theta}^{\xi}\right)}{\partial\theta_{i}^{s}\partial\theta_{j}^{K+1-s}}\right)\right\|$
	$\displaystyle\leq$	$\displaystyle\frac{1}{p}\frac{1}{K+1}\sum_{j=1,\>j\neq i}^{p}\left\|Y_{i,j}\right\|\left(\frac{\left\|\theta_{i}-\theta_{i}^{\prime}\right\|}{\alpha}+\frac{\left\|\theta_{j}-\theta_{j}^{\prime}\right\|}{\alpha}\right)^{K+1}$
	$\displaystyle\leq$	$\displaystyle\frac{b_{(p)}}{K+1}\left(\frac{2\left\\|\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\right\\|_{\infty}}{\alpha}\right)^{K+1}$
	$\displaystyle\leq$	$\displaystyle\frac{b_{(p)}}{K+1}\left(\frac{2\alpha_{0}}{\alpha}\right)^{K+1}$
	$\displaystyle\leq$	$\displaystyle b_{(p)}\left(np\right)^{-c_{4}}.$

Here the last step will hold if we choose $c_{0}$ to be large enough such that $(2\alpha_{0}/\alpha)^{c_{0}/c_{3}}$ $<e^{-1}$ . When $np\rightarrow\infty$ , and $c_{0},c_{3}$ are chosen to be large enough, this bound will be dominated by the upper bounds for $S^{(1)}$ and $S^{(2)}$ .

Consequently, by choosing $K=\lfloor c_{0}\log(np)\rfloor+1$ with $c_{0}$ to be a large enough constant, we conclude that, for any given $\boldsymbol{\theta}^{\prime}$ , there exist large enough constants $C,c>0$ such that uniform for any $\boldsymbol{\theta}\in{\mathbf{B}}_{\infty}\left(\boldsymbol{\theta}^{\prime},\alpha_{0}\right)$ and $1\leq i\leq p$ , with probability greater than $1-(np)^{-c}$ ,

			$\displaystyle\left\|{\mathbf{L}}_{i}\left(\boldsymbol{\theta}\right)-{\mathbf{L}}_{i}\left(\boldsymbol{\theta}^{\prime}\right)\right\|$
		$\displaystyle\leq$	$\displaystyle C_{2}\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{p}\left\|\theta_{i}-\theta^{\prime}_{i}\right\|+C_{2}\frac{b_{(p)}}{p}\left\\|\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}^{\prime}_{-i}\right\\|_{1}$
			$\displaystyle+C_{4}\left\\|\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}_{-i}^{\prime}\right\\|_{1}\left\|\theta_{i}-\theta^{\prime}_{i}\right\|\frac{\left(p\log(np)\right)^{1/4}\sqrt{\sigma_{(p)}b_{(p)}}+b_{(p)}\sqrt{\log(np)}}{p}$
		$\displaystyle\leq$	$\displaystyle C\frac{b_{(p)}}{p}\left\\|\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}_{-i}^{\prime}\right\\|_{1}+C\left(\left\\|\boldsymbol{\theta}_{-i}-\boldsymbol{\theta}_{-i}^{\prime}\right\\|_{1}+1\right)\left\|\theta_{i}-\theta^{\prime}_{i}\right\|\frac{b_{(p)}\log(np)+\sigma_{(p)}\sqrt{p\log(np)}}{p}.$

∎

Appendix B Additional numerical results

B.1 Informal justification of the use of TWHM in analyzing the insecta-ant-colony4 dataset

To motivate the use of the proposed TWHM for the analysis of the insecta-ant-colony4 dataset, we plot the autocorrelation function (ACF) and the partial autocorrelation function (PACF) of the degree sequences of two ants. These ants are selected based on their respective highest and lowest values at time $t=1$ according to

\sum_{t=1}^{n}\left|\frac{p-1}{2}-\sum_{j=1,\>j\neq i}^{p}X_{i,j}^{t}\right|.

Upon examining Figure 5, it becomes evident that both degree sequences exhibit patterns reminiscent of a first-order autoregressive model with long memory. This observation serves as a strong motivation for employing the TWHM methodology.

B.2 Community detection under stochastic block structures

We conduct additional numerical studies to assess the efficacy of community detection using the estimated $\beta$ -parameters. To implement stochastic block structures within the TWHM framework, we partitioned the $p$ nodes into $k$ communities of equal size, ensuring that the parameters $(\beta_{i,0},\beta_{i,1})$ for all nodes $i$ within the same community were identical. Furthermore, we explored scenarios where the networks were independently generated from a Stochastic Block Model (SBM). Specifically, we have considered the following settings:

Setting 1: The networks are generated under TWHM with $\boldsymbol{\beta}_{i,r}=-0.2,0,0.2$ $(r=0,1)$ for all nodes $i$ in communities $1,2$ and $3$ respectively.
Setting 2: The networks are generated under TWHM with $\boldsymbol{\beta}_{i,r}=-0.4,0,0.4$ $(r=0,1)$ for all nodes $i$ in communities $1,2$ and $3$ respectively.
Setting 3: The networks are independently generated under SBM, with the probability matrix among different communities specified as

$\displaystyle\left[\begin{array}[]{ccc}0.26&0.1&0.1\\ 0.1&0.2&0.1\\ 0.1&0.1&0.14\\ \end{array}\right].$
Setting 4: The networks are independently generated under SBM, with the probability matrix among different communities given by

$\displaystyle\left[\begin{array}[]{ccc}0.4&0.3&0.2\\ 0.3&0.225&0.15\\ 0.2&0.15&0.1\\ \end{array}\right].$
Setting 5: The networks are independently generated under SBM, with the probability matrix among different communities given by

$\displaystyle\left[\begin{array}[]{ccc}0.9&0.5&0.3\\ 0.5&0.3&0.2\\ 0.3&0.2&0.15\\ \end{array}\right].$

Networks in Settings 1-2 are generated with autoregressive dependence using our TWHM model, while networks in Setting 3 are independent samples following the classical SBM structure. In Settings 4-5, networks are also generated with SBM structures, but the probability formation matrices are not full-rank.
Once the $\beta$ -parameters are estimated, we apply k-means clustering to these parameters to cluster the $p$ nodes, denoting this method as ”TWHM-Cluster”. For comparison, we apply spectral clustering, widely used for SBM, on the averaged networks, denoting this method as ”SBM-Spectral”. All experiments are repeated 100 times, and the clustering accuracy is reported in Table 7 below.
From Table 7, we observe that community detection using the $\beta$ -parameters performs significantly better under Settings 1 and 2, where data were generated from our TWHM model. This improvement is attributed to the fact that parameter estimation has considered the autoregressive structure of the networks.
When networks were independently generated from SBM under Setting 3, the performance of TWHM-Cluster is comparable to that of SBM-Cluster. However, when the probability matrix of SBM is not full-rank (Settings 4 and 5), the TWHM model still demonstrates promising performance, while classical spectral clustering can be much less satisfactory.

Table 7: Mean clustering accuracy of TWHM and SBM over 100 replications. Here

k,n,p

denote the number of communities, the number of network observations, and the number of nodes, respectively.

	$(k,n,p)$	TWHM-Cluster	SBM-Spectral
Setting 1	(3,2,300)	68.6%	37.1%
	(3,10,300)	95.1%	37.0%
	(3,50,300)	99.5%	38.7%
Setting 2	(3,2,300)	92.2%	39.7%
	(3,10,300)	95.6%	48.6%
	(3,50,300)	100.0%	63.0%
Setting 3	(3,10,300)	92.1%	99.8%
	(3,30,300)	99.4%	100.0%
	(3,50,300)	99.3%	100.0%
Setting 4	(3,10,500)	97.0%	37.1%
	(3,30,500)	93.9%	37.0%
	(3,50,500)	100%	37.5%
Setting 5	(3,10,500)	80.0%	71.2%
	(3,30,500)	80.2%	70.7%
	(3,50,500)	83.8%	72.8%

B.3 Dynamic protein-protein interaction networks

In this section, we applied the proposed TWHM to 12 dynamic protein-protein interaction networks (PPIN) of yeast cells examined in [5]. Each dynamic network comprises 36 network observations. The objective of investigating protein-protein interactions is to glean valuable insights into the cellular function and machinery of a proteome [34]. To provide an overview of these datasets, we present selected statistics in Table 8.

Table 8: Some statistics of the 12 protein-protein interaction network datasets.

Dataset	# of Nodes	Mean degree	Density
DPPIN-Uetz	922	4.68	0.22%
DPPIN-Ito	2856	6.05	0.07%
DPPIN-Ho	1548	54.55	0.13%
DPPIN-Gavin	2541	110.22	0.08%
DPPIN-Krogan (LCMS)	2211	77.01	0.09%
DPPIN-Krogan (MALDI)	2099	74.60	0.10%
DPPIN-Yu	1163	6.19	0.17%
DPPIN-Breitkreutz	869	90.33	0.23%
DPPIN-Babu	5003	44.56	0.04%
DPPIN-Lambert	697	19.09	0.29%
DPPIN-Tarassov	1053	9.17	0.19%
DPPIN-Hazbun	143	27.40	1.40%

We have employed our method for linkage prediction on these PPINs. Similar to the main paper, we utilized TWHM with either a fixed cutoff point of $0.5$ (TWHM_0.5) or the adaptive rule (TWHM_adaptive). For comparison, we used a naive estimator ${\mathbf{X}}^{t-1}$ to predict ${\mathbf{X}}^{t}$ (Naive). Our training time slot size was set to $n=10$ , and the results are presented in Table 9 below. As evident from the table, our approach shows significant promise for accurate link prediction.

Table 9: Comparison of TWHM_0.5, TWHM_adaptive, and Naive in terms of misclassification rate of links (

\times 10^{-5}

Dataset	TWHM_0.5	TWHM_adaptive	Naive
DPPIN-Uetz	55	55	88
DPPIN-Ito	28	29	50
DPPIN-Ho	108	111	194
DPPIN-Gavin	139	142	250
DPPIN-Krogan(LCMS)	133	141	222
DPPIN-Krogan(MALDI)	124	127	215
DPPIN-Yu	54	55	86
DPPIN-Breitkreutz	297	305	493
DPPIN-Babu	27	28	45
DPPIN-Lambert	293	298	489
DPPIN-Tarassov	72	75	133
DPPIN-Hazbun	759	772	1198

			$\displaystyle{\mathbf{P}}\left(\left\|\sum_{t=1}^{n}\left\{X_{i,j}^{t}-{\rm E}\left(X_{i,j}^{t}\right)\right\}\right\|>\epsilon\right)\leq\exp\left(\frac{-C\epsilon^{2}}{n+\epsilon\log\left(n\right)\log\log\left(n\right)}\right),$
			$\displaystyle{\mathbf{P}}\left(\left\|\sum_{t=1}^{n}\left\{X_{i,j}^{t}X_{i,j}^{t-1}-{\rm E}\left(X_{i,j}^{t}X_{i,j}^{t-1}\right)\right\}\right\|>\epsilon\right)\leq\exp\left(\frac{-C\epsilon^{2}}{n+\epsilon\log\left(n\right)\log\log\left(n\right)}\right),$
			$\displaystyle{\mathbf{P}}\left(\left\|\sum_{t=1}^{n}\left\{(1-X_{i,j}^{t})(1-X_{i,j}^{t-1})-{\rm E}\left((1-X_{i,j}^{t})(1-X_{i,j}^{t-1})\right)\right\}\right\|>\epsilon\right)$
		$\displaystyle\leq$	$\displaystyle\exp\left(\frac{-C\epsilon^{2}}{n+\epsilon\log\left(n\right)\log\log\left(n\right)}\right),$

$\displaystyle\\|{\mathbf{Y}}_{i,j}\\|_{2}$		$\displaystyle\leq b\sup_{\\|a\\|_{2}=1}\left((a_{i}+a_{j})^{2}\right)\leq 2b,$
$\displaystyle\left\\|\sum_{1\leq i<j\leq p}{\rm Var}({\mathbf{Y}}_{i,j})\right\\|_{2}$	$\displaystyle=$	$\displaystyle\sup_{\\|a\\|_{2}=1}\left(\sum_{1\leq i<j\leq p}2{\rm Var}\left(Z_{i,j}\right)(a_{i}+a_{j})^{2}\right)$
	$\displaystyle\leq$	$\displaystyle\max_{i,j}\left(2{\rm Var}\Big{(}Z_{i,j}\Big{)}\right)\sup_{\\|a\\|_{2}=1}\left(\sum_{1\leq i<j\leq p}(a_{i}+a_{j})^{2}\right)$
	$\displaystyle\leq$	$\displaystyle 4\sigma^{2}(p-1).$

			$\displaystyle\left\\|{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right\\|_{2}$
		$\displaystyle=$	$\displaystyle\sup_{\\|{\mathbf{a}}\\|_{2}=1}\sum_{1\leq i,j\leq p}\left[{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right]_{i,j}a_{i}a_{j}$
		$\displaystyle=$	$\displaystyle\sup_{\\|{\mathbf{a}}\\|_{2}=1}\sum_{1\leq i<j\leq p}\left[{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right]_{i,j}\left(a_{i}+a_{j}\right)^{2}$
		$\displaystyle\leq$	$\displaystyle\max_{1\leq i<j\leq p}\left\|\left[{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right]_{i,j}\right\|\sup_{\\|{\mathbf{a}}\\|_{2}=1}\sum_{1\leq i<j\leq p}\left(a_{i}+a_{j}\right)^{2}$
		$\displaystyle\leq$	$\displaystyle\max_{1\leq i<j\leq p}\Bigg{\|}{\mathbf{V}}_{2}(\boldsymbol{\theta})_{i,j}-{\mathbf{V}}_{2}(\boldsymbol{\theta})^{}_{i,j}+{\rm E}\left({\mathbf{V}}_{2}(\boldsymbol{\theta})^{}_{i,j}\right)-{\rm E}\left({\mathbf{V}}_{2}(\boldsymbol{\theta})_{i,j}\right)\Bigg{\|}2\left(p-1\right)$
		$\displaystyle=$	$\displaystyle\frac{2\left(p-1\right)}{np}\max_{1\leq i<j\leq p}\Bigg{\|}\left(b_{i,j}-{\rm E}(b_{i,j})\right)\left(\frac{\alpha_{0,i,j}\alpha_{1,i,j}}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-\frac{\alpha_{0,i,j}^{}\alpha_{1,i,j}^{}}{(\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{})^{2}}\right)\Bigg{\|}$
		$\displaystyle\leq$	$\displaystyle\max_{1\leq i<j\leq p}\frac{2\left\|b_{i,j}-{\rm E}(b_{i,j})\right\|}{n}\max_{1\leq i<j\leq p}\Bigg{\|}\frac{\left(\alpha_{0,i,j}^{}\alpha_{1,i,j}-\alpha_{0,i,j}\alpha_{1,i,j}^{}\right)\left(\alpha_{0,i,j}^{}\alpha_{0,i,j}-\alpha_{1,i,j}\alpha_{1,i,j}^{}\right)}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}(\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{})^{2}}\Bigg{\|}$
		$\displaystyle=$	$\displaystyle\max_{1\leq i<j\leq p}\frac{2\left\|b_{i,j}-{\rm E}(b_{i,j})\right\|}{n}\max_{i\neq j}\Bigg{\|}\left(\frac{\alpha_{0,i,j}^{}}{\alpha_{0,i,j}}-\frac{\alpha_{1,i,j}^{}}{\alpha_{1,i,j}}\right)\frac{\alpha_{0,i,j}\alpha_{1,i,j}\left(\alpha_{0,i,j}^{}\alpha_{0,i,j}-\alpha_{1,i,j}\alpha_{1,i,j}^{}\right)}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}(\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{})^{2}}\Bigg{\|}$
		$\displaystyle\leq$	$\displaystyle C_{2}\min\left\{1,\sqrt{\frac{\log(np)}{{n}}}+\frac{\log\left(n\right)\log\log\left(n\right)\log\left(np\right)}{n}\right\}\max_{1\leq i<j\leq p}\left\|\frac{\alpha_{0,i,j}^{}}{\alpha_{0,i,j}}-\frac{\alpha_{1,i,j}^{}}{\alpha_{1,i,j}}\right\|,$

			$\displaystyle\left\\|{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right\\|_{2}$
		$\displaystyle\leq$	$\displaystyle\max_{1\leq i<j\leq p}\left\|\left[{\mathbf{U}}(\boldsymbol{\theta})-{\mathbf{U}}(\boldsymbol{\theta}^{})+{\rm E}({\mathbf{U}}(\boldsymbol{\theta}^{}))-{\rm E}({\mathbf{U}}(\boldsymbol{\theta}))\right]_{i,j}\right\|\sup_{\\|{\mathbf{a}}\\|_{2}=1}\sum_{1\leq i<j\leq p}\left(a_{i}+a_{j}\right)^{2}$
		$\displaystyle\leq$	$\displaystyle\max_{1\leq i<j\leq p}\Bigg{\|}\Bigg{\{}{\mathbf{V}}_{2}(\boldsymbol{\theta})_{i,j}+{\mathbf{V}}_{3}(\boldsymbol{\theta})_{i,j}-{\mathbf{V}}_{2}(\boldsymbol{\theta})^{}_{i,j}-{\mathbf{V}}_{3}(\boldsymbol{\theta})^{}_{i,j}+{\rm E}\left({\mathbf{V}}_{2}(\boldsymbol{\theta})^{}_{i,j}\right)+{\rm E}\left({\mathbf{V}}_{3}(\boldsymbol{\theta})^{}_{i,j}\right)$
			$\displaystyle-{\rm E}\left({\mathbf{V}}_{2}(\boldsymbol{\theta})_{i,j}\right)-{\rm E}\left({\mathbf{V}}_{3}(\boldsymbol{\theta})_{i,j}\right)\Bigg{\}}\Bigg{\|}2\left(p-1\right)$
		$\displaystyle=$	$\displaystyle\frac{2\left(p-1\right)}{np}\max_{1\leq i<j\leq p}\Bigg{\|}\left(b_{i,j}-{\rm E}(b_{i,j})\right)\left(\frac{\alpha_{0,i,j}\alpha_{1,i,j}}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}}-\frac{\alpha_{0,i,j}^{}\alpha_{1,i,j}^{}}{(\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{})^{2}}\right)\Bigg{\|}$
			$\displaystyle+\frac{2\left(p-1\right)}{np}\max_{1\leq i<j\leq p}\Bigg{\|}\left(d_{i,j}-{\rm E}(d_{i,j})\right)\left(\frac{\alpha_{1,i,j}}{(1+\alpha_{1,i,j})^{2}}-\frac{\alpha_{1,i,j}^{}}{(1+\alpha_{1,i,j}^{})^{2}}\right)\Bigg{\|}$
		$\displaystyle\leq$	$\displaystyle 2\max_{1\leq i<j\leq p}\frac{\left\|b_{i,j}-{\rm E}(b_{i,j})\right\|}{n}\max_{1\leq i<j\leq p}\left\|\left(\frac{\alpha_{1,i,j}^{}}{\alpha_{1,i,j}}-\frac{\alpha_{0,i,j}^{}}{\alpha_{0,i,j}}\right)\frac{\alpha_{0,i,j}\alpha_{1,i,j}\left(\alpha_{0,i,j}\alpha_{0,i,j}^{}-\alpha_{1,i,j}\alpha_{1,i,j}^{}\right)}{(\alpha_{0,i,j}+\alpha_{1,i,j})^{2}(\alpha_{0,i,j}^{}+\alpha_{1,i,j}^{})^{2}}\right\|$
			$\displaystyle+2\max_{1\leq i<j\leq p}\frac{\left\|d_{i,j}-{\rm E}(d_{i,j})\right\|}{n}\max_{1\leq i<j\leq p}\Bigg{\|}\left(1-\frac{\alpha_{1,i,j}^{}}{\alpha_{1,i,j}}\right)\frac{\alpha_{1,i,j}\left(1-\alpha_{1,i,j}\alpha_{1,i,j}^{}\right)}{(1+\alpha_{1,i,j})^{2}(1+\alpha_{1,i,j}^{*})^{2}}\Bigg{\|}$
		$\displaystyle\leq$	$\displaystyle C_{3}\min\left\{1,\sqrt{\frac{\log(np)}{{n}}}+\frac{\log\left(n\right)\log\log\left(n\right)\log\left(np\right)}{n}\right\}$
			$\displaystyle\times\left(\max_{1\leq i<j\leq p}\left\|\frac{\alpha_{0,i,j}^{}}{\alpha_{0,i,j}}-\frac{\alpha_{1,i,j}^{}}{\alpha_{1,i,j}}\right\|+\max_{1\leq i<j\leq p}\left\|1-\frac{\alpha_{1,i,j}^{*}}{\alpha_{1,i,j}}\right\|\right),$

			$\displaystyle l\left(\boldsymbol{\theta}_{(i)}^{},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-l\left(\widehat{\boldsymbol{\theta}}^{(s)}_{(i)},\widehat{\boldsymbol{\theta}}^{(s)}_{(-i)}\right)-\left[l_{E}\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{}\right)-l_{E}\left(\widehat{\boldsymbol{\theta}}^{(s)}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)\right]$
		$\displaystyle\geq$	$\displaystyle l_{E}\left(\widehat{\boldsymbol{\theta}}^{(s)}_{(i)},\boldsymbol{\theta}_{(-i)}^{}\right)-l_{E}\left(\boldsymbol{\theta}_{(i)}^{},\boldsymbol{\theta}_{(-i)}^{*}\right)$
		$\displaystyle=$	$\displaystyle\frac{1}{2}\left(\boldsymbol{\theta}_{(i)}^{}-\widehat{\boldsymbol{\theta}}^{(s)}_{(i)}\right)^{\top}l_{E}^{\prime\prime}\left(\boldsymbol{\theta}_{(i)}^{(\xi)},\boldsymbol{\theta}_{(-i)}^{}\right)\left(\boldsymbol{\theta}_{(i)}^{*}-\widehat{\boldsymbol{\theta}}^{(s)}_{(i)}\right)$
		$\displaystyle\geq$	$\displaystyle\\|\boldsymbol{\theta}_{(i)}^{}-\widehat{\boldsymbol{\theta}}^{(s)}_{(i)}\\|_{2}^{2}\left(\inf_{\\|\boldsymbol{\theta}_{(i)}\\|_{\infty}<\kappa}\left\\|{\rm E}{\mathbf{V}}^{(i)}\left(\boldsymbol{\theta}_{(i)}^{(\xi)},\boldsymbol{\theta}_{(-i)}^{}\right)\right\\|_{2}\right)$
		$\displaystyle\geq$	$\displaystyle\frac{C_{1}\\|\boldsymbol{\theta}_{(i)}^{*}-\widehat{\boldsymbol{\theta}}^{(s)}_{(i)}\\|_{\infty}^{2}}{e^{-4\kappa_{0}-4\kappa_{1}}}.$

A two-way heterogeneity model for dynamic networks

Abstract

1 Introduction

2 Two-way Heterogeneity Model

Definition 1.

Proposition 1.

3 Parameter Estimation

3.1 Maximum likelihood estimation

3.2 Existence of the local MLE

Proposition 2.

Proof.

Lemma 1.

Lemma 2.

Lemma 3.

Theorem 1.

3.3 Consistency of the local MLE

Theorem 2.

Theorem 3.

3.4 A method of moments estimator

Theorem 4.

3.5 The sparse case

Corollary 1.

Corollary 2.

Corollary 3.

4 A uniform local deviation bound under high dimensionality

Theorem 5.

Theorem 6.

Corollary 4.

5 Numerical study

5.1 Non-convexity of l​(𝜽)l(\boldsymbol{\theta}) and E​l​(𝜽){\rm E}l(\boldsymbol{\theta})

5.2 Parameter estimation

5.3 Real data

6 Summary and Discussion

References

Appendix A Technical proofs

A.1 Some technical lemmas

Lemma 4.

Proof.

Lemma 5.

Proof.

Lemma 6.

Proof.

A.2 Proof of Proposition 1

Proof.

A.3 Proof of Lemma 1

Proof.

A.4 Proof of Lemma 2

Proof.

A.5 Proof of Lemma 3

Proof.

A.6 Proof of Theorem 1

Proof.

A.7 Proof of Theorem 2

Proof.

A.8 Proof of Theorem 3

Lemma 7.

Proof.

Proof.

A.9 Proof of Theorem 4

Lemma 8.

Proof.

A.10 Proof of (A.32)

Proof.

A.11 Proof of (A.34)

Proof.

A.12 Proof of Corollary 1

Proof.

A.13 Proof of Corollary 2

Proof.

A.14 Proof of Corollary 3

Proof.

A.15 Proof of the Theorem 5

Proof.

A.16 Proof of inequality (A.45)

Proof.

A.17 Proof of Theorem 6

Proof.

Appendix B Additional numerical results

B.1 Informal justification of the use of TWHM in analyzing the insecta-ant-colony4 dataset

B.2 Community detection under stochastic block structures

5.1 Non-convexity of $l(\boldsymbol{\theta})$ and ${\rm E}l(\boldsymbol{\theta})$