This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Asymptotic distribution-free change-point detection for data with repeated observations

Hoseung Songlabel=e1][email protected] [    Hao Chenlabel=e2][email protected] [ University of California, Davis
Abstract

In the regime of change-point detection, a nonparametric framework based on scan statistics utilizing graphs representing similarities among observations is gaining attention due to its flexibility and good performances for high-dimensional and non-Euclidean data sequences, which are ubiquitous in this big data era. However, this graph-based framework encounters problems when there are repeated observations in the sequence, which often happens for discrete data, such as network data. In this work, we extend the graph-based framework to solve this problem by averaging or taking union of all possible optimal graphs resulted from repeated observations. We consider both the single change-point alternative and the changed-interval alternative, and derive analytic formulas to control the type I error for the new methods, making them fast applicable to large datasets. The extended methods are illustrated on an application in detecting changes in a sequence of dynamic networks over time. All proposed methods are implemented in an R package gSeg available on CRAN.

discrete data,
categorical data,
nonparametrics,
scan statistics,
tail probability,
high-dimensional data,
non-Euclidean data,
keywords:

and

1 Introduction

1.1 Background

Change-point analysis plays a significant role in various fields when a sequence of observations is collected. In general, the problem concerns testing whether or not a change has occurred, or several changes might have occurred, and identifying the locations of any such changes. In this paper, we consider the offline change-point detection problem where a sequence of independent observations {yi}i=1,,n\{y_{i}\}_{i=1,\ldotp\ldotp\ldotp,n} is completely observed at the time when data analysis is conducted. Here, nn is the length of the sequence, and ii is the time index or other meaningful index depending on the specific application. We consider testing the null hypothesis

H0:yiF0,i=1,,n,H_{0}:y_{i}\sim F_{0},\ i=1,\ldotp\ldotp\ldotp,n, (1)

against the single change-point alternative

H1: 1τn,yi{F1i>τF0,  otherwise,H_{1}:\exists\ 1\leq\tau\leq n,\ y_{i}\sim\left\{\begin{tabular}[]{c}$F_{1}$, \ $i>\tau$\\ $F_{0}$, \ otherwise,\\ \end{tabular}\right. (2)

or the changed-interval alternative

H2: 1τ1τ2n,yi{F1i=τ1+1,,τ2F0,  otherwise,H_{2}:\exists\ 1\leq\tau_{1}\leq\tau_{2}\leq n,\ y_{i}\sim\left\{\begin{tabular}[]{c}$F_{1}$, \ $i=\tau_{1}+1,\ldotp\ldotp\ldotp,\tau_{2}$\\ $F_{0}$, \ otherwise,\\ \end{tabular}\right. (3)

where F0F_{0} and F1F_{1} are two different distributions.

This problem has been extensively studied for univariate data; see Chen & Gupta (2011) for a survey. However, in many modern applications, yiy_{i} is high-dimensional or even non-Euclidean. For high-dimensional data, most methods are based on parametric models; see for example Zhang et al. (2010); Wang et al. (2017); Wang & Samworth (2018). To apply these methods, the data sequence needs to follow specific parametric models. In the nonparametric domain, Harchaoui et al. (2009) made use of kernel methods, Lung-Yut-Fong et al. (2011) utilized marginal rankings, and Matteson & James (2014) made use of distances among observations. These methods can be applied to a wider range of problems than parametric methods. However, it is in general difficult to conduct theoretical analysis on nonparametric methods and none of these nonparametric methods provides analytic formulas for false discovery control, making them difficult to be applied to large datasets.

1.2 Graph-based change-point methods

Recently, Chen & Zhang (2015) and Chu & Chen (2019) developed a graph-based framework to detect the change-point for high-dimensional and non-Euclidean data sequences. This framework is based on a similarity graph GG, such as a minimum spanning tree (MST), which is a spanning tree that connects all observations with the sum of distances of the edges in the tree minimized, constructed on the sample space. Based on GG over tt, test statistics rely on three basic quantities, R0,G(t)R_{0,G}(t), R1,G(t)R_{1,G}(t), and R2,G(t)R_{2,G}(t), where for each tt R0,G(t)R_{0,G}(t) is the number of edges connecting observations before and after tt, R1,G(t)R_{1,G}(t) is the number of edges connecting observations prior to tt, and R2,G(t)R_{2,G}(t) is the number of edges connecting observations after tt. Then, four scan statistics were studied: the original edge-count scan statistic Z(t)Z(t), the weighted edge-count scan statistic Zw(t)Z_{w}(t), the generalized edge-count scan statistic S(t)S(t), and the max-type edge-count scan statistic M(t)M(t) that can be applied to various alternatives. For detailed comparisons, see Chu & Chen (2019).

While the methods proposed by Chen & Zhang (2015) and Chu & Chen (2019) work well for continuous data, they are problematic for data with repeated observations, which is common for discrete data, such as network data. The reason is that these methods depend on the similarity graph constructed on observations. When there are repeated observations, the similarity graph is in general not uniquely defined and these methods are troublesome. For example, Chen & Zhang (2015) analyzed a phone-call network dataset and the task was to test whether there is any change in the dynamics of the networks over time. In this dataset, a few networks in the sequence are exactly the same. Chen & Zhang (2015) used the MST as the similarity graph. More specifically, in the phone-call network dataset, there are in total 330 networks {y1,,y330}\{y_{1},\ldots,y_{330}\} in the sequence and among them are 290 distinct networks. For example, y1y_{1}, y6y_{6}, y16y_{16} are exactly the same. All repeated observations are listed in Supplement A. Due to this, there are numerous ways in assigning edges in the MST with repeated observations. Hence, the MST is not uniquely defined and existing graph-based methods are not reliable since they are formulated by the unique similarity graph on pooled observations.

Table 1: The pp-values and corresponding test statistics (in parentheses) for four testing procedures proposed in Chen & Zhang (2015) and Chu & Chen (2019): an original edge-count scan statistic (maxn0tn1Z0(t))(\max_{n_{0}\leq t\leq n_{1}}Z_{0}(t)), a generalized edge-count scan statistic (maxn0tn1S(t))(\max_{n_{0}\leq t\leq n_{1}}S(t)), a weighted edge-count scan statistic (maxn0tn1Zw(t))(\max_{n_{0}\leq t\leq n_{1}}Z_{w}(t)), and a max-type edge-count scan statistic (maxn0tn1M(t))(\max_{n_{0}\leq t\leq n_{1}}M(t)). Here, n0n_{0} is set to be 0.05n\lceil 0.05n\rceil = 17 and n1=330n0n_{1}=330-n_{0}, where x\lceil x\rceil denotes the smallest integer greater than or equal to xx.
MST #1 MST #2 MST #3
maxn0tn1Z0(t)\max_{n_{0}\leq t\leq n_{1}}Z_{0}(t) 0.09 (2.32) 0.91 (0.92) 0.51 (1.57)
maxn0tn1S(t)\max_{n_{0}\leq t\leq n_{1}}S(t) 0.04 (13.61) 0.08 (12.31) 0.01 (16.36)
maxn0tn1Zw(t)\max_{n_{0}\leq t\leq n_{1}}Z_{w}(t) 0.44 (2.11) 0.02 (3.49) 0.88 (1.54)
maxn0tn1M(t)\max_{n_{0}\leq t\leq n_{1}}M(t) 0.09 (3.05) 0.02 (3.49) 0.05 (3.27)

Table 1 lists test statistics and their corresponding pp-values of four testing procedures proposed in Chen & Zhang (2015) and Chu & Chen (2019) on three randomly chosen MSTs. We see that the pp-value depends heavily on the choice of the MST: It could be very small under one MST, but very large on another MST, leading to completely different conclusions on whether the sequence is homogeneous or not. Moreover, since the number of possible MSTs is huge, it is impractical to compute the test statistics on all possible MSTs directly to get a summary.

1.3 Our contribution

We extend the methods in Chen & Zhang (2013) and Zhang & Chen (2017) to the change-point settings and propose new graph-based testing procedures that can deal with repeated observations properly. This work fills the gap for the graph-based framework in dealing with discrete data. We show that the new tests are asymptotically distribution-free under the null hypothesis of no change and reveal that the limiting distributions for two approaches in Chen & Zhang (2013) are the same, even for continuous data. We also derive analytic formulas to approximate permutation pp-values for those modified test statistics, making them fast applicable to real datasets. To improve the analytical pp-value approximations for finite sample sizes, skewness correction is also performed. We show that the proposed tests work well to detect the change when the data has repeated observations. We illustrate the new testing procedures through an analysis on phone-call network dataset. The new methods are implemented in an R package gSeg available on CRAN.

2 Notations and related existing works

For data with repeated observations, we represent the data using a contingency table for each tt. Suppose that there are in total nn observations and KK distinct values, which we also refer to as categories in the following. Each tt divides the sequence into two groups, before or at time tt (Group 1) and after time tt (Group 2). Let nik(t)n_{ik}(t) be the number of observations in group i(i=1,2)i\ (i=1,2) and category k(k=1,,K)k\ (k=1,\ldotp\ldotp\ldotp,K) and mk(k=1,,K)m_{k}\ (k=1,\ldotp\ldotp\ldotp,K) be the number of observations in category kk. Notice that mk=n1k(t)+n2k(t)(k=1,,K)m_{k}=n_{1k}(t)+n_{2k}(t)\ (k=1,\ldotp\ldotp\ldotp,K), k=1Kmk=n\sum_{k=1}^{K}m_{k}=n, k=1Kn1k(t)=t\sum_{k=1}^{K}n_{1k}(t)=t, and k=1Kn2k(t)=nt\sum_{k=1}^{K}n_{2k}(t)=n-t.

Table 2: Notations at time tt
Index of distinct values 1 2 \cdotp\cdotp\cdotp KK Total
Group 1 n11(t)n_{11}(t) n12(t)n_{12}(t) \cdotp\cdotp\cdotp n1K(t)n_{1K}(t) tt
Group 2 n21(t)n_{21}(t) n22(t)n_{22}(t) \cdotp\cdotp\cdotp n2K(t)n_{2K}(t) ntn-t
Total m1m_{1} m2m_{2} \cdotp\cdotp\cdotp mKm_{K} nn

In Chen & Zhang (2013) and Zhang & Chen (2017), the authors studied ways to extend the underlying graph-based two-sample tests to accommodate data with repeated observations under the two-sample testing framework. When data has repeated observations, the similarity graph is not uniquely defined based on common optimization criteria, such as the MST, leading to multiple optimal graphs. The authors considered two ways to incorporate information from these graphs: averaging and union. To be more specific, they first construct the similarity graph on the distinct values, denoted by C0C_{0}. Here, C0C_{0} could be the MST on all distinct values, the nearest neighbor link, the union of all possible MSTs on the distinct values, when the MST on the distinct values is not unique, or some other meaningful graphs. Then, the optimal graph initiated from C0C_{0} is defined in the following way: For each pair of edges (k1,k2)C0(k_{1},k_{2})\in C_{0}, randomly choose an observation with the value indexed by k1k_{1} and an observation with the value indexed by k2k_{2}, and connect these two observations; then, for each ki(i=1,2)k_{i}\ (i=1,2), if there are more than one observation (repeated observations) with the value indexed by kik_{i}, connect them by a spanning tree (any edges in this spanning tree has distance 0). More detail explanations for C0C_{0} are provided in Supplement B. Based on these optimal graphs, averaging statistic is defined by averaging the test statistic over all optimal graphs and union statistic is defined by calculating the test statistic on the union of all optimal graphs.

3 Proposed tests

3.1 Extended test statistics for data with repeated observations

Here, we focus on extending the weighted, generalized, and max-type test statistics for repeated observations, which will turn out to be the asymptotic distribution-free tests. Details for extending the original edge-count test is in Supplement E. Based on the two-sample test statistics in Chen & Zhang (2013) and Zhang & Chen (2017), we could define the extended basic quantities at time tt under the averaging approach as follows:

R1,(a)(t)\displaystyle R_{1,(a)}(t) =k=1Kn1k(t)(n1k(t)1)mk+(u,v)C0n1u(t)n1v(t)mumv,\displaystyle=\sum_{k=1}^{K}\frac{n_{1k}(t)\left(n_{1k}(t)-1\right)}{m_{k}}+\sum_{(u,v)\in C_{0}}\frac{n_{1u}(t)n_{1v}(t)}{m_{u}m_{v}}, (4)
R2,(a)(t)\displaystyle R_{2,(a)}(t) =k=1Kn2k(t)(n2k(t)1)mk+(u,v)C0n2u(t)n2v(t)mumv,\displaystyle=\sum_{k=1}^{K}\frac{n_{2k}(t)\left(n_{2k}(t)-1\right)}{m_{k}}+\sum_{(u,v)\in C_{0}}\frac{n_{2u}(t)n_{2v}(t)}{m_{u}m_{v}}, (5)

and under the union approach as follows:

R1,(u)(t)\displaystyle R_{1,(u)}(t) =k=1Kn1k(t)(n1k(t)1)2+(u,v)C0n1u(t)n1v(t),\displaystyle=\sum_{k=1}^{K}\frac{n_{1k}(t)\left(n_{1k}(t)-1\right)}{2}+\sum_{(u,v)\in C_{0}}n_{1u}(t)n_{1v}(t), (6)
R2,(u)(t)\displaystyle R_{2,(u)}(t) =k=1Kn2k(t)(n2k(t)1)2+(u,v)C0n2u(t)n2v(t).\displaystyle=\sum_{k=1}^{K}\frac{n_{2k}(t)\left(n_{2k}(t)-1\right)}{2}+\sum_{(u,v)\in C_{0}}n_{2u}(t)n_{2v}(t). (7)

These are discrete data version of R1,G(t)R_{1,G}(t) and R2,G(t)R_{2,G}(t) to address infeasibility of computing test statistics in data with repeated observations. Hence, relatively large value of the sum of R1,(a)(t)R_{1,(a)}(t) and R2,(a)(t)R_{2,(a)}(t) or R1,(u)(t)R_{1,(u)}(t) and R2,(u)(t)R_{2,(u)}(t) could be the evidence against the null hypothesis.

Under the null hypothesis H0H_{0} (1), the null distribution is defined to be the permutation distribution, which places 1/n!1/n! probabilities on each of the n!n! permutations of {yi}i=1,,n\{y_{i}\}_{i=1,\ldotp\ldotp\ldotp,n}. With no further specification, pr, EE, var, and cov denote the probability, the expectation, the variance, and the covaraince, respectively, under the permutation null distribution. Then, analytic formulas for the expectation and the variance of extended basic quantities can be calculated through combinatorial analysis and detailed expressions and proof can be found in Supplement C.

For any candidate value tt of τ\tau, the extended test statistics can be defined as

Zw,(a)(t)=Rw,(a)(t)E(Rw,(a)(t))var(Rw,(a)(t)),\displaystyle\ \ Z_{w,(a)}(t)=\frac{R_{w,(a)}(t)-E\left(R_{w,(a)}(t)\right)}{\surd{\text{var}\left(R_{w,(a)}(t)\right)}},\quad Zw,(u)(t)=Rw,(u)(t)E(Rw,(u)(t))var(Rw,(u)(t)),\displaystyle Z_{w,(u)}(t)=\frac{R_{w,(u)}(t)-E\left(R_{w,(u)}(t)\right)}{\surd{\text{var}\left(R_{w,(u)}(t)\right)}},
S(a)(t)=Zw,(a)2(t)+Zd,(a)2(t),\displaystyle\ \ S_{(a)}(t)=Z_{w,(a)}^{2}(t)+Z_{d,(a)}^{2}(t),\quad S(u)(t)=Zw,(u)2(t)+Zd,(u)2(t),\displaystyle S_{(u)}(t)=Z_{w,(u)}^{2}(t)+Z_{d,(u)}^{2}(t),
M(a)(t)=max(Zw,(a)(t),|Zd,(a)(t)|),\displaystyle\ \ M_{(a)}(t)=\max\left(Z_{w,(a)}(t),|Z_{d,(a)}(t)|\right),\quad M(u)(t)=max(Zw,(u)(t),|Zd,(u)(t)|),\displaystyle M_{(u)}(t)=\max\left(Z_{w,(u)}(t),|Z_{d,(u)}(t)|\right),

where

Rw,(a)(t)=(1w(t))R1,(a)(t)+w(t)R2,(a)(t),\displaystyle R_{w,(a)}(t)=(1-w(t))R_{1,(a)}(t)+w(t)R_{2,(a)}(t), Rw,(u)(t)=(1w(t))R1,(u)(t)+w(t)R2,(u)(t),\displaystyle\ \ R_{w,(u)}(t)=(1-w(t))R_{1,(u)}(t)+w(t)R_{2,(u)}(t),
Zd,(a)(t)=Rd,(a)(t)E(Rd,(a)(t))var(Rd,(a)(t)),\displaystyle Z_{d,(a)}(t)=\frac{R_{d,(a)}(t)-E\left(R_{d,(a)}(t)\right)}{\surd{\text{var}\left(R_{d,(a)}(t)\right)}}, Rd,(a)(t)=R1,(a)(t)R2,(a)(t),\displaystyle\ \ R_{d,(a)}(t)=R_{1,(a)}(t)-R_{2,(a)}(t),
Zd,(u)(t)=Rd,(u)(t)E(Rd,(u)(t))var(Rd,(u)(t)),\displaystyle Z_{d,(u)}(t)=\frac{R_{d,(u)}(t)-E\left(R_{d,(u)}(t)\right)}{\surd{\text{var}\left(R_{d,(u)}(t)\right)}}, Rd,(u)(t)=R1,(u)(t)R2,(u)(t),\displaystyle\ \ R_{d,(u)}(t)=R_{1,(u)}(t)-R_{2,(u)}(t),

with w(t)=(t1)/(n2)w(t)=(t-1)/(n-2). Relatively large values of test statistics are the evidence against the null.

3.2 New scan statistics

Based on the extended statistics, we could define the scan statistics for the single change-point alternative to handle data with repeated observations as follows:

  1. 1.

    Extended weighted edge-count scan statistic: maxn0tn1Zw,(a)(t)\max_{n_{0}\leq t\leq n_{1}}Z_{w,(a)}(t)  &  maxn0tn1Zw,(u)(t)\max_{n_{0}\leq t\leq n_{1}}Z_{w,(u)}(t),

  2. 2.

    Extended generalized edge-count scan statistic: maxn0tn1S(a)(t)\max_{n_{0}\leq t\leq n_{1}}S_{(a)}(t)  &  maxn0tn1S(u)(t)\max_{n_{0}\leq t\leq n_{1}}S_{(u)}(t),

  3. 3.

    Extended max-type edge-count scan statistic: maxn0tn1M(a)(t)\max_{n_{0}\leq t\leq n_{1}}M_{(a)}(t)  &  maxn0tn1M(u)(t)\max_{n_{0}\leq t\leq n_{1}}M_{(u)}(t),

where n0n_{0} and n1n_{1} are set to be pre-specified values. For example, we can set n0=0.05nn_{0}=\lceil 0.05n\rceil and n1=nn0n_{1}=n-n_{0} in order to contain enough observations to represent the distribution.

Each scan statistic has its own characteristics aimed for different situations (see Section 5 for a comparison of them). For example, the extended weighted edge-count test is useful when a change-point occurs away from the middle of the sequence. The extended generalized edge-count test is effective under both location and scale alternatives. The extended max-type edge-count test is similar but gives more accurate pp-value approximation. The null hypothesis is rejected if the maxima is greater than a certain threshold. How to choose the threshold to control the type I error rate is described in Section 4.

Refer to caption
Figure 1: Plots of S(a)(t)S_{(a)}(t) and S(u)(t)S_{(u)}(t) against tt from a typical observation from Multinomial(10,prob1)(10,prob_{1}) and the second 50 observations from Multinomial(10,prob2)(10,prob_{2}) where prob1=(0.2,0.3,0.3,0.2)Tprob_{1}=(0.2,0.3,0.3,0.2)^{T} and prob2=(0.4,0.3,0.2,0.1)Tprob_{2}=(0.4,0.3,0.2,0.1)^{T} (left panel), and all 100 observations from Multinomial(10,prob1)(10,prob_{1}) (right panel). Here, C0C_{0} is the nearest neighbor link on Euclidean distance.

For illustration, Figure 1 plots the processes of S(a)(t)S_{(a)}(t) and S(u)(t)S_{(u)}(t) for the first 50 observation generated from Multinomial(10,(0.2,0.3,0.3,0.2)T)\left(10,(0.2,0.3,0.3,0.2)^{T}\right) and the second 50 observations generated from Multinomial(10,(0.4,0.3,0.2,0.1)T)\left(10,(0.4,0.3,0.2,0.1)^{T}\right) with C0C_{0} the nearest neighbor link constructed on the Euclidean distance. We see that both S(a)(t)S_{(a)}(t) and S(u)(t)S_{(u)}(t) peak at the true change-point τ=50\tau=50 (left panel). On the other hand, when there is no change-point, S(a)(t)S_{(a)}(t) and S(u)(t)S_{(u)}(t) have random fluctuations with smaller maximum values (right panel). Illustrations of other test statistics are provided in Supplement D.

For testing the null H0H_{0} (1) against the changed-interval alternative H2H_{2} (3), test statistics can be derived in a similar way to the single change-point case. For example, the extended generalized edge-count scan statistics are

max1<t1<t2nn0t2t1n1S(a)(t1,t2) and max1<t1<t2nn0t2t1n1S(u)(t1,t2)\max_{\begin{subarray}{c}1<t_{1}<t_{2}\leq n\\ n_{0}\leq t_{2}-t_{1}\leq n_{1}\end{subarray}}S_{(a)}(t_{1},t_{2})\ \ \ \text{ and }\max_{\begin{subarray}{c}1<t_{1}<t_{2}\leq n\\ n_{0}\leq t_{2}-t_{1}\leq n_{1}\end{subarray}}S_{(u)}(t_{1},t_{2})

for the averaging and union approaches, respectively, where S(a)(t1,t2)S_{(a)}(t_{1},t_{2}) and S(u)(t1,t2)S_{(u)}(t_{1},t_{2}) are the extended generalized edge-count statistics on the two samples: observations within [t1,t2)[t_{1},t_{2}) and observations outside [t1,t2)[t_{1},t_{2}). The details of all statistics for the changed-interval alternative are in Supplement F.

4 Analytical pp-value approximation

4.1 Asymptotic distributions of the stochastic processes

We first consider the averaging approach. We are concerned with the tail distribution of the scan statistics under H0H_{0}. Take the extended generalized edge-count scan statistic as an example, we want to compute

pr(maxn0tn1S(a)(t)),pr(maxn0tn1S(u)(t))\text{pr}\left(\max_{n_{0}\leq t\leq n_{1}}S_{(a)}(t)\right),\ \ \text{pr}\left(\max_{n_{0}\leq t\leq n_{1}}S_{(u)}(t)\right) (8)

for the single change-point alternative, and

pr(max1<t1<t2nn0t2t1n1S(a)(t1,t2)),pr(max1<t1<t2nn0t2t1n1S(u)(t1,t2))\text{pr}\left(\max_{\begin{subarray}{c}1<t_{1}<t_{2}\leq n\\ n_{0}\leq t_{2}-t_{1}\leq n_{1}\end{subarray}}S_{(a)}(t_{1},t_{2})\right),\ \ \text{pr}\left(\max_{\begin{subarray}{c}1<t_{1}<t_{2}\leq n\\ n_{0}\leq t_{2}-t_{1}\leq n_{1}\end{subarray}}S_{(u)}(t_{1},t_{2})\right) (9)

for the changed-interval alternative.

Under the null hypothesis, the scan statistics are defined as the permutation distribution. For small sample size nn, we can directly sample from the permutation distribution to compute the permutation pp-value. However, when nn is large, one needs to draw a large number of random permutations to get a good estimate of the pp-value, which is very time consuming. Hence, we seek to derive analytic approximations to these tail probabilities.

By the definition of Zw,(a)(t)Z_{w,(a)}(t), S(a)(t)S_{(a)}(t), and M(a)(t)M_{(a)}(t), stochastic processes {Zw,(a)(t)}\{Z_{w,(a)}(t)\}, {S(a)(t)}\{S_{(a)}(t)\}, and {M(a)(t)}\{M_{(a)}(t)\} boil down to two pairs of basic processes: {Zw,(a)(t)}\{Z_{w,(a)}(t)\} and {Zd,(a)(t)}\{Z_{d,(a)}(t)\} for the single change-point case and {Zw,(a)(t1,t2)}\{Z_{w,(a)}(t_{1},t_{2})\} and {Zd,(a)(t1,t2)}\{Z_{d,(a)}(t_{1},t_{2})\} for the changed-interval case in the similar way. Therefore, we first study the properties of these basic stochastic processes. Let uC0\mathcal{E}_{u}^{C_{0}} be the subgraph of C0C_{0} containing all edges that connect to node uu, u,2\mathcal{E}_{u,2} be the set of edges in C0C_{0} that contains at least one node in uC0\mathcal{E}_{u}^{C_{0}}, and |uC0||\mathcal{E}_{u}^{C_{0}}| and |u,2C0||\mathcal{E}_{u,2}^{C_{0}}| be the number of edges in the sets. To derive the asymptotic behavior of the stochastic processes in averaging approach, we work under the following conditions:

Condition 4.1.

|C0||C_{0}|,  (u,v)C0(mumv)1=O(n)\sum_{(u,v)\in C_{0}}(m_{u}m_{v})^{-1}=O(n).

Condition 4.2.

u=1Kmu(mu+|uC0|)(v𝒱uC0mv+|u,2C0|)=o(n3/2)\sum_{u=1}^{K}m_{u}\left(m_{u}+|\mathcal{E}_{u}^{C_{0}}|\right)\left(\sum_{v\in\mathcal{V}_{u}^{C_{0}}}m_{v}+|\mathcal{E}_{u,2}^{C_{0}}|\right)=o(n^{3/2}).

Condition 4.3.

(u,v)C0(mu+mv+|uC0|+|vC0|)(w𝒱uC0𝒱vC0mw+|u,2C0|+|v,2C0|)\sum_{(u,v)\in C_{0}}\left(m_{u}+m_{v}+|\mathcal{E}_{u}^{C_{0}}|+|\mathcal{E}_{v}^{C_{0}}|\right)\big{(}\sum_{w\in\mathcal{V}_{u}^{C_{0}}\cup\mathcal{V}_{v}^{C_{0}}}m_{w}+|\mathcal{E}_{u,2}^{C_{0}}|+|\mathcal{E}_{v,2}^{C_{0}}|\big{)}
=o(n3/2)=o(n^{3/2}).

Condition 4.4.

u=1K(|uC0|2)2/(4mu)(|C0|K)2/n=O(n)\sum_{u=1}^{K}(|\mathcal{E}_{u}^{C_{0}}|-2)^{2}/(4m_{u})-(|C_{0}|-K)^{2}/n=O(n).

Let [x][x] denotes the largest integer that is no larger than xx.

Theorem 4.5.

Under Conditions 4.14.4, as nn\rightarrow\infty,

  1. 1.

    {Zw,(a)([nw]):0<w<1}\{Z_{w,(a)}([nw]):0<w<1\} and {Zd,(a)([nw]):0<w<1}\{Z_{d,(a)}([nw]):0<w<1\} converge to independent Gaussian processes in finite dimensional distributions, which we denote as {Zw,(a)(w):0<w<1}\{Z_{w,(a)}^{*}(w):0<w<1\} and {Zd,(a)(w):0<w<1}\{Z_{d,(a)}^{*}(w):0<w<1\}, respectively.

  2. 2.

    {Zw,(a)([nw1],[nw2]):0<w1<w2<1}\{Z_{w,(a)}([nw_{1}],[nw_{2}]):0<w_{1}<w_{2}<1\} and {Zd,(a)([nw1],[nw2]):0<w1<w2<1}\{Z_{d,(a)}([nw_{1}],[nw_{2}]):0<w_{1}<w_{2}<1\} converge to independent Gaussian random fields in finite dimensional distributions, which we denote as {Zw,(a)(w1,w2):0<w1<w2<1}\{Z_{w,(a)}^{*}(w_{1},w_{2}):0<w_{1}<w_{2}<1\} and {Zd,(a)(w1,w2):0<w1<w2<1}\{Z_{d,(a)}^{*}(w_{1},w_{2}):0<w_{1}<w_{2}<1\}, respectively.

The proof for this theorem uses the technique developed in Chen & Zhang (2015) that utilizes the Stein’s method (Chen & Shao, 2005). The details of the proof are in Supplement G.

Let ρw,(a)(u,v)=cov(Zw,(a)(u),Zw,(a)(v))\rho_{w,(a)}^{*}(u,v)=\text{cov}(Z_{w,(a)}^{*}(u),Z_{w,(a)}^{*}(v)) and ρd,(a)(u,v)=cov(Zd,(a)(u),Zd,(a)(v))\rho_{d,(a)}^{*}(u,v)=\text{cov}(Z_{d,(a)}^{*}(u),Z_{d,(a)}^{*}(v)). The next theorem states explicit expressions for the covariance functions of the limiting Gaussian process, {Zw,(a)(w), 0<w<1}\{Z_{w,(a)}^{*}(w),\ 0<w<1\} and {Zd,(a)(w), 0<w<1}\{Z_{d,(a)}^{*}(w),\ 0<w<1\}. It is proved through combinatorial analysis and details are given in Supplement H.

Theorem 4.6.

The covariance functions of the Gaussian processes Zw,(a)(w)Z_{w,(a)}^{*}(w), and Zd,(a)(w)Z_{d,(a)}^{*}(w) have the following expressions:

ρw,(a)(u,v)=(uv){1(uv)}(uv){1(uv)},ρd,(a)(u,v)=[(uv){1(uv)}(uv){1(uv)}]1/2,\displaystyle\rho_{w,(a)}^{*}(u,v)=\frac{(u\wedge v)\left\{1-(u\vee v)\right\}}{(u\vee v)\left\{1-(u\wedge v)\right\}},\ \ \rho_{d,(a)}^{*}(u,v)=\left[\frac{(u\wedge v)\left\{1-(u\vee v)\right\}}{(u\vee v)\left\{1-(u\wedge v)\right\}}\right]^{1/2},

where uv=min(u,v)u\wedge v=\min(u,v) and uv=max(u,v)u\vee v=\max(u,v).

For the union approach, let G¯\bar{G} be the set of graphs that the union of all possible optimal graphs between observations {yi}\{y_{i}\}, iG¯\mathcal{E}_{i}^{\bar{G}} be the subgraph of G¯\bar{G} containing all edges that connect to node yiy_{i}, and |iG¯||\mathcal{E}_{i}^{\bar{G}}| be the number of edges in the set. We work under

Condition 4.7.

|G¯|=O(n)|\bar{G}|=O(n).

Condition 4.8.

u=1Kmu3v𝒱uC0mvv𝒱uC0mv(mv+w𝒱vC0\{v}mw)=o(n3/2)\sum_{u=1}^{K}m_{u}^{3}\sum_{v\in\mathcal{V}_{u}^{C_{0}}}m_{v}\sum_{v\in\mathcal{V}_{u}^{C_{0}}}m_{v}\left(m_{v}+\sum_{w\in\mathcal{V}_{v}^{C_{0}}\backslash\{v\}}m_{w}\right)=o(n^{3/2}).

Condition 4.9.

(u,v)C0mumv(muw𝒱uC0mw+mvw𝒱vC0mw)\sum_{(u,v)\in C_{0}}m_{u}m_{v}\left(m_{u}\sum_{w\in\mathcal{V}_{u}^{C_{0}}}m_{w}+m_{v}\sum_{w\in\mathcal{V}_{v}^{C_{0}}}m_{w}\right)
×{w𝒱uC0𝒱vC0,y𝒱wC0\{w}mw(mw+my)}=o(n3/2).\times\left\{\sum_{w\in\mathcal{V}_{u}^{C_{0}}\cup\mathcal{V}_{v}^{C_{0}},\ y\in\mathcal{V}_{w}^{C_{0}}\backslash\{w\}}m_{w}\left(m_{w}+m_{y}\right)\right\}=o(n^{3/2}).

Condition 4.10.

i=1n|iG¯|24|G¯|2/n=O(n)\sum_{i=1}^{n}|\mathcal{E}_{i}^{\bar{G}}|^{2}-4|\bar{G}|^{2}/n=O(n).

Theorem 4.11.

Under Conditions 4.74.10, as nn\rightarrow\infty,

  1. 1.

    {Zw,(u)([nw]):0<w<1}\{Z_{w,(u)}([nw]):0<w<1\} and {Zd,(u)([nw]):0<w<1}\{Z_{d,(u)}([nw]):0<w<1\} converge to independent Gaussian processes in finite dimensional distributions, which we denote as {Zw,(u)(w):0<w<1}\{Z_{w,(u)}^{*}(w):0<w<1\} and {Zd,(u)(w):0<w<1}\{Z_{d,(u)}^{*}(w):0<w<1\}, respectively.

  2. 2.

    {Zw,(u)([nw1],[nw2]):0<w1<w2<1}\{Z_{w,(u)}([nw_{1}],[nw_{2}]):0<w_{1}<w_{2}<1\} and {Zd,(u)([nw1],[nw2]):0<w1<w2<1}\{Z_{d,(u)}([nw_{1}],[nw_{2}]):0<w_{1}<w_{2}<1\} converge to independent Gaussian random fields in finite dimensional distributions, which we denote as {Zw,(u)(w1,w2):0<w1<w2<1}\{Z_{w,(u)}^{*}(w_{1},w_{2}):0<w_{1}<w_{2}<1\} and {Zd,(u)(w1,w2):0<w1<w2<1}\{Z_{d,(u)}^{*}(w_{1},w_{2}):0<w_{1}<w_{2}<1\}, respectively.

The details of the proof are in Supplement I.

Let ρw,(u)(u,v)=cov(Zw,(u)(u),Zw,(u)(v))\rho_{w,(u)}^{*}(u,v)=\text{cov}(Z_{w,(u)}^{*}(u),Z_{w,(u)}^{*}(v)) and ρd,(u)(u,v)=cov(Zd,(u)(u),Zd,(u)(v))\rho_{d,(u)}^{*}(u,v)=\text{cov}(Z_{d,(u)}^{*}(u),Z_{d,(u)}^{*}(v)). The next theorem states explicit expressions for the covariance functions of the limiting Gaussian processes, {Zw,(u)(w), 0<w<1}\{Z_{w,(u)}^{*}(w),\ 0<w<1\}, and {Zd,(u)(w), 0<w<1}\{Z_{d,(u)}^{*}(w),\ 0<w<1\}. Its proof is in Supplement J.

Theorem 4.12.

The covariance functions of the Gaussian processes Zw,(u)(w)Z_{w,(u)}^{*}(w), and Zd,(u)(w)Z_{d,(u)}^{*}(w) have the follwoing expressions:

ρw,(u)(u,v)=(uv){1(uv)}(uv){1(uv)},ρd,(u)(u,v)=[(uv){1(uv)}(uv){1(uv)}]1/2.\displaystyle\rho_{w,(u)}^{*}(u,v)=\frac{(u\wedge v)\left\{1-(u\vee v)\right\}}{(u\vee v)\left\{1-(u\wedge v)\right\}},\ \ \rho_{d,(u)}^{*}(u,v)=\left[\frac{(u\wedge v)\left\{1-(u\vee v)\right\}}{(u\vee v)\left\{1-(u\wedge v)\right\}}\right]^{1/2}.
Remark 1.

By Theorems 4.6, 4.12, we see that the limiting distributions for {Zw,(a)([nw]):0<w<1}\{Z_{w,(a)}([nw]):0<w<1\} and {Zw,(u)([nw]):0<w<1}\{Z_{w,(u)}([nw]):0<w<1\} are the same and do not depend on the graph at all. The same story for ZdZ_{d}’s. In addition, their covariance functions in Theorem 4.6 and 4.12 are the same as Theorem 4.3 in Chu & Chen (2019). Hence, limiting distributions of the extended graph-based tests based on ZwZ_{w}, SS, MM are exactly the same as their corresponding versions for continuous data. On the other hand, the limiting distributions of the extended original edge-count scan statistics differ from their corresponding versions under the continuous setting (see details in Supplement E).

Remark 2.

Conditions 4.14.10 all constrain the number of repeated observations. Conditions 4.1 and 4.7 can usually be satisfied with an appropriate choice of C0C_{0}. Conditions 4.2, 4.3, 4.8 and 4.9 constrain the degrees of nodes in the graph C0C_{0} such that they cannot be too large. Condition 4.4 ensures that (R1,(a)(t),R2,(a)(t))T(R_{1,(a)}(t),R_{2,(a)}(t))^{T} does not degenerate asymptotically so that S(a)(t)S_{(a)}(t) is well defined. Similar for Condition 4.10.

We check these conditions through simulation studies (Supplement L) and see that some of them could be violated even when the pp-value approximation still works well. Zhu & Chen (2021) recently studied graph-based two-sample tests for continuous data and they proposed much more relaxed conditions than those in Chu & Chen (2019). They checked their conditions under both sparse and dense graphs and the conditions hold well. We believe that conditions for data with repeated observations under the change-point setting can also be relaxed. This requires substantial work and we leave this to our future research.

4.2 Asymptotic pp-value approximation

We now examine the asymptotic behavior of tail probabilities. Following similar arguments in the proof for Proposition 3.4 in Chen & Zhang (2015), we can obtain the foundation for analytical approximations to the probabilities. Assume that n0,n1,n,bn_{0},n_{1},n,b\rightarrow\infty in a way such that for some 0<x0<x1<10<x_{0}<x_{1}<1 and b1>0b_{1}>0, n0/nx0,n1/nx1,b/nb1n_{0}/n\rightarrow x_{0},\ n_{1}/n\rightarrow x_{1},\ b/\surd{n}\rightarrow b_{1}.

Based on Theorem 4.5 and 4.11, as nn\rightarrow\infty, for both averaging and union approaches, we have

pr(maxn0tn1Zw(t/n)>b)\displaystyle\text{pr}\left(\max_{n_{0}\leq t\leq n_{1}}Z_{w}^{*}(t/n)>b\right) bϕ(b)x0x1hw(x)ν[b1{2hw(x)}1/2]𝑑x,\displaystyle\sim b\phi(b)\int_{x_{0}}^{x_{1}}h_{w}^{*}(x)\nu\left[b_{1}\left\{2h_{w}^{*}(x)\right\}^{1/2}\right]dx,
pr(maxn0tn1|Zd(t/n)|>b)\displaystyle\text{pr}\left(\max_{n_{0}\leq t\leq n_{1}}|Z_{d}^{*}(t/n)|>b\right) 2bϕ(b)x0x1hd(x)ν[b1{2hd(x)}1/2]𝑑x,\displaystyle\sim 2b\phi(b)\int_{x_{0}}^{x_{1}}h_{d}^{*}(x)\nu\left[b_{1}\left\{2h_{d}^{*}(x)\right\}^{1/2}\right]dx,

where ν(s)(2/s){Φ(s/2)0.5}/{(s/2)Φ(s/2)+ϕ(s/2)}\nu(s)\approx(2/s)\left\{\Phi(s/2)-0.5\right\}/\left\{(s/2)\Phi(s/2)+\phi(s/2)\right\} (Siegmund & Yakir, 2007) with Φ()\Phi(\cdot) and ϕ()\phi(\cdot) being the standard normal cumulative density function and probability density function, respectively, and

hw(x)=limsxρw,(a)(s,x)s\displaystyle h_{w}^{*}(x)=\lim_{s\nearrow x}\frac{\partial\rho_{w,(a)}^{*}(s,x)}{\partial s} =limsxρw,(a)(s,x)s=limsxρw,(u)(s,x)s=limsxρw,(u)(s,x)s,\displaystyle=-\lim_{s\searrow x}\frac{\partial\rho_{w,(a)}^{*}(s,x)}{\partial s}=\lim_{s\nearrow x}\frac{\partial\rho_{w,(u)}^{*}(s,x)}{\partial s}=-\lim_{s\searrow x}\frac{\partial\rho_{w,(u)}^{*}(s,x)}{\partial s},
hd(x)=limsxρd,(a)(s,x)s\displaystyle h_{d}^{*}(x)=\lim_{s\nearrow x}\frac{\partial\rho_{d,(a)}^{*}(s,x)}{\partial s} =limsxρd,(a)(s,x)s=limsxρd,(u)(s,x)s=limsxρd,(u)(s,x)s.\displaystyle=-\lim_{s\searrow x}\frac{\partial\rho_{d,(a)}^{*}(s,x)}{\partial s}=\lim_{s\nearrow x}\frac{\partial\rho_{d,(u)}^{*}(s,x)}{\partial s}=-\lim_{s\searrow x}\frac{\partial\rho_{d,(u)}^{*}(s,x)}{\partial s}.

It can be shown that hw(x)={x(1x)}1h_{w}^{*}(x)=\left\{x(1-x)\right\}^{-1} and hd(x)={2x(1x)}1h_{d}^{*}(x)=\left\{2x(1-x)\right\}^{-1}.

Since Zw,(a)(t)Z_{w,(a)}^{*}(t) and Zd,(a)(t)Z_{d,(a)}^{*}(t) are independent and Zw,(u)(t)Z_{w,(u)}^{*}(t) and Zd,(u)(t)Z_{d,(u)}^{*}(t) are independent, for both averaging and union approaches, we have

pr(maxn0tn1M(t/n)>b)=1pr(maxn0tn1|Zd(t/n)|<b)pr(maxn0tn1Zw(t/n)<b).\displaystyle\text{pr}\left(\max_{n_{0}\leq t\leq n_{1}}M^{*}(t/n)>b\right)=1-\text{pr}\left(\max_{n_{0}\leq t\leq n_{1}}|Z_{d}^{*}(t/n)|<b\right)\text{pr}\left(\max_{n_{0}\leq t\leq n_{1}}Z_{w}^{*}(t/n)<b\right).

In addition, following similar arguments in the proof for Proposition 4.4 in Chu & Chen (2019), we obtain the analytical pp-value approximations for the extended generalized edge-count test. Assume that n0,n1,n,bSn_{0},n_{1},n,b_{S}\rightarrow\infty in a way such that for some 0<x0<x1<10<x_{0}<x_{1}<1 and b2>0b_{2}>0, n0/nx0,n1/nx1,bS/nb2n_{0}/n\rightarrow x_{0},\ n_{1}/n\rightarrow x_{1},\ b_{S}/n\rightarrow b_{2}. Then, as nn\rightarrow\infty, for both averaging and union approaches, we have

pr(maxn0tn1S(t/n)>bS)bSebS/22π02πx0x1hs(x,w)ν[{2b2hs(x,w)}1/2]𝑑x𝑑w,\displaystyle\text{pr}\left(\max_{n_{0}\leq t\leq n_{1}}S^{*}(t/n)>b_{S}\right)\sim\frac{b_{S}e^{-b_{S}/2}}{2\pi}\int_{0}^{2\pi}\int_{x_{0}}^{x_{1}}h_{s}^{*}(x,w)\nu\left[\left\{2b_{2}h_{s}^{*}(x,w)\right\}^{1/2}\right]dxdw,

where hs(x,w)=hd(x)cos2(w)+hw(x)sin2(w)h_{s}^{*}(x,w)=h_{d}^{*}(x)\cos^{2}(w)+h_{w}^{*}(x)\sin^{2}(w). The analytical pp-value approximations for the changed-interval are in Supplement M.

Remark 3.

In practice, we use hw(n,x)h_{w}(n,x) in place of hw(x)h_{w}^{*}(x), where hw(n,x)h_{w}(n,x) is the finite-sample equivalent of hw(x)h_{w}^{*}(x). That is,

hw(n,x)=nlimsnxρw(s,nx)s,h_{w}(n,x)=n\lim_{s\nearrow nx}\frac{\partial\rho_{w}(s,nx)}{\partial s},

with ρw(s,t)=cov(Zw(s),Zw(t))\rho_{w}(s,t)=\text{cov}\left(Z_{w}(s),Z_{w}(t)\right). The explicit expression for hw(n,x)h_{w}(n,x) is

hw(n,x)=(n1)(2nx22nx+1)2x(1x)(n2x2n2x+n1).h_{w}(n,x)=\frac{(n-1)(2nx^{2}-2nx+1)}{2x(1-x)(n^{2}x^{2}-n^{2}x+n-1)}.

It is clear from above expression that hw(n,x)h_{w}(n,x) does not depend on the graph C0C_{0} as well and it is easy to show that limnhw(n,x)=hw(x)\lim_{n\rightarrow\infty}h_{w}(n,x)=h_{w}^{*}(x). The finite-sample equivalent of hd(x)h_{d}^{*}(x) is exact the same as hd(x)h_{d}^{*}(x), that is,

hd(n,x)=nlimsnxρd(s,nx)s=12x(1x),h_{d}(n,x)=n\lim_{s\nearrow nx}\frac{\partial\rho_{d}(s,nx)}{\partial s}=\frac{1}{2x(1-x)},

where ρd(s,t)=cov(Zd(s),Zd(t))\rho_{d}(s,t)=\text{cov}\left(Z_{d}(s),Z_{d}(t)\right).

4.3 Skewness correction

The analytic pp-value approximations based on asymptotic results give ballpark estimates of the pp-values. However, they are in general not accurate enough if we set n0n_{0} and n1n_{1} close to the two ends and when the dimension is high (see the table in Section 4.4). The inaccuracy is largely attributed to the fact that the convergence of Zw,(a)(t),Zw,(u)(t),Zd,(a)(t),Zd,(u)(t)Z_{w,(a)}(t),Z_{w,(u)}(t),Z_{d,(a)}(t),Z_{d,(u)}(t) to the Gaussian distribution is slow when t/nt/n is close to 0 or 1.

To improve the analytical pp-value approximation, we add extra terms in the analytic formulas to correct for skewness. In our problem, the extent of the skewness depends on the value of tt. Hence, we adopt a skewness correction approach discussed in Chen & Zhang (2015) where different amount of the correction is done for different tt: In particular, this approach utilizes better approximation of the marginal probability by using the third moment, γ\gamma.

After skewness correction, the analytical pp-value approximations for averaging approach are

pr(maxn0tn1Zw,(a)(t)>b)bϕ(b)n0/nn1/nHw,(a)(nx)hw(n,x)ν[b{2hw(n,x)/n}1/2]𝑑x,\displaystyle\text{pr}\left(\max_{n_{0}\leq t\leq n_{1}}Z_{w,(a)}(t)>b\right)\sim b\phi(b)\int_{n_{0}/n}^{n_{1}/n}H_{w,(a)}(nx)h_{w}(n,x)\nu\left[b\left\{2h_{w}(n,x)/n\right\}^{1/2}\right]dx, (10)

where

Hw,(a)(t)=exp{12(bθ^b,w,(a)(t))2+16γw,(a)(t)θ^b,w,(a)3(t)}{1+γw,(a)(t)θ^b,w,(a)(t)}1/2,θ^b,w,(a)(t)={1+2γw,(a)(t)b}1/21γw,(a)(t),H_{w,(a)}(t)=\frac{\exp\big{\{}\frac{1}{2}(b-\hat{\theta}_{b,w,(a)}(t))^{2}+\frac{1}{6}\gamma_{w,(a)}(t)\hat{\theta}_{b,w,(a)}^{3}(t)\big{\}}}{\left\{1+\gamma_{w,(a)}(t)\hat{\theta}_{b,w,(a)}(t)\right\}^{1/2}},\ \ \hat{\theta}_{b,w,(a)}(t)=\frac{\left\{1+2\gamma_{w,(a)}(t)b\right\}^{1/2}-1}{\gamma_{w,(a)}(t)},
pr(maxn0tn1Zd,(a)(t)>b)bϕ(b)n0/nn1/nHd,(a)(nx)hd(n,x)ν[b{2hd(n,x)/n}1/2]𝑑x,\displaystyle\text{pr}\left(\max_{n_{0}\leq t\leq n_{1}}Z_{d,(a)}(t)>b\right)\sim b\phi(b)\int_{n_{0}/n}^{n_{1}/n}H_{d,(a)}(nx)h_{d}(n,x)\nu\left[b\left\{2h_{d}(n,x)/n\right\}^{1/2}\right]dx, (11)

where

Hd,(a)(t)=exp{12(bθ^b,d,(a)(t))2+16γd,(a)(t)θ^b,d,(a)3(t)}{1+γd,(a)(t)θ^b,d,(a)(t)}1/2,θ^b,d,(a)(t)={1+2γd,(a)(t)b}1/21γd,(a)(t),H_{d,(a)}(t)=\frac{\exp\big{\{}\frac{1}{2}(b-\hat{\theta}_{b,d,(a)}(t))^{2}+\frac{1}{6}\gamma_{d,(a)}(t)\hat{\theta}_{b,d,(a)}^{3}(t)\big{\}}}{\left\{1+\gamma_{d,(a)}(t)\hat{\theta}_{b,d,(a)}(t)\right\}^{1/2}},\ \ \hat{\theta}_{b,d,(a)}(t)=\frac{\left\{1+2\gamma_{d,(a)}(t)b\right\}^{1/2}-1}{\gamma_{d,(a)}(t)},

and γw,(a)(t)=E(Zw3(t)),γd,(a)(t)=E(Zd3(t))\gamma_{w,(a)}(t)=E\left(Z^{3}_{w}(t)\right),\gamma_{d,(a)}(t)=E\left(Z^{3}_{d}(t)\right), whose analytic expressions are provided in Supplement K. The skewness corrected analytical pp-value approximations for union approach and the changed-interval can be derived in the similar manner and details are provided in Supplement M.

Remark 4.

By jointly correcting for the marginal probabilities of Zw(t)Z_{w}(t) and Zd(t)Z_{d}(t), we can derive skewness corrected pp-value approximations for maxn0tn1S(t)=max0w2πmaxn0tn1{Zw(t)sin(w)+Zd(t)cos(w)}\max_{n_{0}\leq t\leq n_{1}}S(t)=\max_{0\leq w\leq 2\pi}\max_{n_{0}\leq t\leq n_{1}}\left\{Z_{w}(t)\sin(w)+Z_{d}(t)\cos(w)\right\} (Chu & Chen, 2019). However, the integrand could be easily non-finite, so the method heavily relies on extrapolation. We thus do not perform skewness correction on S(a)(t)S_{(a)}(t) and S(u)(t)S_{(u)}(t).

4.4 Checking analytical pp-value approximations under finite nn

We check the performance of analytical pp-value approximations obtained in Section 4.2 and 4.3. In particular, we compare the critical values for 0.05 pp-value threshold through analytical pp-value approximations based on asymptotic results and skewness correction to those obtained from doing 10,000 permutations under various simulation settings to check how analytical approximation works well for finite samples. Here, we focus on the extended max-type scan statistic for the single change-point alternative. The results for other scan statistics and the changed-interval alternative are provided in Supplement N.

We consider three distributions with different dimensions (Multinomial d=10d=10 with equal probabilities (C1), Gaussian with repeated observations d=100d=100 (C2), Multinomial d=1000d=1000 with equal probabilities (C3)) and let C0C_{0} be the nearest neighbor link constructed on Euclidean distance. The analytic approximations depend on constraints, n0n_{0} and n1n_{1}, on the region where the change-point is searched. To make things simple, we set n1=nn0n_{1}=n-n_{0}.

Since analytical pp-value approximations without skewness correction do not depend on C0C_{0} in the extended weighted, generalized, and max-type tests, the critical value is determined by nn, n0n_{0}, and n1n_{1} only. Notice that analytical pp-value approximations without skewness correction provide the same result in both averaging and union approaches. On the other hand, the skewness corrected approximated pp-values depend on certain characteristics of the graph structure. The structure of the nearest neighbor link depends on the underlying dataset, and thus the critical values vary by simulation runs.

Table 3 shows results of the extended max-type scan statistics. The first table labeled ‘A1’ presents the analytical critical values without skewness correction. ‘A2 (a)’ and ‘A2 (u)’ represent skewness corrected analytical critical values in averaging and union approaches, respectively, and ‘Per (a)(a)’ and ‘Per (u)(u)’ represent permutation critical values in averaging and union cases, respectively. We also show results for 2 randomly simulated sequences in each setting. We see that the asymptotic pp-value approximation is doing reasonably well. As window size decreases, the analytical critical values become less precise. However, skewness corrected approximation performs much better than the approximation without skewness correction. When the dimension is not too high, such as (C1), the skewness corrected analytical approximation is doing reasonably well for n0n_{0} as low as 25. When the dimension is high, such as (C2) and (C3), the approximation performs well when n050n_{0}\geq 50.

Table 3: Critical values for the single change-point scan statistic maxn0tn1M(a)(t)\max_{n_{0}\leq t\leq n_{1}}M_{(a)}(t) and maxn0tn1M(u)(t)\max_{n_{0}\leq t\leq n_{1}}M_{(u)}(t) based on the nearest neighbor link at 0.05 significance level. n=1000n=1000
n0=100n_{0}=100 n0=75n_{0}=75 n0=50n_{0}=50 n0=25n_{0}=25
A1 3.24 3.28 3.32 3.38
Critical Values (a)(a)
n0=100n_{0}=100 n0=75n_{0}=75 n0=50n_{0}=50 n0=25n_{0}=25
A2 (a)(a) Per (a)(a) A2 (a)(a) Per (a)(a) A2 (a)(a) Per (a)(a) A2 (a)(a) Per (a)(a)
(C1) 3.30 3.30 3.36 3.37 3.43 3.43 3.54 3.58
3.30 3.30 3.35 3.36 3.43 3.46 3.55 3.62
(C2) 3.36 3.34 3.44 3.45 3.56 3.59 3.72 3.98
3.34 3.36 3.42 3.47 3.53 3.64 3.76 4.03
(C3) 3.30 3.30 3.38 3.41 3.48 3.57 3.67 3.93
3.30 3.28 3.38 3.39 3.48 3.56 3.67 3.87
Critical Values (u)(u)
n0=100n_{0}=100 n0=75n_{0}=75 n0=50n_{0}=50 n0=25n_{0}=25
A2 (u)(u) Per (u)(u) A2 (u)(u) Per (u)(u) A2 (u)(u) Per (u)(u) A2 (u)(u) Per (u)(u)
(C1) 3.32 3.30 3.37 3.40 3.44 3.43 3.54 3.59
3.31 3.32 3.36 3.35 3.43 3.46 3.55 3.63
(C2) 3.35 3.36 3.42 3.43 3.51 3.52 3.62 3.80
3.34 3.39 3.40 3.46 3.48 3.55 3.67 3.84
(C3) 3.31 3.30 3.39 3.41 3.50 3.57 3.69 3.93
3.31 3.28 3.39 3.39 3.50 3.56 3.69 3.87

5 Performance of the new tests

We study the performance of the new tests in two aspects: (1) whether the test can reject the null hypothesis of homogenity when there is a change, and (2) if the test can reject H0H_{0}, whether the test can estimate the location of the change-point accurately. We use the configuration model random graph G(v,k)G(v,\overrightarrow{k}) to generate networks. Here, vv is the number of vertices and k=(k1,,kv)\overrightarrow{k}=(k_{1},\ldotp\ldotp\ldotp,k_{v}) is a degree sequence on vv vertices, with kik_{i} being the degree of vertex ii. To generate configuration model random graphs, given a degree sequence, we choose a uniformly random matching on the degree stubs (half edges).

We generate a sequence of n=200n=200 networks from the following model:

yi{G(v,k1)i=1,,τ;G(v,k2)i=τ+1,,200.y_{i}\sim\left\{\begin{tabular}[]{c}$G(v,\overrightarrow{k}_{1})$, \ $i=1,\ldotp\ldotp\ldotp,\tau$;\\ $G(v,\overrightarrow{k}_{2})$, \ $i=\tau+1,\ldotp\ldotp\ldotp,200$.\end{tabular}\right.

We explore two cases of the location of the change-point, in the middle (τ=100)(\tau=100) and close to one end (τ=170)(\tau=170) for v=6v=6 vertices in this simulation. This dataset has repeated networks. Also, we consider two types of changes:

  1. 1.

    An equal degree changes in the network (all elements in k1\overrightarrow{k}_{1} are 2 and 2 elements in k2\overrightarrow{k}_{2} are 4 and the rest are 2),

  2. 2.

    A random degree changes in the network (all elements in k1\overrightarrow{k}_{1} are 2 and 2 elements in k2\overrightarrow{k}_{2} are randomly selected from 3 to 5 and the rest are 2).

That is, we present the results for the four combinations: an equal degree change at τ=100\tau=100 (S1), a random degree change at τ=100\tau=100 (S2), an equal degree change at τ=170\tau=170 (S3), and a random degree change at τ=170\tau=170 (S4).

For network at tt, we use an adjacency matrix MtM_{t} to represent the network, with 1 for element (i,j)(i,j) if vertex ii and jj are connected, and 0 otherwise. We consider the dissimilarity defined as the number of different entries normalized by the geometric mean of the total edges in each of two networks, MiMjF/(MiFMjF)1/2\|M_{i}-M_{j}\|_{F}/\left(\|M_{i}\|_{F}\cdot\|M_{j}\|_{F}\right)^{1/2}, where F\|\cdot\|_{F} is the Frobenius norm of a matrix. We set C0C_{0} to be the nearest neighbor link as a similarity graph for our new methods.

Table 4: Estimated power of new tests
Zw,(a)(t)Z_{w,(a)}(t) Zw,(u)(t)Z_{w,(u)}(t) S(a)(t)S_{(a)}(t) S(u)(t)S_{(u)}(t) M(a)(t)M_{(a)}(t) M(u)(t)M_{(u)}(t)
(S1) 0.98 (0.96) 0.96 (0.89) 0.96 (0.94) 0.95 (0.89) 0.96 (0.95) 0.95 (0.88)
(S2) 0.88 (0.83) 0.89 (0.85) 0.90 (0.84) 0.91 (0.85) 0.89 (0.83) 0.90 (0.87)
(S3) 0.86 (0.83) 0.65 (0.59) 0.85 (0.83) 0.85 (0.82) 0.81 (0.80) 0.70 (0.64)
(S4) 0.81 (0.81) 0.73 (0.70) 0.86 (0.84) 0.93 (0.91) 0.84 (0.81) 0.86 (0.82)

Table 4 shows the number of null rejection, out of 100, at 0.05 significance level for each method. For the accuracy of estimating the location of change-point, the count where the estimated change-point is within 20 from the true change-point is provided in parentheses when the null hypothesis is rejected. We see that all tests work well in the balanced equal degree changes case, while the extended generalized edge-count test outperforms in random degree changes case. In this simulation, equal degree changes would be considered the mean change and random degree changes would be considered the change in both location and scale. Hence, the extended generalized edge-count test and max-type edge-count test perform well in this general scenario. When the change-point is not in the center of the sequence, the extended weighted edge-count test outperforms, which complies with what we would expect. We see that the extended generalized edge-count test and max-type edge-count test work well when the change is in both mean and variance in the unbalnced sample size case.

6 Phone-call network data analysis

Here, we apply the new tests to the phone-call network dataset mentioned in Section 1 in details. The MIT Media Laboratory conducted a study following 87 subjects who used mobile phones with a pre-installed device that can record call logs. The study lasted for 330 days from July 2004 to June 2005 (Eagle et al., 2009). Given the richness of this dataset, one question of interest to answer is that whether there is any change in the phone-call pattern among subjects over time. This can be viewed as the change of friendship along time.

We bin the phone-calls by day and we construct t=330t=330 of networks in total with 87 subjects as nodes. We encode each network by the adjacency matrix BtB_{t} with value 1 for element (i,j)(i,j) if subject ii called jj on day tt and 0 otherwise. We define the distance measure as the number of different entries, i.e., d(Bi,Bj)=BiBjF2d(B_{i},B_{j})=\|B_{i}-B_{j}\|_{F}^{2}, whereF\|\cdot\|_{F} is the Frobenius norm of a matrix. Due to the repeated observations, many equal distances among distinct values exist. We set C0C_{0} to be the nearest neighbor link in this example.

We apply the single change-point detection method using the extended generalized scan statistic to the phone-call network dataset recursively so as to detect all possible change-points. As this dataset has a lot of noise, we focus on estimated change-points with pp-value less than 0.001. Figure 2 shows estimated change-points by averaging approach and union approach. We see that the two approaches produce quite a number of similar change-points. We define a change-point τ^\hat{\tau} to be detected by both approaches if they each finds a change-point within the set [τ^2,τ^+2][\hat{\tau}-2,\hat{\tau}+2]. We then deem the location of the shared change-point to be the floor of the average of the two change-points detected by the two approaches.

Since we do not know the underlying distribution of the dataset, we perform more sanity check with the distance matrix of the whole period (Figure 3). It is evident that there are some signals in this dataset and they show comparably good match with our results from the new tests.

Refer to caption
(a) 53 68 97 140 164 247 289
(u) 28 52 66 79 140 164 247 293 323
Shared 52 67 140 164 247 291
Figure 2: Estimated change-points and the order where change-points are detected for averaging and union approaches.
Refer to caption
Figure 3: Heatmap of L1L_{1} norm distance matrix corresponding to 330 networks. Red triangles, blue triangles, and purple triangles indicate estimated change-points by union approach, averaging approach, and their shared change-points, respectively.

7 Discussion

In general, the two approaches work similarly; while they inevitably produce different results sometimes. A brief comparison of the two approaches is provided in Supplement O. The proposed methods detect the most significant single change-point or the changed-interval in the sequence. If the sequence has more than one change-point, the proposed methods can be applied recursively using techniques, such as binary segmentation, circular binary segmentation, or wild binary segmentation (Vostrikova, 1981; Olshen et al., 2004; Fryzlewicz et al., 2014).

Acknowledgement

Hoseung Song and Hao Chen are supported in part by NSF awards DMS-1513653 and DMS-1848579.

References

  • Chen & Zhang (2015) Chen, H. & Zhang, N. (2015). Graph-based change-point detection. The Annals of Statistics 43, 139–176.
  • Chen & Zhang (2013) Chen, H. & Zhang, N. R. (2013). Graph-based tests for two-sample comparisons of categorical data. Statistica Sinica , 1479–1503.
  • Chen & Gupta (2011) Chen, J. & Gupta, A. K. (2011). Parametric statistical change point analysis: with applications to genetics, medicine, and finance. Springer Science & Business Media.
  • Chen & Shao (2005) Chen, L. H. & Shao, Q.-M. (2005). Stein’s method for normal approximation. An introduction to Stein’s method 4, 1–59.
  • Chu & Chen (2019) Chu, L. & Chen, H. (2019). Asymptotic distribution-free change-point detection for multivariate and non-euclidean data. The Annals of Statistics 47, 382–414.
  • Eagle et al. (2009) Eagle, N., Pentland, A. S. & Lazer, D. (2009). Inferring friendship network structure by using mobile phone data. Proceedings of the national academy of sciences 106, 15274–15278.
  • Fryzlewicz et al. (2014) Fryzlewicz, P. et al. (2014). Wild binary segmentation for multiple change-point detection. The Annals of Statistics 42, 2243–2281.
  • Harchaoui et al. (2009) Harchaoui, Z., Moulines, E. & Bach, F. R. (2009). Kernel change-point analysis. In Advances in Neural Information Processing Systems.
  • Lung-Yut-Fong et al. (2011) Lung-Yut-Fong, A., Lévy-Leduc, C. & Cappé, O. (2011). Homogeneity and change-point detection tests for multivariate data using rank statistics. arXiv preprint arXiv:1107.1971 .
  • Matteson & James (2014) Matteson, D. S. & James, N. A. (2014). A nonparametric approach for multiple change point analysis of multivariate data. Journal of the American Statistical Association 109, 334–345.
  • Olshen et al. (2004) Olshen, A. B., Venkatraman, E., Lucito, R. & Wigler, M. (2004). Circular binary segmentation for the analysis of array-based dna copy number data. Biostatistics 5, 557–572.
  • Siegmund & Yakir (2007) Siegmund, D. & Yakir, B. (2007). The statistics of gene mapping. Springer Science & Business Media.
  • Vostrikova (1981) Vostrikova, L. Y. (1981). Detecting “disorder” in multidimensional random processes. In Doklady Akademii Nauk, vol. 259. Russian Academy of Sciences.
  • Wang et al. (2017) Wang, G., Zou, C. & Yin, G. (2017). Change-point detection in multinomial data with a large number of categories. Annals of Statistics .
  • Wang & Samworth (2018) Wang, T. & Samworth, R. J. (2018). High dimensional change point estimation via sparse projection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80, 57–83.
  • Zhang & Chen (2017) Zhang, J. & Chen, H. (2017). Graph-based two-sample tests for discrete data. arXiv preprint arXiv:1711.04349 .
  • Zhang et al. (2010) Zhang, N. R., Siegmund, D. O., Ji, H. & Li, J. Z. (2010). Detecting simultaneous changepoints in multiple sequences. Biometrika .
  • Zhu & Chen (2021) Zhu, Y. & Chen, H. (2021). Limiting distributions of graph-based test statistics. arXiv preprint arXiv:2108.07446 .