Valid and Exact Statistical Inference for Multi-dimensional Multiple Change-Points by Selective Inference

Ryota Sugiyama Nagoya Institute of Technology Hiroki Toda Nagoya Institute of Technology Vo Nguyen Le Duy Nagoya Institute of Technology RIKEN Yu Inatsu Nagoya Institute of Technology Ichiro Takeuchi Corresponding to: [email protected] Nagoya Institute of Technology RIKEN

Abstract

In this paper, we study statistical inference of change-points (CPs) in multi-dimensional sequence. In CP detection from a multi-dimensional sequence, it is often desirable not only to detect the location, but also to identify the subset of the components in which the change occurs. Several algorithms have been proposed for such problems, but no valid exact inference method has been established to evaluate the statistical reliability of the detected locations and components. In this study, we propose a method that can guarantee the statistical reliability of both the location and the components of the detected changes. We demonstrate the effectiveness of the proposed method by applying it to the problems of genomic abnormality identification and human behavior analysis.

1 Introduction

In this paper, we consider the multi-dimensional multiple change-point (CP) detection problem (Truong et al., 2018; Sharma et al., 2016; Enikeeva et al., 2019, 2019; Cho, 2016; Horváth and Hušková, 2012; Jirak, 2015; Cho and Fryzlewicz, 2015; Zhang et al., 2010; Vert and Bleakley, 2010; Lévy-Leduc and Roueff, 2009; Bleakley and Vert, 2011). We denote a multi-dimensional sequence dataset by a $D\times N$ matrix $X\in\mathbb{R}^{D\times N}$ , where each column is called location and each row is called component. Each location is represented by a $D$ -dimensional vector, whereas each component is represented by one-dimensional sequence with the length $N$ . We consider the case where changes exist in multiple locations and multiple components. Multi-dimensional multiple CP detection is an important problem in various fields such as computational biology (Takeuchi et al., 2009b) and signal processing (Hammerla et al., 2016).

We focus on the problem of detecting both location and component when there are changes in multiple components at each of the multiple locations. For example, in the field of computational biology, the identification of genome copy-number abnormalities is formulated as a CP detection problem, where it is important to find changes that are commonly observed in multiple patients. This problem can be interpreted as the problem of detecting both the location (genome position of the genome abnormality) and the component (the patients with the genomic abnormality) of multiple CPs.

Our main contribution in this paper is NOT to propose an algorithm for multi-dimensional multiple CP detection, but to develop a method for evaluating the reliability of the detected CPs. Figure 1 shows an example of the problem we consider in this paper. In this paper, we develop a novel method to quantify the statistical reliability of multiple CP locations and components in the form of $p$ -values when they are detected by a certain CP detection algorithm. The importance of evaluating the reliability of AI and machine learning (ML) results has been actively discussed in the context of reliable AI/ML — this study can be interpreted as a contribution toward reliable AI/ML.

Refer to caption — Figure 1: An example of the problem considered in this paper. The left and the right plots indicate the results of CP detection and CP inference, respectively. In this example, six CPs (A-F) are detected by a CP detection method but the CP inference method that we propose in this paper suggests that the changes are statistically significant ( $p~{}{\rm value}<0.05$ ) at A, B, C, and D but not at E and F ( $p~{}{\rm value}\geq 0.05$ ).

Reliability evaluation of the detected CPs is inherently difficult. This is because, in most CP detection problems, only a single dataset is given, and the CP detection and evaluation must be done on the same dataset. Therefore, locations and components of the CPs are selected to fit the observed dataset, which suggests that the selection bias must be properly considered. Unlike supervised learning such as regression and classification, it is not possible to use data splitting or cross-validation.

In traditional statistics, reliability evaluation of CPs has been based on asymptotic theory, i.e., the case where $N\to\infty$ (Enikeeva et al., 2019; Cho, 2016; Horváth and Hušková, 2012; Zhang et al., 2010; Jirak, 2015; Harchaoui et al., 2009; Korostelev and Lepski, 2008). The asymptotic theory of CPs is built on various regularity assumptions, such as sufficiently weak sequential correlations, and it has been pointed out that unless these regularity assumptions are met, the evaluation will be inaccurate. Furthermore, the asymptotic theory of CPs is developed in relatively simple problem settings, such as a single CP or one-dimensional sequence, and it cannot be easily extended to a multi-dimensional multiple CP detection problem.

In this paper, we propose a method for valid and exact (non-asymptotic) statistical inference on CPs by utilizing the framework of conditional selective inference (SI), which has recently received much attention in the context of statistical inference on feature selection (Lee et al., 2016; Taylor and Tibshirani, 2015; Fithian et al., 2014; Tibshirani et al., 2016; Loftus and Taylor, 2014, 2015; Suzumura et al., 2017; Terada and Shimodaira, 2019; Yamada et al., 2018b, a; Le Duy and Takeuchi, 2021; Sugiyama et al., 2021). In the past few years, there have been attempts to extend the conditional SI framework into various problems (Lee et al., 2015; Chen and Bien, 2019; Shimodaira and Terada, 2019; Tanizaki et al., 2020; Duy et al., 2020a; Tsukurimichi et al., 2021; Duy and Takeuchi, 2021), and several researchers have started to utilize conditional SI framework to evaluate the reliability of CPs (Umezu and Takeuchi, 2017; Hyun et al., 2018; Jewell, 2020; Duy et al., 2020b). However, to the best of our knowledge, there is no such study for multi-dimensional multiple CPs. Furthermore, if we simply extends the methods of these existing studies to multi-dimensional multiple CP detection problems, it turns out that the power of the inference would be low. Therefore, we propose a new approach based on conditional SI, which is more powerful than the naive extensions of existing approaches.

2 Problem Statement

Dataset and statistical model

Consider a dataset in the form of a multi-dimensional sequence with the length $N$ and the dimension $D$ . We denote the dataset as $X\in\mathbb{R}^{D\times N}$ where each row is called a component and each column is called a location. The $i^{\rm th}$ component, denoted as $X_{i,:}\in\mathbb{R}^{N}$ for $i\in[D]$ , is represented as one-dimensional sequence with the length $N$ . The $j^{\rm th}$ location, denoted as $X_{:,j}\in\mathbb{R}^{D}$ for $j\in[N]$ , is represented as a vector with the length $D$ . Furthermore, we denote the $ND$ -dimensional vector obtained by concatenating rows of $X$ as ${\rm vec}(X)=[X_{1,:}^{\top},\ldots,X_{D,:}^{\top}]^{\top}\in\mathbb{R}^{ND}$ where ${\rm vec}(\cdot)$ is an operator to transform a matrix into a vector with concatenated rows.

We interpret that the dataset $X$ is a random sample from a statistical model

\displaystyle{\rm vec}(X)\sim N({\rm vec}(M),\Xi\otimes\Sigma),

(1)

where $M\in\mathbb{R}^{D\times N}$ is the matrix containing the mean values of each element of $X$ , $\Xi\in\mathbb{R}^{D\times D}$ is the covariance matrix for representing correlations between components and $\Sigma\in\mathbb{R}^{N\times N}$ is the covariance matrix for representing correlations between locations. Here, we do not assume any specific structure for the mean matrix $M$ and assume it to be unknown. On the other hand, the covariance matrices $\Xi$ and $\Sigma$ are assumed to be estimable from an independent dataset where it is known that no change points exist.

Detection of locations and components of multiple CPs

We consider the problem of the identifying both the locations and the components of multiple CPs in the dataset $X$ when there exist changes in a subset of the $D$ components at each of the multiple CP locations. Let $K$ be the number of detected CPs. The locations of the $K$ detected CPs are denoted as $1\leq\tau_{1},\ldots,\tau_{K}<N$ and the subsets of the components where changes present are denoted as $\bm{\theta}^{1},\ldots,\bm{\theta}^{K}$ where $\bm{\theta}^{k}$ , $k\in[K]$ , is the set of the components in which changes exist at the location $\tau_{k}$ . Elements of $\bm{\theta}^{k}$ is denoted as $\theta^{k}_{1},\ldots,\theta^{k}_{|\bm{\theta}^{k}|}$ . For notational simplicity, we also define $\tau_{0}=0$ , $\tau_{K+1}=N$ and $\bm{\theta}^{0}=\emptyset$ , $\bm{\theta}^{K+1}=\emptyset$ . Given a multi-dimensional sequence dataset $X$ , the task is to identify both the locations ${\mathcal{T}}=(\tau_{1},\ldots,\tau_{K})$ and the sets of the components $\Theta=(\bm{\theta}^{1},\ldots,\bm{\theta}^{K})$ . of the detected multiple CPs (see Fig. 1).

As a CP detection method for both locations and components, we employ a method combining dynamic-programming (DP)-based approach (Bellman, 1966; Guthery, 1974; Bai and Perron, 2003; Lavielle, 2005; Rigaill, 2010) and scan-statistic (SS)-based approach (Enikeeva et al., 2019). This CP detection method is formulated as the following optimization problem:

\displaystyle(\hat{{\mathcal{T}}},\hat{\Theta})=\mathop{\rm argmax}_{\begin{subarray}{c}{\mathcal{T}}=(\tau_{1},\ldots,\tau_{K}),\\ \Theta=(\bm{\theta}^{1},\ldots,\bm{\theta}^{K})\end{subarray}}\sum_{k\in[K]}C_{\tau_{k}-L+1:\tau_{k}+L}^{(\bm{\theta}^{k})}(\tau_{k},W)

(2)

where


$\displaystyle C_{s:e}^{(\bm{\theta})}(j,W)$	$\displaystyle=\frac{\\|{\rm diag}(\bm{1}_{D,\bm{\theta}})Z_{s:e}^{\prime}(j,W)\\|_{2}^{2}-\|\bm{\theta}\|}{\sqrt{2\|\bm{\theta}\|}},$	(3a)
$\displaystyle Z_{s:e}^{\prime}(j,W)$	$\displaystyle=\sum_{\Delta=-W}^{W}Z_{s,e}(j+\Delta),$	(3b)
$\displaystyle Z_{s:e}(j)$	$\displaystyle=V_{s,j,e}\left(\frac{\sum_{j^{\prime}=s}^{j}X_{:,j^{\prime}}}{j-s+1}-\frac{\sum_{j^{\prime}=j+1}^{e}X_{:,j^{\prime}}}{e-j}\right),$	(3c)
$\displaystyle V_{s,j,e}$	$\displaystyle=\sqrt{\frac{(j-s+1)(e-j)}{e-s+1}}.$	(3d)

Here, $L$ and $W$ are hyperparameters which are explained below, $\bm{1}_{D,\bm{\theta}}$ is $D$ -dimensional binary vector in which the $i^{\rm th}$ element is $1$ if $i\in\bm{\theta}$ and $0$ otherwise, and ${\rm diag}(\bm{1}_{D,\bm{\theta}})\in\mathbb{R}^{D\times D}$ is a matrix in which vector $\bm{1}_{D,\bm{\theta}}$ is arranged diagonally. The CP detection method in (2) is motivated as follows. We employ the scan statistic proposed in (Enikeeva et al., 2019) as the building block, which was introduced for the problem of detecting a single CP from a multi-dimensional sequence¹¹1 Double CUMSUM (Cho, 2016) is known as another CP detection method that enables us to select not only the location but also the subset of the components of the change. We focus on the scan-statistic in this study, but conjecture that our SI framework can be applied also to double CUMSUM. . As a criterion for a single CP at the location $\tau\in[N]$ and the components $\bm{\theta}\subseteq[D]$ , the scan statistic is defined as

\displaystyle\frac{\|{\rm diag}(\bm{1}_{D,\bm{\theta}})Z_{s:e}(\tau)\|^{2}-|\bm{\theta}|}{\sqrt{2|\bm{\theta}|}}.

We extend the scan statistic in various directions in (2). First, we extended the scan statistic to handle multiple CPs, i.e., multiple locations $\tau_{1},\ldots,\tau_{K}$ and the corresponding sets of the components $\bm{\theta}^{1},\cdots,\bm{\theta}^{K}$ . Next, in each component $\theta^{k}_{h}\in\bm{\theta}^{k},h\in[|\bm{\theta}^{k}|]$ we consider the mean signal difference between the $L$ points before and after the CP location $\tau_{k}$ , i.e., the mean difference between $X_{h,\tau_{k}-L+1:\tau_{k}}$ and $X_{h,\tau_{k}+1:\tau_{k}+L}$ . This is to avoid the influence of neighboring CPs when evaluating the statistical reliability of each location $\tau_{k}$ and the components $\theta^{k}_{h}$ . Here, $L$ is considered as a hyperparameter Furthermore, the reason why we use $Z_{(s,e)}^{\prime}(\tau)$ instead of $Z_{(s,e)}(\tau)$ in (3a) is that we consider changes to be common among components if it is within the range of $\pm W$ rather than defining common changes as the changes at exactly the same location. Here, $W$ is also considered as a hyperparameter.

The optimal solution of (2) can be efficiently obtained by using DP and the algorithm for efficient scan-statistic computation described in Enikeeva et al. (2019). The pseudo-code for solving the optimization problem (2) is presented in Algorithms 1 (SortComp and UpdateDP functions are presented in Appendix A).

Algorithm 1 CP detection algorithm for (2)

0: dataset

X\in\mathbb{R}^{D\times N}

and parameters

K,L,W

Set

F_{(0,j)}=0,\mathcal{T}_{(0,j)}=(0),\Theta_{(0,j)}=\phi

for

j\in[N-LK]

Set

\mathcal{S}_{:,m}=

SortComp(m) for

m\in[L,N-L]

for

k=1

K-1

for

j=L(k+1)

N-L(K-k)

F_{(k,j)},\hat{\tau}_{k}^{(k,j)},\hat{\bm{\theta}}_{k}^{(k,j)}

\qquad=\text{UpdateDP}(j,k,F_{(k-1,Lk:j-L)},\mathcal{S})

\mathcal{T}_{(k,j)}={\rm concat}(\mathcal{T}_{(k-1,\hat{\tau}_{k}^{(k,j)})},\hat{\tau}_{k}^{(k,j)})

\Theta_{(k,j)}={\rm concat}(\Theta_{(k-1,\hat{\tau}_{k}^{(k,j)})},\hat{\bm{\theta}}_{k}^{(n,k)})

end for

F_{(K,N)},\hat{\tau}_{K}^{(K,N)},\hat{\bm{\theta}}_{K}^{(K,N)}

\qquad=\text{UpdateDP}(N,K,F_{(K-1,LK:N-L)},\mathcal{S})

\mathcal{T}_{(K,N)}=(\mathcal{T}_{(K-1,\hat{\tau}_{K}^{(K,N)})},\hat{\tau}_{K}^{(K,N)},N)

\Theta_{(K,N)}={\rm concat}(\Theta_{(K-1,\hat{\tau}_{K}^{(K,N)})},\hat{\bm{\theta}}_{K}^{(K,N)},\phi)

\{\tau_{k}\}_{k\in[K]}=\mathcal{T}_{(K,N)},\{\bm{\theta}^{k}\}_{k\in[K]}=\Theta_{(K,N)}

Statistical Inference on Detected Locations and Components

The main goal of this paper is to develop a valid and exact (non-asymptotic) statistical inference method for detected locations and components of multiple CPs. For each of the detected location $\tau_{k}$ and component $\theta^{k}_{h}$ , $k\in[K],h\in[|\bm{\theta}^{k}|]$ , we consider the statistical test with the following null and alternative hypotheses:


		$\displaystyle{\rm H}_{0}^{(k,h)}:\frac{1}{L-W}\sum_{j=\tau_{k}-L+1}^{\tau_{k}-W}M_{\theta_{h}^{k},j}=\frac{1}{L-W}\sum_{j=\tau_{k}+W+1}^{\tau_{k}+L}M_{\theta_{h}^{k},j}$		(4a)
		$\displaystyle{\rm H}_{1}^{(k,h)}:\frac{1}{L-W}\sum_{j=\tau_{k}-L+1}^{\tau_{k}-W}M_{\theta_{h}^{k},j}\neq\frac{1}{L-W}\sum_{j=\tau_{k}+W+1}^{\tau_{k}+L}M_{\theta_{h}^{k},j}$		(4b)

This test simply concerns whether the averages of $L$ points before and after the CP location $\tau_{k}\pm W$ is equal or not in the component $\theta_{h}^{k}$ . Note that, throughout the paper, we do NOT assume that the mean of each component $M_{:,i},i\in[D]$ is piecewise-constant function, i.e., the proposed statistical inference method in §3 is valid in the sense that the type I error is properly controlled at the specified significance level $\alpha$ (e.g., $\alpha=0.05$ ). While existing statistical inference methods for CPs often require the assumption on the true structure (such as piecewise constantness, piecewise linearity, etc.), the conditional SI framework in this paper enables us to be free from assumptions about the true structure on $M_{i,j},i\in[D],j\in[N]$ although the power of the method depend on the unknown structure.

For the statistical test in (4), a reasonable test statistic is the observed mean signal difference

\displaystyle\psi_{k,h}=\frac{1}{L-W}\left(\sum_{j=\tau_{k}-L+1}^{\tau_{k}-W}X_{\theta^{k}_{h},j}-\sum_{j=\tau_{k}+W+1}^{\tau_{k}+L}X_{\theta^{k}_{h},j}\right).

This test statistic can be written as

\displaystyle\psi_{k,h}=\bm{\eta}_{k,h}^{\top}{\rm vec}(X),

by defining an $ND$ -dimensional vector

\displaystyle\bm{\eta}_{k,h}=\frac{1}{L-W}\bm{1}_{\theta^{k}_{h}}\otimes\left(\bm{1}_{\tau_{k}-L+1:\tau_{k}-W}-\bm{1}_{\tau_{k}+W+1:\tau_{k}+L}\right),

where $\bm{1}_{\theta^{k}_{h}}$ is $D$ -dimensional binary vector in which the $\theta_{h}^{k}$ th element is 1 and 0 otherwise, and $\bm{\mathbf{1}}_{s:e}$ is $N$ -dimensional binary vector in which the $j^{\rm th}$ element is 1 if $j\in[s:e]$ and 0 otherwise. In case both the CP location $\tau_{k}$ and the CP component $\theta^{k}_{h}$ are known before looking at data, since ${\rm vec}(X)$ follows the multivariate normal distribution in (1), the null distribution of the test statistic $\bm{\eta}_{k,h}^{\top}{\rm vec}(X)$ also follows a multivariate normal distribution under the null hypothesis (4a). Unfortunately, however, because the CP location $\tau_{k}$ and the CP component $\theta^{k}_{h}$ are actually selected by looking at the observed dataset $X^{\text{obs}}$ , it is highly difficult to derive the sampling distribution of the test statistic. In the literature of traditional statistical CP detection, in order to mitigate this difficulty, only simple problem settings, e.g., with a single location or/and a single component, have been studied by relying on asymptotic approximation with $N\to\infty$ or/and $D\to\infty$ . However, these asymptotic approaches are not available for complex problem settings such as (2), and are only applicable when certain regularity assumptions on sequential correlation are satisfied.

Conditional SI

To conduct exact (non-asymptotic) statistical inference on the locations and the components of the detected CPs, we employ conditional SI. The basic idea of conditional SI is to make inferences based on the sampling distribution conditional on the selected hypothesis (Lee et al., 2016; Fithian et al., 2014). Let us denote the set of pairs of location and component of the detected CPs when the dataset $X$ is applied to Algorithm 1 as

\displaystyle{\mathcal{M}}(X)=\{(\tau_{k},\theta_{h}^{k})\}_{k\in[K],h\in[|\bm{\theta}^{k}|]}.

To apply conditional SI framework into our problem, we consider the following sampling distribution of the test statistic $\bm{\eta}_{k,h}^{\top}{\rm vec}(X)$ conditional on the selection event that a change is detected at location $\tau_{k}$ in the component $\theta^{k}_{h}$ , i.e.,

\displaystyle\bm{\eta}_{k,h}^{\top}{\rm vec}(X)\mid(\tau_{k},\theta^{k}_{h})\in{\mathcal{M}}(X).

Then, for testing the statistical significance of the change, we introduce so-called selective $p$ -value $p_{k,h}$ , which satisfies the following sampling property

\displaystyle\mathbb{P}_{{\rm H_{0}}^{(k,h)}}\left(p_{k,h}\leq\alpha\mid(\tau_{k},\theta^{k}_{h})\in{\mathcal{M}}(X)\right)=\alpha,~{}\forall\alpha\in(0,1),

(5)

where $\alpha\in(0,1)$ is the significance level of the statistical test (e.g., $\alpha=0.05$ )²²2 Note that a valid $p$ -value in a statistical test should satisfies $\mathbb{P}(p\leq\alpha)=\alpha~{}\forall\alpha\in(0,1)$ under the null hypothesis so that the probability of the type I error (false positive) coincides with the significance level $\alpha$ . . To define the selective $p$ -value, with a slight abuse of notation, we make distinction between the random variables $X$ and the observed dataset $X^{\rm obs}$ . The selective $p$ -value is defined as

\displaystyle p_{k,h}=\mathbb{P}_{{\rm H}_{0}^{(k,h)}}\left(|\bm{\eta}_{k,h}^{\top}{\rm vec}(X)|\geq|\bm{\eta}_{k,h}^{\top}{\rm vec}(X^{\rm obs})|~{}\Bigg{|}~{}\begin{array}[]{l}(\tau_{k},\theta_{h}^{k})\in{\mathcal{M}}(X),\\ \bm{q}_{k,h}(X)=\bm{q}_{k,h}(X^{\rm obs})\end{array}\right),

(8)

where $\bm{q}_{k,h}(X)$ is the nuisance parameter defined as

\displaystyle\bm{q}_{k,h}(X)=I_{ND}-\bm{c}\bm{\eta}_{k,h}^{\top}{\rm vec}(X)~{}\text{ with }~{}\bm{c}=\frac{\Xi\otimes\Sigma\bm{\eta}_{k,h}}{\bm{\eta}_{k,h}^{\top}\Xi\otimes\Sigma\bm{\eta}_{k,h}}.

Note that the selective $p$ -values depend on the nuisance component $\bm{q}_{k,h}(X)$ , but the sampling property in (5) is kept satisfied without conditioning on the nuisance event $\bm{q}_{k,h}(X)=\bm{q}_{k,h}(X^{\rm obs})$ because $\bm{q}_{k,h}(X)$ is independent of the test statistic $\bm{\eta}_{k,h}^{\top}{\rm vec}(X)$ . The selective $(1-\alpha)$ confidence interval (CI) for any $\alpha\in(0,1)$ which satisfies the $(1-\alpha)$ coverage property conditional on the selection event $(\tau_{k},\theta_{h}^{k})\in{\mathcal{M}}(X)$ can be also defined similarly. See Lee et al. (2016) and Fithian et al. (2014) for a more detailed discussion and the proof that the selective $p$ -value (8) satisfies the sampling property (5) and how to define selective CIs.

The discussion so far indicates that, if we can compute the conditional probability in (8), we can conduct a valid exact test for each of the detected change at the location $\tau_{k}$ in the component $\theta^{k}_{h}$ for $k\in[K],h\in[|\bm{\theta}^{k}|]$ . Most of the existing conditional SI studies consider the case where the hypothesis selection event is represented as an intersection of a set of linear or quadratic inequalities. For example, in the seminal work of conditional SI in Lee et al. (2016), the selective $p$ -value and CI computations are conducted by exploiting the fact that the event that some features (and their signs) are selected by Lasso is represented as an intersection of a set of linear inequalities, i.e., the selection event is represented as a polyhedron of sample (data) space. Unfortunately, however, the selection event in (8) is highly complicated and cannot be simply represented an intersection of linear nor quadratic inequalities. We resolve this computational challenge in the next section. Our basic idea is to consider a family of over-conditioning problems such that the computation of the conditional probability of each of the over-conditioning problems can be tractable and then sum them up to obtain the selective $p$ -value in (8). This idea is motivated by Liu et al. (2018) and Le Duy and Takeuchi (2021) in which the authors proposed methods to increase the power of conditional SI for Lasso in Lee et al. (2016).

3 Proposed Method

In this section, we propose a method to compute the selective $p$ -value in (8). Because we already know that the test statistic $\bm{\eta}_{k,h}^{\top}{\rm vec}(X)$ follows a normal distribution with the mean $\bm{\eta}_{k,h}^{\top}{\rm vec}(M)$ and the variance $\bm{\eta}_{k,h}^{\top}\Xi\otimes\Sigma\bm{\eta}_{k,h}$ without conditioning, the remaining task is to identify the subspace defined as

\displaystyle{\mathcal{X}}=\left\{\text{vec}(X)\in\mathbb{R}^{DN}~{}\Bigg{|}~{}\begin{array}[]{l}(\tau_{k},\theta^{k}_{h})\in{\mathcal{M}}(X),\\ \bm{q}_{k,h}(X)=\bm{q}_{k,h}(X^{\rm obs})\end{array}\right\}.

(11)

Unfortunately, there is no simple way to identify the subspace ${\mathcal{X}}$ because it is an inverse problem of the optimization problem in (2). Therefore, we adopt the following strategy:

1.

Reformulate the inverse problem (11) as an inverse problem on a line in $\mathbb{R}^{D\times N}$ ;
2.

Consider a family of over-conditioning problems by adding additional conditions.
3.

Show that the inverse problem for each of the over-conditioning problems considered in step 2 can be solved analytically;
4.

Solve a sequence of inverse problems for a sequence of over-conditioning problems along the line considered in step 1.

3.1 Conditional data space on a line

It is easy to see that the condition $\bm{q}_{k,h}(X)=\bm{q}_{k,h}(X^{\rm obs})$ indicates that $\{X\in\mathbb{R}^{D\times N}|\bm{q}_{k,h}(X)=\bm{q}_{k,h}(X^{\rm obs})\}$ is represented as a line in $\mathbb{R}^{D\times N}$ as formally stated in the following lemma.

Lemma 1.

Let

	$\displaystyle\bm{a}$	$\displaystyle=\bm{q}_{k,h}(X^{\rm obs}),$
	$\displaystyle\bm{b}$	$\displaystyle=\frac{\Xi\otimes\Sigma\bm{\eta}_{k,h}}{\bm{\eta}_{k,h}^{\top}\Xi\otimes\Sigma\bm{\eta}_{k,h}}.$

Then,

		$\displaystyle\left\{{\rm vec}(X)\in\mathbb{R}^{DN}~{}\Bigg{\|}~{}\begin{array}[]{l}(\tau_{k},\theta^{k}_{h})\in{\mathcal{M}}(X),\\ \bm{q}_{k,h}(X)=\bm{q}_{k,h}(X^{\rm obs})\end{array}\right\}$
	$\displaystyle=$	$\displaystyle\left\{{\rm vec}(X)=\bm{a}+\bm{b}z~{}\Bigg{\|}~{}\begin{array}[]{l}(\tau_{k},\theta^{k}_{h})\in{\mathcal{M}}(X),\\ z\in\mathbb{R}\end{array}\right\}$

This lemma indicates that we do not have to consider the inverse problem for $DN$ -dimensional data space but only to consider an inverse problem for one-dimensional projected data space

\displaystyle\{z\in\mathbb{R}|(\tau_{k},\theta^{k}_{h})\in{\mathcal{M}}(\bm{a}+\bm{b}z)\}.

This fact has been already implicitly exploited in the seminal conditional SI work in Lee et al. (2016), but explicitly discussed first in Liu et al. (2018). Here, with a slight abuse of notation, ${\mathcal{M}}(X)$ is equivalent to ${\mathcal{M}}({\rm vec}(X))$ .

3.2 Over-conditioning (OC)

Consider all possible sets of triplets of location and component which can be obtained by applying Algorithm 1 to any $D\times N$ multivariate sequence dataset $X$ . Since the number of such sets are finite, we can represent each of such sets by using an index $u=1,2,\cdots$ as $\{{\mathcal{M}}_{u}\}_{u=1,2,\cdots}$ . Let us consider

\displaystyle{\mathcal{U}}^{(k,h)}=\left\{u\mid(\tau_{k},\theta^{k}_{h})\in{\mathcal{M}}_{u}\right\}.

Then, the conditioning event $\{(\tau_{k},\theta^{k}_{h})\in{\mathcal{M}}(X)\}$ in (8) is written as

\displaystyle\{(\tau_{k},\theta^{k}_{h})\in{\mathcal{M}}(X)\}=\bigcup_{u\in{\mathcal{U}}^{(k,h)}}\{{\mathcal{M}}_{u}={\mathcal{M}}(X)\}.

Next, let us delve into Algorithm 1 and interpret this algorithm has the following two steps:

1.

Given $X\in\mathbb{R}^{D\times N}$ , compute ${\mathcal{A}}(X)=({\mathcal{T}},\Theta,{\mathcal{S}})$ ,
2.

Given ${\mathcal{A}}(X)$ , compute ${\mathcal{M}}(X)$ ,

where the notation ${\mathcal{A}}(X)=({\mathcal{T}},\Theta,{\mathcal{S}})$ indicates that ${\mathcal{T}}$ , $\Theta$ and ${\mathcal{S}}$ in Algorithm 1 depend on the dataset $X$ . Consider all possible pairs of tables ${\mathcal{A}}=({\mathcal{T}},\Theta,{\mathcal{S}})$ which can be obtained by applying the above 1st step of Algorithm 1 to any $D\times N$ multivariate sequence dataset $X$ . Since the number of such pairs is finite, we can represent each of such pairs by using an index $v=1,2,\cdots$ as $\{{\mathcal{A}}_{v}\}_{v=1,2,\cdots}$ . Let us now write the above step 2 of Algorithm 1 as ${\mathcal{M}}=\phi({\mathcal{A}})$ and consider

\displaystyle{\mathcal{V}}_{u}^{(k,h)}=\{v\mid{\mathcal{M}}_{u}=\phi({\mathcal{A}}_{v})\},u\in{\mathcal{U}}^{(k,h)}.

Then, the conditioning event $\{(\tau_{k},\theta^{k}_{h})\in{\mathcal{M}}(X)\}$ in (8) is written as

\displaystyle\{(\tau_{k},\theta^{k}_{h})\in{\mathcal{M}}(X)\}=\bigcup_{u\in{\mathcal{U}}^{(k,h)}}\left\{{\mathcal{M}}_{u}={\mathcal{M}}(X)\right\}=\bigcup_{u\in{\mathcal{U}}^{(k,h)}}\bigcup_{v\in{\mathcal{V}}_{u}^{(k,h)}}\left\{{\mathcal{A}}_{v}={\mathcal{A}}(X)\right\}

3.3 Inverse problem for over-conditioning problem

Based on §3.1 and §3.2, we consider the subset of one-dimensional projected dataset on a line for each of the over-conditioning problems:

\displaystyle{\mathcal{Z}}_{v}=\left\{z\in\mathbb{R}~{}|~{}{\mathcal{A}}_{v}={\mathcal{A}}(\bm{a}+\bm{b}z)\right\},v=1,2,\cdots.

Lemma 2.

Let us define the following two sets represented as intersections of quadratic inequalities of $z$ :

	$\displaystyle{\mathcal{E}}_{\text{sort}}^{(m)}$	$\displaystyle=\bigcap_{d=1}^{D-1}\{z~{}\|~{}(\bm{b}^{\top}A_{(m,d)}\bm{b})z^{2}+2(\bm{b}^{\top}A_{(m,d)}\bm{a})z+\bm{a}^{\top}A_{(m,d)}\bm{a}\leq 0\}$
	$\displaystyle{\mathcal{E}}_{\text{table}}^{(k,j)}$	$\displaystyle=\bigcap_{m=Lk}^{j-L}\bigcap_{d=1}^{D}\{z~{}\|~{}(\bm{b}^{\top}B_{(k,j,m,d)}\bm{b})z^{2}+2(\bm{b}^{\top}B_{(k,j,m,d)}\bm{a})z+\bm{a}^{\top}B_{(k,j,m,d)}\bm{a}+c_{(k,j,m,d)}\leq 0\}$

for $A_{(m,d)},B_{(k,j,m,d)}\in\mathbb{R}^{ND\times ND}$ and $c_{(k,j,m,d)}\in\mathbb{R}$ . Then, the subset $\mathcal{Z}_{v}$ is characterized as

\displaystyle\mathcal{Z}_{v}=\bigcap_{m=L}^{L-N}\{z~{}|~{}z\in{\mathcal{E}}_{\text{sort}}^{(m)}\}\cap\bigcap_{k=1}^{K-1}\bigcap_{j=L(k+1)}^{N-L(K-k)}\{z~{}|~{}z\in{\mathcal{E}}_{\text{table}}^{(k,j)}\}\cap\{z~{}|~{}z\in{\mathcal{E}}_{\text{table}}^{(K,N)}\}

Here, matrices $A,B$ and a scalar $c$ require very complex and lengthy definitions (see Appendix B for their complete definitions). Briefly speaking, $A_{(m,d)}$ is a matrix for the sorting operation required to select $\tilde{\bm{\theta}}^{(m)}$ , while $B_{(k,j,m,d)}$ and $c_{(k,j,m,d)}$ are a matrix and a scalar for updating the DP tables which are required to select $(\hat{\tau}_{k}^{(k,j)},\hat{\bm{\theta}}_{k}^{(k,j)})$ in $(k,j)$ . The proof of Lemma 2 is presented in Appendix B. Lemma 2 indicates that the subset ${\mathcal{Z}}_{v}$ for each $v$ can be characterized by an intersection of quadratic inequalities with respect to $Z$ and can be obtained analytically.

3.4 Line search

The results in §3.1, §3.2 and §3.3 are summarized as

	$\displaystyle{\mathcal{X}}$	$\displaystyle=\left\{{\rm vec}(X)\in\mathbb{R}^{DN}~{}\Bigg{\|}~{}\begin{array}[]{l}(\tau_{k},\theta^{k}_{h})\in{\mathcal{M}}(X),\\ \bm{q}(X)=\bm{q}(X^{\rm obs})\end{array}\right\}$
		$\displaystyle=\left\{{\rm vec}(X)=\bm{a}+\bm{b}z~{}\Bigg{\|}~{}z\in\bigcup_{u\in{\mathcal{U}}^{(k,h)}}\bigcup_{v\in{\mathcal{V}}_{u}^{(k,h)}}{\mathcal{Z}}_{v}\right\}.$

Our strategy is to identify ${\mathcal{Z}}=\bigcup_{u\in{\mathcal{U}}^{(k,h)}}\bigcup_{v\in{\mathcal{V}}_{u}^{(k,h)}}{\mathcal{Z}}_{v}$ by repeatedly applying Algorithm 1 to a sequence of datasets ${\rm vec}(X)=\bm{a}+\bm{b}z$ within sufficiently wide range of $z\in[z_{\rm min},z_{\rm max}]$ ³³3 In §4, we set $z_{\rm min}=-10^{6}$ and $z_{\rm max}=10^{6}$ because the probability mass outside the range is negligibly small. . Because ${\mathcal{Z}}_{v}\in\mathbb{R}$ is characterized by an intersection of quadratic inequalities, it is represented either as an empty set, an interval or two intervals in $\mathbb{R}$ . For simplicity⁴⁴4 The case where ${\mathcal{Z}}_{v}$ is empty set does not appear in the following discussion, and the extension to the case where ${\mathcal{Z}}_{v}$ is represented as two intervals is straightforward. , we consider the case where ${\mathcal{Z}}_{v}$ is represented by an interval and denoted as ${\mathcal{Z}}_{v}=[L_{v},U_{v}]$ .

Our method works as follows. Let $z^{(1)}=z_{\rm min}$ . By applying Algorithm 1 to the dataset ${\rm vec}(X)=\bm{a}+\bm{b}z^{(1)}$ , we obtain $v^{(1)}$ such that ${\mathcal{A}}_{v^{(1)}}={\mathcal{A}}(\bm{a}+\bm{b}z^{(1)})$ . It means that ${\mathcal{A}}_{v^{(1)}}={\mathcal{A}}(\bm{a}+\bm{b}z)$ for $z\in[L_{v^{(1)}},U_{v^{(1)}}]$ . Therefore, we next consider $z^{(2)}=U_{v^{(1)}}+\delta$ where $\delta$ is a very small value (e.g., $\delta=10^{-6}$ ). Then, we apply Algorithm 1 to the dataset ${\rm vec}(X)=\bm{a}+\bm{b}z^{(2)}$ , and obtain $v^{(2)}$ such that ${\mathcal{A}}_{v^{(2)}}={\mathcal{A}}(\bm{a}+\bm{b}z^{(2)})$ . We simply repeat this process up to $z=z_{\rm max}$ . Finally, we enumerate intervals $[L_{v^{(t)}},U_{v^{(t)}}]$ such that $v^{(t)}\in{\mathcal{V}}_{u}^{(k,h)},u\in{\mathcal{U}}^{(k,h)}$ , which enables us to obtain ${\mathcal{Z}}=\bigcup_{u\in{\mathcal{U}}^{(k,h)}}\bigcup_{v\in{\mathcal{V}}_{u}^{(k,h)}}{\mathcal{Z}}_{v}$ .

Once ${\mathcal{Z}}$ is computed, remembering that the test-statistic is normally distributed, the selective $p$ -value $p_{k,h}$ in (8) is computed based on a truncated normal distribution with the truncation region ${\mathcal{Z}}$ . Specifically,

	$\displaystyle p_{k,h}$	$\displaystyle=2\min\left\{\pi_{k,h},1-\pi_{k,h}\right\},$
	$\displaystyle\pi_{k,h}$	$\displaystyle=1-F_{0,\bm{\eta}_{k,h}^{\top}\Xi\otimes\Sigma\bm{\eta}_{k,h}}^{{\mathcal{Z}}}(\bm{\eta}_{k,h}^{\top}{\rm vec}(X^{\rm obs}))$

where $F_{\mu,\sigma^{2}}^{{\mathcal{Z}}}$ is the cumulative distribution function of the truncated normal distribution with the mean $\mu$ , the variance $\sigma^{2}$ and the truncation region ${\mathcal{Z}}$ .

3.5 Discussion

In this section, we discuss the power of conditional SI. It is important to note that a conditional SI method is still valid (i.e., the type I error is properly controlled at the significance level $\alpha$ ) in the case of over-conditioning (Fithian et al., 2014). In fact, in the early papers on conditional SI (including the seminal paper by Lee et al. (2016)), the computations of selective $p$ -values and CIs were made tractable by considering over-conditioning cases. In the past few years, for improving the power of conditional SI, a few methods were proposed to resolve the over-conditioning issue (Liu et al., 2018; Le Duy and Takeuchi, 2021). In Liu et al. (2018), the authors pointed out that the power can be improved by conditioning only on individual features, rather than on the entire subset of features selected by Lasso. In Le Duy and Takeuchi (2021), the authors proposed how to avoid over-conditioning on Lasso conditional SI by interpreting the problem as a search problem on a line in the data space. In this study, we introduce these two ideas into the task of quantifying the statistical reliability of the locations and components of multiple CPs. Specifically, by only considering the locations of $L$ points before and after each CP, the statistical reliability of each location can be evaluated independently of other locations, which results in improved power. Furthermore, the approach of searching for matching conditions on a straight line in the data space is inspired by the method proposed in Le Duy and Takeuchi (2021). In the experiments in §4, the SI conditional on the over-condition ${\mathcal{A}}(X)={\mathcal{A}}(X^{\rm obs})$ is denoted as “proposed-OC”, and it is compared with the minimum (i.e., optimal) condition case denoted as “proposed-MC”.

Optimization approaches to solving families of optimization problems parameterized by scalar parameter are called parametric programming or homotopy methods. Parametric programming and homotopy methods are used in statistics and machine learning in the context of regularization path. A regularization path refers to the path of the (optimal) models when the regularization parameter is continuously varied. Regularization paths for various problems and various models have been studied in the literature (Osborne et al., 2000; Hastie et al., 2004; Rosset and Zhu, 2007; Bach et al., 2006; Rosset and Zhu, 2007; Tsuda, 2007; Lee and Scott, 2007; Takeuchi et al., 2009a; Takeuchi and Sugiyama, 2011; Karasuyama and Takeuchi, 2010; Hocking et al., 2011; Karasuyama et al., 2012; Ogawa et al., 2013; Takeuchi et al., 2013). The algorithm described in this section can be interpreted as a parametric optimization problem with respect to $z$ (see Lemma 1). In the context of Conditional SI, the parametric optimization problem was first used in the work by Le Duy and Takeuchi (2021), and this work can be interpreted as an application of their ideas to the multi-dimensional change-point detection problem.

4 Experiments

We demonstrate the effectiveness of the proposed method through numerical experiments. In all experiments, we set the significance level at $\alpha=0.05$ .

4.1 Synthetic Data Experiments

We compared the following four methods:

•

Proposed-MC: the proposed method with minimal (optimal) conditioning;
•

Proposed-OC: the proposed method with over-conditioning;
•

Data splitting (DS): an approach that divides the dataset into odd- and even-numbered locations, and uses one for CP detection and the other for inference;
•

Naive: naive inference without considering selection bias;

The goodness of each method was evaluated using false positive rate (FPR), conditional power.

False Positive Rate

FPR is the probability that the null hypothesis will be rejected when it is actually true, i.e., the probability of finding a wrong CPs. In FPR experiments, synthetic data was generated from the normal distribution $\text{vec}(X)\sim N(\bm{0},I_{D}\otimes\Sigma)$ . We considered the following two covariance matrices for $\Sigma$ :

•

Independence: $\Sigma=\sigma^{2}\mathrm{I}_{N}$
•

AR(1): $\Sigma=\sigma^{2}(\rho^{|i-j|})$

Here, we set $\sigma^{2}=1$ and $\rho=0.5$ . We considered a setting in which the number of components is fixed with $D=5$ and the sequence length is changed among $N=20,30,40,50$ . Similarly, we considered a setting where the sequence length is fixed with $N=30$ and the number of components is changed among $D=5,10,15,20$ . We generated 1000 samples from $N(\bm{0},I_{D}\otimes\Sigma)$ , and detected CPs with hyperparameters $K=2,L=5$ and $W=3$ . Then, we chose one detected location randomly and tested the changes at all the detected components. Figures 2 and 3 show the results of FPR experiments. Naive method could not control FPR at the significance level. Similarly, since DS is a method that is valid only when each locations are independent each other, FPR could not be controlled in the case of AR(1). On the other hand, the proposed methods (both proposed-MC and proposed-OC) could control FPR at the significance level.

Conditional Power

The power of conditional SI was evaluated by the proportion of correctly identified CP locations and components. We set the mean matrix $M$ with $N=30,D=5$ as follows:

	$\displaystyle M_{1,j}$	$\displaystyle=\begin{cases}0&(1\leq j\leq 10,21\leq j\leq 30)\\ \Delta\mu&(11\leq j\leq 20)\end{cases},$
	$\displaystyle M_{2,j},M_{3,j}$	$\displaystyle=\begin{cases}0&(1\leq j\leq 10)\\ \Delta\mu&(11\leq j\leq 30)\end{cases},$
	$\displaystyle M_{4,j},M_{5,j}$	$\displaystyle=\begin{cases}\Delta\mu&(1\leq j\leq 20)\\ 0&(21\leq j\leq 30)\end{cases}.$

Synthetic dataset was generated 1000 times from $N({\rm vec}(M),I_{D}\otimes\Sigma)$ , where $\Delta\mu\in\{1,2,3,4\}$ and the same covariance structures as in the FPR experiments were considered. In this setting, we detected locations with $K=2,L=5,W=1$ , and tested all the detected components. Here, proposed-MC and proposed-OC were evaluated by using the following performance measure (Hyun et al., 2018):

\displaystyle\text{Conditional Power}=\frac{\text{\#correctly detected \& rejected}}{\text{\#correctly detected}}.

In this experiment, we defined that detection is considered to be correct if detected points are within $\pm 2$ of the true CP locations. Figure 4 shows the result of conditional power experiments. The power of proposed-MC is much greater than that of proposed-OC, indicating that it is crucially important to resolve over-conditioning issues in conditional SI.

4.2 Real data

Here, we compared the performances of the proposed-MC and the naive method on two types of real datasets. Due to the space limitation, we only show parts of the results: the complete results are presented in Appendix C.

Array CGH

Array Comparative Genomic Hybridization (Array CGH) is a molecular biology tool for identifying copy number abnormalities in cancer genomes. We analyzed an Array CGH dataset for $D=46$ malignant lymphoma cancer patients studied in (Takeuchi et al., 2009b). The covariance matrix $\Sigma$ (both for independent and AR(1) cases) by using Array CGH data for 18 healthy people for which it is reasonable to assume that there are no changes (copy number abnormalities). Figure 5 shows a part of the experimental results. In the naive test, all the detected CP locations and components are declared to be significant results. On the other hand, the proposed-MC suggests that some of the detected locations and components are not really statistically significant.

Human Activity Recognition

Here, we analyzed human activity recognition dataset studied in (Chavarriaga et al., 2013), in which sensor signals attached in the humber bodies of for subjects were recorded in their daily activities. The sensor signals were recorded at $30$ Hz, and locomotion annotations (Stand, Walk, Sit, Lie) and gesture annotations (e.g., Open Door and Close Door) were added manually. We adopted the same preprocessing as (Hammerla et al., 2016) where we used the accelerometers of both arms and back and the IMU data of both feet and signals containing missing values were removed, which resulted in $D=77$ sensor signals. The covariance structures were estimated by using the signals in the longest continuous locomotion state if “Lie” because it indicates the sleeping of the subjects, i.e., no changes exist. In this experiment, the sequence length was compressed to 250 by taking the average so that there is no overlap in the direction of $N$ , i.e., the dataset size was $N=250$ and $D=77$ . We used $8$ different combinations of hyperparameters $(K,L,W)\in\{5,10\}\times\{10,20\}\times\{0,2\}$ . Figure 6 and 7 show parts of the experimental results. The background colors of the figures correspond to the locomotion states, such as blue for None, orange for Stand, green for Walk, red for Sit, and purple for Lie. Most of the detected CP locations and components are declared to be significant by the Naive method. On the other hand, the proposed-MC suggests that some of the detected locations and components are not really statistically significant.

5 Sec5

In this paper, we present a valid exact (non-asymptotic) statistical inference method for testing the statistical significances of CP locations and components. In many applied problems, it is important to identify the locations and components where changes occur. By using the proposed method, it is possible to accurately quantify the reliability of those detected changes.

Acknowledgement

This work was partially supported by Japanese Ministry of Education, Culture, Sports, Science and Technology 16H06538, 17H00758, JST CREST JPMJCR1302, JPMJCR1502, RIKEN Center for Advanced Intelligence Project.

Appendix A Algorithm for UpdateDP, SortComp

In this appendix, we present the pseudo-codes for SortComp and UpdateDP in Algorithm 1.

Algorithm 2 SortComp

m

Define

\tilde{\bm{\theta}}^{(m)}=(\tilde{\theta}_{1}^{(m)},\ldots,\tilde{\theta}_{D}^{(m)})^{\top}\in\mathbb{R}^{D}

where

\tilde{\bm{\theta}}^{(m)}

is the vector of indices of the components sorted to satisfy

[Z_{m-L+1:m+L}^{\prime(\tilde{\theta}^{(m)}_{d})}(m,W)]^{2}>[Z_{m-L+1:m+L}^{\prime(\tilde{\theta}^{(m)}_{d+1})}(m,W)]^{2}

, for

d\in[D-1]

\tilde{\bm{\theta}}^{(m)}

Algorithm 3 UpdateDP

j,k,F_{(k-1,:)},{\mathcal{S}}

F=\displaystyle\max_{\begin{subarray}{c}Lk\leq m\leq j-L,\\ d\in[D]\end{subarray}}\left[F_{(k-1,m)}+C_{m-L+1:m+L}^{(\bm{\theta}^{(m)}_{:d})}(m,W)\right]

\hat{\tau},\hat{d}=\displaystyle\mathop{\rm argmax}_{\begin{subarray}{c}Lk\leq m\leq j-L,\\ d\in[D]\end{subarray}}\left[F_{(k-1,m)}+C_{m-L+1:m+L}^{(\bm{\theta}^{(m)}_{:d})}(m,W)\right]

F,\hat{\tau},\hat{\bm{\theta}}=\tilde{\bm{\theta}}_{:d}^{(\hat{\tau})}

Appendix B Lemma2

In this appendix, we describe additional details of Lemma 2.

Definition

The matrices $A_{(m,d)}$ , $B_{(k,j,m,d)}$ and the scalar $c_{(k,j,m,d)}$ in Lemma 2 are defined as follows.

	$\displaystyle A_{(m,d)}$	$\displaystyle=\sqrt{2}(G_{(m,\{\tilde{\theta}^{(m)}_{d+1}\})}-G_{(m,\{\tilde{\theta}^{(m)}_{d}\})}),$
	$\displaystyle B_{(k,j,m,d)}$	$\displaystyle=-\sum_{k^{\prime}=1}^{k}G_{(\hat{\tau}_{k^{\prime}}^{(k,j)},\hat{\bm{\theta}}_{k^{\prime}}^{(k,j)})}+\sum_{k^{\prime}=1}^{k-1}G_{(\hat{\tau}_{k^{\prime}}^{(k-1,m)},\hat{\bm{\theta}}_{k^{\prime}}^{(k-1,m)})}+G_{(m,\tilde{\bm{\theta}}_{:d}^{(m)})},$
	$\displaystyle c_{(k,j,m,d)}$	$\displaystyle=\sum_{k^{\prime}=1}^{k}\frac{\sqrt{2\|\hat{\bm{\theta}}_{k^{\prime}}^{(k,j)}\|}}{2}-\sum_{k^{\prime}=1}^{k-1}\frac{\sqrt{2\|\hat{\bm{\theta}}_{k^{\prime}}^{(k-1,m)}\|}}{2}-\frac{\sqrt{2d}}{2},$
	$\displaystyle G_{(\tau,\bm{\theta})}$	$\displaystyle=\frac{V^{2}_{\tau-L+1,\tau,\tau+L}}{\sqrt{2\|\bm{\theta}\|}}\text{diag}(\bm{1}_{D,\bm{\theta}})\otimes\left(\sum_{\Delta=-w}^{w}\frac{1}{L+\Delta}\bm{1}_{\tau-L+1:\tau+\Delta}-\frac{1}{L-\Delta}\bm{1}_{\tau+\Delta+1:\tau+L}\right)$
		$\displaystyle\hskip 142.26378pt\left(\sum_{\Delta=-w}^{w}\frac{1}{L+\Delta}\bm{1}_{\tau-L+1:\tau+\Delta}-\frac{1}{L-\Delta}\bm{1}_{\tau+\Delta+1:\tau+L}\right)^{\top},$

where $\bm{1}_{s:e}$ is an $N$ -Dimensional binary vector in which the $i^{\text{th}}$ element is $1$ if $s\leq i\leq e$ and $0$ otherwise.

Proof

The proof of Lemma 2 is as follows.

Proof.

To prove Lemma 2, it suffices to show that the event that the sorting result at location $m$ is as the same as the observed case and the event that the $(k,j)^{\rm th}$ entry of the DP table is as the same as the observed case are both written as intersections of quadratic inequalities of $z$ as defined in ${\mathcal{E}}_{\text{sort}}^{(m)}$ for $m\in[L,L-N]$ and ${\mathcal{E}}_{\text{table}}^{(k,j)}$ for $k\in[K]$ , $j\in[Lk,N-L(K-k)]$ , respectively.

First, we prove the claim on the sorting event. Using the matrix $G_{(\tau,\bm{\theta})}$ defined above, we can write

\displaystyle[Z_{m-L+1:m+L}^{\prime(\tilde{\theta}_{d}^{(m)})}(m,W)]^{2}

\displaystyle=\sqrt{2}\text{vec}(X)^{\top}G_{(m,\{\tilde{\bm{\theta}}_{d}^{(m)}\})}\text{vec}(X).

Therefore, the sorting event is written as

\displaystyle[Z_{m-L+1:m+L}^{\prime(\tilde{\theta}_{d}^{(m)})}(m,W)]^{2}\geq[Z_{m-L+1:m+L}^{\prime(\tilde{\theta}_{d+1}^{(m)})}(m,W)]^{2}~{}\Leftrightarrow~{}\text{vec}(X)^{\top}A_{(m,d)}\text{vec}(X)\leq 0.

(12)

for $d\in[D-1]$ . By restricting on a line ${\rm vec}(X)=\bm{a}+\bm{b}z,z\in\mathbb{R}$ , the range of $z$ with which the sorting results are as the same as the observed case is written as

\displaystyle z\in{\mathcal{E}}_{\text{sort}}^{(m)}=\bigcap_{d=1}^{D-1}\{z~{}|~{}(\bm{b}^{\top}A_{(m,d)}\bm{b})z^{2}+2(\bm{b}^{\top}A_{(m,d)}\bm{a})z+\bm{a}^{\top}A_{(m,d)}\bm{a}

\displaystyle\leq 0\}.

(13)

Next, we prove the claim on the DP table. Let

\displaystyle C_{s:e}^{(\bm{\theta})}(\tau,W)

\displaystyle=\text{vec}(X)^{\top}G_{(\tau,\bm{\theta})}\text{vec}(X)-\frac{\sqrt{2|\bm{\theta}|}}{2}.

Then, the $(k,j)^{\rm th}$ element of the DP table $F$ is written as

\displaystyle F_{(k,j)}

\displaystyle=\sum_{k^{\prime}=1}^{k}C_{\hat{\tau}_{k^{\prime}}^{(k,j)}-L+1:\hat{\tau}_{k^{\prime}}^{(k,j)}+L}^{(\hat{\bm{\theta}}_{k^{\prime}}^{(k,j)})}(\hat{\tau}_{k^{\prime}}^{(k,j)},W).

Therefore, the DP table event is written as

\displaystyle F_{(k,j)}\geq F_{(k-1,m)}+C_{m-L+1:m+L}^{(\tilde{\bm{\theta}}_{:d}^{(m)})}(m,W),~{}m\in[L(k+1),j-L],d\in[D].

(14)

By restricting on a line ${\rm vec}(X)=\bm{a}+\bm{b}z,z\in\mathbb{R}$ , the range of $z$ with which the $(k,j)^{\rm th}$ entry of the DP table is as the same as the observed case is written as

\displaystyle z\in{\mathcal{E}}_{\text{table}}^{(k,j)}=\bigcap_{m}\bigcap_{d}\{Z~{}|~{}(\bm{b}^{\top}B_{(k,j,m,d)}\bm{b})z^{2}+2(\bm{b}^{\top}B_{(k,j,m,d)}\bm{a})z+\bm{a}^{\top}B_{(k,j,m,d)}\bm{a}+c_{(k,j,m,d)}\leq 0\}.

(15)

From (13) and (15), the subset ${\mathcal{Z}}_{v}\subseteq\mathbb{R}$ is represented as claimed in Lemma 2. ∎

Appendix C Additional information about the experiments

In this appendix, we describe the details of the experimental setups and additional experimental results.

C.1 Additional information about experimental setups

We executed all the the experiments on Intel(R) Xeon(R) Gold 6230 CPU 2.10GHz, and Python 3.8.8 was used all through the experiments. The two proposed methods, Proposed-MC and Proposed-OC, are implemented as conditional SIs with the following conditional distributions:

•

Proposed-MC: $\bm{\eta}_{k,h}^{\top}{\rm vec}(X)~{}|~{}(\tau_{k},\theta_{h}^{k})\in{\mathcal{M}}(X)$ ;
•

Proposed-OC: $\bm{\eta}_{k,h}^{\top}{\rm vec}(X)~{}|~{}{\mathcal{A}}(X^{\rm obs})={\mathcal{A}}(X)$ .

C.2 Additional information about synthetic data experiments

In the synthetic data experiments, in addition to FPR and conditional power, we also compared selective confidence intervals (CI) of Proposed-MC and Proposed-OC. For comparison of selective CIs, we set the mean matrix $M$ of the multi-dimensional sequence $X$ with $N=30,D=5$ as follows:

	$\displaystyle M_{1,j}$	$\displaystyle=\begin{cases}4&(1\leq j\leq 10)\\ 2&(11\leq j\leq 20)\\ 0&(21\leq j\leq 30)\end{cases},$
	$\displaystyle M_{2,j},M{3,j}$	$\displaystyle=\begin{cases}2&(1\leq j\leq 10)\\ 0&(11\leq j\leq 30)\end{cases},$
	$\displaystyle M_{4,j},M_{5,j}$	$\displaystyle=\begin{cases}0&(1\leq j\leq 30)\end{cases}.$

We considered the same covariance structures as in the experiments for conditional power. For each setting, we generated 100 multi-dimensional sequence samples. Then, the lengths of the 95% CIs in proposed-MC and proposed-OC were compared. Figure 8 shows the result, which are consistent with the results of conditional power⁵⁵5Here, a few outliers are removed for visibility..

C.3 Additional information about real data experiments

Here, we present the detail experimental setups and additional experimental results of the two real data experiments.

Array CGH

We applied the naive method and the proposed-MC method to each of the 22 chromosomes, where $N$ and $D$ correspond to the length of each chromosome and the number of subjects, respectively. Here, in addition to the result of chromosome 1 in the main paper, we show the results of chromosome 2 – 5 to avoid too large a file size. For the covariance matrix $\Sigma$ , we considered two alternatives based on the Array CGH data set of 18 healthy subjects, where it is reasonable to assume that CP is not present. The first option is to assume the independence structure and the second option is to assume AR(1) structure. The parameters $\sigma^{2}$ (for both of independence and AR(1) options), $\gamma$ and $\rho$ (for AR(1) option) were estimated as

	$\displaystyle\hat{\sigma}^{2}$	$\displaystyle=\frac{1}{18N}\sum_{i=1}^{18}\sum_{j=1}^{N}(X_{i,j}-\mu_{i})^{2}$
	$\displaystyle\hat{\gamma}$	$\displaystyle=\frac{1}{18(N-1)}\sum_{i=1}^{18}\sum_{j=2}^{N}(X_{i,j}-\mu_{i})(X_{i,j-1}-\mu_{i})$
	$\displaystyle\hat{\rho}$	$\displaystyle=\frac{\hat{\gamma}}{\hat{\sigma}^{2}},$

where $\mu_{i}=\frac{1}{N}\sum_{j=1}^{N}X_{i,j}$ . Here, we assumed that correlations between components are uncorrelated because each component represents individual subjects. Figures 9–12 show results of chromosomes 2–5. As we discussed in §4, the naive test declared that most of the detected CP locations and components are statistically significant, whereas the proposed-MC method suggested that some of them are not really statistically significant.

Human Activity Recognition

The original dataset in Chavarriaga et al. (2013) consists of multi-dimensional sensor signals taken from four subjects (S1-S4) in five settings (ADL1-ADL5). In this paper, we applied the naive method and the proposed-MC method only to the first setting (ADL1) of first subject (S1). We considered the same covariance structure options as in the Array CGH data. The covariance structures were estimated by using the signals in the longest continuous locomotion state “Lie” because it indicates the sleeping of the subjects, i.e., no changes exist, as follows:

	$\displaystyle\hat{\sigma}^{2}$	$\displaystyle=\frac{1}{77n}\sum_{i=1}^{77}\sum_{j=1}^{n}(X_{i,j}-\mu_{i})^{2},$
	$\displaystyle\hat{\gamma}$	$\displaystyle=\frac{1}{77(n-1)}\sum_{i=1}^{77}\sum_{j=2}^{n}(X_{i,j}-\mu_{i})(X_{i,j-1}-\mu_{i}),$
	$\displaystyle\hat{\rho}$	$\displaystyle=\frac{\hat{\gamma}}{\hat{\sigma}^{2}}.$

Here, we used $8$ different combinations of hyperparameters $(K,L,W)\in\{5,10\}\times\{10,20\}\times\{0,2\}$ . Figures 13–20 show the results of all the parameter settings. As discussed in §4, the naive test declared that most detected CP locations and components are statistically significant, but the proposed-MC method suggested that some of them are not really statistically significant.

References

Bach et al. [2006] F. R. Bach, D. Heckerman, and E. Horvits. Considering cost asymmetry in learning classifiers. Journal of Machine Learning Research, 7:1713–41, 2006.
Bai and Perron [2003] Jushan Bai and Pierre Perron. Computation and analysis of multiple structural change models. Journal of applied econometrics, 18(1):1–22, 2003.
Bellman [1966] Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.
Bleakley and Vert [2011] Kevin Bleakley and Jean-Philippe Vert. The group fused lasso for multiple change-point detection. arXiv preprint arXiv:1106.4199, 2011.
Chavarriaga et al. [2013] Ricardo Chavarriaga, Hesam Sagha, Alberto Calatroni, Sundara Tejaswi Digumarti, Gerhard Tröster, José del R Millán, and Daniel Roggen. The opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recognition Letters, 34(15):2033–2042, 2013.
Chen and Bien [2019] Shuxiao Chen and Jacob Bien. Valid inference corrected for outlier removal. Journal of Computational and Graphical Statistics, pages 1–12, 2019.
Cho [2016] Haeran Cho. Change-point detection in panel data via double cusum statistic. Electronic Journal of Statistics, 10(2):2000–2038, 2016.
Cho and Fryzlewicz [2015] Haeran Cho and Piotr Fryzlewicz. Multiple-change-point detection for high dimensional time series via sparsified binary segmentation. Journal of the Royal Statistical Society: Series B: Statistical Methodology, pages 475–507, 2015.
Duy and Takeuchi [2021] Vo Nguyen Le Duy and Ichiro Takeuchi. Exact statistical inference for the wasserstein distance by selective inference. arXiv preprint arXiv:2109.14206, 2021.
Duy et al. [2020a] Vo Nguyen Le Duy, Shogo Iwazaki, and Ichiro Takeuchi. Quantifying statistical significance of neural network representation-driven hypotheses by selective inference. arXiv preprint arXiv:2010.01823, 2020a.
Duy et al. [2020b] Vo Nguyen Le Duy, Hiroki Toda, Ryota Sugiyama, and Ichiro Takeuchi. Computing valid p-value for optimal changepoint by selective inference using dynamic programming. In Advances in Neural Information Processing Systems, volume 33, pages 11356–11367, 2020b.
Enikeeva et al. [2019] Farida Enikeeva, Zaid Harchaoui, et al. High-dimensional change-point detection under sparse alternatives. Annals of Statistics, 47(4):2051–2079, 2019.
Fithian et al. [2014] William Fithian, Dennis Sun, and Jonathan Taylor. Optimal inference after model selection. arXiv preprint arXiv:1410.2597, 2014.
Guthery [1974] Scott B Guthery. Partition regression. Journal of the American Statistical Association, 69(348):945–947, 1974.
Hammerla et al. [2016] Nils Y Hammerla, Shane Halloran, and Thomas Plötz. Deep, convolutional, and recurrent models for human activity recognition using wearables. arXiv preprint arXiv:1604.08880, 2016.
Harchaoui et al. [2009] Zaid Harchaoui, Félicien Vallet, Alexandre Lung-Yut-Fong, and Olivier Cappé. A regularized kernel-based approach to unsupervised audio segmentation. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1665–1668. IEEE, 2009.
Hastie et al. [2004] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5:1391–415, 2004.
Hocking et al. [2011] T. Hocking, j. P. Vert, F. Bach, and A. Joulin. Clusterpath: an algorithm for clustering using convex fusion penalties. In Proceedings of the 28th International Conference on Machine Learning, pages 745–752, 2011.
Horváth and Hušková [2012] Lajos Horváth and Marie Hušková. Change-point detection in panel data. Journal of Time Series Analysis, 33(4):631–648, 2012.
Hyun et al. [2018] Sangwon Hyun, Kevin Lin, Max G’Sell, and Ryan J Tibshirani. Post-selection inference for changepoint detection algorithms with application to copy number variation data. arXiv preprint arXiv:1812.03644, 2018.
Jewell [2020] Sean William Jewell. Estimation and Inference in Changepoint Models. University of Washington, 2020.
Jirak [2015] Moritz Jirak. Uniform change point tests in high dimension. The Annals of Statistics, 43(6):2451–2483, 2015.
Karasuyama and Takeuchi [2010] M. Karasuyama and I. Takeuchi. Nonlinear regularization path for quadratic loss support vector machines. IEEE Transactions on Neural Networks, 22(10):1613–1625, 2010.
Karasuyama et al. [2012] M. Karasuyama, N. Harada, M. Sugiyama, and I. Takeuchi. Multi-parametric solution-path algorithm for instance-weighted support vector machines. Machine Learning, 88(3):297–330, 2012.
Korostelev and Lepski [2008] A Korostelev and O Lepski. On a multi-channel change-point problem. Mathematical Methods of Statistics, 17(3):187–197, 2008.
Lavielle [2005] Marc Lavielle. Using penalized contrasts for the change-point problem. Signal processing, 85(8):1501–1510, 2005.
Le Duy and Takeuchi [2021] Vo Nguyen Le Duy and Ichiro Takeuchi. Parametric programming approach for more powerful and general lasso selective inference. In International Conference on Artificial Intelligence and Statistics, pages 901–909. PMLR, 2021.
Lee and Scott [2007] G. Lee and C. Scott. The one class support vector machine solution path. In Proc. of ICASSP 2007, pages II521–II524, 2007.
Lee et al. [2015] Jason D Lee, Yuekai Sun, and Jonathan E Taylor. Evaluating the statistical significance of biclusters. Advances in neural information processing systems, 28:1324–1332, 2015.
Lee et al. [2016] Jason D Lee, Dennis L Sun, Yuekai Sun, Jonathan E Taylor, et al. Exact post-selection inference, with application to the lasso. The Annals of Statistics, 44(3):907–927, 2016.
Lévy-Leduc and Roueff [2009] Céline Lévy-Leduc and François Roueff. Detection and localization of change-points in high-dimensional network traffic data. The Annals of Applied Statistics, pages 637–662, 2009.
Liu et al. [2018] Keli Liu, Jelena Markovic, and Robert Tibshirani. More powerful post-selection inference, with application to the lasso. arXiv preprint arXiv:1801.09037, 2018.
Loftus and Taylor [2014] Joshua R Loftus and Jonathan E Taylor. A significance test for forward stepwise model selection. arXiv preprint arXiv:1405.3920, 2014.
Loftus and Taylor [2015] Joshua R Loftus and Jonathan E Taylor. Selective inference in regression models with groups of variables. arXiv preprint arXiv:1511.01478, 2015.
Ogawa et al. [2013] Kohei Ogawa, Motoki Imamura, Ichiro Takeuchi, and Masashi Sugiyama. Infinitesimal annealing for training semi-supervised support vector machines. In International Conference on Machine Learning, pages 897–905, 2013.
Osborne et al. [2000] Michael R Osborne, Brett Presnell, and Berwin A Turlach. A new approach to variable selection in least squares problems. IMA journal of numerical analysis, 20(3):389–403, 2000.
Rigaill [2010] Guillem Rigaill. Pruned dynamic programming for optimal multiple change-point detection. arXiv preprint arXiv:1004.0887, 17, 2010.
Rosset and Zhu [2007] S. Rosset and J. Zhu. Piecewise linear regularized solution paths. Annals of Statistics, 35:1012–1030, 2007.
Sharma et al. [2016] Shilpy Sharma, David A Swayne, and Charlie Obimbo. Trend analysis and change point techniques: a survey. Energy, Ecology and Environment, 1(3):123–130, 2016.
Shimodaira and Terada [2019] Hidetoshi Shimodaira and Yoshikazu Terada. Selective inference for testing trees and edges in phylogenetics. Frontiers in Ecology and Evolution, 7:174, 2019.
Sugiyama et al. [2021] Kazuya Sugiyama, Vo Nguyen Le Duy, and Ichiro Takeuchi. More powerful and general selective inference for stepwise feature selection using homotopy method. In International Conference on Machine Learning, pages 9891–9901. PMLR, 2021.
Suzumura et al. [2017] Shinya Suzumura, Kazuya Nakagawa, Yuta Umezu, Koji Tsuda, and Ichiro Takeuchi. Selective inference for sparse high-order interaction models. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3338–3347. JMLR. org, 2017.
Takeuchi et al. [2009a] I. Takeuchi, K. Nomura, and T. Kanamori. Nonparametric conditional density estimation using piecewise-linear solution path of kernel quantile regression. Neural Computation, 21(2):539–559, 2009a.
Takeuchi and Sugiyama [2011] Ichiro Takeuchi and Masashi Sugiyama. Target neighbor consistent feature weighting for nearest neighbor classification. In Advances in neural information processing systems, pages 576–584, 2011.
Takeuchi et al. [2009b] Ichiro Takeuchi, Hiroyuki Tagawa, Akira Tsujikawa, Masao Nakagawa, Miyuki Katayama-Suguro, Ying Guo, and Masao Seto. The potential of copy number gains and losses, detected by array-based comparative genomic hybridization, for computational differential diagnosis of b-cell lymphomas and genetic regions involved in lymphomagenesis. haematologica, 94(1):61, 2009b.
Takeuchi et al. [2013] Ichiro Takeuchi, Tatsuya Hongo, Masashi Sugiyama, and Shinichi Nakajima. Parametric task learning. Advances in Neural Information Processing Systems, 26:1358–1366, 2013.
Tanizaki et al. [2020] Kosuke Tanizaki, Noriaki Hashimoto, Yu Inatsu, Hidekata Hontani, and Ichiro Takeuchi. Computing valid p-values for image segmentation by selective inference. 2020.
Taylor and Tibshirani [2015] Jonathan Taylor and Robert J Tibshirani. Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25):7629–7634, 2015.
Terada and Shimodaira [2019] Yoshikazu Terada and Hidetoshi Shimodaira. Selective inference after variable selection via multiscale bootstrap. arXiv preprint arXiv:1905.10573, 2019.
Tibshirani et al. [2016] Ryan J Tibshirani, Jonathan Taylor, Richard Lockhart, and Robert Tibshirani. Exact post-selection inference for sequential regression procedures. Journal of the American Statistical Association, 111(514):600–620, 2016.
Truong et al. [2018] Charles Truong, Laurent Oudre, and Nicolas Vayatis. A review of change point detection methods. 2018.
Tsuda [2007] K. Tsuda. Entire regularization paths for graph data. In In Proc. of ICML 2007, pages 919–925, 2007.
Tsukurimichi et al. [2021] Toshiaki Tsukurimichi, Yu Inatsu, Vo Nguyen Le Duy, and Ichiro Takeuchi. Conditional selective inference for robust regression and outlier detection using piecewise-linear homotopy continuation. arXiv preprint arXiv:2104.10840, 2021.
Umezu and Takeuchi [2017] Yuta Umezu and Ichiro Takeuchi. Selective inference for change point detection in multi-dimensional sequences. arXiv preprint arXiv:1706.00514, 2017.
Vert and Bleakley [2010] Jean-Philippe Vert and Kevin Bleakley. Fast detection of multiple change-points shared by many signals using group lars. Advances in neural information processing systems, 23:2343–2351, 2010.
Yamada et al. [2018a] Makoto Yamada, Yuta Umezu, Kenji Fukumizu, and Ichiro Takeuchi. Post selection inference with kernels. In International Conference on Artificial Intelligence and Statistics, pages 152–160. PMLR, 2018a.
Yamada et al. [2018b] Makoto Yamada, Denny Wu, Yao-Hung Hubert Tsai, Ichiro Takeuchi, Ruslan Salakhutdinov, and Kenji Fukumizu. Post selection inference with incomplete maximum mean discrepancy estimator. arXiv preprint arXiv:1802.06226, 2018b.
Zhang et al. [2010] Nancy R Zhang, David O Siegmund, Hanlee Ji, and Jun Z Li. Detecting simultaneous changepoints in multiple sequences. Biometrika, 97(3):631–645, 2010.