This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep learning for intermittent gravitational wave signals

Takahiro S. Yamamoto [email protected] Department of Physics, Nagoya University, Nagoya, 464-8602, Japan    Sachiko Kuroyanagi Instituto de Física Teórica UAM-CSIC, Universidad Autonóma de Madrid, Cantoblanco, 28049 Madrid, Spain Department of Physics, Nagoya University, Nagoya, 464-8602, Japan    Guo-Chin Liu Department of Physics, Tamkang University, Tamsui, New Taipei City 25137, Taiwan
Abstract

The ensemble of unresolved compact binary coalescences is a promising source of the stochastic gravitational wave (GW) background. For stellar-mass black hole binaries, the astrophysical stochastic GW background is expected to exhibit non-Gaussianity due to their intermittent features. We investigate the application of deep learning to detect such non-Gaussian stochastic GW background and demonstrate it with the toy model employed in Drasco & Flanagan (2003), in which each burst is described by a single peak concentrated at a time bin. For the detection problem, we compare three neural networks with different structures: a shallower convolutional neural network (CNN), a deeper CNN, and a residual network. We show that the residual network can achieve comparable sensitivity as the conventional non-Gaussian statistic for signals with the astrophysical duty cycle of log10ξ[3,1]\log_{10}\xi\in[-3,-1]. Furthermore, we apply deep learning for parameter estimation with two approaches, in which the neural network (1) directly provides the duty cycle and the signal-to-noise ratio (SNR) and (2) classifies the data into four classes depending on the duty cycle value. This is the first step of a deep learning application for detecting a non-Gaussian stochastic GW background and extracting information on the astrophysical duty cycle.

I Introduction

The astrophysical stochastic gravitational-wave (GW) background is one of the most interesting targets for current and future GW experiments. It originates from the ensemble of many unresolved GW sources at high redshift and contains information about the mass function and redshift distribution of the sources.

Observations of binary black holes (BBH) and binary neutron stars (BNS) indicate that distant merger events could be detected as a stochastic GW background by the near-future ground-based detector network [1, 2, 3]. An estimation from the merger rate shows that the energy density of the background spectra of BBH- and BNS-origins would be similar, but the statistical behavior could be very different [2]. While sub-threshold BNS events typically overlap and create an approximately continuous background, the time interval between BBH events is much longer than the duration of the individual signal, and they do not overlap. Because of this, the BBH background could be highly non-stationary and non-Gaussian (sometimes referred to as intermittent or popcorn signal). The information on the non-Gaussianity could be used to disentangle the different origins of the GW sources [4].

Detection strategies for such non-Gaussian backgrounds have been discussed in the literature. First, Drasco & Flanagan [5] (DF03) derived the maximum likelihood detection statistic for the case of co-located, co-aligned interferometers characterized by stationary, Gaussian white noise with the burst-like non-Gaussian signals. Although the computational cost is significantly larger than the standard cross-correlation method, it has been shown that the maximum likelihood method can outperform the standard cross-correlation search. Subsequently, Thrane [6] presented a method that can be applied in the more realistic case of spatially separated interferometers with colored, non-Gaussian noise. Martellini & Regimbau [7, 8] proposed semiparametric maximum likelihood estimators. While they are framed in the context of frequentist statistics, Cornish & Romano [9] discussed the use of Bayesian model selection. Alternative methods were also discussed. Seto [10, 11] presented the use of the fourth-order correlation between four detectors. Smith & Thrane [12, 13] proposed a method to use sub-threshold BBHs in the matched-filtering search and demonstrated a Bayesian parameter estimation. Subsequently, Biscoveanu et al. [14] simulated the Bayesian parameter estimation of the primordial background (stationary, Gaussian) in the presence of an astrophysical foreground (non-stationary, non-Gaussian). Finally, Matas & Romano [15] showed that the hybrid frequentist Bayesian analysis is equivalent to a fully Bayesian approach and claimed that their method can be extended to non-stationary GW background. See also Ref. [16] for a comprehensive review.

In the general context of GW data analysis, the application of deep learning has been actively studied in the last five years. George & Huerta [17, 18] showed that deep neural networks can achieve a sensitivity comparable to the matched filtering for detection and parameter estimation of BBH mergers. Although their neural network does not predict the statistical error, several authors proposed a method to predict the posterior distributions [19, 20, 21, 22, 23]. Also, deep learning has been widely applied for various types of signal (e.g., BBH mergers [24, 25, 26], black hole ringdown GWs [27, 28, 29, 30], continuous GWs [31, 32, 33, 34, 35], supernovae [36], and hyperbolic encounters [37]).

In this work, we use deep learning to analyze a non-Gaussian GW background. The great advantage of deep learning is that it is computationally cheaper than the matched-filter-based approach. It is because neural networks learn the features of the data through a training process before being applied to real data. The data analysis of a stochastic background is usually performed by dividing the long-duration data stream (\simyears) into short time segments (typically 192192 seconds; see e.g. [3]). If we want to apply the non-Gaussian statistic of DF03, it will take a much longer time to analyze each segment compared to the standard cross-correlation statistic. On the other hand, in the case of deep learning, once the training is completed, it can quickly analyze each segment and repeat the same analysis for a large number of data segments with similar feature of training data. In this way, it is expected to reduce the total time for the analysis. Another advantage is that neural networks can extract the features which are difficult to model. Thus, it could be applied to various types of non-Gaussian GW backgrounds even if the source waveform is not understood well. As a first step, we employ the toy model and the detection method proposed by DF03. We train the neural network with the dataset that is generated by the toy model and assess the neural network’s performance by comparing it with their detection method.

Finally, let us mention the work by Utina et al. [38], which has a similar purpose and developed neural network algorithms to detect the GW background from binary black hole mergers. In [38], the data is split into 11 or 22 second segments, and the neural network is trained with the injection of binary black hole events. On the other hand, our method is based on the toy model in DF03, which does not rely on a particular burst model and is designed to analyze segments with longer duration (as long as the computational power allows). In addition, we discuss the estimation of the intermittency (astrophysical duty cycle), while Utina et al. focused on the detection problem.

The paper is structured as follows. In Sec. II, we describe the signal model and the non-Gaussian statistic proposed by DF03, which is demonstrated for the comparison in the result sections. Section. III is dedicated to a review of deep learning algorithms used in this paper. Then, we show the results of the detection problem in Sec. IV and parameter estimation in Sec. V. Finally, we summarize our work in Sec. VI.

II Signal model and maximum likelihood statistic

II.1 signal model

We use a simple toy model used in DF03. The assumptions are the followings: two detectors that are co-located and co-aligned; the detector noises are white, stationary, Gaussian, and statistically independent; each astrophysical burst is represented by a sharp peak that has support only on a discretized time grid. The methodology could be easily extended to the case of spatially separated interferometers by introducing the overlap reduction function [39, 6]. Detector noise, in reality, is colored and highly non-Gaussian, non-stationary, and these effects should be taken into account before applying our method to the real data. In this paper, however, we focus on presenting the basic methodology of deep learning and the comparison with the DF03’s results. The assumption on the sharp peak signal is valid if the duration of the burst is shorter than the time resolution of the detector. In that case, the observed GW strain at the burst arrival time is the averaged amplitude over the time interval between the sampled time step. However, this assumption cannot be applied to the expected astrophysical backgrounds from BBHs and BNSs, and again, we leave it as future work.

Refer to caption
Figure 1: Example of the signal model. We see four bursts at times 5, 9, 13, and 18. Each burst is represented by a single peak.

A strain data obtained by each detector is denoted by hikh^{k}_{i}, where i=1,2i=1,2 labels the different detectors, and k=1,2,,Nk=1,2,\cdots,N is a time index. We use sks^{k} to denote the GW signal. Including detector noise data, which is denoted by nikn^{k}_{i}, we can express the strain data of the iith detector as

hik=sk+nik.h_{i}^{k}=s^{k}+n_{i}^{k}\,. (1)

The detector noise is randomly generated from Gaussian distribution, that is,

p(nik)=𝒩(nik;0,σi2).p(n_{i}^{k})=\mathcal{N}(n_{i}^{k};0,\sigma^{2}_{i})\,. (2)

𝒩(x;μ,σ2)\mathcal{N}(x;\mu,\sigma^{2}) is a one-dimensional Gaussian distribution with a mean μ\mu and a variance σ2\sigma^{2}, i.e.,

𝒩(x;μ,σ2)=12πσ2exp[(xμ)22σ2].\mathcal{N}(x;\mu,\sigma^{2})=\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left[-\frac{(x-\mu)^{2}}{2\sigma^{2}}\right]\,. (3)

Due to the assumption of stationarity and white, the variance σ2\sigma^{2} is constant in time. We also assume that two detectors have noise with the same variance, and we set

σ12=σ22=1,\sigma^{2}_{1}=\sigma^{2}_{2}=1\,, (4)

throughout this paper.

Assuming that the GW burst rate is not so high that the bursts do not overlap, we model the probability density function of the strain value at time kk as

p(sk)=ξ𝒩(sk;0,α2)+(1ξ)δ(sk),p(s^{k})=\xi\mathcal{N}(s^{k};0,\alpha^{2})+(1-\xi)\delta(s^{k})\,, (5)

where α2\alpha^{2} is the variance of the amplitude of the bursts, and ξ\xi is so-called the (astrophysical) duty cycle. The duty cycle describes the probability that a burst is present in the detector at any chosen time and takes a value in the range of 0<ξ10<\xi\leq 1. The case of ξ=1\xi=1 is equivalent to Gaussian stochastic GWs. On the other hand, it is reduced to the absence of the signal for ξ=0\xi=0. A signal exhibits non-Gaussianity as ξ\xi decreases. Figure 1 shows an example of the time-series signal generated based on Eq. (5). Following DF03, we define the signal-to-noise ratio (SNR) of the non-Gaussian stochastic background by

ρ=ξα2Nσ1σ2,\rho=\frac{\xi\alpha^{2}\sqrt{N}}{\sigma_{1}\sigma_{2}}\,, (6)

and use it for describing the strength of the signal.

II.2 Non-Gaussian statistic

As proposed in DF03, the likelihood ratio can be used as a detection statistic. Under the assumption of the noise model Eq. (2) and the signal model Eq. (5), the likelihood ratio can be reduced to

ΛMLNG=max0<ξ1maxα2>0maxσ120maxσ220λMLNG(α2,ξ,σ12,σ22),{\Lambda}^{\textrm{NG}}_{\textrm{ML}}=\max_{0<\xi\leq 1}\ \max_{\alpha^{2}>0}\ \max_{\sigma_{1}^{2}\geq 0}\ \max_{\sigma_{2}^{2}\geq 0}\ {\lambda}^{\textrm{NG}}_{\textrm{ML}}(\alpha^{2},\xi,\sigma_{1}^{2},\sigma_{2}^{2})\,, (7)

where

λMLNG(α2,ξ,σ12,σ22)\displaystyle{\lambda}^{\textrm{NG}}_{\textrm{ML}}(\alpha^{2},\xi,\sigma_{1}^{2},\sigma_{2}^{2}) k=1N{σ¯1σ¯2ξσ12σ22+σ12α2+σ22α2exp[(h1k/σ12+h2k/σ22)2α22(α2/σ12+α2/σ22+1)(h1k)22σ12(h2k)22σ22+1]\displaystyle\coloneqq\prod_{k=1}^{N}\left\{\frac{\bar{\sigma}_{1}\bar{\sigma}_{2}\xi}{\sqrt{\sigma_{1}^{2}\sigma_{2}^{2}+\sigma_{1}^{2}\alpha^{2}+\sigma_{2}^{2}\alpha^{2}}}\exp\left[\frac{(h_{1}^{k}/\sigma_{1}^{2}+h_{2}^{k}/\sigma_{2}^{2})^{2}\alpha^{2}}{2(\alpha^{2}/\sigma_{1}^{2}+\alpha^{2}/\sigma_{2}^{2}+1)}-\frac{(h_{1}^{k})^{2}}{2\sigma_{1}^{2}}-\frac{(h_{2}^{k})^{2}}{2\sigma_{2}^{2}}+1\right]\right.
+σ¯1σ¯2σ1σ2(1ξ)exp[(h1k)22σ12(h2k)22σ22+1]},\displaystyle\ \left.+\frac{\bar{\sigma}_{1}\bar{\sigma}_{2}}{\sigma_{1}\sigma_{2}}(1-\xi)\exp\left[-\frac{(h_{1}^{k})^{2}}{2\sigma_{1}^{2}}-\frac{(h_{2}^{k})^{2}}{2\sigma_{2}^{2}}+1\right]\right\}\,, (8)

and

σ¯i21Nk=1N(hik)2.\bar{\sigma}^{2}_{i}\coloneqq\frac{1}{N}\sum_{k=1}^{N}(h_{i}^{k})^{2}\,. (9)

More details of the non-Gaussian statistic are described in Appendix A.

In the later section, we compare the results of the non-Gaussian statistic and the neural networks for the detection problem. As seen in Eq. (7), we need to perform the parameter space search to find the maximum value of λMLNG{\lambda}^{\textrm{NG}}_{\textrm{ML}} in the four-dimensional space. To do that, we take the grid points spanning over the ranges of ρ[0.0,4.0]\rho\in[0.0,4.0], log10ξ[5.0,0.0]\log_{10}\xi\in[-5.0,0.0], and σ12,σ22[0.95,1.05]\sigma_{1}^{2},\sigma_{2}^{2}\in[0.95,1.05] with the regular interval of Δρ=0.1\Delta\rho=0.1, Δlog10ξ=0.05\Delta\log_{10}\xi=0.05, and Δσ12=Δσ22=0.05\Delta\sigma_{1}^{2}=\Delta\sigma_{2}^{2}=0.05.

III Basics of neural network

III.1 Structure

The fundamental unit of a neural network is called a(n) (artificial) neuron which is an artificial model of a nerve cell in a brain. A neuron takes values signaled by other neurons as inputs and returns a single value as an output. The alignment of neurons is called a layer, and a neural network consists of a sequence of layers. Each layer takes the output of the previous layer and passes its own output to the next layer. The input data of a neural network go through many layers, and a neural network returns the output data. In the following, we denote an input vector and an output vector of each layer by 𝒙\bm{{x}} and 𝒚\bm{{y}}, respectively. The dimensions of the input and the output depend on the type of layer, which is described below.

A fully connected layer is one of the fundamental layers of a neural network. It takes a one-dimensional real-valued vector as an input and returns a linear transformation of the input data. The operation can be described as

yi=j=0Nwijxj,y_{i}=\sum_{j=0}^{N}w_{ij}x_{j}\,, (10)

where NN is the number of elements of the input vector. The 0th component is set to be x0=1x_{0}=1 and represents the constant term of the linear transformation. The coefficients wijw_{ij} are called weight, and we must appropriately tune them before applying the neural network to real data.

A linear transformation like a fully connected layer is usually followed by a nonlinear function which is called an activation function. Most activation functions have no tunable parameters. In this work, we used a rectified linear unit (ReLU) defined by

ReLU(z){zif 0z.0if z<0.\mathrm{ReLU}(z)\coloneqq\begin{cases}z&\text{if }0\leq z\,.\\ 0&\text{if }z<0\,.\\ \end{cases} (11)

An activation function can take input with arbitrary size and be applied elementwise.

For image recognition, it is important to capture the local pattern. To do so, filters with a much smaller size than that of the input data are used. A convolutional layer carries out a convolution between input data and filters. A neural network containing one or more convolutional layers is often called a convolutional neural network (CNN). In this work, we use a one-dimensional convolutional layer. It can take a two-dimensional tensor as an input that is denoted by 𝒙=xc,i\bm{{x}}=x_{c,i}. This represents the situation where each grid of the data has multiple values. The different values contained in one pixel are called channels, which are represented by the first index cc. For example, a color image has three channels corresponding to the primary colors, namely, red, blue, and green. In our case, the strain data has two channels corresponding to two different detectors. The second index ii corresponds to a pixel. Formally, we can write a one-dimensional convolutional layer by

yc,i=c=1Ck=0K1wc,c,kxc,s(i1)+k,y_{c,i}=\sum_{c^{\prime}=1}^{C}\sum_{k=0}^{K-1}w_{c^{\prime},c,k}x_{c^{\prime},s(i-1)+k}\,, (12)

where the parameter wc,c,kw_{c^{\prime},c,k} characterize the filter, KK is the filter size, and CC is the number of channels. The filter is applied multiple times to the input data by sliding it over the whole matrix. The parameter ss is called stride and controls the step width of the slide.

A pooling layer reduces the size of data by contracting several data points into one data point. There are several variants of pooling layers depending on how to reduce information. In the present work, we use two types of pooling. The max pooling layer is defined by

yc,i=maxj=0,1,,K1[xc,s(i1)+j].y_{c,i}=\max_{j=0,1,\dots,K-1}\left[x_{c,s(i-1)+j}\right]\,. (13)

Also, we use the average pooling that is defined by,

yc,i=1Kj=0K1xc,s(i1)+j.y_{c,i}=\frac{1}{K}\sum_{j=0}^{K-1}x_{c,s(i-1)+j}\,. (14)

The last layer of a neural network is called the output layer. It should be tailored depending on the problem to solve. For the regression, the identity function

yi=xi,y_{i}=x_{i}\,, (15)

is often applied. Usually, the identity function is not explicitly applied because it is trivial. On the other hand, the classification problem requires a more tricky layer. In the classification problem, the neural network is constructed in a way that each element of the output corresponds to the probability that the input is likely to belong to each class. To interpret the output as the probabilities, they must satisfy the following relations,

i=1Nclassyi=1,\sum_{i=1}^{N_{\mathrm{class}}}y_{i}=1\,, (16)

and

yi0 for any i.y_{i}\geq 0\text{ for any }i\,. (17)

Here, NclassN_{\mathrm{class}} is the number of target classes. A softmax layer that is defined by

yi=exp[xi]j=1Nclassexp[xj],y_{i}=\frac{\exp[x_{i}]}{\sum_{j=1}^{N_{\mathrm{class}}}\exp[x_{j}]}\,, (18)

is suitable for the classification problem. The output of a softmax layer (18) satisfies the conditions Eqs. (16) and  (17).

III.2 Residual block

Refer to caption
Figure 2: Schematic picture of a residual block. A standard layer transforms an input 𝒙\bm{{x}} into an output F(𝒙)F(\bm{{x}}), while a shortcut connection directly passes the input to the output. In total, the residual block returns their sum F(𝒙)+𝒙F(\bm{{x}})+\bm{{x}}. If the input 𝒙\bm{{x}} and the output F(𝒙)F(\bm{{x}}) have different shapes, the data passing through the shortcut connection is reshaped appropriately to have the same shape.

One may naively expect that a deeper neural network shows better performance. This expectation is valid to some extent. However, it is empirically known that the performance gets saturated as the network depth increases. On the contrary, the accuracy gets worse. It is known as the degradation problem. He et al. [40] proposed the residual learning to address the degradation problem. The idea of the residual network is to introduce a shortcut connection, as shown in Fig. 2, which enables us to efficiently train deep neural networks.

Refer to caption
Figure 3: Structure of the residual block we used in this work. The main path consists of the three convolutional layers and the batch normalization layer. In the shortcut connection, the convolutional layer reshapes the data so that the size of the output matches that of the output of the main path.

Figure 3 shows another type of residual block called a bottleneck block [40]. The main path has three convolutional layers. The first convolutional layer has a kernel size of one and reduces the number of channels. The second convolutional layer plays a role as a usual one. The third convolutional layer has a kernel size of 1 and recovers the number of channels. Both the first and the second convolutional layers are followed by ReLU activation (11). The batch normalization [41] is located at the end of the main path, where the average and the variance of the input data are calculated elementwise over the batch,

𝝁\displaystyle\bm{{\mu}} 1Nbatchn=1Nbatch𝒙n,\displaystyle\coloneqq\frac{1}{{N}_{\textrm{batch}}}\sum_{n=1}^{{N}_{\textrm{batch}}}\bm{{x}}_{n}\,, (19)
𝒗\displaystyle\bm{{v}} 1Nbatchn=1Nbatch(𝒙n𝝁)2.\displaystyle\coloneqq\frac{1}{{N}_{\textrm{batch}}}\sum_{n=1}^{{N}_{\textrm{batch}}}(\bm{{x}}_{n}-\bm{{\mu}})^{2}\,. (20)

Here, a batch is a subset of the training data, and NbatchN_{\mathrm{batch}} is the size of a batch. The use of a batch in the training is explained in the later subsection, Sec. III.4. Using Eqs. (19) and (20), the batch normalization transforms the input data into

𝒙^n\displaystyle\hat{\bm{{x}}}_{n} 𝒙n𝝁𝒗+ϵ,\displaystyle\coloneqq\frac{\bm{{x}}_{n}-\bm{{\mu}}}{\sqrt{\bm{{v}}+\epsilon}}\,, (21)
𝒚\displaystyle\bm{{y}} =𝜸𝒙^n+𝜷,\displaystyle=\bm{{\gamma}}\ast\hat{\bm{{x}}}_{n}+\bm{{\beta}}\,, (22)

where 𝜸\bm{{\gamma}} and 𝜷\bm{{\beta}} are trainable parameters, the asterisk \ast represents an elementwise multiplication, and ϵ\epsilon is introduced to prevent the overflow. We set ϵ=105\epsilon=10^{-5}.

The shortcut connection also has a convolutional layer with a kernel size of one and a stride of two. The input data is reshaped to match the size of the output of the main path.

III.3 Supervised learning

Before applying the neural network to real data, we optimize the neural network’s weights using a dataset prepared in advance. The optimization process is called training. To train a neural network, we prepare a dataset consisting of many pairs of input data and target data, which is hereafter denoted by 𝒕\bm{{t}}. In this paper, the input data is the time-series strain data, and the target data should be chosen appropriately depending on the problem to solve.

In this work, we treat two types of problems: regression and classification. For the regression problem, the injected values of the parameters can be used as the target values. For the classification problem in which the inputs are classified into several classes, the one-hot representation is widely used for the target vector. When the number of the target classes is NclassN_{\mathrm{class}}, the target vector is a NclassN_{\mathrm{class}}-dimensional vector that takes 0 or 1 for each element. If the input is assigned to the iith class, only iith element takes 1, and others are 0. For example, in the detection problem presented in Sec. IV, we have two classes, that is, the absence and the presence of the GW signal. In this case, the target vector is chosen as

𝒕={(1,0)in the absence of a GW signal(0,1)in the presence of a GW signal.\bm{{t}}=\begin{cases}(1,0)&\text{in the absence of a GW signal}\\ (0,1)&\text{in the presence of a GW signal}\,.\end{cases} (23)

In Sec. V, we demonstrate estimation of the duty cycle. We first apply the ordinary parameter estimation approach; the injected values of the parameters (the signal amplitude and the duty cycle) are used as the target values. In the second approach, we reduce the parameter estimation to the classification problem, in which the inputs are classified into four classes of duty cycle values.

III.4 Training process

In the training process, we use a loss function to quantify the deviation between the neural network predictions and the target values. For the regression, various choice exists. In this work, we employ the L1 loss defined by

LL1(𝒚,𝒕)=1Nparami=1Nparam|yiti|,L_{\mathrm{L1}}(\bm{{y}},\bm{{t}})=\frac{1}{N_{\mathrm{param}}}\sum_{i=1}^{N_{\mathrm{param}}}|y_{i}-t_{i}|\,, (24)

where NparamN_{\mathrm{param}} is the number of parameters to be estimated. For the classification problem, a cross-entropy loss, defined by,

Lcross-entropy(𝒚,𝒕)=i=1Nclasstilnyi,{L}_{\textrm{cross-entropy}}(\bm{{y}},\bm{{t}})=-\sum_{i=1}^{N_{\mathrm{class}}}t_{i}\ln y_{i}\,, (25)

is widely used.

The weights of a neural network are updated so that the sum of the loss functions for all training data is small. However, in general, the minimization of the loss function cannot be done analytically. Thus, the iterative process is employed. We divide the training data into several subsets, called a batch. In each step of the iteration, the prediction and the loss evaluation are made for all data contained in a given batch. The update process depends on the gradient of the sum of the loss function over the batch, i.e.,

=1Nbatchn=1NbatchL(𝒚n,𝒕n),\mathcal{L}=\frac{1}{{N}_{\textrm{batch}}}\sum_{n=1}^{{N}_{\textrm{batch}}}L(\bm{{y}}_{n},\bm{{t}}_{n})\,, (26)

where 𝒚n\bm{{y}}_{n} and 𝒕n\bm{{t}}_{n} are the prediction of the neural network and the target vector for the nn th data, respectively.

The simplest update procedure is the stochastic gradient descent method (SGD). In SGD, the derivatives of the loss function with respect to the neural network’s weights are calculated, and the weights are updated by the procedure

wwηw,w\to w-\eta\frac{\partial{\mathcal{L}}}{\partial{w}}\,, (27)

where we omit all subscripts and superscripts of ww just for brevity. η\eta is called learning rate and characterizes how much the update of weights is sensitive to the loss function gradients. A batch is randomly chosen for every iteration step. Many updated procedures have been proposed so far. Most of them commonly exploit the gradients of the loss function with respect to the weights. In spite of the tremendous number of the weights, the algorithm called back propagation enables us to efficiently calculate all gradients of the loss function.

IV Detection of non-Gaussian stochastic GWs

In this section, we present the application of deep learning to the detection problem of the non-Gaussian stochastic GW background.

IV.1 Setup of neural networks

Table 1: Structure of the shallower CNN. The total number of tunable parameters is 668658. The first column shows the name of the layer. The second column is the size of the output data. The last column is the number of the tunable parameters. The network first has an input layer that passes the input data, and the size of the output is equal to that of the input data. Before the flattening layer, the output size has two dimensions corresponding to the number of channels and the data length. The flattering layer transforms data into a one-dimensional vector. It is followed by three fully connected layers and two activation layers. The last layer is the softmax layer returning the probabilities of the absence and the presence of the signal.
Layer Output size No. of parameters
Input (2, 10000) -
1D convolutional (16, 9993) 272
ReLU (16, 9993) -
1D max pooling (16, 2498) -
1D convolutional (32, 2491) 4128
ReLU (32, 2491) -
1D max pooling (32, 622) -
1D convolutional (64, 619) 8256
ReLU (64, 619) -
1D max pooling (64, 154) -
1D convolutional (128, 151) 32896
ReLU (128, 151) -
1D max pooling (128, 37) -
Flattening (4736,) -
Fully connected (128,) 606336
ReLU (128,) -
Fully connected (128,) 16512
ReLU (128,) -
Fully connected (2,) 258
Softmax (2,) -
Table 2: Structure of the deeper CNN. The total number of tunable parameters is 10127426. The description of the table is the same as Table. 1
Layer Output size No. of parameters
Input (2, 10000) -
1D convolutional (64, 9993) 1088
ReLU (64, 9993) -
1D convolutional (64, 9986) 32832
ReLU (64, 9986) -
1D max pooling (64, 2496) -
1D convolutional (64, 2489) 32832
ReLU (64, 2489) -
1D convolutional (64, 2482) 32832
ReLU (64, 2482) -
1D max pooling (64, 620) -
1D convolutional (64, 613) 32832
ReLU (64, 613) -
1D convolutional (64, 606) 32832
ReLU (64, 606) -
1D max pooling (64, 303) -
Flattening (19392,) -
Fully connected (512,) 9929216
ReLU (512,) -
Fully connected (64,) 32832
ReLU (64,) -
Fully connected (2,) 130
Softmax (2,) -
Table 3: Structure of the residual network. Each residual block has the structure shown in Fig. 3. The total number of the trainable parameters is 10280546, which is comparable to that of the deeper CNN presented in Table. 2.
Layer Output size No. of parameters
Input (2,10000) -
1D convolutional (64,9993) 1088
ReLU (64,9993) -
Residual block (64,4997) 7456
ReLU (64, 4997) -
Residual block (64, 2499) 7456
ReLU (64, 2499) -
Residual block (64, 1250) 7456
ReLU (64, 1250) -
1D average pooling (64, 312) -
Flatten (19968,) -
Fully connected (512,) 10224128
ReLU (512,) -
Fully connected (64,) 32832
ReLU (64,) -
Fully connected (2,) 130
Softmax (2,) -

We test two CNNs of different sizes (deeper and shallower CNNs) and the residual network. The deeper CNN has about the same amount of tunable parameters as the residual network, which is useful for making a fair comparison with the residual network, and the shallower CNN has fewer parameters. Two CNNs have similar structures, which are shown in Table 1 and 2. Just before the first fully connected layer, the data is reshaped into a one-dimensional vector, which is called flattening and is often regarded as a layer. Table 3 shows the structure of the residual network.

We train the three networks in an equal manner. For the signal and noise model applied in this paper, generating data is not computationally costly, so we can generate data for every iteration of the training. We set the batch size to 256 and divide a batch into two subsets. The first half is the data containing only noise, and another half contains the GW signal and noise. Each data has two simulated strain data {h1k,h2k}\{h^{k}_{1},h^{k}_{2}\} (see Eq. (1)) that are generated by using the noise model (2) and the signal model (5). We assign the target vector 𝒕=(1,0)\bm{{t}}=(1,0) and 𝒕=(0,1)\bm{{t}}=(0,1) for the absence and the presence of the GW signal, respectively. The data length is set to be N=104N=10^{4}. For the signal injection, the astrophysical duty cycle is sampled from a log uniform distribution on ξ[103,101]\xi\in[10^{-3},10^{-1}]. The SNR is uniformly sampled from [ρmin,4.0][\rho_{\mathrm{min}},4.0] with

ρmin=max[0.5,3.5+1.3log10ξ].\rho_{\mathrm{min}}=\max[0.5,3.5+1.3\log_{10}\xi]\,. (28)

The lower bound ρmin\rho_{\mathrm{min}} is set for the following reason. The sensitivity of the non-Gaussian statistic depends on the duty cycle, as described in Appendix A. We expect the sensitivity of the neural networks also show this trend and not significantly outperform the non-Gaussian statistic. If we use a lower bound of SNR that is constant with the duty cycle, it could happen for a larger duty cycle that we train the neural network with wrong reference data for a positive detection, which contains too small a signal to be detected, and it can result in the degradation of the neural network. Therefore, we manually give the lower bound (28) on SNR that is slightly below the detectable SNR of the non-Gaussian statistic.

Before inputting the data to the neural network, we normalize them to make the mean zero and the variance unity. Thus, the normalized input is given by

h^ik=hikμhσh,\hat{h}^{k}_{i}=\frac{h^{k}_{i}-{\mu}_{\textrm{h}}}{{\sigma}_{\textrm{h}}}\,, (29)

where

μh12Nk=1N(h1k+h2k),{\mu}_{\textrm{h}}\coloneqq\frac{1}{2N}\sum_{k=1}^{N}(h^{k}_{1}+h^{k}_{2})\,, (30)

and

σh212Nk=1N{(h1kμh)2+(h2kμh)2}.{\sigma}_{\textrm{h}}^{2}\coloneqq\frac{1}{2N}\sum_{k=1}^{N}\left\{(h^{k}_{1}-{\mu}_{\textrm{h}})^{2}+(h^{k}_{2}-{\mu}_{\textrm{h}})^{2}\right\}\,. (31)

We use the cross-entropy loss Eq. (25) with Nclass=2N_{\mathrm{class}}=2. The weight update is repeated for 100,000 iterations. We use the Adam [42] as an update method. The learning rate is set at 10510^{-5}. The code is implemented with pytorch [43], a library for deep learning.

IV.2 Result

Refer to caption
Figure 4: Minimum detectable SNR with 90% detection probability for the non-Gaussian statistic, two convolutional neural networks (shallower and deeper), and the residual network. The false alarm rate is set at 5%. The black squares are the non-Gaussian statistic, the blue squares are the shallower CNN, the orange circles are the deeper CNN, and the green circles are the residual network. For visibility, the dots are slightly shifted in the horizontal direction. The error bar shows the standard deviation of ρ90%\rho_{90\%} evaluated by four independent runs. The shaded area is the parameter region not used for training.

Now we evaluate the detection efficiencies of the neural networks and compare them with the non-Gaussian statistic. First, we set the thresholds of the detection statistics by simulating noise-only data. The false alarm probability is the fraction of false positive events over the total test events, i.e.,

FAP=N(Γ<Γ)Nnoise,\mathrm{FAP}=\frac{N(\Gamma_{\ast}<\Gamma)}{N_{\mathrm{noise}}}\,, (32)

where NnoiseN_{\mathrm{noise}} is the number of the simulated noise data, Γ\Gamma is the detection statistic, Γ\Gamma_{\ast} is the threshold value of Γ\Gamma, and N(Γ<Γ)N(\Gamma_{\ast}<\Gamma) is the number of events that the detection statistic exceeds the threshold. The neural network returns the probability of each class, denoted by {pi}i=1,,N\{p_{i}\}_{i=1,\dots,N} which satisfies i=1Npi=1\sum_{i=1}^{N}p_{i}=1. Now we have the two classes (N=2N=2) corresponding to the absence and presence of a GW signal. We set Γ=p2\Gamma=p_{2}, which is the probability that the data contains a GW signal, for the neural networks and Γ=ΛMLNG\Gamma=\Lambda^{\mathrm{NG}}_{\mathrm{ML}} for the non-Gaussian statistic. We set FAP=0.05\mathrm{FAP}=0.05 and find the value of Γ\Gamma_{\ast} which satisfies Eq. (32). To determine the threshold, we use 500 test data of simulated Gaussian noise.

Once we obtain the threshold, we determine the minimum SNR for detection by simulating data with GW signal and setting the detection probability to pdet=0.9{p}_{\textrm{det}}=0.9. For signal injection, the values of SNR and duty cycle are taken from ρ[0.2,4.0]\rho\in[0.2,4.0] and log10ξ[3.0,1.0]\log_{10}\xi\in[-3.0,-1.0] with the interval of Δρ=0.2\Delta\rho=0.2 and Δlog10ξ=0.2\Delta\log_{10}\xi=0.2. We prepare 500 data for each injection value, count the number of data satisfying Γ<Γ\Gamma_{\ast}<\Gamma, and obtain the detection probability as a function of ρ\rho for each ξ\xi. From this, we can find the minimum value of ρ\rho that gives pdet=0.9{p}_{\textrm{det}}=0.9. To evaluate the statistical fluctuation due to the randomness of the signal and the noise, we independently carry out the whole process four times.

Figure 4 summarises the results, showing the minimum detectable SNRs for the four methods, i.e., the non-Gaussian statistic based on DF03, the shallower CNN, the deeper CNN, and the residual network. For the range of 3.0log10ξ2.0-3.0\leq\log_{10}\xi\leq-2.0, all deep learning methods show a comparable performance to the non-Gaussian statistic. For 1.75<log10ξ-1.75<\log_{10}\xi, we see that the residual network performs as well as the non-Gaussian statistic, while the performance of the shallower and deeper CNNs gets worse. The deviation between the residual network and the deeper CNN, which have almost the same number of tunable parameters, clearly shows the advantage of using the residual blocks.

IV.3 Computational time

At the end of this section, we list the computational times of the neural networks and the non-Gaussian statistic. The computational time of the non-Gaussian statistic is defined by the time to carry out the grid search for 500 test data. For the neural networks, we measure the time that the trained models spend analyzing 500 test data. We use CPU Intel(R) Xeon(R) CPU E5-1620 v4 @ 3.50GHz (224224 GFLOPS) for the non-Gaussian statistic and GPU Quadro GV100 (16.616.6 TFLOPS in single precision) for the neural networks. Table. 4 shows the comparison of the computational time and the ratio with respect to the non-Gaussian statistic. Note that here we performed a simple grid search to find the maximum value of the non-Gaussian statistics, but the computational time could be improved by applying a fast grid search algorithm. Even considering this point and the difference in the computational power between the CPU and the GPU, deep learning shows a clear advantage in computational time. This can be fruitful when we apply deep learning for longer strain data that is reasonably expected for a realistic situation.

Table 4: Computational times of different methods for the detection problem.
method time [sec] ratio
non-Gaussian statistic 1.13×1041.13\times 10^{4} 1
shallower CNN 2.54×1022.54\times 10^{-2} 2.25×1062.25\times 10^{-6}
deeper CNN 7.96×1027.96\times 10^{-2} 7.04×1067.04\times 10^{-6}
residual network 7.66×1027.66\times 10^{-2} 6.78×1066.78\times 10^{-6}

V Estimating duty cycle

In this section, we demonstrate neural network applications for parameter estimation. We take two approaches. First, the neural network is trained to output the estimated values of the duty cycle and the SNR. In the second approach, we treat parameter estimation as a classification problem by dividing the range of duty cycle values into four classes. The first method is more straightforward and can directly give the value of ξ\xi, while we show that the estimation gets biased when the duty cycle is relatively small (ξ103\xi\lesssim 10^{-3}). The second approach can predict only the rough range of ξ\xi, but it shows reasonable performance even for smaller duty cycle ξ104\xi\sim 10^{-4}.

V.1 First approach: direct estimation of the duty cycle and the SNR

We train the neural network to predict the value of the duty cycle and the SNR. We use the structure of the residual network shown in Table. 3 and Fig. 3 by removing the softmax layer. The weight update is repeated for 10510^{5} times. The training data is generated by sampling the duty cycle from the log uniform distribution on [102,100][10^{-2},10^{0}] and the SNR from the uniform distribution on [1,60][1,60]. To make the training easier, the injection parameters are normalized by

Q^=2QQminQmaxQmaxQmin,\hat{Q}=\frac{2Q-Q_{\mathrm{min}}-Q_{\mathrm{max}}}{Q_{\mathrm{max}}-Q_{\mathrm{min}}}\,, (33)

where Q={log10ξ,ρ}Q=\{\log_{10}\xi,\rho\} is the injected values, and QminQ_{\mathrm{min}} and QmaxQ_{\mathrm{max}} are the minimum value and the maximum value of the training range, respectively. By this normalization, Q^\hat{Q} has the range [1,1][-1,1]. The outputs of the neural network directly correspond to the estimated values of Q^\hat{Q}. We use the L1 loss (Eq. (24)) as the loss function. We set the batch size to 512. The update algorithm is Adam with the learning rate of 10510^{-5}.

Refer to caption
Figure 5: Parameter estimation of the duty cycle (left) and the SNR (right) by the neural network. The scatter plot shows the true value on the horizontal axis and the predicted value on the vertical axis. The diagonal line represents equal values for predicted and true values.

We test the trained residual network with the newly-generated data with the parameters sampled from the same distributions as one of the training data. Figure 5 is the scatter plot comparing the true values with the predicted values. We can see that the neural network can recover the true values reasonably well.

In order to evaluate the performance quantitatively, let us define the average and standard deviations of the error as

δQ¯\displaystyle\overline{\delta Q} 1Nn=1N(QnpredQntrue),\displaystyle\coloneqq\frac{1}{N}\sum_{n=1}^{N}\left(Q^{\mathrm{pred}}_{n}-Q^{\mathrm{true}}_{n}\right)\,, (34)
σ[δQ]\displaystyle\sigma[\delta Q] 1Nn=1N(QnpredQntrue)2,\displaystyle\coloneqq\sqrt{\frac{1}{N}\sum_{n=1}^{N}\left(Q^{\mathrm{pred}}_{n}-Q^{\mathrm{true}}_{n}\right)^{2}}\,, (35)

where NN is the number of the test data, QnpredQ^{\mathrm{pred}}_{n} and QntrueQ^{\mathrm{true}}_{n} are respectively the predicted value and the true value of the quantity Q={log10ξ,ρ}Q=\{\log_{10}\xi,\rho\} of the nn th test data. Table. 5 shows δQ¯\overline{\delta Q} and σ[δQ]\sigma[\delta Q] obtained by using 500 test data. The duty cycle and SNR are randomly sampled from a uniform distribution on log10ξ[2,0]\log_{10}\xi\in[-2,0] and ρ[1,60]\rho\in[1,60]. For both the duty cycle and the SNR, δQ¯\overline{\delta Q} is much smaller than σ[δQ]\sigma[\delta Q]. From this, we can conclude that the neural network predicts the duty cycle and the SNR without bias.

Table 5: Averages and standard deviations of errors in log10ξ\log_{10}\xi and ρ\rho.
δlog10ξ¯\overline{\delta\log_{10}\xi} σ[δlog10ξ]\sigma[\delta\log_{10}\xi] δρ¯\overline{\delta\rho} σ[δρ]\sigma[\delta\rho]
1.29×105-1.29\times 10^{-5} 0.11 8.90×102-8.90\times 10^{-2} 2.97
Refer to caption Refer to caption
Figure 6: Errors in the duty cycle (top) and the SNR (bottom) for different fiducial values of the duty cycle. The blue circles, orange squares, and green triangles respectively show the results of the data sets with the fiducial SNR of 10, 30, and 50. Each dot shows the average of the error, and the error bar represents the standard deviation of the errors. The left panels show the results of the neural network trained with log10ξ[2,0]\log_{10}\xi\in[-2,0], and the right panels are the ones trained with log10ξ[4,0]\log_{10}\xi\in[-4,0].

To further check the performance in detail, in the left panel of Fig. 6, we plot the average errors of log10ξ\log_{10}\xi (top) and ρ\rho (bottom) for different fiducial parameter values. The error bars indicate their standard deviations. To make this plot, we sample log10ξ\log_{10}\xi from 2-2 to 0 by the interval of 0.50.5. For each duty cycle, we prepare datasets with SNR 1010, 3030, and 5050. Each dataset contains 500500 realizations. Note that we do not use the relative error for log10ξ\log_{10}\xi because the target value can be close to zero, which causes divergence in the relative error. First, we find from both left panels that the error variance reasonably increases as the SNR decreases. The estimation of the duty cycle and SNR seems not to be biased except when the fiducial value is at the border of the training range (log10ξ=0\log_{10}\xi=0 and 2-2), and the SNR is small (ρ=10\rho=10).

In the right panels of Fig. 6, we show the results in which we include lower values of the duty cycle for the training, log10ξ[4,0]\log_{10}\xi\in[-4,0]. We find that the error variances of both the duty cycle and the SNR significantly increase as the duty cycle decreases for log10ξ3\log_{10}\xi\lesssim-3. For the duty cycle, the systematic bias is smaller than the variance. On the other hand, for the SNR, we find a clear bias that the neural network tends to output a larger SNR than the true value when ρ=10\rho=10 and a smaller SNR when ρ=50\rho=50. We find from test runs that such biases tend to increase when we use a shorter data length. From this, we can infer that the bias arises because the data length is too short. In fact, with the data length of N=104N=10^{4} used throughout this paper, the burst can be absent in the strain data for ξ104\xi\sim 10^{-4}.

V.2 Second approach: Classification problem

As a second approach, we consider the classification problem. we divide the range of duty cycle values into four categories and assign the class index as the following

class index={1(1log10ξ<0)2(2log10ξ<1)3(3log10ξ<2)4(4log10ξ<3).\textrm{class index}=\begin{cases}1&(-1\leq\log_{10}\xi<0)\\ 2&(-2\leq\log_{10}\xi<-1)\\ 3&(-3\leq\log_{10}\xi<-2)\\ 4&(-4\leq\log_{10}\xi<-3)\,.\end{cases} (36)

Again, we use the residual network with the structure shown in Table. 3 and Fig. 3, but the last fully connected layer and the softmax layer are modified to have four-dimensional outputs.

The training procedure is as follows. The weight update is repeated for 10510^{5} times. The input data are normalized in the same way as the detection problem (see Eq. (29)). The duty cycle is sampled from the log uniform distribution on [104,100][10^{-4},10^{0}], and the SNR is sampled from the uniform distribution on [1,60][1,60]. The batch size is 512, and the update algorithm is Adam with the learning rate of 10510^{-5}. For the loss function, we use the cross-entropy loss Eq. (25) with Nclass=4N_{\mathrm{class}}=4.

Refer to caption
Figure 7: Confusion matrix for the duty cycle estimation. The row and column represent the true label and predicted label, respectively. Each class is labeled by the integer {1,2,3,4}\{1,2,3,4\} and they correspond to log10ξ[1,0)\log_{10}\xi\in[-1,0), [2,1)[-2,-1), [3,2)[-3,-2), [4,3)[-4,-3), respectively. The numbers are in the unit of percent and represent the fraction of data classified from the true label to the predicted label.

The trained neural network is tested with four datasets; each consists of 512 data and corresponds to the different classes. In the same way as the training data, SNRs are uniformly sampled from the range [1, 60] for all test datasets. Figure 7 presents the confusion matrix of the classification by the residual network. We find that 93.1% of test data are successfully classified to the correct class on average. Also, unlike the direct parameter estimation shown in the previous subsection, we can see that the residual neural network works well even for small values of the duty cycle log10ξ2\log_{10}\xi\lesssim-2. Thus, this method could be useful for giving an order of magnitude estimation of the duty cycle.

Refer to caption
Figure 8: Distribution of the true values of (log10ξ,ρ)\log_{10}\xi,\rho) of the misclassified events. The color of the dots represents the maximum value of the probability among {pi}i=1,2,3,4\{p_{i}\}_{i=1,2,3,4}.
Refer to caption
Figure 9: Cumulative fraction of the maximum value of the predicted probabilities in descending order. Blue and orange lines correspond to the misclassified and correctly classified events, respectively. Note that the numbers of correctly classified and misclassified events are different: 1905 events are correctly classified, and 143 events are misclassified.

Now, we further investigate the misclassified cases. Figure 8 shows the scatter plot of misclassified events in the (log10ξ\log_{10}\xi, ρ\rho) plane. It clearly shows that the duty cycles of the misclassified events are located at the boundary of the neighboring classes. As for the SNR distribution, we find that it is almost uniform, but as expected, there is a tendency that misclassification occurs more for ρ5\rho\lesssim 5. The colors of the dots in the scatter plot represent the maximum values of pip_{i} (the probability of each class), which indicates how confidently the neural network predicts the class. We can see that most of the misclassified events are given with low confidence.

Figure 9 shows the cumulative histograms (cumulating in the reverse direction of pip_{i}) for correctly classified events and misclassified events. We can see a clear difference between them. For most of the correctly classified events, the probability of close to 1 is assigned. On the other hand, we can again see that misclassified events tend to have low confidence. However, 20%20\% of the events are misclassified with max[pi]>0.9\max~{}[p_{i}]>0.9. As seen from Fig. 8, they are at the boundary of the neighboring classes, and this would be unavoidable with the classification problem method.

VI Conclusion

In this work, we studied applications of convolutional neural networks to the detection and parameter estimation of non-Gaussian stochastic GWs. As for the detection problem, we compared three different configurations of neural networks; shallower CNN, deeper CNN, and residual network. We found that the residual network can achieve comparable sensitivity to the maximum likelihood statistics. We also showed that neural networks have an advantage in computational time compared to the non-Gaussian statistic.

Next, we investigated the estimation of the duty cycle by a neural network with two different approaches. In the first approach, we trained the residual neural network to directly estimate the values of the duty cycle and SNR. We found that the estimation error in log10ξ\log_{10}\xi is about 0.2\lesssim 0.2. As for SNR, the neural network can estimate with the relative error of 1020%10-20\%. We found that the estimation of the duty cycle gets biased when we include a small duty cycle for the training log10ξ[4,0]\log_{10}\xi\in[-4,0]. This could be explained by the shortness of the data length used in this paper. In the second approach, the parameter estimation was reduced to the classification problem in which the neural network classifies the data depending on the duty cycle. The parameter range was log10ξ[4,0]\log_{10}\xi\in[-4,0], and it was divided into four classes with the band of Δlog10ξ=1\Delta\log_{10}\xi=1. The neural network could classify the data with an accuracy of 93% on average.

The present work is the first attempt to apply deep learning to the astrophysical GW background. In this work, we employed the toy model that is used in DF03 where various realistic effects, such as the detector’s configuration, noise properties, and waveform model of the bursts are neglected. In particular, detection of the astrophysical GW background would become challenging in the presence of glitch noises and the correlated magnetic noise from Schumann resonances. Further study of their effects will be extremely important for applying our method to real data. We leave it as future work with an expectation that deep learning has a high potential to distinguish such troublesome noises from the signal.

Acknowledgements.
This work is supported by JSPS KAKENHI Grant no. JP20H01899. TSY is supported by JSPS KAKENHI Grant no. JP17H06358 (and also JP17H06357), A01: Testing gravity theories using gravitational waves, as a part of the innovative research area, “Gravitational wave physics and astronomy: Genesis”. SK acknowledges support from the research project PGC2018-094773-B-C32, and the Spanish Research Agency through the Grant IFT Centro de Excelencia Severo Ochoa No CEX2020-001007-S, funded by MCIN/AEI/10.13039/501100011033. SK is supported by the Spanish Atracción de Talento contract no. 2019-T1/TIC-13177 granted by Comunidad de Madrid, the I+D grant PID2020-118159GA-C42 of the Spanish Ministry of Science and Innovation, the i-LINK 2021 grant LINKA20416 of CSIC, and JSPS KAKENHI Grant no. JP20H05853. This work was also supported in part by the Ministry of Science and Technology (MOST) of Taiwan, R.O.C., under Grants No. MOST 110-2123-M-007-002 and 110-2112-M-032 -007- (G.C.L.) The authors thank Joseph D. Romano for very useful comments.

Appendix A Review of non-Gaussian statistic

Here, we review the properties of the non-Gaussian statistic (7). DF03 compared the non-Gaussian statistic with the standard cross-correlation statistic that is defined by

ΛCC(h)α^2σ¯1σ¯2,{\Lambda}_{\textrm{CC}}(h)\coloneqq\frac{\hat{\alpha}^{2}}{\bar{\sigma}_{1}\bar{\sigma}_{2}}\,, (37)

where

α^2α¯2Θ(α¯2),α¯21Nk=1Nh1kh2k,\hat{\alpha}^{2}\coloneqq\bar{\alpha}^{2}\Theta(\bar{\alpha}^{2})\,,\qquad\bar{\alpha}^{2}\coloneqq\frac{1}{N}\sum_{k=1}^{N}h_{1}^{k}h_{2}^{k}\,, (38)

and Θ(x)\Theta(x) is the Heaviside step function defined by

Θ(x)={1ifx00ifx<0.\Theta(x)=\begin{cases}1&\text{if}\ x\geq 0\\ 0&\text{if}\ x<0\,.\end{cases} (39)

This is obtained by assuming the Gaussian signal model, i.e., ξ=1\xi=1 in Eq. (5).

Refer to caption
Figure 10: Minimum detectable SNR as a function of the duty cycle ξ\xi. Both the false alarm probability and the false dismissal probability are set to be 0.1. Error bars are obtained by four independent runs. The time-series data has the length of N=104N=10^{4}, and the detector’s noise variances are σ12=σ22=1\sigma_{1}^{2}=\sigma_{2}^{2}=1. Note that ρ90%\rho_{\mathrm{90\%}} represents the minimum detectable SNR with 90%90\% detection probability and is different from that of Ωdetectable\Omega_{\mathrm{detectable}} in Fig.1 of DF03.

Here we aim to reproduce the results of DF03 and demonstrate the performance of Eq. (7) by simulating time series strain data with the length N=104N=10^{4}. The maximization of λMLNG{\lambda}^{\textrm{NG}}_{\textrm{ML}} in Eq. (7) requires to explore the parameter space. Here, by following DF03, we substitute the injected values into λMLNG{\lambda}^{\textrm{NG}}_{\textrm{ML}} instead of maximizing the model parameters. Note that we perform the parameter search to simulate the non-Gaussian statistic for the comparison purpose in the main part of the paper, but the general behavior does not change.

Figure 10 compares the minimum detectable SNR for the standard cross-correlation statistic and the non-Gaussian statistic. For ξ>0.1\xi>0.1, their performances are comparable. This can be interpreted that the non-Gaussianity of the signal is not much strong, and taking into account non-Gaussianity does not give a significant advantage. On the other hand, for ξ<0.1\xi<0.1, the non-Gaussian statistic outperforms the cross-correlation statistic. It is reasonable because the non-Gaussian statistic is developed based on the same signal model as the one we used for simulating strain data.

Refer to caption
Figure 11: The color map shows the logarithm of λMLNG(α2,ξ){\lambda}^{\textrm{NG}}_{\textrm{ML}}(\alpha^{2},\xi). The data length is N=104N=10^{4}, and we set the signal parameters to ξ=0.2\xi=0.2 and ρ=40\rho=40 (indicated by the blue square). The variances of the detector noises are σ12=σ22=1\sigma_{1}^{2}=\sigma_{2}^{2}=1. The red circle indicates the parameter values of the maximum log-likelihood. Black dashed lines indicate the constant SNR for ρ=60,40,20\rho=60,40,20 from top to bottom. It is clearly seen that the likelihood estimator has a degeneracy along the parameter combination that gives the same SNR value.

Next, parameter estimation is tested. Figure 11 shows an example of the distribution of the logarithm of λMLNG{\lambda}^{\textrm{NG}}_{\textrm{ML}} in the ξα2\xi-\alpha^{2} plane. We injected a stochastic signal with ξ=0.2\xi=0.2 and ρ=40\rho=40. It is clearly seen that the duty cycle ξ\xi and the amplitude variance α2\alpha^{2} of each burst degenerate. We also draw three dashed lines corresponding to different SNRs, ρ=\rho=20, 40, and 60. We clearly see that the strong degeneracy exists along the line of constant SNR. In other words, the non-Gaussian statistic is sensitive to the difference in SNR.

References