This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Detecting Gravitational-waves from Extreme Mass Ratio Inspirals using Convolutional Neural Networks

Xue-Ting Zhang MOE Key Laboratory of TianQin Mission, TianQin Research Center for Gravitational Physics & School of Physics and Astronomy, Frontiers Science Center for TianQin, Gravitational Wave Research Center of CNSA, Sun Yat-sen University (Zhuhai Campus), Zhuhai 519082, China    Chris Messenger SUPA, School of Physics and Astronomy, University of Glasgow, Glasgow G12 8QQ, United Kingdom    Natalia Korsakova ARTEMIS, Observatoire de la Côte d’Azur, Boulevard de l’Observatoire, 06304 Nice, France    Man Leong Chan Department of Applied Physics, Fukuoka University, Nanakuma 8-19-1, Fukuoka 814-0180, Japan    Yi-Ming Hu [email protected] MOE Key Laboratory of TianQin Mission, TianQin Research Center for Gravitational Physics & School of Physics and Astronomy, Frontiers Science Center for TianQin, Gravitational Wave Research Center of CNSA, Sun Yat-sen University (Zhuhai Campus), Zhuhai 519082, China    Jing-dong Zhang MOE Key Laboratory of TianQin Mission, TianQin Research Center for Gravitational Physics & School of Physics and Astronomy, Frontiers Science Center for TianQin, Gravitational Wave Research Center of CNSA, Sun Yat-sen University (Zhuhai Campus), Zhuhai 519082, China
Abstract

Extreme mass ratio inspirals (EMRIs) are among the most interesting gravitational wave (GW) sources for space-borne GW detectors. However, successful GW data analysis remains challenging due to many issues, ranging from the difficulty of modeling accurate waveforms, to the impractically large template bank required by the traditional matched filtering search method. In this work, we introduce a proof-of-principle approach for EMRI detection based on convolutional neural networks (CNNs). We demonstrate the performance with simulated EMRI signals buried in Gaussian noise. We show that over a wide range of physical parameters, the network is effective for EMRI systems with a signal-to-noise ratio larger than 50, and the performance is most strongly related to the signal-to-noise ratio. The method also shows good generalization ability towards different waveform models. Our study reveals the potential applicability of machine learning technology like CNNs towards more realistic EMRI data analysis.

EMRI, gravitational-wave, detection, convolutional neural network, CNN

I Introduction

The successful detection of Gravitational Waves has opened a new window of understanding the Universe Abbott et al. (2019, 2020, 2021a, 2021b). However, a wide band of GW frequencies is still inaccessible for observation. For example, systems like Double White Dwarfs Huang et al. (2020); Korol et al. (2017), Massive Binary Black Holes Wang et al. (2019); Klein et al. (2016), stellar-mass Binary Black Holes Liu et al. (2020); Sesana (2016), Extreme Mass Ratio Inspirals Babak et al. (2017a); Fan et al. (2020), and stochastic gravitational-wave background Liang et al. (2021); Chen et al. (2019) are expected to emit GW signals in the mmHz frequency band. The future detections of these signals can help us better understand the nature of gravity (e.g. Shi et al., 2019; Zi et al., 2021) and the Universe (e.g. Zhu et al., 2021a, b). In order to properly understand the physics and astronomy behind GW events, sophisticated and efficient algorithm to perform data analysis plays a significant role.

Space-borne GW missions are capable of detecting GW signals in the mHz band Luo et al. (2016); Mei et al. (2020). Of all expected GW sources, EMRIs are among the most interesting ones. A typical EMRI system is composed of a stellar-mass Compact Object (CO) and a Massive Black Hole (MBH) (104107M10^{4}-10^{7}M_{\odot}). Such systems are believed to be formed in the center of galaxies. The detection of EMRIs , among other things, can help us better understand the growth mechanism of MBHs, and also put a more stringent constraint on their population properties, like the mass function and the distribution of the surrounding COs Baker et al. (2019); Amaro-Seoane et al. (2007). In addition, the EMRI signal can last for thousands to millions of cycles in the mmHz frequency band. This feature empowers the EMRI systems to be ideal laboratories to study gravity in a strong regime. Zi et al. (2021); Moore et al. (2017); Yunes et al. (2012); Canizares et al. (2012); Barack and Cutler (2007).

Despite its great scientific potential, there remain great challenges for the end-to-end data analysis of EMRI signals. For example, the ideal EMRI waveform should consider the effects of self-force. Some progress on self-force waveform has been made recently, such as Lynch et al. (2021) for an eccentric orbit to the second-order, and Isoyama et al. (2021) for a generic orbit to the adiabatic order. However, a fast and accurate EMRI waveform for the most general case is yet to be implemented. Most of the widely used waveform models are expected to quickly dephase from the physical waveform, which restricts its applicability in EMRI data analysis. The computational cost of EMRI waveforms is also usually very high, since it involves orbit integration over a period of months to years. It has been estimated that in order to implement a classical signal search with matched filtering, a bank template containing at least 104010^{40} templates would be needed Gair et al. (2004).

Some attempts have been made to explore efficient data analysis algorithms for EMRIs. Both template-based algorithms and template-free methods have been proposed to detect the EMRI signals. The former includes the semi-coherent method Gair et al. (2004) and \cal{F}{\textendash}statistic algorithm Wang et al. (2012, 2015), and the latter includes the time-frequency algorithm Gair and Wen (2005); Wen and Gair (2005); Gair and Jones (2007); Gair et al. (2008a). On the parameter estimation side, methods like Metropolis-Hastings search Gair et al. (2008b); Babak et al. (2009), parallel tempering MCMC Ali et al. (2012), and Gaussian process Chua (2016); Chua et al. (2020) have been implemented.

In recent years, machine learning has gained huge popularity in various complex tasks, such as computer vision Krizhevsky et al. (2012); Simonyan and Zisserman (2014); Szegedy et al. (2015); He et al. (2015); Chollet (2016); Szegedy et al. (2016), speech recognition Deng et al. (2013), natural language process Collobert et al. (2011), and some have been successfully applied to GW astronomy George and Huerta (2018a, b); Huerta et al. (2019); Xia et al. (2020); Gabbard et al. (2018); Schäfer et al. (2020); Krastev et al. (2021); Chan et al. (2020); Bayley et al. (2019, 2020); Gabbard et al. (2019); Williams et al. (2021); George et al. (2017); Shen et al. (2019). Deep learning is a sub-field of machine learning that is based on learning several levels of data representations Heaton (2017), corresponding to a hierarchy of features, factors, or concepts, where higher-level concepts are defined from lower-level ones. The Convolutional Neural Network (CNN) is a deep learning algorithm, which has been applied in many fields because it can automatically extract data features and easily process various high-dimensional data, such as image classification, target detection, and part-of-speech tagging. In GW data analysis, CNNs has been used to detect GW signals detection, such as Binary Black Hole (BBH) mergers George and Huerta (2018a, b); Huerta et al. (2019); Xia et al. (2020); Gabbard et al. (2018), Binary Neutron Star (BNS) mergers Schäfer et al. (2020); Krastev et al. (2021), supernova explosions Chan et al. (2020), and continuous gravitational waves Bayley et al. (2019, 2020). After being properly trained, deep learning methods also demand low computation cost, which improves efficiency in an end-to-end detection system. In this work, we use TianQin, a space-borne GW detector, as a reference, to explore the EMRI signal detection with a CNN.

This paper is organized as follows. Section II introduces the characteristics of EMRI and its waveform. Section III provides the architecture of a CNN, and discusses how to prepare datasets. Section IV describes the search procedure. The results are presented in Section V. Conclusions and discussions are provided in Section VI.

II Basics of EMRI detection

II.1 Basic astronomy of EMRIs

Galactic nuclei are extreme environments for stellar dynamics Amaro-Seoane (2018). The stellar densities can be as high as 106Mpc310^{6}\ \rm M_{\odot}\ pc^{-3} and its velocity scatter can exceed 1000kms11000\ \rm km\ s^{-1}. Within a galactic nucleus, a large number of COs, including stellar-mass Black Holes, White Dwarfs, and Neutron Stars, are expected to exist. They are considered to be within the “loss cone” if their orbit would lead to their capture by the central MBH, leading to the formation of EMRIs. COs outside the loss cone can fill the loss cone through dynamical processes like two-body relaxation Amaro-Seoane (2018).

In addition to this traditional channel, some new formation channels of EMRI systems have also been proposed. It has been conjectured that a CO can be captured by the accretion disk around central MBH( i.e., Active Galactic Nuclei (AGN) disk). Effects like wind effects, density wave generation, and dynamic friction can assist the inward migration of COs, which further boosts the formation of EMRIs Pan and Yang (2021); Pan et al. (2021).

Although EMRIs are speculated to exist, no direct observational evidence is available yet (c.f.Arcodia et al. (2021)). Therefore, currently, the population properties of EMRIs are derived from theoretical modeling. However, components within the theoretical modeling, like the mass and spin distribution of the central MBH, the erosion of the stellar cusp, the M-σ\sigma relation, etc. are uncertain, which lead to a very uncertain prediction on EMRI population Babak et al. (2017b). Indeed, different assumptions on EMRI population lead to orders of magnitude difference in terms of detection rate by TianQin Fan et al. (2020). Throughout this analysis, an model is adopted to simulate EMRI population. To be exact, we adopt the model M12 from Babak et al. (2017b).

II.2 Waveform models of EMRIs

Theoretical modelling of EMRI waveforms is essential for the successful detection of GWs from these systems. To accurately compute the EMRI waveform, one has to consider the effects of the gravitational self-force, which leads to a deviation of CO orbit from the test particle geodesics under the Kerr metric of the MBH Barack (2009). Unfortunately, waveform simulations with numerical relativity cannot be used for EMRI due to prohibitively high computational costs caused by the very high mass ratio of such systems. Although some progress has been made in adding self-force into the waveform generation Barack (2009); Van De Meent (2018); Miller and Pound (2021); Hughes et al. (2021); McCart et al. (2021); Lynch et al. (2021); Isoyama et al. (2021), a fully realistic solution for EMRI is still yet to be achieved. Nevertheless, there are methods that can generate EMRI waveforms. For example, the Teukolsky-based waveform Hughes et al. (2005) and Numerical Kludge (NK) waveform Babak et al. (2007) solves the Teukolsky equation with the geodesic under Kerr metric. Such methods suffer from low efficiency, while the Analytic Kludge (AK) method that adopts the post-Newtonian (PN) formula can achieve waveform generation in a much faster way Barack and Cutler (2004). The Augmented Analytic Kludge (AAK) waveform Chua and Gair (2015a) can reach an accuracy similar to NK with the generating speed of AK. For our study, since a large number of simulated waveforms are needed for the training, testing, and validation of our CNN method, we focus on the analytic family of waveform models and concentrate on both AK and AAK waveforms. We remark that during the writing of this manuscript, a new and fast waveform has been implemented Katz et al. (2021), which could be used to improve efficiency in the future, but we do not expect any significant difference in our findings. Besides, Katz et al. (2021) adopts the Schwarzschild Black Hole (BH) assumption, while we would like to also study the impact of MBH spin.

The AK method first calculates the orbit of CO through post-Newtonian equations, considering the radiation back-reaction and the Lense-Thirring effect. The overall evolution of the CO orbit is governed by three fundamental frequencies, namely the orbital angular frequency fof_{\rm o}, the periapsis precession frequency fpf_{\rm p}, and the Lense-Thirring precession frequency fLTf_{\rm LT}. The waveform is then calculated based on quadrupolar approximation. Notice that the overall setup of a waveform is not self-consistent, hence the name of a “kludge”. However, this choice of formulation allows a fast way of producing waveforms at the cost of its accuracy.

Under the transverse and traceless gauge, two polarizations of AK waveform are defined via a nn-harmonic waveform which is a function of physical parameters 111The details of the parameters adopted are presented in Table 2:

h+=\displaystyle h_{+}= n[1+(L^n^)2][ancos2γbnsin2γ]\displaystyle\sum_{n}-\left[1+(\hat{L}\cdot\hat{n})^{2}\right]\left[a_{n}\cos 2\gamma-b_{n}\sin 2\gamma\right]
+[1(L^n^)2]cn,\displaystyle+\left[1-(\hat{L}\cdot\hat{n})^{2}\right]c_{n},
h×=\displaystyle h_{\times}= n2(L^n^)[bncos2γ+ansin2γ]),\displaystyle\sum_{n}2(\hat{L}\cdot\hat{n})\left[b_{n}\cos 2\gamma+a_{n}\sin 2\gamma\right]),

where L^\hat{L} is the orbital angular momentum, n^\hat{n} is the position of the source, and γ\gamma is an azimuthal angle measuring the direction of pericenter. The coefficients (an,bn,cna_{n},b_{n},c_{n}) with nn ranges from 1 to 5, are determined by the eccentricity and mean anomaly orbital phase Barack and Cutler (2004). The orbital phase is determined by three fundamental frequencies fo,fp,f_{\rm o},\ f_{\rm p}, and fLTf_{\rm LT}. More details on waveform generation can be found in Barack and Cutler (2004); Fan et al. (2020). Notice that since the self-force can not be properly incorporated by the AK waveform, the spin parameter for the CO is not included.

The AK waveform quickly deviates from physical systems, due to intrinsic inconsistency. Under the AK model, instead of using the physical parameters, one can adopt the effective parameters to evolve the orbit and produce a waveform similar to the NK model. The NK model is more accurate as it combines PN orbital evolution with Kerr geodesics. Chua et al. implemented the AAK waveform by establishing the link between the physical parameter and the effective parameter. To obtain such a link, one can evolve the NK waveform for a short period and determine the mapping relationship to derive the correct fundamental frequencies fof_{\rm o}, fpf_{\rm p}, and fLTf_{\rm LT} Chua and Gair (2015b). In this way, the AAK waveform can achieve both the accuracy of NK and the speed of AK.

II.3 The TianQin mission

For a given gravitational wave detector, one can use the antenna pattern to describe its response to a source at a given location, polarization, and frequency Cutler (1998); Cornish and Larson (2001); Cornish and Rubbo (2003). A typical EMRI signal lasts for a long time. If we observe from the point of view of the detector, then a given source would change in location as well as in polarisation. One needs to know the orbit of the spacecraft to derive the overall evolution of the signal as seen at the detector. In this work, we focus our attention on the proposed TianQin mission Luo et al. (2016); Zhang et al. (2020) for the following analysis.

The TianQin mission is designed to be a constellation with three satellites, which accommodate test masses in free fall and use drag-free control to follow these test masses. The satellites share an Earth-central orbit and form a regular triangle. They are connected by laser links and can form Michelson interferometers to detect GW. The geocentric nature of the TianQin orbit makes it adopts a “3-month on 3-month off” work scheme, that the detectors would operate continuously for three months followed by three months of break. Each pair of neighboring laser arms can form interference on the vertex satellite, whose noise can be described by the power spectral density (PSD):

Sn(f)=1Larm2[4Sa(2πf)4(1+104Hzf)+Sx][1+0.6(ff)2],\begin{split}S_{n}(f)=&\frac{1}{L_{\rm arm}^{2}}\left[\frac{4S_{a}}{{(2\pi f)^{4}}}\left(\frac{1+10^{-4}\ \text{Hz}}{f}\right)+S_{x}\right]\\ &\left[1+0.6\left(\frac{f}{f_{*}}\right)^{2}\right],\end{split} (1)

where ff is the frequency, Sa=1×1030m2s4/HzS_{a}=1\times 10^{-30}\ {\rm m^{2}\ s^{-4}/\text{Hz}} describes the acceleration noise, Sx=1×1024m2/HzS_{x}=1\times 10^{-24}\ \rm m^{2}/\rm Hz describes the positional noise, and f=c2πLarmf_{*}=\frac{c}{2\pi L_{\rm arm}} is the transfer frequency with cc being the speed of light and Larm=3×108mL_{\rm arm}=\sqrt{3}\times 10^{8}\ {\rm m} is the arm length.

For the low frequency, where f<ff<f_{*}, the three interferometers can be combined into two orthogonal channels, corresponding to the two GW polarisations. So the recorded GW signal can be expressed as:

hI,II(t)=32(h+(t)FI,II+(θS,ϕS,ψS)+h×(t)FI,II×(θS,ϕS,ψS)),\begin{split}h_{I,II}(t)=\frac{\sqrt{3}}{2}\Big{(}h_{+}(t)F_{I,II}^{+}(\theta_{S},\phi_{S},\psi_{S})\\ +h_{\times}(t)F_{I,II}^{\times}(\theta_{S},\phi_{S},\psi_{S})\Big{)},\end{split} (2)

where FI,II+F_{I,II}^{+} and FI,II×F_{I,II}^{\times} are the antenna pattern for both plus and cross polarisations, and

FI+(θS,ϕS,ψS)=12(1+cos2θS)(cos2ϕScos2ψScosθSsin2ϕSsin2ψS).FI×(θS,ϕS,ψS)=12(1+cos2θS)(cos2ϕSsin2ψS+cosθSsin2ϕScos2ψS).\begin{split}F_{I}^{+}(\theta_{S},\phi_{S},\psi_{S})=&\frac{1}{2}(1+\cos^{2}\theta_{S})(\cos 2\phi_{S}\cos 2\psi_{S}\\ &-\cos\theta_{S}\sin 2\phi_{S}\sin 2\psi_{S}).\\ F_{I}^{\times}(\theta_{S},\phi_{S},\psi_{S})=&\frac{1}{2}(1+\cos^{2}\theta_{S})(\cos 2\phi_{S}\sin 2\psi_{S}\\ &+\cos\theta_{S}\sin 2\phi_{S}\cos 2\psi_{S}).\end{split} (3)

Here θS\theta_{S} and ϕS\phi_{S} are the sky location of the source, and ψS\psi_{S} is the polarization angle. All quantities are defined in the detector frame, which makes it time-dependent Liu et al. (2020).

The two effective channels differ only in polarisation angles by π/4\pi/4, which means

FII+(θS,ϕS,ψS)=FI+(θS,ϕS,ψSπ/4),FII×(θS,ϕS,ψS)=FI×(θS,ϕS,ψSπ/4).\begin{split}F_{II}^{+}(\theta_{S},\phi_{S},\psi_{S})=F_{I}^{+}(\theta_{S},\phi_{S},\psi_{S}-\pi/4),\\ F_{II}^{\times}(\theta_{S},\phi_{S},\psi_{S})=F_{I}^{\times}(\theta_{S},\phi_{S},\psi_{S}-\pi/4).\end{split} (4)

Finally, a correction term is added to account for the Doppler effect induced by the orbital motion of the detectors Liu et al. (2020).

ΦDoppler=2πf(t)×R×sinθ¯S×cos(Φ¯(t)ϕ¯S),\Phi_{\rm Doppler}=2\pi f(t)\times R\times\sin{\bar{\theta}_{S}}\times\cos{\left(\bar{\Phi}(t)-\bar{\phi}_{S}\right)},\\ (5)

where ff is the frequency of GW, R=1R=1AU and barred symbols denote parameters in Solar System Barycentre frame, with Φ¯(t)=ϕ¯0+2πt/T,T=1year\bar{\Phi}(t)=\bar{\phi}_{0}+2\pi t/T,\ T=1\rm year is Earth’s orbital period around the Sun, and ϕ¯0\bar{\phi}_{0} is the initial location of TianQin at time t = 0. Notice that in higher frequencies when fff\geq f_{*}, the long-wavelength approximation breaks down and the two channels are no longer uncorrelated, hence an extra correction term is needed in Equation 2 Cornish and Larson (2001); Cornish and Rubbo (2003). Fortunately, EMRI signals we are interested in, are located mostly within the low-frequency range. This is because the highest orbital frequency of CO is determined by the last stable orbit(LSO), where νLSO=1/(2πM)×63/2\nu_{\rm LSO}=1/(2\pi M)\times 6^{3/2}. Given a typical MBH with 105M10^{5}M_{\odot}, νLSO=0.04<f=0.28\nu_{LSO}=0.04<f_{*}=0.28 Hz. Therefore, we stick with the long-wavelength approximation throughout the work.

Refer to caption
Figure 1: An example EMRI signals compared with the sensitivity curve of TianQin. A total length of 3 months observation time is assumed.

As an illustration, we show in Figure 1 an example EMRI signal (on both channels) on the TianQin sensitivity curve. This EMRI system consists of a 10 MM_{\odot} BH and MBH with 4.5 ×106M\times 10^{6}\ M_{\odot} and a spin of 0.97. The Signal-to-Noise Ratio (SNR) of this signal equals 50.

We define the overall SNR as the root sum square of inner products of waveform with itself across two channels of TianQin,

ρ=i=I,IIhi|hi=4i=I,II0h~i(f)h~i(f)Sn(f)df4i=I,IIflowfhigh(h~i(f)Sn(f))(h~i(f)Sn(f))df.\begin{split}\rho&=\sqrt{\sum_{i={\rm I,II}}\langle{h}_{i}|{h}_{i}\rangle}\\ &=\sqrt{4\Re\sum_{i={\rm I,II}}\int^{\infty}_{0}{\frac{\tilde{h}_{i}(f)\tilde{h}_{i}^{*}(f)}{S_{n}(f)}}{\rm d}f}\\ &\approx\sqrt{4\sum_{i={\rm I,II}}\int^{f_{\rm high}}_{f_{\rm low}}\left(\frac{\tilde{h}_{i}(f)}{\sqrt{S_{n}(f)}}\right)\left(\frac{\tilde{h}_{i}(f)}{\sqrt{S_{n}(f)}}\right)^{*}{\rm d}f}.\end{split} (6)

Here we use “ g|h\langle g|h\rangle” to represent the inner product between the two frequency domain waveforms g~(f)\tilde{g}(f) and h~(f)\tilde{h}(f). Notice that in the last line, the SNR in a channel is simply the integration of squared modulus of whitened signal over the frequency range between flow=104Hzf_{\rm low}=10^{-4}\ \rm Hz and fhigh=1Hzf_{\rm high}=1\ \rm Hz, which are sensitivity frequency band of TianQin.

III convolutional neural networks for detection

III.1 Data preparation

For the detection of the signal in noise we are going to use CNNs. In order to train and verify CNN properly, one needs to divide the data into three groups, namely training data, validation data, and testing data. In these datasets, half of them are signal+noise samples, and the rest are noise-only samples. Using training data, the CNN learns how to classify samples. Validation data is incorporated in the training process to verify that the CNN is learning correctly. Finally, testing data can test the efficiency of the trained CNN.

In our case, we aim to accomplish the task of classifying the data into two categories: with signal, or without signal. One can express the data dd as the addition of random Gaussian noise nn and the GW signal hh (in the case it is present in the data):

d(t)={h(t)+n(t), if signal is presentn(t), if there is no signal.d(t)=\left\{\begin{aligned} &h(t)+n(t),\text{ if signal is present}\\ &n(t),\text{ if there is no signal}.\\ \end{aligned}\right. (7)

The raw output of TianQin is three correlated time domain data channels. After some preprocessing they can be reduced to two channels with uncorrelated noise, plus a channel with a response to astrophysical signals highly suppressed. This process is implemented by the Time Delay Interferometry (TDI) Estabrook et al. (2000); Tinto and Armstrong (1999). Out of simplicity, we do not implement TDI, but rather treat TianQin as two orthogonal Michelson interferometers. In Figure 2, we illustrate a data sample that contains an EMRI signal, using the parameters demonstrated in Figure 1.

Refer to caption
Figure 2: An example of whitened data in channel II in comparison with signal hIh_{I} alone. For this event, the SNR is set to be 50. We draw the reader’s attention to the difference in scale for the noise and the signal.

We prepare the generation of noise and signal separately. Due to the “3-month on 3-month off” work scheme of TianQin, the longest chunk of continuous data lasts only three months. To avoid the complexity introduced by considering the data gap issue, we focus our analysis on data length of three months (or 7864320 seconds). By adopting the long-wavelength approximation, it is also sufficient to assume the sample rate of 1/301/30 Hz, which leads to the data size of 262144 samples. The parameters are summarised in Table 1.

Table 1: The setting of data simulation.
Symbol Physical Meaning Range
TT observation time 7864320 seconds
fsf_{s} sample rate 1/30 Hz
NtN_{t} a data size 262144
Table 2: Summary of physical parameters and their meaning under the analytic kludge model.
Symbol Physical Meaning distribution
MM MBH mass uniform in lnM\ln{M} over [104M10^{4}\ M_{\odot},107M10^{7}\ M_{\odot}]
μ\mu CO mass 10M10\ M_{\odot}
aa MBH spin magnitude, defined as |S|/M2|\vec{S}|/M^{2}, with S\vec{S} being the MBH spin angular momentum uniform [0,1]
elsoe_{\rm lso} orbital eccentricity at plunge uniform [0,0.2]
Φ0\Phi_{0} mean anomaly at plunge uniform[0,2π0,2\pi]
α0\alpha_{0} azimuth angle for orbital angular momentum L\vec{L} at plunge uniform[0,2π0,2\pi]
λ\lambda angle between L\vec{L} and S\vec{S} cos(λ\lambda)= uniform[-1,1]
γ0\gamma_{0} angle between L×S\vec{L}\times\vec{S} and pericenter at plunge uniform[0,2π0,2\pi]
νlso\nu_{\rm lso} orbital frequency of the last stable orbit depend on elsoe_{\rm lso}
θS\theta_{S} source direction’s polar angle uniform[0,2π0,2\pi]
ϕS\phi_{S} azimuthal direction to source cos(ϕS\phi_{S})=uniform[1,1-1,1]
θK\theta_{K} the polar angel of MBH’s spin uniform[0,2π0,2\pi]
ϕK\phi_{K} azimuthal direction of MBH’s spin cos(ϕK\phi_{K})=uniform[1,1-1,1]
tct_{c} merger time of a CO plunges into MBH fixed to 7864320 seconds
DLD_{L} luminosity distance to source irrelavent as will be rescaled to SNR with uniform[50,120]

The detector noise was assumed to be Gaussian and was generated by sampling from the one-sided PSD given by Equation 1. For the simulation of an EMRI waveform signal, we used both AK and AAK models. For different purposes that we will detail in the following sections, we generate 7 groups of signals, whose properties are explained in Table 4, and the shared default parameter distributions are listed in Table 2. The data are whitened in the frequency domain. When data are converted back to the time domain, we apply a Tukey window function with the coefficient α=1/4\alpha=1/4 to avoid the spectrum leakage.

Notice that Table 2 does not reflect an astrophysical model, and one might argue that an astrophysically motivated distribution for EMRI parameters could be more realistic. However, an astrophysically motivated distribution is not necessarily the best choice for training the CNN. The reason for that is that the astrophysically motivated models are subject to huge uncertainty. Therefore sticking with any specific model runs a risk of biasing the network.

SNR of a signal is inversely proportional to the distance. When assessing the performance of a CNN, it is convenient to compare signals with similar SNRs than with similar distances. Therefore, we do not fix distribution of distance DLD_{L} in Table 2. The choice of SNR range is determined by balancing the hardware requirement between data storage, waveform generation speed, and I/O speed in training.

III.2 Mathematical model

CNN is a specific realization of Neural Network (NN) which features the inclusion of convolution operations Ng et al. (2013); Heaton (2017). In NN, each “neuron” represents a non-linear mapping or an affine transformation function. The non-linearity is achieved by an activation function. A typical realization of a CNN includes convolution layers, pooling layers, and fully-connected layers (or dense layers) LeCun et al. (2015). It can be treated as a classifier. Such a CNN composed of multiple neurons can be understood as a non-linear function fw()f_{w}(\cdot) that maps the input space of the data to the output space:

yi=fw(di),y_{i}=f_{w}(d_{i}), (8)

where yiy_{i} is the output probability for the ii-th data did_{i} to belong to a certain class, ww are weights of the CNN.

Based on training data a CNN can learn features of the input data by updating the weights associated with different neurons by minimizing the difference between the output prediction and training data labels. The size of our training data is large, therefore we cannot optimize weights with all the data at once and need to adopt mini-batching. The training process is divided into multiple epochs. For each epoch, the weights are updated by the optimizer algorithm. When the training process converges, the CNN can classify new data into different categories.

To achieve the reliable ability to classify, one needs, first, to define the loss function L(ytrue,ypred)L(y_{\rm true},y_{\rm pred}), which quantifies the difference between the CNN output and labels in the training process. This could be done by adopting the binary cross-entropy function:

L=j=1N[ytruejlog(ypredj)(1ytruej)log(1ypredj)],L=\sum_{j=1}^{N}\left[-y^{j}_{\rm true}\log(y^{j}_{\rm pred})-(1-y^{j}_{\rm true})\log(1-y^{j}_{\rm pred})\right], (9)

where NN is the total number of training data samples, for the jj-th data instance. The noise-only samples are labeled as ytruej=0y^{j}_{\rm true}=0, and samples containing signals are labeled as ytruej=1y^{j}_{\rm true}=1. The output of CNN ypredjy^{j}_{\rm pred} represents the assigned probability for each category. The loss function is minimized when the predicted value for the label class matches the one from the training data with the highest probability. In the training process, the weights are updated through the Nadam algorithm, such that the loss function can be minimized by following the adaptive gradient descent with Nesterov momentum Dozat (2016). During training, validation data is used to test the trained CNN after each epoch and use those result to evaluate the performance of the CNN and its stability towards overfitting.

III.3 The CNN architecture

In a CNN different layers play different roles. In the convolution layer, features can be extracted by applying convolution between the input and the convolution kernel. After multiple convolution layers with various filters, a CNN can learn different features of the input data from lower-level to higher-level. The purpose of the pooling layer is to reduce the size of the convolved features while remaining the key features so that one can reduce computation complexity and avoid over-fitting. In the fully-connected layer, neurons are connected with all activation values to the previous layer, with the final output indicating the predicted class probability. In this work, rectified linear unit (relu) is selected as the activation function to all but the output layer, for which the softmax function is adopted. Relu function is convenient and efficient to avoid vanishing gradient problems when training a CNN Krizhevsky et al. (2012), and softmax function at the last layer achieves a normalization so that the output is constrained in the range of (0,1). In this way, the output can appropriately represent the probability that the input sample belongs to one of the classes. The performance of a CNN depends on many hyper-parameters, like CNN depth, convolutional layer number, number of kernels, and their respective size.

IV Search procedure

In this section, we will explain the details about the CNN application to the EMRI signal search 222In our implementation of the CNN, we use Keras 2.2.4, and Tensorflow 1.1.18. We implement our software on top of a GPU Tesla V100 PCIe 16 GB, CPU Intel Xeon Gold 6140 (72) @ 2.3, with a memory of 256GB..

IV.1 Training phase

In order for CNN to gain the classification ability, a certain amount of input data is necessary for the training. More samples are needed to train CNN to detect weaker signals. In the actual training, it turns out that the I/O of training data becomes the bottleneck. Therefore, we are limited in the size of total data, which in turn determines the SNR lower limit. After some trial and error, we converged at 50 as the SNR lower limit in our training sample. On the other hand, stronger signals can be detected more easily. We concentrate on signals with SNR comparable with the lower limit, therefore an upper limit of SNR 120 is adopted. The total amount of 2.5×1052.5\times 10^{5} signal plus noise samples are used to train the final CNN.

With the data ready, we need to train the network to maximize the classification ability without over-fitting. In general, a more complicated network is expected to perform better on the training data. However, a too-complicated neural network runs a risk of over-fitting or being too acquainted with the training data so that the performance on training data is not generalizable towards unfamiliar data. Therefore one needs to check the consistency of the CNN performance on both training data and validation data. One can identify the over-fitting if the performance on the training data is significantly better than the validation data. This can be relieved by adopting a less complicated neural network, adding dropout layers, or increasing the size of the training set.

Finally, we need to fix the training architecture. We start our trial of convolution kernel size of (1,5) as suggested by previous studies Gabbard et al. (2018); Bayley et al. (2020), and finally settled with a size of (1,34) through trial and error. We apply the max-pooling in the pooling layer, which can retain the maximum value of each patch of the input. The mini-batch size is set to 56, and we train the CNN with 300 epochs. For each mini-batch, 56 simulated datasets will be randomly selected from samples, half of which are signal plus noise samples and the other half are noise-only samples. The wall clock time for each epoch of training is about 2 hours. Strategies like early stopping Yao (2007) are also employed to increase efficiency. In our experiments, the training process will be stopped when either 300 epochs are reached, or the accuracy of validation data no longer increases for 50 epochs. Finally, it takes about 10.5 days to train the final CNN. This trained model can afterwards be used for testing. We summarise the architecture of the CNN in Table 3.

Table 3: the architecture of the CNN
Layers kernel number kernel size Activation function
1 Input / matrix(size: 2×2621442\times 262144 ) /
2 Convolution 32 matrix(size:1×341\times 34) relu
3 Pooling 16 matrix(size:1×81\times 8) relu
4 Convolution 16 matrix(size: 1×81\times 8) relu
5 Pooling 16 matrix(size: 1×61\times 6) relu
6 Convolution 16 matrix(size: 1×61\times 6) relu
7 Pooling 16 matrix(size: 1×41\times 4) relu
8 Flatten / / /
9 Dense / vector(size: 128) relu
10 Dense / vector(size: 32) relu
11 Output / vector(size: 2) softmax
Table 4: Different waveform setups used as testing data. The first column indicates the index; the second column is the waveform model; the third column describes the parameter distribution, where to some astrophysical parameters the range shown in the table will be applied, but other unspecified parameters will remain in the same range as the training data( See table 2).
number waveform model physical parameters distribution signal samples number
1 AK ρ\rho\in uniform [50,120] 500
2 AK ρ\rho >50>50, astrophysical model M12 500
3 AAK ρ\rho\in uniform [50,120] 500
4 AK ρ\rho enumerates 10,20,,13010,20,...,130 1000 ×13\times 13
5 AK MM enumerates 104,104.5,,107M10^{4},10^{4.5},...,10^{7}M_{\odot}, a=0.98a=0.98 z=z= 0.1 1000 ×7×3\times 7\times 3
z=z= 0.2
z=z= 0.3
6 AK M=106MM=10^{6}M_{\odot}, aa enumerates 0.0,0.2,0.4,0.6,0.8,0.980.0,0.2,0.4,0.6,0.8,0.98 z=z= 0.1 1000 ×6×3\times 6\times 3
z=z= 0.2
z=z= 0.3
7 AK M=105.5M,a=0.98M=10^{5.5}M_{\odot},a=0.98, zz enumerates 0.1,0.2,0.30.1,0.2,0.3 1000 ×3×3\times 3\times 3
M=106M,a=0.0M=10^{6}M_{\odot},a=0.0, zz enumerates 0.1,0.2,0.30.1,0.2,0.3
M=106M,a=0.98M=10^{6}M_{\odot},a=0.98, zz enumerates 0.1,0.2,0.30.1,0.2,0.3

IV.2 Testing phase

Once a CNN is trained, we test the performance with 7 different groups of testing data, the properties of which are introduced in Table 4. The setup of these groups is motivated by the testing of two main points, namely the validity of the CNN, and the sensitivity of the CNN. For testing the validity, we prepare three groups of signal+noise samples: signals that follow the same distribution as the training data, signals that follow an astrophysically motivated distribution, and signals produced by AAK but with the same distribution as the training data. For testing the sensitivity, we apply the CNN on top of the remaining four groups, in which many intrinsic parameters are fixed, but either mass, redshift, or spin is enumerated over a certain list. Taking group 4 as an example, it replaces the SNR distribution in group 1 with a set of fixed SNRs, taking values 10, 20, 30,, 13010,\ 20,\ 30,\ \ldots,\ 130. Similarly, in group 5, it enumerates the mass of the MBH, taking values of 104, 104.5,, 107M10^{4},\ 10^{4.5},\ \ldots,\ 10^{7}\ M_{\odot}. In group 6, it enumerates the spin of the MBH in a similar way, taking values of 0.0, 0.2, 0.4, 0.6, 0.8, 0.980.0,\ 0.2,\ 0.4,\ 0.6,\ 0.8,\ 0.98 with fixed MBH mass is 106M10^{6}\ M_{\odot}. In group 7, it focuses on EMRI systems with three kinds of MBHs: 1. M=105.5MM=10^{5.5}\ M_{\odot} and a=0.98a=0.98; 2. M=106MM=10^{6}\ M_{\odot} and a=0.98a=0.98; 3. M=106MM=10^{6}\ M_{\odot} and a=0a=0. And they enumerates the redshift, taking the value of 0.1, 0.15, 0.2, 0.25, 0.3, 0.35.0.1,\ 0.15,\ 0.2,\ 0.25,\ 0.3,\ 0.35.

The trained CNN takes the testing data and outputs a probability ypredy_{\rm pred}, which predicts the probability that the data contains a signal. Conversely, 1ypred1-y_{\rm pred} predicts the probability that it is purely Gaussian noise.

V Results

V.1 Validity

To measure the effectiveness of the trained CNN model, we used the testing data of groups 1-3 in Table 4. One can compare the efficiencies among detection methods through the Receiver Operator Characteristics (ROC) curve, where the True Alarm Probability (TAP) is plotted against the False Alarm Probability (FAP).

In the calculation of the ROC, one first chooses a value of yy^{*} as a detection threshold. Any value of ypred>yy_{\rm pred}>y^{*} is regarded as an alarm. At a given yy^{*}, the FAP and the TAP is constructed as the fraction of noise-only and s͡ignal plus noise samples, respectively, that are reported as an alarm. By varying yy^{*}, one obtain pairs of TAP-FAPs, and a line that links all these pairs is the ROC curve. By definition, the TAP is a monotonic function of the FAP.

Refer to caption
Figure 3: The ROC curve of the signals from testing groups 1-3 is shown with the blue, purple, and red lines, respectively. The blue line indicates the expected effectiveness for group 1, the parameters have identical distribution to the training data; for group 2, the distribution is drawn from an astrophysical model; for group 3, the distribution is the same as group 1 and the training data, but switched to the AAK waveform model. The 1-σ\sigma confidence intervals are indicated by the shaded regions.

The blue line in Figure 3 represents a ROC curve for group 1. The dotted line is the worst possible performance which indicates a total inability of distinguishing signal from noise. One can observe that for FAP equal to 1%, the corresponding TAP is approximately 92%. This fact indicates that when the CNN is applied to data constructed in a similar way to training data, it has a very good ability to distinguish the signal from noise. We remark that the results from group 1 are expected to be the upper limit among all groups and could be used as a benchmark.

The purple line in Figure 3 corresponds to the ROC curve for group 2. Compared with the blue line, the purple line is lower: for a FAP of 1%, the TAP is about 75%. Since group 2 is an astrophysically motivated population, the majority of signals are expected to be associated with relatively low SNRs. Due to the choice of SNR threshold of 50 for the training data, we observe clustering of SNRs just above 50. This significant decrease in TAP can be explained by such clustering. On the other hand, even though the training data differ in the parameter distribution to the testing data, the trained CNN still has a good ability to detect signals, showing a good generalization ability.

The red line in Figure 3 corresponds to the ROC curve of group 3. In this group, the parameter distribution is the same as in group 1, but a different set of waveform models, namely the AAK waveform, is adopted. This group could be used to study the robustness of the CNN against choices of EMRI waveforms. The similarity between the blue line and the red line indicates that a CNN trained on AK waveforms can achieve good detection ability on AAK waveforms. One should be cautious not to take this conclusion for granted. Although both AK and AAK adopt similar equations, the setup of the waveform calculation makes them dephase quickly after a timescale of around an hour. This observation gives us confidence that the CNN method could be robust, and even if a fully realistic waveform would not be available when space-borne GW detectors are operating, one can still rely on CNN trained on approximate waveforms to achieve good detection potential up to certain SNR values.

V.2 Sensitivity

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 4: The comparison of the CNN sensitivity over EMRIs with different parameters.The vertical axis is the TAP, while the horizontal axis is the single varying parameter. The 1-σ\sigma confidence intervals are indicated by the shaded regions.

One can present the sensitivity of the CNN over parameters like SNR, MBH mass, MBH spin, or redshift, using the distribution of signal parameters group 4-7 in Table 4. We have adopted fixed FAPs of 0.1 and 0.01. The results are presented in Figure 4 in the form of efficiency curves.

Figure 4(a) displays the TAP versus the SNR for signals from group 4. It shows that two efficiency curves are consistent with the expectation that the CNN exhibits higher sensitivity towards stronger signal, and for SNR of higher than about 100, the CNN both can guarantee a certain detection at false alarm probabilities of 0.01 and 0.1.

We study the effectiveness of the CNN over MBH masses with group 5. The MBH spin is fixed as a=0.98a=0.98, and redshifts are varied through z=0.1, 0.2, 0.3z=0.1,\ 0.2,\ 0.3. It can be seen from Figure 4(b) that EMRI systems containing a 105.5M10^{5.5}M_{\odot} MBH are more easily detected by the trained CNN. For such an EMRI with a redshift of 0.10.1, the CNN can achieve a TAP of 78% conditioned on a FAP of 0.1. The feature of a sensitivity peak with regards to MBH masses stems from two factors: the sensitivity curve, and the GW signal frequency distribution. Larger MBH masses lead to larger amplitudes and lower frequencies, while the detector is most sensitive only for a limited frequency band. This result is also consistent with horizon distance analysis in a previous study Fan et al. (2020).

EMRI signals contained in group 6 have a fixed MBH mass (M=106MM=10^{6}M_{\odot}) and fixed redshifts (z=0.1, 0.2, 0.3z=0.1,\ 0.2,\ 0.3). From Figure 4(c) one can observe that there is almost no apparent dependency of detection ability on MBH spin. This is a piece of indirect evidence that the detection ability is most sensitive to the SNR. Further investigation shows that EMRI systems with different spins share similar SNRs. This can be roughly explained as follows: the EMRI systems have a smooth evolution of overall GW amplitude before the final merger. According to our setup, the merger is fixed at the end of the three months observation period. Therefore, the spin parameter has a minor impact on the overall SNR.

We present the efficiency curve for group 7 in Figure 4(d). The SNR of a signal is roughly inversely proportional to luminosity distance, which for the nearby universe is approximately proportional to redshift. Therefore, on the vertical axis we plot the TAP multiplied by the redshift. If the TAP is linearly correlated with the SNR, then such multiplication should remain constant. This is exactly what we observed in the blue solid line (M=105.5MM=10^{5.5}M_{\odot} and spin =0.98, FAP equals 0.1). The distribution of SNR within this redshift range (0.1-0.35) roughly spans the range of 20-60, which shows good linearity between SNR and TAP in Figure 4(a). The prediction is violated to the biggest degree for the purple dashed line (M=106MM=10^{6}M_{\odot} and a=0.98a=0.98, FAP equals 0.01). Closer events are gaining more detection efficiency, which can be expected from the non-linearity of the dashed line in the similar SNR range in Figure 4(a).

From the above analyses, we can conclude that among all factors, the SNR has the strongest influence on the expected detection efficiency. Changes in other parameters can also lead to a different performance in TAP, but such differences can be mostly explained by the different SNRs.

VI conclusions and Future works

The GW detection of EMRI signals is challenging from multiple perspectives. The accurate modeling of the EMRI waveform is a challenging task for theorists. At the same time a template bank with 1040\sim 10^{40} templates is needed, if one is to adopt the traditional matched filtering method Gair et al. (2004). In this work, we demonstrate a proof-of-principle application of a CNN on the detections of EMRIs signals. Specifically, an EMRI source with a signal-to-noise ratio uniformly drawn from 50 to 120 can be detected with a detection rate of 91% at a search false alarm probability of 1%. Meanwhile, the effectiveness of the CNN varies for different physical parameters, however, most of the difference can be explained by the influence on SNRs. EMRI systems associated with higher SNR can be expected to be detected with a higher chance.

Our work indicates that detecting EMRI signals with CNN is appealing, also for multiple reasons. For example, after being properly trained, the CNN can be applied to three months’ data, and finish the classification within one second. On the other hand, CNN shows a good generalization ability against a change of waveform models. While training on the less physical AK waveform, the network can successfully recognize signals injected with AAK waveforms. This implies that the CNN method has the potential to achieve actual detections even without accurately modeled waveforms.

It is important to benchmark the performance of the CNN to other methods. However, to the best of the authors’ knowledge, there is no other study that uses physical EMRI waveforms to search the signal and also provides a FAP estimate Wang et al. (2012). For example, some studies use physical waveforms to search for the signal, but no FAP is available Gair et al. (2004). On the other hand, other studies can provide FAP estimates, but under a very different context, and the authors adopt phenomenological waveforms for the detection Wang et al. (2012).

Although this proof-of-principle study is encouraging, we recognize that there are still lots of challenges to implement a reliable CNN to detect EMRI signals. For example, one needs to push the SNR threshold to the values lower than 50. This translates to a much larger amount of training data, the data generation, data storage, and training procedure would all become more hardware demanding. We leave the issue for future studies, with the potential of applying more sophisticated methods or constructing a hierarchical search and examining strategy for the detection pipeline.

At present, there are many Machine Learning (ML) algorithms that have wider applications in GW data analysis. For different problems in the GW field, developers use different methods or different architectures of the same method like CNN. Similar challenges might also apply to the EMRI search, and we could borrow the existing wisdom in the future.

VII Acknowledgments

The authors thank Hui-Min Fan, Lianggui Zhu, En-Kun Li, Jianwei Mei, and Micheal Williams for helpful discussions. This work has been supported by Guangdong Major Project of Basic and Applied Basic Research (Grant No. 2019B030302001), the Natural Science Foundation of China (Grants No. 12173104, No. 11805286, and No. 11690022). CM is supported by the Science and Technology Facilities Council grant ST/V005634/1. The authors would like to thank the Gravitational-wave Excellence through Alliance Training (GrEAT) network for facilitating this collaborative project. The GrEAT network is funded by the Science and Technology Facilities Council UK grant no. ST/R002770/1. NK would like to thank the support of the CNES fellowship. The authors acknowledge the uses of the calculating utilities of numpy van der Walt et al. (2011), scipy Virtanen et al. (2020), and the plotting utilities of matplotlib Hunter (2007).

References