This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Finding Quasars behind the Galactic Plane. I. Candidate Selections with Transfer Learning

Yuming Fu Department of Astronomy, School of Physics, Peking University, Beijing 100871, China Kavli Institute for Astronomy and Astrophysics, Peking University, Beijing 100871, China Xue-Bing Wu Department of Astronomy, School of Physics, Peking University, Beijing 100871, China Kavli Institute for Astronomy and Astrophysics, Peking University, Beijing 100871, China Qian Yang Department of Astronomy, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA Anthony G. A. Brown Leiden Observatory, Leiden University, Niels Bohrweg 2, 2333 CA, Leiden, The Netherlands Xiaotong Feng Department of Astronomy, School of Physics, Peking University, Beijing 100871, China Kavli Institute for Astronomy and Astrophysics, Peking University, Beijing 100871, China Qinchun Ma Department of Astronomy, School of Physics, Peking University, Beijing 100871, China Kavli Institute for Astronomy and Astrophysics, Peking University, Beijing 100871, China Shuyan Li Department of Astronomy, School of Physics, Peking University, Beijing 100871, China
(Received 10 Sep, 2020; Revised 30 Jan, 2021; Accepted 18 Feb, 2021)
Abstract

Quasars behind the Galactic plane (GPQs) are important astrometric references and useful probes of Milky Way gas. However, the search for GPQs is difficult due to large extinctions and high source densities in the Galactic plane. Existing selection methods for quasars developed using high Galactic latitude (high-bb) data cannot be applied to the Galactic plane directly because the photometric data obtained from high-bb regions and the Galactic plane follow different probability distributions. To alleviate this dataset shift problem for quasar candidate selection, we adopt a Transfer Learning Framework at both data and algorithm levels. At the data level, to make a training set in which dataset shift is modeled, we synthesize quasars and galaxies behind the Galactic plane based on SDSS sources and Galactic dust map. At the algorithm level, to reduce the effect of class imbalance, we transform the three-class classification problem for stars, galaxies, and quasars to two binary classification tasks. We apply XGBoost algorithm on Pan-STARRS1 (PS1) and AllWISE photometry for classification, and additional cut on Gaia proper motion to remove stellar contaminants. We obtain a reliable GPQ candidate catalog with 160,946{{160,946}} sources located at |b|20|b|\leq 20^{\circ} in PS1-AllWISE footprint. Photometric redshifts of GPQ candidates achieved with XGBoost regression algorithm show that our selection method can identify quasars in a wide redshift range (0<z50<z\lesssim 5). This study extends the systematic searches for quasars to the dense stellar fields and shows the feasibility of using astronomical knowledge to improve data mining under complex conditions in the Big Data era.

Active galactic nuclei (16), Astrostatistics techniques (1886), Catalogs (205), Classification (1907), Galactic and extragalactic astronomy (563), Quasars (1319)
journal: ApJSsoftware: astropy (Astropy Collaboration et al., 2013; Price-Whelan et al., 2018), dustmaps (Green, 2018), GNU Parallel (Tange, 2011), healpy (Zonca et al., 2019), HEALPix (Górski et al., 2005), optuna (Akiba et al., 2019), scikit-learn (Pedregosa et al., 2011), TOPCAT (Taylor, 2005), XGBoost (Chen & Guestrin, 2016).

1 Introduction

The Galactic plane has long been the “zone of avoidance” for extragalactic astronomy, including quasar surveys. The Half Million Quasar (HMQ; Flesch, 2015) catalog contains a total of 510,764 objects, but only 35,105 located at b|30|b\leq|30^{\circ}| (half of the whole sky area), 3,730 at b|20|b\leq|20^{\circ}|, and 255 at b|10|b\leq|10^{\circ}|. Although it is difficult to search for Quasars behind the Galactic Plane (hereafter GPQs), such quasars are important references for astrometry and useful probes of Milky Way gas.

Quasars are used as astrometric references due to their small parallaxes and proper motions. GPQs enable the accurate measurement of positions, distances, and proper motions of stars in the Galactic disk, which is key to understanding our own Galaxy. The high-precision astrometry provided by the Gaia mission defines a celestial reference frame through the positions of 556,869 candidate quasars, however only a tiny fraction of these quasars are located at |b|15|b|\leq 15^{\circ} (Gaia Collaboration et al., 2018a). A large sample of GPQs will help build a better reference frame in the optical, through direct coverage of the sky in the Galactic plane, and will help to better understand the systematic astrometry errors of Gaia in the Galactic plane region (Arenou et al., 2018).

Line-of-sight absorption towards quasars can probe gas structures of the Milky Way. While quasars at high Galactic latitude have been useful in studying the Milky Way halo gas (e.g. Savage et al., 1993, 2000; Ben Bekhti et al., 2008, 2012), GPQs allow absorption line studies on gaseous structures in the Galactic plane (e.g. Anti-Center Shell, H Complex; see Westmeier, 2018). Moreover, a high density sample of GPQs can map the gas distribution with a higher angular resolution than that is possible with the 21 cm surveys.

Another application of GPQs is adaptive-optics observation on quasar host galaxies, which is achieved by their proximity to nearby bright stars as natural guide stars (Im et al., 2007; Fischer et al., 2019). For adaptive optics, natural guide stars should be located within a few arcseconds of the science target, which rarely occurs outside of the Galactic plane but is more common in the plane.

The difficulty of finding quasars behind the Galactic plane is caused by several challenges, including:

  • In comparison to objects at high Galactic latitude (high-bb), sources in the Galactic plane suffer from higher extinction and reddening. As a result, many sources (especially extragalactic sources) can not be detected within the survey detection limit. For other detectable sources, their colors are different from those at high Galactic latitude.

  • The source density in the Galactic plane is high. The quality of photometry can be worse in dense regions, because sources can be easily contaminated by visible or unseen neighbors.

  • A lot of “unusual” stars are located within the Galactic plane, including some white dwarfs, M/L/T dwarfs, and Young Stellar Objects (YSOs), that share many similar observational properties with quasars. These sources can be contaminants for quasars at different redshifts (e.g. Kirkpatrick et al., 1997; Vennes et al., 2002; Chiu et al., 2006; Kozłowski & Kochanek, 2009).

Since the first identification of quasar (3C 273; Schmidt, 1963), many methods for quasar candidate selection have been developed, including ultraviolet excess (e.g. Sandage, 1965; Green et al., 1986), radio sources (e.g. Gregg et al., 1996; White et al., 2000; Becker et al., 2001), X-ray sources (e.g. Pounds, 1979; Grazian et al., 2000), optical/near-infrared (near-IR) colors (e.g. Richards et al., 2002; Fan et al., 2001; Wu & Jia, 2010), mid-IR colors (e.g. Lacy et al., 2004; Stern et al., 2005, 2012; Mateos et al., 2012; Wu et al., 2012; Yan et al., 2013), and quasar variability (e.g. Dobrzycki et al., 2003; Palanque-Delabrouille et al., 2011). In addition, tools based on statistical machine learning (e.g. Richards et al., 2004; Bovy et al., 2011) and deep learning (e.g. Yèche et al., 2010; Pasquet-Itam & Pasquet, 2018) have also been established to find quasars with various data that are available.

A few studies have focused on finding quasars/AGNs behind dense stellar fields such as the Galactic plane, Magellanic Clouds, and M31 and M33 galaxies. Most of these studies used infrared selection methods to efficiently find quasars. For example, Im et al. (2007) discovered 40 bright quasars at |b|20|b|\leq 20^{\circ} by applying the combination of a near-IR color cut of JK>1.4J-K>1.4 on Two Micron All Sky Survey (2MASS; Skrutskie et al., 2006) and detection of a radio counterpart from the NRAO VLA Sky Survey (NVSS; Condon et al., 1998). Kozłowski & Kochanek (2009) identified 5,000 AGNs behind the Magellanic Clouds with mid-IR color cuts modified from method of Stern et al. (2005). Huo et al. (2010, 2013, 2015) discovered 1,870 new quasars around the Andromeda (M31) and Triangulum (M33) galaxies, with the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) from 2009 to 2013.

Recently, searches for quasars have been focused on large datasets with big data volumes and large sky coverage. Secrest et al. (2015) obtained an all-sky AGN candidates catalog with 1.4\sim 1.4 million sources using two-color infrared photometric selection criteria from the Wide-field Infrared Survey Explorer final catalog release (AllWISE) (Wright et al., 2010; Mainzer et al., 2011). Assef et al. (2018) built two catalogs of AGN candidates also based on AllWISE photometry, while excluding regions around Galactic center and Galactic plane. Jin et al. (2019) selected quasar candidates with machine learning method using Pan-STARRS1 (PS1; Chambers et al., 2016) and AllWISE data. Bailer-Jones et al. (2019) classified objects in Gaia Data Release 2 (Gaia DR2; Gaia Collaboration et al., 2016, 2018b) as stars, quasars, and galaxies with Gaussian Mixture Model and addressed the problem of class imbalance in Gaia DR2.

However, the studies listed above either treated sources in the Galactic plane and high Galactic latitude as the same, or removed the Galactic plane from consideration. Selection methods for quasars at high Galactic latitude are not generic and can not be applied to the Galactic plane directly, because data (e.g. PS1 and AllWISE photometry) obtained from high-bb and low-bb follow different probability distributions. For example, apparent colors of quasars (stars) vary from high-bb to low-bb regions, and so do the source density of quasars (stars). Such behavior of data is a kind of non-stationarity called dataset shift (Quionero-Candela et al., 2009), which leads to significant estimation bias of supervised machine learning algorithms. The color cuts for quasar selection can also be regarded as simple decision tree models in machine learning regime. Previous color cuts obtained from high Galactic latitude regions fail in the Galactic plane due to the dataset shift.

To deal with these dataset shift problems, transfer learning (Pan & Yang, 2009) has been proposed and studied extensively by data scientists. The idea of transfer learning is to use knowledge gained in one problem and apply it to a different but related problem. Although spectroscopically identified (i.e. “labeled”) samples of extragalactic objects are inadequate in the Galactic plane, such labeled samples are available at high Galactic latitude. The labeled data make it possible to build a good selection method for GPQs, once the knowledge transfer from high Galactic latitude to low Galactic latitude is successful.

This paper is the first one of this series for finding GPQs. In this paper we present a transfer learning method for quasar selection, as well as a GPQ candidate catalog with 160,946{{160,946}} sources. In Section 2, we introduce the archival data used for this study. In Section 3, we describe the algorithm design for GPQ selection. In Section 4, we synthesize quasars and galaxies behind the Galactic plane with extragalactic objects at high Galactic latitude from the Sloan Digital Sky Survey (SDSS; York et al., 2000), to make a training set in which dataset shift is modeled. In Section 5, we transform the three-class classification problem for stars, galaxies and quasars to two binary classification tasks: stars versus extragalactic objects, and quasars versus galaxies to reduce the class imbalance and class-balance change. In Section 6, we calculate the photometric redshifts for GPQ candidates. In Section 7, we present the GPQ candidate catalog and some statistical properties of the sample. We summarize the results in Section 8. Throughout this paper, we use AB magnitude for PS1 photometry and Vega magnitude for AllWISE photometry unless mentioned.

2 Data

We make use of optical and infrared photometric data from PS1 and AllWISE, and astrometric data from Gaia DR2. We also retrieve samples of spectroscopically identified objects from SDSS and LAMOST.

2.1 PS1 DR1 photometry

Pan-STARRS1 (PS1; Chambers et al., 2016) has carried out a set of synoptic imaging sky surveys including the 3π\pi Steradian Survey and the Medium Deep Survey in 5 bands (grizyP1)(grizy_{P1}). The mean 5σ\sigma point source limiting sensitivities in the stacked 3π\pi Steradian Survey in (grizyP1)(grizy_{P1}) are (23.3, 23.2, 23.1, 22.3, 21.4) and the single epoch 5σ\sigma depths in (grizyP1)(grizy_{P1}) are (22.0, 21.8, 21.5, 20.9, 19.7). For better astrometry in the crowded Galactic plane field, we use mean coordinates from the PS1 MeanObject table. Mean PSF magnitudes are used for all bands (grizyP1)(grizy_{P1}), and mean Kron magnitudes (Kron, 1980) are used for iP1i_{P1} and zP1z_{P1} bands. The Galactic extinction coefficients for (grizyP1)(grizy_{P1}) are Rg,Rr,Ri,Rz,Ry=3.5805,2.6133,1.9468,1.5097,1.2245R_{g},R_{r},R_{i},R_{z},R_{y}=3.5805,2.6133,1.9468,1.5097,1.2245. These coefficients are calculated using Rλ=Aλ/AV×RVR_{\lambda}=A_{\lambda}/A_{V}\times R_{V}, where Aλ/AVA_{\lambda}/A_{V} is the relative extinction value for band λ\lambda given by a new optical to mid-IR extinction law (Wang & Chen, 2019), and RV=3.1R_{V}=3.1.

We set a few constraints on the PS1 data to ensure the data quality. All sources should be: (i) detected in all PS1 bands (grizyP1>0grizy_{P1}>0) and significantly detected in iP1i_{P1} (error in PSF mag of iP1i_{P1} band i_err<0.2171i\_err<0.2171, equivalent to iP1i_{P1}-band SNR larger than 5); (ii) not too bright in iP1i_{P1} to avoid possible saturation (i>14i>14); (iii) measured with Kron magnitude (Kron, 1980) in the iP1i_{P1} and zP1z_{P1} bands (iKron>0&zKron>0i_{\mathrm{Kron}}>0\ \&\ z_{\mathrm{Kron}}>0). For simplification, we use (g,r,i,z,yg,r,i,z,y) to represent the PSF magnitudes of PS1 bands (grizyP1grizy_{P1}) in color indexes (e.g. grg-r, gW1g-W1) and derived quantities iiKroni-i_{\mathrm{Kron}} and zzKronz-z_{\mathrm{Kron}}. The zP1z_{P1} PSF magnitude does not appear alone and will not be confused with the redshift symbol zz.

2.2 AllWISE photometry for point-like sources

The AllWISE catalog is built upon the work of the Wide-field Infrared Survey Explorer mission (WISE; Wright et al., 2010) by combining data from the WISE cryogenic and NEOWISE (Mainzer et al., 2011) post-cryogenic survey. WISE has 4 bands at 3.4, 4.6, 12, and 22 μ\mum (W1, W2, W3, and W4). The 5σ\sigma limiting magnitudes of the AllWISE catalog in W1, W2, W3, and W4 bands are 19.6, 19.3, 16.7, and 14.6 mag. The Galactic extinction coefficients for W1, W2, W3 used in this study are RW1,RW2,RW3=0.1209,0.0806,0.124R_{W1},R_{W2},R_{W3}=0.1209,0.0806,0.124. These coefficients are also calculated with relative extinction Aλ/AVA_{\lambda}/A_{V} values from Wang & Chen (2019).

We cross-match the PS1 sources with AllWISE using a radius of 1′′1^{\prime\prime} to avoid source confusion in the dense fields of the Galactic plane. We also set a few constraints on the AllWISE data. All sources should be: (i) AllWISE point sources (ext_flg=0ext\_flg=0); (ii) not too bright to avoid possible saturation (W1>8&W2>7W1>8\ \&\ W2>7); (iii) significantly detected in W1W1 and W2W2 bands (W1snr>5&W2snr>5W1snr>5\ \&\ W2snr>5); (iv) unaffected by prioritized image artifacts in each band (cc_flags=cc\_flags=“0000”); (v) unblended with nearby detections, so that only one component is used in each profile-fitting for each source (nb=1nb=1).

2.3 Gaia DR2 astrometry

Gaia DR2 (Gaia Collaboration et al., 2016, 2018b) contains celestial positions and the apparent brightness in GG band for approximately 1.71.7 billion sources. For 1.31.3 billion of those sources, parallaxes and proper motions are available. Broad-band photometry in the GBPG_{\mathrm{BP}} (330–680 nm) and GRPG_{\mathrm{RP}} (630–1050 nm) bands are available for 1.41.4 billion sources. We use the proper motions and their uncertainties from the Gaia DR2 catalog (columns pmra, pmra_error, pmdec, and pmdec_error) to find quasars.

2.4 SDSS Quasar Catalog: the fourteenth data release

SDSS (York et al., 2000) has mapped the high Galactic latitude northern sky and obtained imaging as well as spectroscopy data for millions of objects including stars, galaxies, and quasars. The 14th data release of the SDSS Quasar Catalog (SDSS DR14Q; Pâris et al., 2018) contains 526,356 quasars. We cross-match the DR14Q catalog with PS1 and AllWISE both with a radius of 1′′1^{\prime\prime}. To ensure the data quality, we use the same constraints in Section 2.1 and Section 2.2 to retrieve a subset of DR14Q. This subset has 289,271 sources and is denoted as GoodQSOGoodQSO hereafter. As can be seen from the HEALPix (Górski et al., 2005) density map of GoodQSOGoodQSO (Figure 1), very few sources of GoodQSOGoodQSO are located at |b|20|b|\leq 20^{\circ}.

Refer to caption
Figure 1: HEALPix density map of GoodQSOGoodQSO sources from SDSS DR14Q (in Galactic coordinate system) with a median density of 20.3 deg2\mathrm{deg}^{-2}. The HEALPix parameter Nside=64N_{side}=64 and the sky area per pixel is 0.839 deg2\mathrm{deg}^{2}.

2.5 SDSS spectroscopically identified stars and galaxies

In order to compare high-bb sources with Galactic plane sources that we use, a sample of stars and a sample of galaxies are extracted from SpecPhotoAll table of SDSS Data Release 15 (Blanton et al., 2017; Aguado et al., 2019). We cross-match both the star and galaxy sample with PS1 and AllWISE with a radius of 1′′1^{\prime\prime}. The SDSS star sample has 23,693 sources. We also apply quality constraints in Section 2.1 and Section 2.2 to select galaxy subset with good photometry for later use. The resulted subset of galaxy (denoted as GoodGalGoodGal hereafter) has 1,635,053 sources. Most SDSS stars and galaxies are located at high Galactic latitude (|b|>20|b|>20^{\circ}).

2.6 Stars from LAMOST general catalog

The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST, also called the Guoshoujing Telescope) is a special reflecting Schmidt telescope, the design of which allows both a large effective aperture of 3.6 m–4.9 m and a wide field of view of 55^{\circ} (Wang et al., 1996; Su & Cui, 2004; Cui et al., 2012). The LAMOST spectral survey (Zhao et al., 2012; Luo et al., 2012, 2015) consists of two major components, i.e. the LAMOST Experiment for Galactic Understanding and Exploration (LEGUE; Deng et al., 2012), and the LAMOST ExtraGAlactic Survey (LEGAS). The LEGUE observes stars in different sky regions with different magnitude ranges, including the Galactic halo with r<16.8r<16.8 mag at |b|>30|b|>30^{\circ}, the Galactic anti-center with 14.0<r<17.814.0<r<17.8 mag at 150l210150^{\circ}\leq l\leq 210^{\circ} and |b|<30|b|<30^{\circ} (Yuan et al., 2015), as well as the Galactic disk with r16r\lesssim 16 mag at |b|20|b|\leq 20^{\circ} with uniform coverage along Galactic longitude. The LEGAS mainly identifies galaxies and quasars that are within the SDSS footprint but complementary to the SDSS spectroscopic samples (e.g. Shen et al., 2016; Yao et al., 2019). Nevertheless, extragalactic objects in the LEGUE plates are also targets of the LEGAS. The LAMOST spectral survey has obtained the largest stellar spectra sample to date. We retrieve star sample from LAMOST general catalog from DR1 to DR7v0. A total number of 3,940,076 LAMOST stars meet the the same constraints in Section 2.1 and Section 2.2. From this LAMOST star sample, we select 1,334,577 Galactic plane stars with |b|20|b|\leq 20^{\circ} (denoted as TStarT_{Star} hereafter). Most TStarT_{Star} sources are from the LEGUE survey and are brighter than 18 mag in iP1i_{P1} band.

2.7 The Million Quasars (Milliquas) Catalog

The Million Quasars (Milliquas) Catalog (Flesch, 2019) is a compilation of quasars and quasar candidates from the literature. The Milliquas v6.4c update includes 758,908 type-I QSOs and AGN up to 31 December 2019. We use this catalog to extract extant GPQ sample within PS1 footprint. There are 4,344 quasars (with “Q” label in the “Descrip” column) located in |b|20|b|\leq~{}20^{\circ} in Milliquas v6.4c. Cross-matching these 4,344 known GPQs with PS1 and AllWISE both with a radius of 1′′1^{\prime\prime} gives 2,757 sources. After applying same constraints as in Section 2.1 and Section 2.2, we get a subset of 1,853 sources. This Galactic plane subset of Milliquas quasars, denoted as MLQSUBMLQSUB, will be used later for candidate validation.

3 Design of the Transfer Learning Framework

3.1 Dataset shift problem in the Galactic plane

The task of quasar selections can be described by classification problems in machine learning. Here we look into the three-class classification for stars, galaxies, and quasars with photometric data. The learning process requires two independent datasets for model training and model validation respectively. Training and validation sets can be two nonoverlapping subsets from a common parent sample with both features (colors and/or magnitues) and class labels (star, galaxy, and quasar). Usually the class labels are given by spectroscopic identifications. The classification algorithm learns a mapping relation from features to class labels with the training set. Often, the trained classification model (classifier) is applied to another dataset without class labels (i.e. no spectroscopic identifications), which is called application set or test set. The classifier takes features from the test set as inputs XX (a.k.a. covariates) and gives class labels as outputs YY.

A basic assumption for traditional machine learning is that training and test data follow the same probability distribution (Bishop, 2006; Hastie et al., 2009; Vapnik, 2013). However, this assumption no longer holds if we use high-bb data for model training and low-bb data for application, because the joint distribution of inputs and outputs P(X,Y)P(X,Y) differs between training and test data (i.e. dataset shift; Quionero-Candela et al., 2009).

For our GPQ selections, the dataset shift includes changes in both source colors and prior probabilities of different classes. Sources in the Galactic plane become fainter and redder than those at high Galactic latitude due to greater reddening, which changes the distribution of input features and the conditional probability of the output labels given the inputs P(Y|X)P(Y|X) (i.e. covariate shift; Shimodaira, 2000; Sugiyama & Kawanabe, 2012). Prior probabilities of stars are much higher than those of quasars (and galaxies) in the Galactic plane, which means the marginal probability P(Y){P(Y)} differs from that at high Galactic latitude (i.e. class-balance change; Saerens et al., 2002; Du Plessis & Sugiyama, 2014). Moreover, class ratio between extragalactic objects and stars may vary significantly from one place to another in the Galactic plane, which we refer to as “internal” class-balance change of the test data.

Transfer learning can be applied to improve the learning performance under dataset shift from a source domain to the target domain (see a review in Pan & Yang, 2009), where domain is a set 𝒟\mathcal{D} that consists of a feature space 𝒳\mathcal{X} and a marginal probability distribution P(X)P(X), 𝒟={𝒳,P(X)}\mathcal{D}=\{\mathcal{X},P(X)\}. For our classification task, the source domain data are from high Galactic latitude (|b|>20|b|>20^{\circ}) and the target domain data are from the Galactic plane (|b|20|b|\leq 20^{\circ}). In this study we only care about areas at δ>30\delta>-30^{\circ} due to the limit of PS1 survey coverage. Comparison of some properties of the source and target domains are listed in Table 1.

Table 1: Comparison of the two domains of learning
Domains of learning Location Labels of stars Labels of quasars/galaxies Internal class-balance change
Source domain |b|>20|b|>20^{\circ} Available Available Moderate
Target domain |b|20|b|\leq 20^{\circ} Available Unavailable Severe

As large numbers of stars, quasars and galaxies have been spectroscopically identified at |b|>20|b|>20^{\circ}, labels for these three classes are available in the source domain. Since spectroscopically identified samples of quasars and galaxies are significantly lacked at |b|20|b|\leq 20^{\circ}, labels for these two classes are unavailable in the target domain. Nevertheless, labels of many stars in the target domain are available with the help of LAMOST spectroscopic survey.

According to the classification scheme for different settings of transfer learning by Pan & Yang (2009), the set-up of classification in the Galactic plane can be categorized into Transductive Transfer Learning, where source domain labels are available and target domain labels are unavailable. A popular approach to Transductive Transfer Learning is Feature-based Transfer (e.g. Blitzer et al., 2006; Argyriou et al., 2006), which reduces the difference between the source and target domain through feature transformation in either one or both of the domains.

To solve the dataset shift problem of classification in the Galactic plane, we borrow the idea of Feature-based Transfer Learning. Using the mapping relation between the features of high-bb and low-bb objects, we can generate mock samples of quasars and galaxies in the Galactic plane to simulate the covariate change of their colors and magnitudes. The LAMOST Galactic plane stars also contribute to a more accurate probability distribution of data in the target domain. To reduce the effect of class-balance change, we manually go through two binary classification steps rather than running a three-class classification algorithm only once.

3.2 Modelling covariate change with mock samples

As data of LAMOST Galactic plane stars are available, we only focus on reducing the differences in features of extragalactic objects between training and test data. For our classification problem, all features will be constructed with photometric data from PS1 and AllWISE. We assume that the differences in photometric properties between extragalactic objects in the Galactic plane and those off the plane are only caused by different extinctions/reddening along their sight-lines. In this way, we can simply generate mock extragalactic objects behind the Galactic plane with data obtained at high Galactic latitude, using the mapping relation determined by the Galactic extinction law and Galactic dust map.

The covariate change can then be shown as color change of extragalactic objects on a set of color-color diagrams. Our classification will perform better by adding mock samples of quasars and galaxies behind the Galactic plane into the training set.

3.3 Dealing with class imbalance and class-balance change in machine learning

With the data-level improvements above, the covariate change can be reduced. Efforts on the algorithm level are required to handle the class imbalance and class-balance change. During the GPQ selections, instead of performing a star-quasar binary classification, we additionally take galaxies into account and perform a three-class classification.

Many machine learning software packages support multi-class classification jobs, by transforming the task into multiple binary classification problems. However, the built-in treatment is often inflexible and sometimes destructive when dealing with class imbalance problems. For example, in the scenario of using one-vs-rest (also known as one-vs-all) strategy for multi-class classification, at some stages, samples of one class are regarded as the positive samples while all samples of other classes are regarded as negative samples. Even if all the classes in the training set have a same sample size, the binary classification situation is imbalanced as the positive class (the “one”) has less samples than the negative class (the “rest”). In our case, the GPQ set (in both training and test set) has significantly less samples than the sets of galaxies and stars, thus severe class imbalance will happen.

To reduce the disadvantage of one-vs-rest strategy which is commonly used in machine learning algorithms, we convert this three-class classification problem into two binary classification problems manually. In the first step, the Galactic plane sources are classified into two classes: stars and extragalactic objects. Extragalactic objects are then classified into quasars and galaxies in the second step. By combining the two minority classes of quasar and galaxy into one, we would expect the class imbalance to be better controlled in the first step. The physical basis for merging the quasar and galaxy classes is that quasars are a special type of galaxies. For the second step, we expect that the quasar-to-galaxy ratio to be nearly constant across different locations in the Galactic plane. Thus the variable quasar-to-star or galaxy-to-star ratio is avoided and the internal class-balance change is lessened in the learning process.

4 Mock catalogs for quasars and galaxies behind the Galactic plane

In order to construct training samples for extragalactic objects, as well as understand the covariate shift of them from high Galactic latitude to the Galactic plane, we synthesize quasars and galaxies behind the Galactic plane using GoodQSOGoodQSO and GoodGalGoodGal samples. The synthesis is plausible if we assume the distribution of quasars on the celestial sphere is homogeneous and isotropic on large scale, just as the Cosmological principle has suggested. We not only observe the changes in colors of quasars and galaxies as they are placed in low Galactic latitudes in this modeling process, but also get a rough estimation on the sky distributions of the sources that could be detected by a certain sky survey.

4.1 Synthesizing procedures

Let EE be a set of extragalactic objects (EE can be GoodQSOGoodQSO or GoodGalGoodGal). The synthesis process consists of following steps.

1. Correcting for extinctions. Extinctions of objects in set EE are corrected according to a two-dimensional dust map provided by Planck Collaboration et al. (2014, hereafter Planck14), and the optical to mid-IR extinction law from Wang & Chen (2019) with RV=3.1R_{V}=3.1. The E(BV)E(B-V) values are retrieved using a Python module, dustmaps (Green, 2018).

2. Assigning new locations. We generate a random sample of points that are uniformly distributed on the sky with |b|20|b|\leq 20^{\circ}. The number of these random points is equal to the sample size of EE. Coordinates of these points are randomly assigned to objects of EE as their new locations. Now we get a new set EmE_{m} (MockGPQMockGPQ, MockGalMockGal) without line-of-sight extinctions.

3. Adding new extinctions. We add extinctions to the EmE_{m} sample using the Planck14 dust map based on their new (mock) locations.

4. Setting limiting magnitudes. We obtain a subset of EmE_{m} by choosing sources brighter than the PS1 single epoch 5σ\sigma depths in all PS1 passbands: (grizyP1)<(22.0, 21.8, 21.5, 20.9, 19.7)(grizy_{P1})<(22.0,\ 21.8,\ 21.5,\ 20.9,\ 19.7). This subset, denoted as EgmE_{gm} (GoodMockGPQGoodMockGPQ, GoodMockGalGoodMockGal), represents “good” mock sample that can be detected by PS1 survey in all bands. However, we don’t apply similar constraints to AllWISE bands as the magnitude which corresponds to a 5σ\sigma sensitivity varies with location. Also, this extinction-selection effect relies more on the optical survey depth than the IR survey depth. Factors such as observation strategies and source confusions in dense fields are not taken into consideration in this step. Therefore we may overestimate the detection rate of GPQs (and galaxies) through this synthesis. We select sources within the PS1 footprint (i.e. δ30\delta\geq-30^{\circ}) and obtain set Egm-PS1E_{gm\textnormal{-}\mathrm{PS1}} (GoodMockGPQ-PS1GoodMockGPQ\textnormal{-PS1}, GoodMockGal-PS1GoodMockGal\textnormal{-PS1}).

5. Constructing training sets with mock and real data. For mock quasars that are not included in GoodMockGPQGoodMockGPQ-PS1, their original counterparts (high-bb quasars in the input set GoodQSOGoodQSO; denoted as CQSOC_{QSO}) are also added to the training and validation sets along with GoodMockGPQGoodMockGPQ-PS1. For mock galaxies that are not included inGoodMockGalGoodMockGal-PS1, 25% of their original counterparts (high-bb galaxies in the input set GoodGalGoodGal; denoted as CGalC_{Gal}) are added to the training and validation sets. The resulted quasar and galaxy samples for training and validation are denoted as TQSOT_{QSO} and TGalT_{Gal}, respectively. TQSOT_{QSO}, TGalT_{Gal}, and the LAMOST Galactic plane star sample TStarT_{Star} form the training and validation sets for machine learning classification.

By adding good mock samples and real data (CQSOC_{QSO} and CGalC_{Gal}) together instead of using only good mock samples as training data for quasars or galaxies, we increase the data diversity as well as sample size of the training set. This data diversification ensures that the training set can provide more discriminative information for the machine learning model (Gong et al., 2019). In addition, more training data can help reduce overfitting.

The flowchart of the synthesizing procedures is displayed in Figure 2.

Refer to caption
Figure 2: Flowchart of synthesizing procedures for the mock catalogs.

In the synthesizing process, we adopt the Planck14 dust map because it detects dust at a greater depth and better estimates the 2D extinctions in the Galactic plane than do the dust maps constructed with stellar photometry (e.g. Green et al., 2018, 2019). We assume a uniform RV=3.1R_{V}=3.1 although RVR_{V} varies slightly in the Galaxy with a dispersion of about 0.18 (Schlafly et al., 2016). Such minor variations in RVR_{V} can lead to small uncertainties of magnitudes and colors of individual mock quasars (galaxies), but have limited impacts on the statistical properties of the training sample because mock sources with large extinctions (and thus large uncertainties caused by RVR_{V} variations) are removed by the magnitude limits, as we shall see in Section 4.2.

4.2 Synthesizing results and dataset shift

We define the extinction-based selection rate in the Galactic plane as R=|Egm|/|E|R=|E_{gm}|/|E|, where |E||E| is the cardinality, i.e. number of elements/sources of set EE. The source numbers of the input samples are |GoodQSO|=289,271|GoodQSO|=289,271 and |GoodGal|=1,635,053|GoodGal|=1,635,053; the source numbers of the output samples are |GoodMockGPQ|=101,482|GoodMockGPQ|=101,482 and |GoodMockGal|=771,392|GoodMockGal|=771,392. Therefore the selection rates for GPQs and galaxies are RGPQ=|GoodMockGPQ|/|GoodQSO|=0.35R_{GPQ}=|GoodMockGPQ|/|GoodQSO|=0.35 and RGal=|GoodMockGal|/|GoodGal|=0.47R_{Gal}=|GoodMockGal|/|GoodGal|=0.47, respectively. The selection rate of galaxies is higher than that of GPQs because the input galaxies are on average brighter than the input quasars. With Step 2 in Section 4.1, sources of MockGPQMockGPQ and MockGalMockGal are randomly and evenly distributed in the Galactic plane (|b|<20|b|<20^{\circ}). But after Step 4, the densities of remaining sources (GoodMockGPQGoodMockGPQ and GoodMockGalGoodMockGal) are inversely related to the dust map (Figure 3): more extragalactic sources remain detectable in regions with smaller E(BV)E(B-V), and voids of detection present at regions with large E(BV)E(B-V). A sky survey deeper than PS1 might help make up some fraction of the gap in the middle of the Galactic plane. The GoodMockGPQGoodMockGPQ sample is sparser compared to GoodMockGalGoodMockGal, simply because the input quasars are fewer than galaxies. Most sources of GoodMockGPQGoodMockGPQ and GoodMockGalGoodMockGal have line-of-sight color excess of E(BV)<1.5E(B-V)<1.5, which corresponds to extinction of AV<4.65A_{V}<4.65 with RV=3.1R_{V}=3.1. The medians of line-of-sight E(BV)E(B-V) of GoodMockGPQGoodMockGPQ-PS1 and GoodQSOGoodQSO are 0.21 and 0.03, respectively. In general, GoodMockGPQGoodMockGPQ-PS1 sample has significantly larger E(BV)E(B-V) compared to GoodQSOGoodQSO (see Figure 4 (a)). Therefore covariate change for color indexes from high-bb to low-bb regions cannot be ignored. In addtion, GoodMockGPQGoodMockGPQ-PS1 sources are fainter than GoodQSOGoodQSO sources (Figure 4 (b)).

Refer to caption
Figure 3: Dust extinction map along the Galactic plane retrived from Planck14 (top panel), sky density of GoodMockGPQGoodMockGPQ (middle panel) and GoodMockGalGoodMockGal (bottom panel).
Figure 4: Histograms of (a) line-of-sight E(BV)E(B-V) and (b) iP1i_{P1} band magnitudes of GoodMockGPQGoodMockGPQ-PS1 and GoodQSOGoodQSO. The iP1i_{P1} band magnitudes are not corrected for extinction.

A series of color-color diagrams for GoodQSOGoodQSO and GoodMockGPQGoodMockGPQ-PS1 along with SDSS stars and Galactic plane point sources are shown in Figure 5 and Figure 6. In Figure 5, from the left to the middle panels, the covariate change of colors of quasars from high Galactic laititude to the Galactic plane can be directly observed. The Galactic reddening makes the cluster of GPQs in a color-color plane extend towards redder colors (to the upper right along the reddening vector) and scatter more than high-bb quasars. The scattering is greater in color indexes of bluer bands, while less at redder bands. This trend is also observable in the quasar evolutionary tracks with E(BV)=0,0.75,1.5E(B-V)=0,0.75,1.5. From the top to the bottom panels in Figure 5, the distance between two quasar evolutionary tracks with different reddening decreases.

Figure 5: Color-color diagrams of (1a, 2a, 3a) reddening-corrected GoodQSOGoodQSO (color coded dots) and SDSS stars (black dots); (1b, 2b, 3b) GoodMockGPQGoodMockGPQ-PS1 (color coded dots) and a random sample of PS1-AllWISE point sources (black dots) in Galactic plane (|b|20|b|\leq 20^{\circ}); and (1c, 2c, 3c) the same sample of PS1-AllWISE point sources in Galactic plane. For panel on the left (1a, 2a, 3a), quasar evolutionary tracks from redshift 0 to 4 without Galactic reddening (E(BV)=0E(B-V)=0) are shown in red. While for panel in the middle (1b, 2b, 3b), quasar evolutionary tracks from redshift 0 to 4 with E(BV)=0,0.75,1.5E(B-V)=0,0.75,1.5 are displayed. The black arrows (reddening vectors) indicate evolution directions of source colors with increasing E(BV)E(B-V). Yellow crosses denote points on the quasar evolutionary tracks without Galactic reddening with z=0,1,2,3,4z=0,1,2,3,4 in (1b, 2b) and z=0,1,1.5z=0,1,1.5 in (3b). The quasar evolutionary tracks are calculated using the optical composite quasar spectrum from Vanden Berk et al. (2001) and near-IR composite quasar spectrum from Glikman et al. (2006).

The covariate change of stellar colors is also evident from the color-color diagrams. The stellar loci are simple and clear for high-bb (SDSS) stars (see Figure 5 (1a, 2a, 3a)). However, additional spikes along the direction of increasing E(BV)E(B-V) appear in the stellar loci of Galactic plane stars due to reddening, as can be seen from Figure 5 (1b, 1c, 2b, 2c). Therefore we expect to better distinguish stars from other sources using Galactic plane stars from LAMOST instead of high-bb SDSS stars in the training set.

Since mid-IR bands are less sensitive to extinction and reddening, the covariate change of AllWISE colors are less obvious than that of PS1 colors. For instance, in Figure 6 (a), the quasar evolutionary tracks with E(BV)=0,0.75,1.5E(B-V)=0,0.75,1.5 stay very close to each other. Since the AllWISE colors of quasars do not change much from high Galactic latitude to the Galactic plane, we can use the W1W2W1-W2 versus W2W3W2-W3 color-color diagram to examine the purity of the final quasar candidates, by comparing probability distributions of the candidates and GoodMockGPQGoodMockGPQ-PS1 sources.

The AllWISE color-color diagram also gives “hardness” information on the classification problems that separating quasars from galaxies is harder than separating quasars from stars. In general, quasars have redder W1W2W1-W2 and W2W3W2-W3 colors than stars and galaxies due to the power-law SEDs and hot dust of quasars. From Figure 6 (b), we can recognize stellar locus (W1W20W1-W2\approx 0; on the lower-left) and galaxy locus (W1W20.5,W2W33.5W1-W2\approx 0.5,\ W2-W3\approx 3.5; on the middle-right) from the density plot. The quasar reference contour line marks a “quasar region” where most quasars are located in the W1W2W1-W2 versus W2W3W2-W3 diagram. Most stars are away from the quasar region, while a large number of galaxies enter the quasar region, indicating that such galaxies can contaminate the mid-IR quasar selection.

Figure 6: (a) The W1W2W1-W2 versus W2W3W2-W3 color-color diagram of GoodMockGPQGoodMockGPQ-PS1 (color coded dots) and PS1-AllWISE point sources (black dots) in Galactic plane (|b|20|b|\leq 20^{\circ}), and (b) The W1W2W1-W2 versus W2W3W2-W3 color-color diagram of PS1-AllWISE point sources in Galactic plane (color coded according to density) and marginal probability distribution plots in W1W2W1-W2 and W2W3W2-W3 axes. For panel (a), contour lines from 2d Kernal Density Estimation (KDE) for GoodMockGPQGoodMockGPQ-PS1 sample is plotted, and a reference contour line (in magenta) with density of 0.02 is specified. Quasar evolutionary tracks that begin at z=0.6z=0.6 (due to the template coverage) and end at z=7z=7 with E(BV)=0,0.75,1.5E(B-V)=0,0.75,1.5 are displayed. The black arrow (reddening vector) indicates evolution direction of source colors with increasing E(BV)E(B-V). Gold cross marks denotes points on the quasar evolutionary tracks without Galactic reddening with z=1,,7z=1,\cdots,7. The mean color of quasars with z<0.1z<0.1 is marked with gold triangle. For panel (b), the same reference contour line of GoodMockGPQGoodMockGPQ-PS1 sample from panel (a) is shown in red dashed line; the lowest W1W2W1-W2 value of the reference contour line is plotted as black dashed line over the W1W2W1-W2 marginal distribution plot. The quasar evolutionary tracks are calculated based on the template from Hernán-Caballero et al. (2016).

Some comparisons between mock quasars (GoodMockGPQGoodMockGPQ-PS1) and mock galaxies (GoodMockGalGoodMockGal-PS1) behind the Galactic plane are also shown in Figure 7. In the color-color diagrams of PS1 bands, galaxies largely overlap with quasars (Figure 7 (a, b, c)); while on the W1W2W1-W2 versus W2W3W2-W3 plane, these two classes are slightly more separable (Figure 7 (d)). Except for the colors, the difference between PSF magnitude and Kron (1980) magnitude is often used as a morphological separator (Strauss et al., 2002; Farrow et al., 2014) for galaxies and point sources including quasars. However, separating quasars from galaxies becomes harder with iPSFiKroni_{\mathrm{PSF}}-i_{\mathrm{Kron}} at the faint end (Figure 7 (e)), as has been pointed out by Yang et al. (2017). Among all the 1.6\sim 1.6 million GoodGalGoodGal sources, 200\sim 200 are point sources with iPSF<18i_{\mathrm{PSF}}<18 and iPSFiKron<0i_{\mathrm{PSF}}-i_{\mathrm{Kron}}<0, which can also been seen from Figure 7 (e). These point sources include few quasars with “galaxy” labels and may also include some stars that are misclassified as galaxies. We do not pay attention to these point sources because they only contribute to a tiny fraction of the whole galaxy sample.

Figure 7: Color-color diagrams of GoodMockGPQGoodMockGPQ-PS1 and GoodMockGalGoodMockGal-PS1 (a, b, c, d), and iPSFiKroni_{\mathrm{PSF}}-i_{\mathrm{Kron}} versus iPSFi_{\mathrm{PSF}} plot for the two samples (e). Orange circles represent GoodMockGPQGoodMockGPQ-PS1 sources while blue circles represent GoodMockGalGoodMockGal-PS1 sources.

To sum up, we examine the properties of GoodMockGPQGoodMockGPQ, GoodMockGalGoodMockGal and PS1-AllWISE point-like sources in the color-color spaces. For quasar candidates selection, contamination from both stars and galaxies should be taken care of. Simple PS1 color cuts are only capable of selecting quasars that are away from the stellar loci. Using a series of PS1 colors in high-dimensional space might help reduce the overlap between the stellar loci and clusters of quasar and galaxy. Moreover, with AllWISE colors, quasars can be better separated from stars and galaxies. Therefore we expect that the combination of PS1 and AllWISE data will make quasar selection more efficient.

4.3 A rough estimation on the lower limit to the sky density of GPQs

An estimation on the sky density of GPQs will be useful for evaluating the final GPQ candidate sample and the selection method. However, the GoodMockGPQGoodMockGPQ sky distribution in the middle panel of Figure 3 does not reflect the true density of GPQs due to two reasons: (i) the synthesizing process does not consider the source crowdedness and its effects on the photometric data quality; (ii) the source number of GoodMockGPQGoodMockGPQ only depends on the size of input GoodQSOGoodQSO sample when the dust extinction map is fixed.

Let the density of GoodMockGPQGoodMockGPQ be DoldD_{\mathrm{old}}, then the relative density of quasars with good photometry across the Galactic plane is:

Dnew=Dold×DgoodphDallD^{\prime}_{\mathrm{new}}=D_{\mathrm{old}}\times\frac{D_{\mathrm{goodph}}}{D_{\mathrm{all}}} (1)

where DallD_{\mathrm{all}} is the sky density of all PS1-AllWISE sources in the Galactic plane, and DgoodphD_{\mathrm{goodph}} is the sky density of sources with good photometry as defined in Section 2.1 and Section 2.2. The fraction Dgoodph/DallD_{\mathrm{goodph}}/D_{\mathrm{all}} roughly quantifies the effects of source crowdedness on the photometric quality. We expect that the median sky density of GPQs is no higher than that of GoodQSOGoodQSO (Median(Dnew)Median(DGoodQSO)\mathrm{Median}(D_{\mathrm{new}})\leq\mathrm{Median}(D_{GoodQSO})), therefore the lower limit “absolute” sky density of GPQs can be computed as:

DnewDnew×Median(DGoodQSO)Median(Dnew)D_{\mathrm{new}}\geq D^{\prime}_{\mathrm{new}}\times\frac{\mathrm{Median}(D_{GoodQSO})}{\mathrm{Median}(D^{\prime}_{\mathrm{new}})} (2)

where Median(DGoodQSO)=20.3deg2\mathrm{Median}(D_{GoodQSO})=20.3~{}\mathrm{deg}^{-2}, and Median(Dnew)=2.9deg2\mathrm{Median}(D^{\prime}_{\mathrm{new}})=2.9~{}\mathrm{deg}^{-2}. Figure 8 shows the sky distribution of DallD_{\mathrm{all}}, DgoodphD_{\mathrm{goodph}}, Dgoodph/DallD_{\mathrm{goodph}}/D_{\mathrm{all}} and DnewD_{\mathrm{new}}. The estimated DnewD_{\mathrm{new}} has a median of 20.3deg220.3~{}\mathrm{deg}^{-2} and a maximum of 66.7deg266.7~{}\mathrm{deg}^{-2}.

Refer to caption
Figure 8: Sky density of all PS1-AllWISE Galactic plane sources (a) and its subset with good photometry (b), fraction of sources with good photometry in the PS1-AllWISE sample (c), and estimated lower limit to the sky density of GPQs (d).

The predicted marginal probability of GPQs to the PS1-AllWISE sample with good photometry is Dnew/DgoodphD_{\mathrm{new}}/D_{\mathrm{goodph}}, which ranges from 2×1042\times 10^{-4} to 0.17 with a median of 3×1033\times 10^{-3}. The maximum value of 0.17 is not reliable because it locates at edges of the HEALPix map (δ30\delta\sim-30^{\circ}), where source count in a pixel does not correspond to the true number in the sky region.

5 GPQ candidate selections with XGBoost

We use XGBoost (Chen & Guestrin, 2016), a scalable tree boosting system, to perform machine learning classification for GPQ selection. XGBoost is an implementation of the original gradient boosting framework (Friedman et al., 2000; Friedman, 2001), known for high efficiency and outstanding performance in machine learning competitions (Chen & Guestrin, 2016). Compared to traditional Gradient Boosting Machines (GBM), XGBoost has made a few improvements in the algorithm level. For example, XGBoost includes regularization terms in the objective function to control the model complexity, therefore can reduce overfitting and improve the model generalization; XGBoost is optimized for sparse input data, i.e. data with missing values; Other than greedy algorithm by Friedman (2001), XGBoost supports a weighted quantile sketch algorithm that can more effectively find the optimal split points. Moreover, system enhancements for parallelization, tree pruning, and cache optimization have been integrated into XGBoost. Recently, XGBoost has been applied to astronomy and showed its capabilities of handling astronomical problems, including identifying Galactic candidates among unassociated sources from the Third Fermi Large Area Telescope (LAT) catalog (3FGL; Acero et al., 2015) (e.g. Mirabal et al., 2016), distinguishing M giants from M dwarfs for spectral surveys (e.g. Yi et al., 2019), and selecting quasar candidates with photometric data (e.g. Jin et al., 2019).

In order to obtain the optimal models, we use optuna (Akiba et al., 2019), a hyperparameter optimization framework to tune the learning hyperparameters. As has been mentioned in Section 3.3, we transform the three-class classification problem into two binary classification problems (stars versus extragalactic objects, and galaxies versus quasars). Under this setting, hyperparameters can be fine-tuned separately for the two classification steps. After classifying the Galactic plane sources with the two classifiers, we may use necessary additional criteria to ensure the purity of GPQ candidates. The classification scheme is shown in Figure 9.

A few evaluation metrics are used in the machine learning process: AccuracyAccuracy, PrecisionPrecision, RecallRecall, F1F_{1}, MCCMCC (Matthews correlation coefficient) and AUCPRAUCPR (Area Under the Precision-Recall Curve). With true positive denoted as TPTP, true negative as TNTN, false positive as FPFP and false negative as FNFN, the first five metrics are defined as:

Accuracy=TP+TNTP+TN+FP+FN\displaystyle Accuracy=\frac{TP+TN}{TP+TN+FP+FN} (3)
Precision=TPTP+FP\displaystyle Precision=\frac{TP}{TP+FP} (4)
Recall=TPTP+FN\displaystyle Recall=\frac{TP}{TP+FN} (5)
F1=2×Precision×RecallPrecision+Recall\displaystyle F_{1}=2\times\frac{Precision\times Recall}{Precision+Recall} (6)
MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN).\displaystyle MCC=\frac{TP\times TN-FP\times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}. (7)

The Precision-Recall (PR) curve can be constructed by plotting precision-recall pairs (operating points) that are obtained using different thresholds on a probabilistic or other continuous-output classifier (Boyd et al., 2013). The AUCPRAUCPR can then be calculated with numerical integration methods.

Among the six metrics, AccuracyAccuracy, PrecisionPrecision, RecallRecall and F1F_{1} are commonly used. However, the AccuracyAccuracy and F1F_{1} metrics fail to measure the classification performance correctly under class-imbalanced situations, because they will be heavily biased towards the majority class. For example, given a sample with 95 from the negative class and 5 from the positive class, simply classifying all instances as negative produces Accuracy=0.95Accuracy=0.95 and F1=0.9744F_{1}=0.9744. These two scores of metrics are misleading because all the positive instances are wrongly classified while the AccuracyAccuracy and F1F_{1} are high. The last two metrics, MCCMCC and AUCPRAUCPR are considered better evaluation measures in class-imbalanced cases. The MCCMCC takes the four confusion matrix categories (TPTP, TNTN, FPFP, FNFN) into account, and it is high only if the classifier makes good predictions on both positive and negative classes, independently of their ratios in the overall dataset (Chicco & Jurman, 2020). It is also suggested by studies that the PR curve is more informative than the more famous Receiver Operator Characteristic (ROC) curve (first recommended by Provost et al., 1998), especially on imbalanced datasets (Davis & Goadrich, 2006; Saito & Rehmsmeier, 2015). AUCPRAUCPR is useful as a measure of the overall performance of the model.

A total of 13 features are chosen for the two classification steps and the later photometric redshift regression, including 11 colors: grg-r, rir-i, izi-z, zyz-y, gW1g-W1, rW1r-W1, iW1i-W1, zW1z-W1, yW1y-W1, W1W2W1-W2, and W2W3W2-W3; and two morphological features: iiKroni-i_{\mathrm{Kron}} and zzKronz-z_{\mathrm{Kron}}. As has been discussed in Section 4.2, using a set of PS1 colors (grg-r, rir-i, izi-z, and zyz-y) can help reduce the overlap between clusters of quasars and stellar loci on two-dimensional diagrams. Quasars have redder W1W2W1-W2 and W2W3W2-W3 colors than stars and galaxies, which makes these two colors good features for quasars selection. Jin et al. (2019) showed that three PS1-AllWISE colors (i.e. iW1i-W1, yW1y-W1, and zW2z-W2) can be used to efficiently distinguish quasars from stars and improve the performance of XGBoost classification. We construct similar colors as features by combining all PS1 bands and W1W1 (i.e. gW1g-W1, rW1r-W1, iW1i-W1, zW1z-W1 and yW1y-W1), because W1W1 is the most sensitive one of AllWISE bands. These five new colors provide rough optical Spectral Energy Distributions (SEDs) for the objects and can characterize different objects with broader wavelength ranges than other optical colors (e.g. grg-r) do. The difference between PSF magnitude and Kron magnitude in iP1i_{P1} and zP1z_{P1} bands (iiKroni-i_{\mathrm{Kron}} and zzKronz-z_{\mathrm{Kron}}) are used as morphological features to separate point sources (stars and quasars) from extended sources (galaxies). We convert Vega magnitude to AB magnitude for AllWISE data when constructing all the features. As we don’t set constraints on W3W3 magnitude or W3snrW3snr, some sources may have poor or missing W3W3 (and W2W3W2-W3) data. Nevertheless, the use of W2W3W2-W3 will not be a problem, because XGBoost can handle the missing values, and data with lower SNRSNR are more informative than missing values.

Refer to caption
Figure 9: Flowchart of GPQ selection and photometric redshift calculation.

5.1 Binary classification for stars and extragalactic objects

In the first classification step, the input data for training and validation consist of synthetic quasar sample TQSOT_{QSO}, synthetic galaxy sample TGalT_{Gal} (see Section 4.1), and LAMOST Galactic plane star sample TStarT_{Star} (Section 2.6). The input data have more than 3 million rows. For binarization, we assign the label EXT (extragalactic object) to all TQSOT_{QSO} and TGalT_{Gal} instances, and keep the label for TStarT_{Star} as STAR. Here we regard extragalactic objects as the positive class and stars as the negative class.

We first apply five-fold cross validations with optuna to find the optimal setting of hyperparameters that minimizes the log loss among 500 trials. For a binary classification problem with a true label y{0,1}y\in\{0,1\} and a probability estimate p=Pr(y=1)p=\mathrm{Pr}(y=1), the log loss per sample is the negative log-likelihood of the classifier given the true label:

log_loss(y,p)\displaystyle log\_loss(y,p) =logPr(y|p)\displaystyle=-\log\mathrm{Pr}(y|p) (8)
=(ylog(p)+(1y)log(1p)).\displaystyle=-(y\log(p)+(1-y)\log(1-p)). (9)

Then we randomly split the whole input data into training set and validation set according to a 4:14:1 ratio and calculate scores of the six metrics with the validation set. This 4:14:1 split ratio is consistent with that of the five-fold cross validations. The large sample size of input data also ensures both training and validation sets have enough samples.

Some fixed parameters in our programs are: objective=binary:logistic; booster=gbtree; tree_method=hist. For hyperparameters that are tuned, the default values, optimal values found by the cross validations, and corresponding metric scores of these parameters are listed in Table 2. The number of boosting rounds (num_boost_round, a.k.a. n_estimators in scikit-learn API of XGBoost) is fixed to 100 and not tuned together with eta (a.k.a. learning_rate), because the effects of increasing num_boost_round can cancel that of decreasing eta, and vice versa. In the training process, we need to lower the learning rate eta and increase the num_boost_round to reduce the generalization error. The Classifier No.1 (CLF-1) is trained using 𝚎𝚝𝚊=0.02\mathtt{eta}=0.02, 𝚗𝚞𝚖_𝚋𝚘𝚘𝚜𝚝_𝚛𝚘𝚞𝚗𝚍=1,200\mathtt{num\_boost\_round}=1,200 with other optimal parameters in Table 2.

Table 2: Default and optimal hyperparameter settings for CLF-1 (star versus extragalactic object classification)
Hyperparameter Default Optimal
eta (learning_rate) 0.3 0.3
lambda (reg_lambda) 1 2.32
alpha (reg_alpha) 0 1.13
max_depth 6 9
gamma (min_split_loss) 0 0.60
grow_policy depthwise depthwise
min_child_weight 1 1
subsample 1 0.96
colsample_bytree 1 0.92
max_delta_step 0 3
AccuracyAccuracy 0.9993 0.9995
Precision+Precision^{+} 0.9993 0.9995
Recall+Recall^{+} 0.9995 0.9996
F1F_{1} 0.9994 0.9996
MCCMCC 0.9985 0.9990
AUCPRAUCPR 0.9992 0.9995

We then classify the PS1-AllWISE point-like sources with CLF-1. To exclude as many stars as possible, we adopt a high threshold on pEXTp_{\mathrm{EXT}} (model-predicted probabilities of sources for being extragalactic) to select extragalactic candidates. Sources with pEXT>0.99p_{\mathrm{EXT}}>0.99 are labeled as EXT, and the others are labeled as STAR and removed.

5.2 Binary classification for galaxies and quasars

We use TQSOT_{QSO} and TGalT_{Gal} samples as input data for training and validation in the second classification step. Here we regard quasars as the positive class and galaxies as the negative class.

The same processes of parameter tuning and training as those of CLF-1 are applied to build CLF-2. We keep some parameters unchanged as: objective=binary:logistic; booster=gbtree; tree_method=hist. For hyperparameters that are tuned, the default values, optimal values found by the cross validations, and corresponding metric scores of these parameters are listed in Table 3. The CLF-2 is trained using 𝚎𝚝𝚊=0.02\mathtt{eta}=0.02, 𝚗𝚞𝚖_𝚋𝚘𝚘𝚜𝚝_𝚛𝚘𝚞𝚗𝚍=1,500\mathtt{num\_boost\_round}=1,500 with other optimal parameters in Table 3.

The optimal scores of the six metrics in Table 3 are all lower than those in Table 2, indicating that the quasar–galaxy classification “hardness” is higher than that of star–extragalactic problem. Here we also use a high threshold of probability to select sources of our target class. We classify the sources labeled as EXT with CLF-2. Sources with pQSO>0.95p_{\mathrm{QSO}}>0.95 are kept as GPQ candidates, where pQSOp_{\mathrm{QSO}} is the probability of a source for being a quasar predicted by the XGBoost model.

Table 3: Default and optimal hyperparameter settings for CLF-2 (quasar versus galaxy classification)
Hyperparameter Default Optimal
eta (learning_rate) 0.3 0.2
lambda (reg_lambda) 1 2.32
alpha (reg_alpha) 0 1.10
max_depth 6 9
gamma (min_split_loss) 0 0.81
grow_policy depthwise depthwise
min_child_weight 1 2
subsample 1 0.95
colsample_bytree 1 0.86
max_delta_step 0 6
AccuracyAccuracy 0.9969 0.9974
Precision+Precision^{+} 0.9894 0.9905
Recall+Recall^{+} 0.9894 0.9918
F1F_{1} 0.9894 0.9912
MCCMCC 0.9875 0.9896
AUCPRAUCPR 0.9938 0.9951

5.3 Additional cut based on Gaia proper motion to remove stellar contaminants

In the first classification process, we classify all PS1-AllWISE point-like sources to stars and extragalactic objects. We ignore stars in the second classification step. Although the metrics of CLF-1 are high (Table 2), some stars can be misclassified as extragalactic objects, and then be classified either as quasars or galaxies. Faint stars are more likely to be misclassified than bright stars because stars in the training sample (TStarT_{Star}) are biased towards the bright end. When using optical and near-IR colors for candidate selection, white dwarfs are major contaminants for low redshift quasars, and M/L/T dwarfs are typical contaminants for high redshift quasars (e.g. Kirkpatrick et al., 1997; Vennes et al., 2002; Chiu et al., 2006). In the mid-IR regime, potential stellar contaminants for quasars are Young Stellar Objects (YSO), Asymptotic Giant Branch (AGB) stars, and Planetary Nebulae (PNe) (Kozłowski & Kochanek, 2009; Koenig & Leisawitz, 2014; Assef et al., 2018).

The YSOs are stars at the early stages of evolution, and are often divided into four subclasses (Lada, 1987): Class I, Class II, Flat spectrum, and Class III. Among them, Class II and Flat spectrum YSOs are the most-likely contaminants since they have optical and mid-IR SEDs similar to those of quasars. Since we require both optical and mid-IR detections for classification, optically faint Class I YSOs are eliminated in the first place. As has been studied by Koenig & Leisawitz (2014), Class III YSOs are clustered around W2W3=0W2-W3=0 and W1W2=0W1-W2=0, while Class I, II, and Flat spectrum YSOs occupy the region approximately with W1W2>0.25W1-W2>0.25 and 1.0<W2W3<4.51.0<W2-W3<4.5 (see their Figure 5). The latter YSO region is overlapped with the quasar region shown in Figure 6, therefore the contamination should be taken care of.

AGB stars are evolved stars with low temperatures and high luminosities. They are surrounded by circumstellar envelopes, and IR excess exists in their broad SEDs. According to Figure 5 from Koenig & Leisawitz (2014), only a minority of AGB stars actually overlap with Class I, II, and Flat spectrum YSOs (and thus quasars) in the W1W2W1-W2 versus W2W3W2-W3 diagram. Therefore we expect the contamination from AGB stars is less than that from YSOs.

The PNe have a series of narrow emission lines as well as IR excess. Known PNe can be later removed from the GPQ candidate sample by cross-matching with the Simbad database (Wenger et al., 2000).

In order to remove stellar contaminants such as white dwarfs, M/L/T dwarfs, YSOs, and AGB stars from GPQ candidates, we apply an additional cut based on Gaia proper motion, because the proper motion distribution of quasars is different from that of Milky Way stars. Although quasars should have negligible transverse motions, non-zero proper motions of them are measured by Gaia due to various effects, such as photocenter variability of quasars (see Bachchan et al., 2016, and references therein). In addition, proper motions with large uncertainties are not reliable. Therefore we need a probabilistic cut instead of a cut on the total proper motion. We define the probability density of zero proper motion (fPM0f_{\mathrm{PM0}}) of a source based on the bivariate normal distribution of proper motion measurements of the source as:

fPM0=12πσxσy1ρ2exp{12(1ρ2)[(xσx)22ρxyσxσy+(yσy)2]}f_{\mathrm{PM0}}=\frac{1}{2\pi\sigma_{x}\sigma_{y}\sqrt{1-\rho^{2}}}\mathrm{exp}\left\{-\frac{1}{2(1-\rho^{2})}\left[\left(\frac{x}{\sigma_{x}}\right)^{2}-\frac{2\rho xy}{\sigma_{x}\sigma_{y}}+\left(\frac{y}{\sigma_{y}}\right)^{2}\right]\right\} (10)

where x=pmrax=\mathrm{pmra}, y=pmdecy=\mathrm{pmdec}, and ρ=pmra_pmdec_corr\rho=\mathrm{pmra\_pmdec\_corr} (correlation coefficient between pmra and pmdec) are obtained from Gaia DR2 catalog, while σx\sigma_{x} and σy\sigma_{y} are the true external proper motion uncertainties calculated with the method suggested by Lindegren et al. (2018a, b). The external proper motion uncertainty can be expressed as σext=(k2σi2+σs2)12\sigma_{\mathrm{ext}}=(k^{2}\sigma_{i}^{2}+\sigma_{s}^{2})^{\frac{1}{2}}, where σext\sigma_{\mathrm{ext}} can be σx\sigma_{x} or σy\sigma_{y}, k=1.08k=1.08 is a multiplicative factor, σi\sigma_{i} is the catalog uncertainty (pmra_error or pmdec_error), and σs\sigma_{s} is the systematic error. For bright sources (G<13G<13), σs=0.032mas/yr\sigma_{s}=0.032~{}\mathrm{mas/yr}; for faint sources (G>13G>13), σs=0.066mas/yr\sigma_{s}=0.066~{}\mathrm{mas/yr}. Under the same uncertainty level, sources with smaller proper motions will have higher fPM0f_{\mathrm{PM0}} by definition.

We take the logarithm of fPM0f_{\mathrm{PM0}} for better comparison between samples. Figure 10 shows distributions of log(fPM0)\mathrm{log}(f_{\mathrm{PM0}}) of stars, galaxies and quasars used in this study. For stellar samples, in addition to TStarT_{Star} (LAMOST Galactic plane star sample), a subsample of the SDSS Stripe 82 Standard Star Catalog (hereafter S82 star; Ivezić et al., 2007) that meets the same constraints in Section 2.1 and Section 2.2 is also included for comparison. We choose a log(fPM0)4\mathrm{log}(f_{\mathrm{PM0}})\geq-4 cut that excludes 94.1% of both LAMOST Galactic plane stars and S82 stars, while retains 99.8% of the quasars. Nevertheless, faint stars can be major contaminants even with such strict cut on log(fPM0)\mathrm{log}(f_{\mathrm{PM0}}).

Refer to caption
Figure 10: Histograms of log(fPM0)\mathrm{log}(f_{\mathrm{PM0}}) of TStarT_{Star} (LAMOST Galactic plane star), GoodGalGoodGal (from SDSS galaxy), GoodQSOGoodQSO (from SDSS DR14Q), and sources from the SDSS Stripe 82 Standard Star Catalog. Because fPM0f_{\mathrm{PM0}} is the probability density which can be greater than 1 (the integral of the probability density function over the entire space is equal to 1), log(fPM0)\mathrm{log}(f_{\mathrm{PM0}}) can have positive values.

We calculate log(fPM0)\mathrm{log}(f_{\mathrm{PM0}}) for GPQ candidates after cross-matching them with Gaia DR2. For candidates without Gaia DR2 proper motion records, we assign a default value of 99 for log(fPM0)\mathrm{log}(f_{\mathrm{PM0}}). Sources with log(fPM0)4\mathrm{log}(f_{\mathrm{PM0}})\geq-4 are kept as reliable GPQ candidates.

6 Photometric redshift estimation for GPQ candidates

Measuring redshifts is an important step for quasar surveys. For quasar candidates, photometric redshifts (photo-zz) estimation is a key to follow-up studies. Many different approaches have been proposed for calculating photo-zzs of quasars, including quasar template fitting (e.g. Budavári et al., 2001; Babbedge et al., 2004; Salvato et al., 2009), the empirical color-redshift relation (e.g. Richards et al., 2001; Weinstein et al., 2004; Wu et al., 2004; Wu & Jia, 2010; Wu et al., 2012), machine learning (e.g. Yèche et al., 2010; Laurino et al., 2011; Brescia et al., 2013; Zhang et al., 2013; Pasquet-Itam & Pasquet, 2018), XDQSOz method (Bovy et al., 2012), and Skew-QSO method (Yang et al., 2017). As the photo-zz estimation problem can be well described by the regression problem in machine learning, we also use XGBoost to train the regression model and predict photo-zzs for our reliable GPQ candidates.

To build the training set and validation set, we randomly split the de-reddened GoodQSOGoodQSO sample with a ratio of 4:1. Our application set (reliable GPQ candidates) is also de-reddened. The same 13 features as those in Section 5 are used for photo-zz regression: grg-r, rir-i, izi-z, zyz-y, gW1g-W1, rW1r-W1, iW1i-W1, zW1z-W1, yW1y-W1, W1W2W1-W2, W2W3W2-W3, iiKroni-i_{\mathrm{Kron}} and zzKronz-z_{\mathrm{Kron}}. The morphological features iiKroni-i_{\mathrm{Kron}} and zzKronz-z_{\mathrm{Kron}} are included as they may help distinguish quasars at different cosmological distances. To obtain the optimal model, we also tune the parameters with five-fold cross-validations using optuna.

Refer to caption
Figure 11: Photometric redshift obtained with XGBoost regression model against spectral redshift of de-reddened validation set with 57,855 quasars. The red dashed line denotes zphot=zspecz_{\mathrm{phot}}=z_{\mathrm{spec}} and the blue dotted lines mark the margin within one RMSE from the red dashed line.

The performance of the XGBoost photo-zz regression model on the test set can be examined in zphotz_{\mathrm{phot}}-zspecz_{\mathrm{spec}} (photometric redshift versus spectral redshift) plot (Figure 11) or with two quantities: the root-mean-square error (RMSE) and photo-zz accuracy. For a validation set with sample size nn, the root-mean-square error is RMSE=i=1n(zphotzspec)2/n\mathrm{RMSE}=\sqrt{{\sum^{n}_{i=1}(z_{\mathrm{phot}}-z_{\mathrm{spec}})^{2}}/{n}}. On our validation set with a sample size of 57,855, the RMSE is 0.35. The photo-zz accuracy R0.1R_{0.1} is defined as the fraction of quasars with |Δz|0.1|\Delta z|\leq 0.1, where |Δz|=|zspeczphot|/(1+zspec)|\Delta z|=|z_{\mathrm{spec}}-z_{\mathrm{phot}}|/(1+z_{\mathrm{spec}}). Our XGBoost regression model yields a photo-zz accuracy of 74% on the validation set, which is comparable to that of Yang et al. (2017) on PS1 and WISE data (79%). Yang et al. (2017) adopted a multivariate Skew-t model and prior probabilities from the quasar luminosity function (QLF) to achieve the high photo-zz accuracy. Figure 12 shows the photo-zz accuracy R0.1R_{0.1} as a function of spectral redshift (left panel) and de-reddened iP1i_{P1}-band magnitude (right panel) respectively. R0.1R_{0.1} has maximum values at z2.3z\approx 2.3 and z4z\approx 4, and reaches a minimum at z3z\approx 3. Most z3z\approx 3 quasars have underestimated photometric redshifts (see Figure 11) due to a degeneracy of broad-band photometry in response to quasar SEDs at different redshifts. The strong Lyα\alpha emission line enters gP1g_{P1} band at z2.4z\approx 2.4 and moves into rP1r_{P1} band at z3.5z\approx 3.5, which leads to large excess in gP1g_{P1} magnitudes and hence similar PS1 colors of quasars within 2.4z3.52.4\lesssim z\lesssim 3.5. This kind of degeneracy can be alleviated if SDSS uu-band data are available to characterize the Lyman limit systems (see Section 4 of Yang et al., 2017). The photo-zz accuracy is improved at z3.5z\gtrsim 3.5, because the Lyman limit enters gP1g_{P1} band. R0.1R_{0.1} also drops at low redshift (z<1z<1), and at both bright and faint ends, because the training sample is biased towards intermediate redshifts and magnitudes.

Refer to caption
Figure 12: Photo-z accuracy R0.1R_{0.1} (the fraction of quasars with |Δz|0.1|\Delta z|\leq 0.1, where |Δz|=|zspeczphot|/(1+zspec)|\Delta z|=|z_{\mathrm{spec}}-z_{\mathrm{phot}}|/(1+z_{\mathrm{spec}})) as a function of redshift (left panel) and magnitude (de-reddened; right panel).

7 The GPQ candidate catalog

7.1 Validation of the GPQ candidates with Simbad, Milliquas, and SDSS DR16Q

With our Transfer Learning Framework and aforementioned additional selection criteria, we obtain a reliable GPQ candidate sample with 161,532 sources from PS1 and AllWISE. We cross-match the GPQ candidates with Simbad database (Wenger et al., 2000), and find 2,786 matches. The object types and summary are shown in Table 4.

We categorize all matched sources to four groups: AGN/QSO, star (including PNe and PN candidates), galaxy, and other-type objects. Among all the matches, 53.98% (1,504) are recorded as AGN/QSO (including candidates), 8.97% (250) are recorded as star (including candidates), 4.02% (112) are recorded as galaxy, and 33.02% (920) are other-type objects labeled according to the detection properties (e.g. wavelength). Those other-type objects have higher probabilities to be AGN/QSO than stars, as most (728+27) of them are radio sources, and 64 are X-ray sources (see Table 4). For the 4.02% sources labeled as galaxy, a number of them may also host AGN/QSO as we have applied careful selection criteria to remove possible galaxy contaminants. Among the 250 sources labeled as star, 40 are candidates, and the other 210 are known stars. 101 of the known stars were once selected as QSO candidates using SDSS photometry, and then identified as stars by the 2dF-SDSS LRG and QSO Survey (Croom et al., 2009). From these analyses, we can conclude that, the purity of our GPQ candidates on the small subset of 2,786 Simbad matches can be as high as 90\sim 90%. The true purity may vary at different locations in the Galactic plane.

Table 4: Matching results of GPQ candidates and Simbad database
AGN/QSO Number Star Number Galaxy Number Other types Number
QSO 1121 Star 175 Galaxy 106 Radio source 728
AGN candidate 150 YSO Candidate 22 Radio galaxy 3 IR source 77
BL Lac object 143 YSO 13 Brightest galaxy in a cluster (BCG) 2 X-ray source 64
Seyfert 1 galaxy 38 Cataclysmic binary candidate 7 Cluster of galaxies 1 Centimetric radio source 27
AGN 31 Planetary Nebula (PN) 5 Blue object 22
Other subclasses 21 Other subclasses 28 Far-IR source (λ30μm\lambda\geq 30\mu\mathrm{m}) 2
Total 1504 Total 250 Total 112 Total 920
Fraction 53.98% Fraction 8.97% Fraction 4.02% Fraction 33.02%

As another test on stellar contamination, we cross-match our GPQ candidates with LAMOST Galactic plane star sample. This match identifies 29 LAMOST stars, all of which are not recorded in Simbad. Therefore the total number of known stars in the GPQ candidates is 239.

We also examine the fraction of known GPQs that can be recovered with our candidates table. The known GPQ sample is MLQSUBMLQSUB with 1,853 sources, which is retrieved from Milliquas catalog and described in Section 2.7. The MLQSUBMLQSUB is selected with same constraints as those on our application PS1-AllWISE data, to get a consistent analysis result. Cross-matching MLQSUBMLQSUB with our GPQ candidates results in 1,763 matches, meaning that 95.14% of GPQs from Milliquas can be selected with our methods, under same quality constraints on the photometric data. The recent sixteenth data release of SDSS Quasar Catalog (DR16Q; Lyke et al., 2020) has a total of 750,414 sources, in which 3,727 sources are located at |b|<20|b|<20^{\circ}. Only 1,320 of these SDSS GPQs meet the photometric quality constraints in Section 2.1 and Section 2.2. Cross-matching our GPQ candidates with SDSS DR16Q gives 1,292 matches, which corresponds to a recall rate of 97.88% under the same photometric quality constraints, or a recall rate of 34.67% to the whole identified sample. The overall completeness of the sample of candidates is mainly limited by the photometric quality constraints.

7.2 Description of the GPQ candidate catalog

We remove 239 known stars (see Section 7.1) from our GPQ candidate sample. We then match the remaining GPQ candidates by coordinates with TOPCAT internally, and found 347 close pairs within 0.2′′0.2^{\prime\prime}. These pairs are very likely duplicated sources, because the PS1 survey can not resolve two sources within an angular distance of 0.2′′0.2^{\prime\prime}. The median image quality for PS1 3π3\pi survey is FWHM = (1.31, 1.19, 1.11, 1.07, 1.02) arcseconds for (grizyP1grizy_{\mathrm{P1}}) (Magnier et al., 2016). Therefore we only keep one source for each close pair, and obtain the final GPQ candidates sample with 160,946{{160,946}} sources. The GPQ candidate catalog is compiled based on this sample, with photometric data from PS1 DR1, AllWISE, and astrometric data from Gaia DR2. The descriptions for the catalog are displayed in Table 5.

\startlongtable
Table 5: Contents of the GPQ candidate catalog
Column Units Label Explanations
1 Designation Catalog designation hhmmss.ss+ddmmss.s (J2000) based on Pan-STARRS1 (PS1) coordinates
2 deg ra PS1 right ascension in decimal degrees (J2000) (weighted mean) at mean epoch
3 deg dec PS1 declination in decimal degrees (J2000) (weighted mean) at mean epoch
4 deg l Galactic longitude in decimal degrees
5 deg b Galactic latitude in decimal degrees
6 photoz Photometric redshift predicted with XGBoost regressor
7 p_star Probability of the object to be a star, predicted by the first XGBoost classifier, a.k.a. pstarp_{\mathrm{star}} (p_star+p_ext=1)
8 p_ext Probability of the object to be an extragalactic object, predicted by the first XGBoost classifier, a.k.a. pextp_{\mathrm{ext}} (p_star+p_ext=1)
9 p2_gal Probability of the object to be a galaxy, predicted by the second XGBoost classifier, a.k.a. pgalp_{\mathrm{gal}} (p2_gal+p2_qso=1)
10 p2_qso Probability of the object to be a quasar, predicted by the second XGBoost classifier, a.k.a. pQSOp_{\mathrm{QSO}} (p2_gal+p2_qso=1)
11 fpm0 Probability density of zero proper motion (fPM0f_{\mathrm{PM0}}) of the source
12 log_fpm0 The logarithm of fpm0 (log(fPM0)\mathrm{log}(f_{\mathrm{PM0}}))
13 mag ebv Line-of-sight E(BV)E(B-V) given by the Planck14 dust map
14 PS_objID Pan-STARRS1 (PS1) unique object identifier
15 mag gmag Mean PSF AB magnitude from PS1 g filter detections
16 mag e_gmag Error in gmag
17 mag gKmag Mean Kron AB magnitude from PS1 g filter detections
18 mag e_gKmag Error in gKmag
19 mag rmag Mean PSF AB magnitude from PS1 r filter detections
20 mag e_rmag Error in rmag
21 mag rKmag Mean Kron AB magnitude from PS1 r filter detections
22 mag e_rKmag Error in rKmag
23 mag imag Mean PSF AB magnitude from PS1 i filter detections
24 mag e_imag Error in imag
25 mag iKmag Mean Kron AB magnitude from PS1 i filter detections
26 mag e_iKmag Error in iKmag
27 mag zmag Mean PSF AB magnitude from PS1 z filter detections
28 mag e_zmag Error in zmag
29 mag zKmag Mean Kron AB magnitude from PS1 z filter detections
30 mag e_zKmag Error in zKmag
31 mag ymag Mean PSF AB magnitude from PS1 y filter detections
32 mag e_ymag Error in ymag
33 mag yKmag Mean Kron AB magnitude from PS1 y filter detections
34 mag e_yKmag Error in yKmag
35 AllWISE_ID AllWISE unique source ID
36 mag W1mag W1 (Vega) magnitude (3.35μ\mum)
37 mag e_W1mag Mean error on W1 magnitude
38 mag W2mag W2 (Vega) magnitude (4.6μ\mum)
39 mag e_W2mag Mean error on W2 magnitude
40 mag W3mag W3 (Vega) magnitude (11.6μ\mum)
41 mag e_W3mag Mean error on W3 magnitude
42 mag W4mag W4 (Vega) magnitude (22.1μ\mum)
43 mag e_W4mag Mean error on W4 magnitude
44 mag Jmag 2MASS J (Vega) magnitude (1.25μ\mum)
45 mag e_Jmag Mean error on J magnitude
46 mag Hmag 2MASS H (Vega) magnitude (1.65μ\mum)
47 mag e_Hmag Mean error on H magnitude
48 mag Kmag 2MASS Ks (Vega) magnitude (2.17μ\mum)
49 mag e_Kmag Mean error on Ks magnitude
50 Gaia_source_id Gaia DR2 unique source identifier
51 mas parallax Gaia DR2 parallax
52 mas parallax_error Standard error of parallax
53 mas/yr pmra Gaia DR2 proper motion in right ascension direction
54 mas/yr pmra_error Standard error of pmra
55 mas/yr pmdec Gaia DR2 proper motion in declination direction
56 mas/yr pmdec_error Standard error of pmdec
57 pmdec_pmdec_corr Correlation between pmra and pmdec
58 mas/yr pmra_error_ext True external uncertainty of pmra
59 mas/yr pmdec_error_ext True external uncertainty of pmdec
60 sb_main_id Main identifier for an object in Simbad database
61 sb_main_type Main object type for an object in Simbad database
62 sb_redshift Redshift of an object recorded in Simbad database

Note. — This table is published in its entirety in the machine-readable format.

The sky density of sources from the GPQ candidate catalog is shown in Figure 13. In general, the sky distribution of the GPQ candidates is consistent with the prediction in Section 4.3. The highest sky density of the candidates is 72.7 deg2\mathrm{deg}^{-2}, which is slightly higher than the estimation (66.7 deg2\mathrm{deg}^{-2}). The median density is 16.7 deg2\mathrm{deg}^{-2}, which is comparable but lower than the estimated value (or the median density of GoodQSOGoodQSO). As can be seen from Figure 13, the sky densities of GPQ candidates at |b|10|b|\lesssim 10^{\circ} are lower than those of the estimation in Figure 8 (d), which indicates that the modelling process overestimates the sky density of GPQs at lower Galactic latitudes. The region with δ30\delta\lesssim-30^{\circ} (0l2400^{\circ}\lesssim~{}l~{}\lesssim~{}240^{\circ}) is blank, because it is not covered by the PS1 3π3\pi survey.

Refer to caption
Figure 13: Sky density plot of GPQ candidates in Galactic coordinates.

The distributions of de-reddened iP1i_{P1} magnitudes and photometric redshifts of our GPQ candidates are displayed in Figure 14 (a). The lowest and highest photometric redshift are zphot=0.016z_{\mathrm{phot}}=0.016 and zphot=4.777z_{\mathrm{phot}}=4.777, respectively. Taking into account the uncertainties in photo-zz estimations, the actual highest redshift of the GPQs can be up to 5. Five peaks appear in the histogram of photometric redshift (Figure 14 (a)) at zphotz_{\mathrm{phot}}\approx (0.8, 1.2, 1.7, 2.1, 2.4), which are caused by selection effects and sample bias of the training set. Quasars with these redshifts have higher chances to be selected with PS1 photometry: (i) when z0.8z\approx 0.8, the Mg ii emission line enters gP1g_{\mathrm{P1}} band; (ii) when z1.2z\approx 1.2, Mg ii enters rP1r_{\mathrm{P1}} band; (iii) when z1.7z\approx 1.7, C iii] enters gP1g_{\mathrm{P1}} band, and Mg ii enters iP1i_{\mathrm{P1}} band; (iv) when z2.1z\approx 2.1, both Si iv and C iv line enter gP1g_{\mathrm{P1}} band, and C iii] enters rP1r_{\mathrm{P1}} band; and (v) when z2.4z\approx 2.4, Lyα\alpha and Si iv enter gP1g_{\mathrm{P1}} band.

Figure 14: (a) De-reddened iP1i_{P1} magnitude and photometric redshift distribution of GPQ candidates. (b) De-reddened iP1i_{P1} magnitude and redshift distribution of GoodQSO sample. The iP1i_{P1} magnitude is de-reddened according to the Planck14 dust map.

The distributions of de-reddened iP1i_{P1} magnitudes and spectroscopic redshifts of GoodQSOGoodQSO sample from SDSS DR14Q are also shown in Figure 14 (b) for comparison. The GoodQSOGoodQSO sample and the sample of GPQ candidates have similar redshift distributions with some subtle differences. The magnitude distributions are also similar to each other, except that GoodQSOGoodQSO has a larger fraction of bright sources (iP1<19i_{P1}<19) than the GPQ candidates. Such differences in both redshift and magnitude distributions of these two samples are mainly due to their different target selection strategies. Our GPQ candidates are selected from one single parent sample, while SDSS DR14Q includes many quasar samples in various redshift and magnitude ranges (see Section 2.2 of Pâris et al., 2018).

The color-color properties of sources from the GPQ candidate catalog are shown in Figure 15. In general, GPQ candidates have color-color distributions that are well matched to those of GoodMockQSOGoodMockQSO-PS1 (see Figure 5 and 6). The unimodal structures seen from both AllWISE and PS1 colors imply a low level of contamination from stars and galaxies. However, contamination from stars can be recognized from the izi-z versus rir-i diagram, where some sources are concentrated along the stellar locus (see the slightly contaminated region “SC” in Figure 15 (c)). We apply no cut on the “SC” region because it only contains 7,892 sources (4.90% of whole catalog) and any cut is likely to also remove reddened quasars (see Figure 5).

Figure 15: AllWISE (a) and PS1 (b, c, d) color-color diagrams of sources from the GPQ candidate catalog. Contour lines based on 2d KDE are displayed on the density plots. The red dashed lines and “SC” in subplot (c) mark the region which is slightly contaminated by stars (ri>0.6&iz>0.25r-i>0.6\ \&\ i-z>0.25).

8 Summary and conclusions

We present a Transfer Learning Framework for quasar selection, and its application on finding GPQs. We construct mock samples of quasars and galaxies behind the Galactic plane, by assigning new locations and extinction values to the extinction-corrected high-bb SDSS extragalactic sources. We use PS1 limiting magnitudes to select good mock sources, and compare them with high-bb sources in color-color spaces. We show that the covariate change of source colors is significant from high-bb regions to the Galactic plane. We synthesize training and validation data for machine learning with: (i) good mock samples, (ii) SDSS extragalactic sources which do not have counterparts in the good mock samples, and (iii) real LAMOST Galactic plane star sample.

We apply XGBoost algorithm for machine learning in this study. To help reduce the effects of class imbalance and class-balance change, we turn the three-class classification task (star, galaxy, and quasar) into two binary classification problems. A total of 13 features are used for the two classification steps: grg-r, rir-i, izi-z, zyz-y, gW1g-W1, rW1r-W1, iW1i-W1, zW1z-W1, yW1y-W1, W1W2W1-W2, W2W3W2-W3, iiKroni-i_{\mathrm{Kron}} and zzKronz-z_{\mathrm{Kron}}. In order to remove star and galaxy contaminants, we use high thresholds of model-predicted probabilities (pEXT>0.99&pQSO>0.95p_{\mathrm{EXT}}>0.99\ \&\ p_{\mathrm{QSO}}>0.95) to select extragalactic and quasar candidates. We perform an additional cut on probability density of zero proper motion (log(fPM0)4\mathrm{log}(f_{\mathrm{PM0}})\geq-4) based on Gaia DR2 data to further reduce stellar contamination. Using the extinction-corrected SDSS DR14Q sources, we build the photometric redshift estimator with RMSE=0.35\mathrm{RMSE}=0.35 on the validation set.

Our GPQ candidate sample is validated with Simbad database and Milliquas catalog. The purity of quasars is 90\sim 90% on the Simbad matches. Under our constraints for good PS1 and AllWISE photometry, 95.14% of the GPQs in the Milliquas catalog and 97.88% of the GPQs from SDSS DR16Q can be recalled with our GPQ candidate sample. The photometric quality constraints ensure reliability of the candidates, but at the cost of lower overall completeness of the candidate sample. The sky density of GPQ candidates is consistent with the estimation based on mock GPQ catalog. The median marginal probability of GPQs to the PS1-AllWISE sample with good photometry is 103\sim 10^{-3} and the lowest marginal probability is 104\sim 10^{-4}. We compile the GPQ candidate catalog after removing known stars in Simbad and LAMOST, and some duplicated sources. The GPQ candidate catalog consists of 160,946{{160,946}} sources. In addtion to our machine learning predictions, we include PS1 and AllWISE photometry, as well as Gaia DR2 astrometry in the table. The GPQ candidate sample has a broad redshift coverage (0<z50<z\lesssim 5), indicating that our selection methods can be used on wide redshift ranges.

Colors of GPQ candidates agree well with those of the mock GPQ catalog, which also indicates high purity of the candidates even though the marginal probability is low. Contamination from stars and galaxies still exists in the GPQ candidate sample, but at a low level. Because most stars in the training sample (TStarT_{Star}) are bright ones, identifying and removing faint stars can be challenging for the XGBoost classification model (CLF-1). Using colors instead of magnitudes as features helps to lessen the effects of such training sample bias. The strict log(fPM0)4\mathrm{log}(f_{\mathrm{PM0}})\geq-4 cut can additionally remove most stellar contaminants. Galaxies overlap heavily with quasars on PS1 color-color diagrams, and show similar log(fPM0)\mathrm{log}(f_{\mathrm{PM0}}) distribution with quasars. The use of AllWISE colors (W1W2W1-W2 and W2W3W2-W3) and morphological separators (iiKroni-i_{\mathrm{Kron}} and zzKronz-z_{\mathrm{Kron}}) largely aids the galaxy-quasar classification. For future GPQ candidate selections, we expect to improve the machine learning performance by compiling a Galactic plane star training sample with more stars in the faint end.

We have been carrying out a series of spectroscopic identification for GPQ candidates since 2018, using optical telescopes including two-meter telescopes based at Lijiang and Xinglong in China and Siding Spring in Australia, and 200-inch Hale Telescope in the US. The success rate of identifying new GPQs is \sim90% in our spectroscopic campaign, which is consistent with the estimated reliability of the GPQ candidate catalog. We have also been exploring the LAMOST spectral data to find new GPQs. All these efforts yield promising results that will be presented in the next paper of this series.

We thank the support from the National Key R&D Program of China (2016YFA0400703) and the National Science Foundation of China (11533001, 11721303 & 11927804). We thank the referee very much for constructive and helpful suggestions to improve this paper. We thank Prof. Yanxia Zhang (NAOC) for helping us cross-match PS1 and AllWISE catalogs. We thank Dr. Jinyi Yang and Dr. Feige Wang from Steward Observatory for useful suggestions. YF thanks Dr. Hassen Yesuf (KIAA-PKU), Dr. Bojin Zhuang (Ping An Technology), Hongdong Zheng (PKU) and Dinghuai Zhang (Mila/PKU) for helpful discussions on machine learning. YF thanks Prof. Gregory J. Herczeg, Bitao Wang, Mingyang Zhuang, Yun Zheng, Yuanhang Ning and Niankun Yu from PKU for helping edit the draft. This publication makes use of data from the Pan-STARRS1 Surveys. The Pan-STARRS1 Surveys (PS1) and the PS1 public science archive have been made possible through contributions by the Institute for Astronomy, the University of Hawaii, the Pan-STARRS Project Office, the Max-Planck Society and its participating institutes, the Max Planck Institute for Astronomy, Heidelberg and the Max Planck Institute for Extraterrestrial Physics, Garching, The Johns Hopkins University, Durham University, the University of Edinburgh, the Queen’s University Belfast, the Harvard-Smithsonian Center for Astrophysics, the Las Cumbres Observatory Global Telescope Network Incorporated, the National Central University of Taiwan, the Space Telescope Science Institute, the National Aeronautics and Space Administration under Grant No. NNX08AR22G issued through the Planetary Science Division of the NASA Science Mission Directorate, the National Science Foundation Grant No. AST-1238877, the University of Maryland, Eotvos Lorand University (ELTE), the Los Alamos National Laboratory, and the Gordon and Betty Moore Foundation. This publication makes use of data products from the Wide-field Infrared Survey Explorer, which is a joint project of the University of California, Los Angeles, and the Jet Propulsion Laboratory/California Institute of Technology, funded by the National Aeronautics and Space Administration. This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement. The Guoshoujing Telescope (the Large Sky Area Multi-object Fiber Spectroscopic Telescope LAMOST) is a National Major Scientific Project built by the Chinese Academy of Sciences. Funding for the project has been provided by the National Development and Reform Commission. LAMOST is operated and managed by the National Astronomical Observatories, Chinese Academy of Sciences. Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Science, and the Participating Institutions. SDSS-IV acknowledges support and resources from the Center for High-Performance Computing at the University of Utah. The SDSS web site is www.sdss.org. SDSS-IV is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, the Chilean Participation Group, the French Participation Group, Harvard-Smithsonian Center for Astrophysics, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe (IPMU) / University of Tokyo, the Korean Participation Group, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Astrophysik (MPA Garching), Max-Planck-Institut für Extraterrestrische Physik (MPE), National Astronomical Observatories of China, New Mexico State University, New York University, University of Notre Dame, Observatário Nacional / MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, University of Colorado Boulder, University of Oxford, University of Portsmouth, University of Utah, University of Virginia, University of Washington, University of Wisconsin, Vanderbilt University, and Yale University.

References

  • Acero et al. (2015) Acero, F., Ackermann, M., Ajello, M., et al. 2015, ApJS, 218, 23
  • Aguado et al. (2019) Aguado, D. S., Ahumada, R., Almeida, A., et al. 2019, ApJS, 240, 23
  • Akiba et al. (2019) Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. 2019, in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–2631
  • Arenou et al. (2018) Arenou, F., Luri, X., Babusiaux, C., et al. 2018, A&A, 616, A17
  • Argyriou et al. (2006) Argyriou, A., Evgeniou, T., & Pontil, M. 2006, in Proceedings of the 19th International Conference on Neural Information Processing Systems, 41–48
  • Assef et al. (2018) Assef, R. J., Stern, D., Noirot, G., et al. 2018, ApJS, 234, 23
  • Astropy Collaboration et al. (2013) Astropy Collaboration, Robitaille, T. P., Tollerud, E. J., et al. 2013, A&A, 558, A33
  • Babbedge et al. (2004) Babbedge, T. S. R., Rowan-Robinson, M., Gonzalez-Solares, E., et al. 2004, MNRAS, 353, 654
  • Bachchan et al. (2016) Bachchan, R. K., Hobbs, D., & Lindegren, L. 2016, A&A, 589, A71
  • Bailer-Jones et al. (2019) Bailer-Jones, C. A. L., Fouesneau, M., & Andrae, R. 2019, MNRAS, 490, 5615
  • Becker et al. (2001) Becker, R. H., White, R. L., Gregg, M. D., et al. 2001, ApJS, 135, 227
  • Ben Bekhti et al. (2008) Ben Bekhti, N., Richter, P., Westmeier, T., & Murphy, M. T. 2008, A&A, 487, 583
  • Ben Bekhti et al. (2012) Ben Bekhti, N., Winkel, B., Richter, P., et al. 2012, A&A, 542, A110
  • Bishop (2006) Bishop, C. M. 2006, Pattern recognition and machine learning (springer)
  • Blanton et al. (2017) Blanton, M. R., Bershady, M. A., Abolfathi, B., et al. 2017, AJ, 154, 28
  • Blitzer et al. (2006) Blitzer, J., McDonald, R., & Pereira, F. 2006, in Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, 120–128
  • Bovy et al. (2011) Bovy, J., Hennawi, J. F., Hogg, D. W., et al. 2011, ApJ, 729, 141
  • Bovy et al. (2012) Bovy, J., Myers, A. D., Hennawi, J. F., et al. 2012, ApJ, 749, 41
  • Boyd et al. (2013) Boyd, K., Eng, K. H., & Page, C. D. 2013, in Joint European conference on machine learning and knowledge discovery in databases, Springer, 451–466
  • Brescia et al. (2013) Brescia, M., Cavuoti, S., D’Abrusco, R., Longo, G., & Mercurio, A. 2013, ApJ, 772, 140
  • Budavári et al. (2001) Budavári, T., Csabai, I., Szalay, A. e. S., et al. 2001, AJ, 122, 1163
  • Chambers et al. (2016) Chambers, K. C., Magnier, E. A., Metcalfe, N., et al. 2016, arXiv e-prints, arXiv:1612.05560
  • Chen & Guestrin (2016) Chen, T., & Guestrin, C. 2016, in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794
  • Chicco & Jurman (2020) Chicco, D., & Jurman, G. 2020, BMC genomics, 21, 6
  • Chiu et al. (2006) Chiu, K., Fan, X., Leggett, S. K., et al. 2006, AJ, 131, 2722
  • Condon et al. (1998) Condon, J. J., Cotton, W., Greisen, E., et al. 1998, AJ, 115, 1693
  • Croom et al. (2009) Croom, S. M., Richards, G. T., Shanks, T., et al. 2009, MNRAS, 392, 19
  • Cui et al. (2012) Cui, X.-Q., Zhao, Y.-H., Chu, Y.-Q., et al. 2012, Research in Astronomy and Astrophysics, 12, 1197
  • Davis & Goadrich (2006) Davis, J., & Goadrich, M. 2006, in Proceedings of the 23rd international conference on Machine learning, 233–240
  • Deng et al. (2012) Deng, L.-C., Newberg, H. J., Liu, C., et al. 2012, Research in Astronomy and Astrophysics, 12, 735
  • Dobrzycki et al. (2003) Dobrzycki, A., Macri, L. M., Stanek, K. Z., & Groot, P. J. 2003, AJ, 125, 1330
  • Du Plessis & Sugiyama (2014) Du Plessis, M. C., & Sugiyama, M. 2014, Neural Networks, 50, 110
  • Fan et al. (2001) Fan, X., Narayanan, V. K., Lupton, R. H., et al. 2001, AJ, 122, 2833
  • Farrow et al. (2014) Farrow, D. J., Cole, S., Metcalfe, N., et al. 2014, MNRAS, 437, 748
  • Fischer et al. (2019) Fischer, T. C., Rigby, J. R., Mahler, G., et al. 2019, ApJ, 875, 102
  • Flesch (2015) Flesch, E. W. 2015, PASA, 32, e010
  • Flesch (2019) —. 2019, arXiv e-prints, arXiv:1912.05614
  • Friedman et al. (2000) Friedman, J., Hastie, T., Tibshirani, R., et al. 2000, The annals of statistics, 28, 337
  • Friedman (2001) Friedman, J. H. 2001, Annals of statistics, 1189
  • Gaia Collaboration et al. (2016) Gaia Collaboration, Prusti, T., De Bruijne, J., et al. 2016, A&A, 595, A1
  • Gaia Collaboration et al. (2018a) Gaia Collaboration, Mignard, F., Klioner, S. A., et al. 2018a, A&A, 616, A14
  • Gaia Collaboration et al. (2018b) Gaia Collaboration, Brown, A. G. A., Vallenari, A., et al. 2018b, A&A, 616, A1
  • Glikman et al. (2006) Glikman, E., Helfand, D. J., & White, R. L. 2006, ApJ, 640, 579
  • Gong et al. (2019) Gong, Z., Zhong, P., & Hu, W. 2019, IEEE Access, 7, 64323
  • Górski et al. (2005) Górski, K. M., Hivon, E., Banday, A. J., et al. 2005, ApJ, 622, 759
  • Grazian et al. (2000) Grazian, A., Cristiani, S., D’Odorico, V., Omizzolo, A., & Pizzella, A. 2000, AJ, 119, 2540
  • Green (2018) Green, G. 2018, The Journal of Open Source Software, 3, 695
  • Green et al. (2019) Green, G. M., Schlafly, E., Zucker, C., Speagle, J. S., & Finkbeiner, D. 2019, ApJ, 887, 93
  • Green et al. (2018) Green, G. M., Schlafly, E. F., Finkbeiner, D., et al. 2018, MNRAS, 478, 651
  • Green et al. (1986) Green, R. F., Schmidt, M., & Liebert, J. 1986, ApJS, 61, 305
  • Gregg et al. (1996) Gregg, M. D., Becker, R. H., White, R. L., et al. 1996, AJ, 112, 407
  • Hastie et al. (2009) Hastie, T., Tibshirani, R., & Friedman, J. 2009, The elements of statistical learning: data mining, inference, and prediction (Springer Science & Business Media)
  • Hernán-Caballero et al. (2016) Hernán-Caballero, A., Hatziminaoglou, E., Alonso-Herrero, A., & Mateos, S. 2016, MNRAS, 463, 2064
  • Huo et al. (2010) Huo, Z.-Y., Liu, X.-W., Yuan, H.-B., et al. 2010, Research in Astronomy and Astrophysics, 10, 612
  • Huo et al. (2013) Huo, Z.-Y., Liu, X.-W., Xiang, M.-S., et al. 2013, AJ, 145, 159
  • Huo et al. (2015) Huo, Z.-Y., Liu, X.-W., Xiang, M.-S., et al. 2015, Research in Astronomy and Astrophysics, 15, 1438
  • Im et al. (2007) Im, M., Lee, I., Cho, Y., et al. 2007, ApJ, 664, 64
  • Ivezić et al. (2007) Ivezić, Ž., Smith, J. A., Miknaitis, G., et al. 2007, AJ, 134, 973
  • Jin et al. (2019) Jin, X., Zhang, Y., Zhang, J., et al. 2019, MNRAS, 485, 4539
  • Kirkpatrick et al. (1997) Kirkpatrick, J. D., Henry, T. J., & Irwin, M. J. 1997, AJ, 113, 1421
  • Koenig & Leisawitz (2014) Koenig, X. P., & Leisawitz, D. T. 2014, ApJ, 791, 131
  • Kozłowski & Kochanek (2009) Kozłowski, S., & Kochanek, C. S. 2009, ApJ, 701, 508
  • Kron (1980) Kron, R. G. 1980, ApJS, 43, 305
  • Lacy et al. (2004) Lacy, M., Storrie-Lombardi, L. J., Sajina, A., et al. 2004, ApJS, 154, 166
  • Lada (1987) Lada, C. J. 1987, in Symposium-International astronomical union, Vol. 115, Cambridge University Press, 1–18
  • Laurino et al. (2011) Laurino, O., D’Abrusco, R., Longo, G., & Riccio, G. 2011, MNRAS, 418, 2165
  • Lindegren et al. (2018a) Lindegren, L., Hernández, J., Bombrun, A., et al. 2018a, A&A, 616, A2
  • Lindegren et al. (2018b) Lindegren, L., Hernández, J., Bombrun, A., et al. 2018b, in IAU 30 GA - Division A: Fundamental Astronomy
  • Luo et al. (2012) Luo, A. L., Zhang, H.-T., Zhao, Y.-H., et al. 2012, Research in Astronomy and Astrophysics, 12, 1243
  • Luo et al. (2015) Luo, A.-L., Zhao, Y.-H., Zhao, G., et al. 2015, Research in Astronomy and Astrophysics, 15, 1095
  • Lyke et al. (2020) Lyke, B. W., Higley, A. N., McLane, J. N., et al. 2020, ApJS, 250, 8
  • Magnier et al. (2016) Magnier, E. A., Schlafly, E. F., Finkbeiner, D. P., et al. 2016, arXiv e-prints, arXiv:1612.05242
  • Mainzer et al. (2011) Mainzer, A., Bauer, J., Grav, T., et al. 2011, ApJ, 731, 53
  • Mateos et al. (2012) Mateos, S., Alonso-Herrero, A., Carrera, F. J., et al. 2012, MNRAS, 426, 3271
  • Mirabal et al. (2016) Mirabal, N., Charles, E., Ferrara, E. C., et al. 2016, ApJ, 825, 69
  • Palanque-Delabrouille et al. (2011) Palanque-Delabrouille, N., Yeche, C., Myers, A. D., et al. 2011, A&A, 530, A122
  • Pan & Yang (2009) Pan, S. J., & Yang, Q. 2009, IEEE Transactions on knowledge and data engineering, 22, 1345
  • Pâris et al. (2018) Pâris, I., Petitjean, P., Aubourg, É., et al. 2018, A&A, 613, A51
  • Pasquet-Itam & Pasquet (2018) Pasquet-Itam, J., & Pasquet, J. 2018, A&A, 611, A97
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, the Journal of machine Learning research, 12, 2825
  • Planck Collaboration et al. (2014) Planck Collaboration, Abergel, A., Ade, P. A. R., et al. 2014, A&A, 571, A11
  • Pounds (1979) Pounds, K. A. 1979, Proceedings of the Royal Society of London Series A, 366, 375
  • Price-Whelan et al. (2018) Price-Whelan, A. M., Sipőcz, B. M., Günther, H. M., et al. 2018, AJ, 156, 123
  • Provost et al. (1998) Provost, F., Fawcett, T., & Kohavi, R. 1998, in ICML Conference
  • Quionero-Candela et al. (2009) Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. D. 2009, Dataset shift in machine learning (The MIT Press)
  • Richards et al. (2001) Richards, G. T., Weinstein, M. A., Schneider, D. P., et al. 2001, AJ, 122, 1151
  • Richards et al. (2002) Richards, G. T., Fan, X., Newberg, H. J., et al. 2002, AJ, 123, 2945
  • Richards et al. (2004) Richards, G. T., Nichol, R. C., Gray, A. G., et al. 2004, ApJS, 155, 257
  • Saerens et al. (2002) Saerens, M., Latinne, P., & Decaestecker, C. 2002, Neural computation, 14, 21
  • Saito & Rehmsmeier (2015) Saito, T., & Rehmsmeier, M. 2015, PloS one, 10
  • Salvato et al. (2009) Salvato, M., Hasinger, G., Ilbert, O., et al. 2009, ApJ, 690, 1250
  • Sandage (1965) Sandage, A. 1965, ApJ, 141, 1560
  • Savage et al. (1993) Savage, B. D., Lu, L., Bahcall, J. N., et al. 1993, ApJ, 413, 116
  • Savage et al. (2000) Savage, B. D., Wakker, B., Jannuzi, B. T., et al. 2000, ApJS, 129, 563
  • Schlafly et al. (2016) Schlafly, E. F., Meisner, A. M., Stutz, A. M., et al. 2016, ApJ, 821, 78
  • Schmidt (1963) Schmidt, M. 1963, Nature, 197, 1040
  • Secrest et al. (2015) Secrest, N. J., Dudik, R. P., Dorland, B. N., et al. 2015, ApJS, 221, 12
  • Shen et al. (2016) Shen, S.-Y., Argudo-Fernández, M., Chen, L., et al. 2016, Research in Astronomy and Astrophysics, 16, 43
  • Shimodaira (2000) Shimodaira, H. 2000, Journal of statistical planning and inference, 90, 227
  • Skrutskie et al. (2006) Skrutskie, M., Cutri, R., Stiening, R., et al. 2006, AJ, 131, 1163
  • Stern et al. (2005) Stern, D., Eisenhardt, P., Gorjian, V., et al. 2005, ApJ, 631, 163
  • Stern et al. (2012) Stern, D., Assef, R. J., Benford, D. J., et al. 2012, ApJ, 753, 30
  • Strauss et al. (2002) Strauss, M. A., Weinberg, D. H., Lupton, R. H., et al. 2002, AJ, 124, 1810
  • Su & Cui (2004) Su, D.-Q., & Cui, X.-Q. 2004, Chinese J. Astron. Astrophys., 4, 1, doi: 10.1088/1009-9271/4/1/1
  • Sugiyama & Kawanabe (2012) Sugiyama, M., & Kawanabe, M. 2012, Machine learning in non-stationary environments: Introduction to covariate shift adaptation (MIT press)
  • Tange (2011) Tange, O. 2011, ;login: The USENIX Magazine, 36, 42
  • Taylor (2005) Taylor, M. B. 2005, in Astronomical Society of the Pacific Conference Series, Vol. 347, Astronomical Data Analysis Software and Systems XIV, ed. P. Shopbell, M. Britton, & R. Ebert, 29
  • Vanden Berk et al. (2001) Vanden Berk, D. E., Richards, G. T., Bauer, A., et al. 2001, AJ, 122, 549
  • Vapnik (2013) Vapnik, V. 2013, The nature of statistical learning theory (Springer science & business media)
  • Vennes et al. (2002) Vennes, S., Smith, R. J., Boyle, B. J., et al. 2002, MNRAS, 335, 673
  • Wang & Chen (2019) Wang, S., & Chen, X. 2019, ApJ, 877, 116
  • Wang et al. (1996) Wang, S.-g., Su, D.-q., Chu, Y.-q., Cui, X., & Wang, Y.-n. 1996, Applied Optics, 35, 5155
  • Weinstein et al. (2004) Weinstein, M. A., Richards, G. T., Schneider, D. P., et al. 2004, ApJS, 155, 243
  • Wenger et al. (2000) Wenger, M., Ochsenbein, F., Egret, D., et al. 2000, A&AS, 143, 9
  • Westmeier (2018) Westmeier, T. 2018, MNRAS, 474, 289
  • White et al. (2000) White, R. L., Becker, R. H., Gregg, M. D., et al. 2000, ApJS, 126, 133
  • Wright et al. (2010) Wright, E. L., Eisenhardt, P. R., Mainzer, A. K., et al. 2010, AJ, 140, 1868
  • Wu et al. (2012) Wu, X.-B., Hao, G., Jia, Z., Zhang, Y., & Peng, N. 2012, AJ, 144, 49
  • Wu & Jia (2010) Wu, X.-B., & Jia, Z. 2010, MNRAS, 406, 1583
  • Wu et al. (2004) Wu, X.-B., Zhang, W., & Zhou, X. 2004, Chinese J. Astron. Astrophys., 4, 17
  • Yan et al. (2013) Yan, L., Donoso, E., Tsai, C.-W., et al. 2013, AJ, 145, 55
  • Yang et al. (2017) Yang, Q., Wu, X.-B., Fan, X., et al. 2017, AJ, 154, 269
  • Yao et al. (2019) Yao, S., Wu, X.-B., Ai, Y. L., et al. 2019, ApJS, 240, 6
  • Yèche et al. (2010) Yèche, C., Petitjean, P., Rich, J., et al. 2010, A&A, 523, A14
  • Yi et al. (2019) Yi, Z., Chen, Z., Pan, J., et al. 2019, ApJ, 887, 241
  • York et al. (2000) York, D. G., Adelman, J., Anderson, Jr., J. E., et al. 2000, AJ, 120, 1579
  • Yuan et al. (2015) Yuan, H.-B., Liu, X.-W., Huo, Z.-Y., et al. 2015, MNRAS, 448, 855
  • Zhang et al. (2013) Zhang, Y., Ma, H., Peng, N., Zhao, Y., & Wu, X.-b. 2013, AJ, 146, 22
  • Zhao et al. (2012) Zhao, G., Zhao, Y.-H., Chu, Y.-Q., Jing, Y.-P., & Deng, L.-C. 2012, Research in Astronomy and Astrophysics, 12, 723
  • Zonca et al. (2019) Zonca, A., Singer, L., Lenz, D., et al. 2019, Journal of Open Source Software, 4, 1298