Finding Quasars behind the Galactic Plane. I. Candidate Selections with Transfer Learning
Abstract
Quasars behind the Galactic plane (GPQs) are important astrometric references and useful probes of Milky Way gas. However, the search for GPQs is difficult due to large extinctions and high source densities in the Galactic plane. Existing selection methods for quasars developed using high Galactic latitude (high-) data cannot be applied to the Galactic plane directly because the photometric data obtained from high- regions and the Galactic plane follow different probability distributions. To alleviate this dataset shift problem for quasar candidate selection, we adopt a Transfer Learning Framework at both data and algorithm levels. At the data level, to make a training set in which dataset shift is modeled, we synthesize quasars and galaxies behind the Galactic plane based on SDSS sources and Galactic dust map. At the algorithm level, to reduce the effect of class imbalance, we transform the three-class classification problem for stars, galaxies, and quasars to two binary classification tasks. We apply XGBoost algorithm on Pan-STARRS1 (PS1) and AllWISE photometry for classification, and additional cut on Gaia proper motion to remove stellar contaminants. We obtain a reliable GPQ candidate catalog with sources located at in PS1-AllWISE footprint. Photometric redshifts of GPQ candidates achieved with XGBoost regression algorithm show that our selection method can identify quasars in a wide redshift range (). This study extends the systematic searches for quasars to the dense stellar fields and shows the feasibility of using astronomical knowledge to improve data mining under complex conditions in the Big Data era.
1 Introduction
The Galactic plane has long been the “zone of avoidance” for extragalactic astronomy, including quasar surveys. The Half Million Quasar (HMQ; Flesch, 2015) catalog contains a total of 510,764 objects, but only 35,105 located at (half of the whole sky area), 3,730 at , and 255 at . Although it is difficult to search for Quasars behind the Galactic Plane (hereafter GPQs), such quasars are important references for astrometry and useful probes of Milky Way gas.
Quasars are used as astrometric references due to their small parallaxes and proper motions. GPQs enable the accurate measurement of positions, distances, and proper motions of stars in the Galactic disk, which is key to understanding our own Galaxy. The high-precision astrometry provided by the Gaia mission defines a celestial reference frame through the positions of 556,869 candidate quasars, however only a tiny fraction of these quasars are located at (Gaia Collaboration et al., 2018a). A large sample of GPQs will help build a better reference frame in the optical, through direct coverage of the sky in the Galactic plane, and will help to better understand the systematic astrometry errors of Gaia in the Galactic plane region (Arenou et al., 2018).
Line-of-sight absorption towards quasars can probe gas structures of the Milky Way. While quasars at high Galactic latitude have been useful in studying the Milky Way halo gas (e.g. Savage et al., 1993, 2000; Ben Bekhti et al., 2008, 2012), GPQs allow absorption line studies on gaseous structures in the Galactic plane (e.g. Anti-Center Shell, H Complex; see Westmeier, 2018). Moreover, a high density sample of GPQs can map the gas distribution with a higher angular resolution than that is possible with the 21 cm surveys.
Another application of GPQs is adaptive-optics observation on quasar host galaxies, which is achieved by their proximity to nearby bright stars as natural guide stars (Im et al., 2007; Fischer et al., 2019). For adaptive optics, natural guide stars should be located within a few arcseconds of the science target, which rarely occurs outside of the Galactic plane but is more common in the plane.
The difficulty of finding quasars behind the Galactic plane is caused by several challenges, including:
-
•
In comparison to objects at high Galactic latitude (high-), sources in the Galactic plane suffer from higher extinction and reddening. As a result, many sources (especially extragalactic sources) can not be detected within the survey detection limit. For other detectable sources, their colors are different from those at high Galactic latitude.
-
•
The source density in the Galactic plane is high. The quality of photometry can be worse in dense regions, because sources can be easily contaminated by visible or unseen neighbors.
-
•
A lot of “unusual” stars are located within the Galactic plane, including some white dwarfs, M/L/T dwarfs, and Young Stellar Objects (YSOs), that share many similar observational properties with quasars. These sources can be contaminants for quasars at different redshifts (e.g. Kirkpatrick et al., 1997; Vennes et al., 2002; Chiu et al., 2006; Kozłowski & Kochanek, 2009).
Since the first identification of quasar (3C 273; Schmidt, 1963), many methods for quasar candidate selection have been developed, including ultraviolet excess (e.g. Sandage, 1965; Green et al., 1986), radio sources (e.g. Gregg et al., 1996; White et al., 2000; Becker et al., 2001), X-ray sources (e.g. Pounds, 1979; Grazian et al., 2000), optical/near-infrared (near-IR) colors (e.g. Richards et al., 2002; Fan et al., 2001; Wu & Jia, 2010), mid-IR colors (e.g. Lacy et al., 2004; Stern et al., 2005, 2012; Mateos et al., 2012; Wu et al., 2012; Yan et al., 2013), and quasar variability (e.g. Dobrzycki et al., 2003; Palanque-Delabrouille et al., 2011). In addition, tools based on statistical machine learning (e.g. Richards et al., 2004; Bovy et al., 2011) and deep learning (e.g. Yèche et al., 2010; Pasquet-Itam & Pasquet, 2018) have also been established to find quasars with various data that are available.
A few studies have focused on finding quasars/AGNs behind dense stellar fields such as the Galactic plane, Magellanic Clouds, and M31 and M33 galaxies. Most of these studies used infrared selection methods to efficiently find quasars. For example, Im et al. (2007) discovered 40 bright quasars at by applying the combination of a near-IR color cut of on Two Micron All Sky Survey (2MASS; Skrutskie et al., 2006) and detection of a radio counterpart from the NRAO VLA Sky Survey (NVSS; Condon et al., 1998). Kozłowski & Kochanek (2009) identified 5,000 AGNs behind the Magellanic Clouds with mid-IR color cuts modified from method of Stern et al. (2005). Huo et al. (2010, 2013, 2015) discovered 1,870 new quasars around the Andromeda (M31) and Triangulum (M33) galaxies, with the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST) from 2009 to 2013.
Recently, searches for quasars have been focused on large datasets with big data volumes and large sky coverage. Secrest et al. (2015) obtained an all-sky AGN candidates catalog with million sources using two-color infrared photometric selection criteria from the Wide-field Infrared Survey Explorer final catalog release (AllWISE) (Wright et al., 2010; Mainzer et al., 2011). Assef et al. (2018) built two catalogs of AGN candidates also based on AllWISE photometry, while excluding regions around Galactic center and Galactic plane. Jin et al. (2019) selected quasar candidates with machine learning method using Pan-STARRS1 (PS1; Chambers et al., 2016) and AllWISE data. Bailer-Jones et al. (2019) classified objects in Gaia Data Release 2 (Gaia DR2; Gaia Collaboration et al., 2016, 2018b) as stars, quasars, and galaxies with Gaussian Mixture Model and addressed the problem of class imbalance in Gaia DR2.
However, the studies listed above either treated sources in the Galactic plane and high Galactic latitude as the same, or removed the Galactic plane from consideration. Selection methods for quasars at high Galactic latitude are not generic and can not be applied to the Galactic plane directly, because data (e.g. PS1 and AllWISE photometry) obtained from high- and low- follow different probability distributions. For example, apparent colors of quasars (stars) vary from high- to low- regions, and so do the source density of quasars (stars). Such behavior of data is a kind of non-stationarity called dataset shift (Quionero-Candela et al., 2009), which leads to significant estimation bias of supervised machine learning algorithms. The color cuts for quasar selection can also be regarded as simple decision tree models in machine learning regime. Previous color cuts obtained from high Galactic latitude regions fail in the Galactic plane due to the dataset shift.
To deal with these dataset shift problems, transfer learning (Pan & Yang, 2009) has been proposed and studied extensively by data scientists. The idea of transfer learning is to use knowledge gained in one problem and apply it to a different but related problem. Although spectroscopically identified (i.e. “labeled”) samples of extragalactic objects are inadequate in the Galactic plane, such labeled samples are available at high Galactic latitude. The labeled data make it possible to build a good selection method for GPQs, once the knowledge transfer from high Galactic latitude to low Galactic latitude is successful.
This paper is the first one of this series for finding GPQs. In this paper we present a transfer learning method for quasar selection, as well as a GPQ candidate catalog with sources. In Section 2, we introduce the archival data used for this study. In Section 3, we describe the algorithm design for GPQ selection. In Section 4, we synthesize quasars and galaxies behind the Galactic plane with extragalactic objects at high Galactic latitude from the Sloan Digital Sky Survey (SDSS; York et al., 2000), to make a training set in which dataset shift is modeled. In Section 5, we transform the three-class classification problem for stars, galaxies and quasars to two binary classification tasks: stars versus extragalactic objects, and quasars versus galaxies to reduce the class imbalance and class-balance change. In Section 6, we calculate the photometric redshifts for GPQ candidates. In Section 7, we present the GPQ candidate catalog and some statistical properties of the sample. We summarize the results in Section 8. Throughout this paper, we use AB magnitude for PS1 photometry and Vega magnitude for AllWISE photometry unless mentioned.
2 Data
We make use of optical and infrared photometric data from PS1 and AllWISE, and astrometric data from Gaia DR2. We also retrieve samples of spectroscopically identified objects from SDSS and LAMOST.
2.1 PS1 DR1 photometry
Pan-STARRS1 (PS1; Chambers et al., 2016) has carried out a set of synoptic imaging sky surveys including the 3 Steradian Survey and the Medium Deep Survey in 5 bands . The mean 5 point source limiting sensitivities in the stacked 3 Steradian Survey in are (23.3, 23.2, 23.1, 22.3, 21.4) and the single epoch 5 depths in are (22.0, 21.8, 21.5, 20.9, 19.7). For better astrometry in the crowded Galactic plane field, we use mean coordinates from the PS1 MeanObject table. Mean PSF magnitudes are used for all bands , and mean Kron magnitudes (Kron, 1980) are used for and bands. The Galactic extinction coefficients for are . These coefficients are calculated using , where is the relative extinction value for band given by a new optical to mid-IR extinction law (Wang & Chen, 2019), and .
We set a few constraints on the PS1 data to ensure the data quality. All sources should be: (i) detected in all PS1 bands () and significantly detected in (error in PSF mag of band , equivalent to -band SNR larger than 5); (ii) not too bright in to avoid possible saturation (); (iii) measured with Kron magnitude (Kron, 1980) in the and bands (). For simplification, we use () to represent the PSF magnitudes of PS1 bands () in color indexes (e.g. , ) and derived quantities and . The PSF magnitude does not appear alone and will not be confused with the redshift symbol .
2.2 AllWISE photometry for point-like sources
The AllWISE catalog is built upon the work of the Wide-field Infrared Survey Explorer mission (WISE; Wright et al., 2010) by combining data from the WISE cryogenic and NEOWISE (Mainzer et al., 2011) post-cryogenic survey. WISE has 4 bands at 3.4, 4.6, 12, and 22 m (W1, W2, W3, and W4). The 5 limiting magnitudes of the AllWISE catalog in W1, W2, W3, and W4 bands are 19.6, 19.3, 16.7, and 14.6 mag. The Galactic extinction coefficients for W1, W2, W3 used in this study are . These coefficients are also calculated with relative extinction values from Wang & Chen (2019).
We cross-match the PS1 sources with AllWISE using a radius of to avoid source confusion in the dense fields of the Galactic plane. We also set a few constraints on the AllWISE data. All sources should be: (i) AllWISE point sources (); (ii) not too bright to avoid possible saturation (); (iii) significantly detected in and bands (); (iv) unaffected by prioritized image artifacts in each band (“0000”); (v) unblended with nearby detections, so that only one component is used in each profile-fitting for each source ().
2.3 Gaia DR2 astrometry
Gaia DR2 (Gaia Collaboration et al., 2016, 2018b) contains celestial positions and the apparent brightness in band for approximately billion sources. For billion of those sources, parallaxes and proper motions are available. Broad-band photometry in the (330–680 nm) and (630–1050 nm) bands are available for billion sources. We use the proper motions and their uncertainties from the Gaia DR2 catalog (columns pmra, pmra_error, pmdec, and pmdec_error) to find quasars.
2.4 SDSS Quasar Catalog: the fourteenth data release
SDSS (York et al., 2000) has mapped the high Galactic latitude northern sky and obtained imaging as well as spectroscopy data for millions of objects including stars, galaxies, and quasars. The 14th data release of the SDSS Quasar Catalog (SDSS DR14Q; Pâris et al., 2018) contains 526,356 quasars. We cross-match the DR14Q catalog with PS1 and AllWISE both with a radius of . To ensure the data quality, we use the same constraints in Section 2.1 and Section 2.2 to retrieve a subset of DR14Q. This subset has 289,271 sources and is denoted as hereafter. As can be seen from the HEALPix (Górski et al., 2005) density map of (Figure 1), very few sources of are located at .

2.5 SDSS spectroscopically identified stars and galaxies
In order to compare high- sources with Galactic plane sources that we use, a sample of stars and a sample of galaxies are extracted from SpecPhotoAll table of SDSS Data Release 15 (Blanton et al., 2017; Aguado et al., 2019). We cross-match both the star and galaxy sample with PS1 and AllWISE with a radius of . The SDSS star sample has 23,693 sources. We also apply quality constraints in Section 2.1 and Section 2.2 to select galaxy subset with good photometry for later use. The resulted subset of galaxy (denoted as hereafter) has 1,635,053 sources. Most SDSS stars and galaxies are located at high Galactic latitude ().
2.6 Stars from LAMOST general catalog
The Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST, also called the Guoshoujing Telescope) is a special reflecting Schmidt telescope, the design of which allows both a large effective aperture of 3.6 m–4.9 m and a wide field of view of (Wang et al., 1996; Su & Cui, 2004; Cui et al., 2012). The LAMOST spectral survey (Zhao et al., 2012; Luo et al., 2012, 2015) consists of two major components, i.e. the LAMOST Experiment for Galactic Understanding and Exploration (LEGUE; Deng et al., 2012), and the LAMOST ExtraGAlactic Survey (LEGAS). The LEGUE observes stars in different sky regions with different magnitude ranges, including the Galactic halo with mag at , the Galactic anti-center with mag at and (Yuan et al., 2015), as well as the Galactic disk with mag at with uniform coverage along Galactic longitude. The LEGAS mainly identifies galaxies and quasars that are within the SDSS footprint but complementary to the SDSS spectroscopic samples (e.g. Shen et al., 2016; Yao et al., 2019). Nevertheless, extragalactic objects in the LEGUE plates are also targets of the LEGAS. The LAMOST spectral survey has obtained the largest stellar spectra sample to date. We retrieve star sample from LAMOST general catalog from DR1 to DR7v0. A total number of 3,940,076 LAMOST stars meet the the same constraints in Section 2.1 and Section 2.2. From this LAMOST star sample, we select 1,334,577 Galactic plane stars with (denoted as hereafter). Most sources are from the LEGUE survey and are brighter than 18 mag in band.
2.7 The Million Quasars (Milliquas) Catalog
The Million Quasars (Milliquas) Catalog (Flesch, 2019) is a compilation of quasars and quasar candidates from the literature. The Milliquas v6.4c update includes 758,908 type-I QSOs and AGN up to 31 December 2019. We use this catalog to extract extant GPQ sample within PS1 footprint. There are 4,344 quasars (with “Q” label in the “Descrip” column) located in in Milliquas v6.4c. Cross-matching these 4,344 known GPQs with PS1 and AllWISE both with a radius of gives 2,757 sources. After applying same constraints as in Section 2.1 and Section 2.2, we get a subset of 1,853 sources. This Galactic plane subset of Milliquas quasars, denoted as , will be used later for candidate validation.
3 Design of the Transfer Learning Framework
3.1 Dataset shift problem in the Galactic plane
The task of quasar selections can be described by classification problems in machine learning. Here we look into the three-class classification for stars, galaxies, and quasars with photometric data. The learning process requires two independent datasets for model training and model validation respectively. Training and validation sets can be two nonoverlapping subsets from a common parent sample with both features (colors and/or magnitues) and class labels (star, galaxy, and quasar). Usually the class labels are given by spectroscopic identifications. The classification algorithm learns a mapping relation from features to class labels with the training set. Often, the trained classification model (classifier) is applied to another dataset without class labels (i.e. no spectroscopic identifications), which is called application set or test set. The classifier takes features from the test set as inputs (a.k.a. covariates) and gives class labels as outputs .
A basic assumption for traditional machine learning is that training and test data follow the same probability distribution (Bishop, 2006; Hastie et al., 2009; Vapnik, 2013). However, this assumption no longer holds if we use high- data for model training and low- data for application, because the joint distribution of inputs and outputs differs between training and test data (i.e. dataset shift; Quionero-Candela et al., 2009).
For our GPQ selections, the dataset shift includes changes in both source colors and prior probabilities of different classes. Sources in the Galactic plane become fainter and redder than those at high Galactic latitude due to greater reddening, which changes the distribution of input features and the conditional probability of the output labels given the inputs (i.e. covariate shift; Shimodaira, 2000; Sugiyama & Kawanabe, 2012). Prior probabilities of stars are much higher than those of quasars (and galaxies) in the Galactic plane, which means the marginal probability differs from that at high Galactic latitude (i.e. class-balance change; Saerens et al., 2002; Du Plessis & Sugiyama, 2014). Moreover, class ratio between extragalactic objects and stars may vary significantly from one place to another in the Galactic plane, which we refer to as “internal” class-balance change of the test data.
Transfer learning can be applied to improve the learning performance under dataset shift from a source domain to the target domain (see a review in Pan & Yang, 2009), where domain is a set that consists of a feature space and a marginal probability distribution , . For our classification task, the source domain data are from high Galactic latitude () and the target domain data are from the Galactic plane (). In this study we only care about areas at due to the limit of PS1 survey coverage. Comparison of some properties of the source and target domains are listed in Table 1.
Domains of learning | Location | Labels of stars | Labels of quasars/galaxies | Internal class-balance change |
---|---|---|---|---|
Source domain | Available | Available | Moderate | |
Target domain | Available | Unavailable | Severe |
As large numbers of stars, quasars and galaxies have been spectroscopically identified at , labels for these three classes are available in the source domain. Since spectroscopically identified samples of quasars and galaxies are significantly lacked at , labels for these two classes are unavailable in the target domain. Nevertheless, labels of many stars in the target domain are available with the help of LAMOST spectroscopic survey.
According to the classification scheme for different settings of transfer learning by Pan & Yang (2009), the set-up of classification in the Galactic plane can be categorized into Transductive Transfer Learning, where source domain labels are available and target domain labels are unavailable. A popular approach to Transductive Transfer Learning is Feature-based Transfer (e.g. Blitzer et al., 2006; Argyriou et al., 2006), which reduces the difference between the source and target domain through feature transformation in either one or both of the domains.
To solve the dataset shift problem of classification in the Galactic plane, we borrow the idea of Feature-based Transfer Learning. Using the mapping relation between the features of high- and low- objects, we can generate mock samples of quasars and galaxies in the Galactic plane to simulate the covariate change of their colors and magnitudes. The LAMOST Galactic plane stars also contribute to a more accurate probability distribution of data in the target domain. To reduce the effect of class-balance change, we manually go through two binary classification steps rather than running a three-class classification algorithm only once.
3.2 Modelling covariate change with mock samples
As data of LAMOST Galactic plane stars are available, we only focus on reducing the differences in features of extragalactic objects between training and test data. For our classification problem, all features will be constructed with photometric data from PS1 and AllWISE. We assume that the differences in photometric properties between extragalactic objects in the Galactic plane and those off the plane are only caused by different extinctions/reddening along their sight-lines. In this way, we can simply generate mock extragalactic objects behind the Galactic plane with data obtained at high Galactic latitude, using the mapping relation determined by the Galactic extinction law and Galactic dust map.
The covariate change can then be shown as color change of extragalactic objects on a set of color-color diagrams. Our classification will perform better by adding mock samples of quasars and galaxies behind the Galactic plane into the training set.
3.3 Dealing with class imbalance and class-balance change in machine learning
With the data-level improvements above, the covariate change can be reduced. Efforts on the algorithm level are required to handle the class imbalance and class-balance change. During the GPQ selections, instead of performing a star-quasar binary classification, we additionally take galaxies into account and perform a three-class classification.
Many machine learning software packages support multi-class classification jobs, by transforming the task into multiple binary classification problems. However, the built-in treatment is often inflexible and sometimes destructive when dealing with class imbalance problems. For example, in the scenario of using one-vs-rest (also known as one-vs-all) strategy for multi-class classification, at some stages, samples of one class are regarded as the positive samples while all samples of other classes are regarded as negative samples. Even if all the classes in the training set have a same sample size, the binary classification situation is imbalanced as the positive class (the “one”) has less samples than the negative class (the “rest”). In our case, the GPQ set (in both training and test set) has significantly less samples than the sets of galaxies and stars, thus severe class imbalance will happen.
To reduce the disadvantage of one-vs-rest strategy which is commonly used in machine learning algorithms, we convert this three-class classification problem into two binary classification problems manually. In the first step, the Galactic plane sources are classified into two classes: stars and extragalactic objects. Extragalactic objects are then classified into quasars and galaxies in the second step. By combining the two minority classes of quasar and galaxy into one, we would expect the class imbalance to be better controlled in the first step. The physical basis for merging the quasar and galaxy classes is that quasars are a special type of galaxies. For the second step, we expect that the quasar-to-galaxy ratio to be nearly constant across different locations in the Galactic plane. Thus the variable quasar-to-star or galaxy-to-star ratio is avoided and the internal class-balance change is lessened in the learning process.
4 Mock catalogs for quasars and galaxies behind the Galactic plane
In order to construct training samples for extragalactic objects, as well as understand the covariate shift of them from high Galactic latitude to the Galactic plane, we synthesize quasars and galaxies behind the Galactic plane using and samples. The synthesis is plausible if we assume the distribution of quasars on the celestial sphere is homogeneous and isotropic on large scale, just as the Cosmological principle has suggested. We not only observe the changes in colors of quasars and galaxies as they are placed in low Galactic latitudes in this modeling process, but also get a rough estimation on the sky distributions of the sources that could be detected by a certain sky survey.
4.1 Synthesizing procedures
Let be a set of extragalactic objects ( can be or ). The synthesis process consists of following steps.
1. Correcting for extinctions. Extinctions of objects in set are corrected according to a two-dimensional dust map provided by Planck Collaboration et al. (2014, hereafter Planck14), and the optical to mid-IR extinction law from Wang & Chen (2019) with . The values are retrieved using a Python module, dustmaps (Green, 2018).
2. Assigning new locations. We generate a random sample of points that are uniformly distributed on the sky with . The number of these random points is equal to the sample size of . Coordinates of these points are randomly assigned to objects of as their new locations. Now we get a new set (, ) without line-of-sight extinctions.
3. Adding new extinctions. We add extinctions to the sample using the Planck14 dust map based on their new (mock) locations.
4. Setting limiting magnitudes. We obtain a subset of by choosing sources brighter than the PS1 single epoch 5 depths in all PS1 passbands: . This subset, denoted as (, ), represents “good” mock sample that can be detected by PS1 survey in all bands. However, we don’t apply similar constraints to AllWISE bands as the magnitude which corresponds to a 5 sensitivity varies with location. Also, this extinction-selection effect relies more on the optical survey depth than the IR survey depth. Factors such as observation strategies and source confusions in dense fields are not taken into consideration in this step. Therefore we may overestimate the detection rate of GPQs (and galaxies) through this synthesis. We select sources within the PS1 footprint (i.e. ) and obtain set (, ).
5. Constructing training sets with mock and real data. For mock quasars that are not included in -PS1, their original counterparts (high- quasars in the input set ; denoted as ) are also added to the training and validation sets along with -PS1. For mock galaxies that are not included in-PS1, 25% of their original counterparts (high- galaxies in the input set ; denoted as ) are added to the training and validation sets. The resulted quasar and galaxy samples for training and validation are denoted as and , respectively. , , and the LAMOST Galactic plane star sample form the training and validation sets for machine learning classification.
By adding good mock samples and real data ( and ) together instead of using only good mock samples as training data for quasars or galaxies, we increase the data diversity as well as sample size of the training set. This data diversification ensures that the training set can provide more discriminative information for the machine learning model (Gong et al., 2019). In addition, more training data can help reduce overfitting.
The flowchart of the synthesizing procedures is displayed in Figure 2.

In the synthesizing process, we adopt the Planck14 dust map because it detects dust at a greater depth and better estimates the 2D extinctions in the Galactic plane than do the dust maps constructed with stellar photometry (e.g. Green et al., 2018, 2019). We assume a uniform although varies slightly in the Galaxy with a dispersion of about 0.18 (Schlafly et al., 2016). Such minor variations in can lead to small uncertainties of magnitudes and colors of individual mock quasars (galaxies), but have limited impacts on the statistical properties of the training sample because mock sources with large extinctions (and thus large uncertainties caused by variations) are removed by the magnitude limits, as we shall see in Section 4.2.
4.2 Synthesizing results and dataset shift
We define the extinction-based selection rate in the Galactic plane as , where is the cardinality, i.e. number of elements/sources of set . The source numbers of the input samples are and ; the source numbers of the output samples are and . Therefore the selection rates for GPQs and galaxies are and , respectively. The selection rate of galaxies is higher than that of GPQs because the input galaxies are on average brighter than the input quasars. With Step 2 in Section 4.1, sources of and are randomly and evenly distributed in the Galactic plane (). But after Step 4, the densities of remaining sources ( and ) are inversely related to the dust map (Figure 3): more extragalactic sources remain detectable in regions with smaller , and voids of detection present at regions with large . A sky survey deeper than PS1 might help make up some fraction of the gap in the middle of the Galactic plane. The sample is sparser compared to , simply because the input quasars are fewer than galaxies. Most sources of and have line-of-sight color excess of , which corresponds to extinction of with . The medians of line-of-sight of -PS1 and are 0.21 and 0.03, respectively. In general, -PS1 sample has significantly larger compared to (see Figure 4 (a)). Therefore covariate change for color indexes from high- to low- regions cannot be ignored. In addtion, -PS1 sources are fainter than sources (Figure 4 (b)).

A series of color-color diagrams for and -PS1 along with SDSS stars and Galactic plane point sources are shown in Figure 5 and Figure 6. In Figure 5, from the left to the middle panels, the covariate change of colors of quasars from high Galactic laititude to the Galactic plane can be directly observed. The Galactic reddening makes the cluster of GPQs in a color-color plane extend towards redder colors (to the upper right along the reddening vector) and scatter more than high- quasars. The scattering is greater in color indexes of bluer bands, while less at redder bands. This trend is also observable in the quasar evolutionary tracks with . From the top to the bottom panels in Figure 5, the distance between two quasar evolutionary tracks with different reddening decreases.
The covariate change of stellar colors is also evident from the color-color diagrams. The stellar loci are simple and clear for high- (SDSS) stars (see Figure 5 (1a, 2a, 3a)). However, additional spikes along the direction of increasing appear in the stellar loci of Galactic plane stars due to reddening, as can be seen from Figure 5 (1b, 1c, 2b, 2c). Therefore we expect to better distinguish stars from other sources using Galactic plane stars from LAMOST instead of high- SDSS stars in the training set.
Since mid-IR bands are less sensitive to extinction and reddening, the covariate change of AllWISE colors are less obvious than that of PS1 colors. For instance, in Figure 6 (a), the quasar evolutionary tracks with stay very close to each other. Since the AllWISE colors of quasars do not change much from high Galactic latitude to the Galactic plane, we can use the versus color-color diagram to examine the purity of the final quasar candidates, by comparing probability distributions of the candidates and -PS1 sources.
The AllWISE color-color diagram also gives “hardness” information on the classification problems that separating quasars from galaxies is harder than separating quasars from stars. In general, quasars have redder and colors than stars and galaxies due to the power-law SEDs and hot dust of quasars. From Figure 6 (b), we can recognize stellar locus (; on the lower-left) and galaxy locus (; on the middle-right) from the density plot. The quasar reference contour line marks a “quasar region” where most quasars are located in the versus diagram. Most stars are away from the quasar region, while a large number of galaxies enter the quasar region, indicating that such galaxies can contaminate the mid-IR quasar selection.
Some comparisons between mock quasars (-PS1) and mock galaxies (-PS1) behind the Galactic plane are also shown in Figure 7. In the color-color diagrams of PS1 bands, galaxies largely overlap with quasars (Figure 7 (a, b, c)); while on the versus plane, these two classes are slightly more separable (Figure 7 (d)). Except for the colors, the difference between PSF magnitude and Kron (1980) magnitude is often used as a morphological separator (Strauss et al., 2002; Farrow et al., 2014) for galaxies and point sources including quasars. However, separating quasars from galaxies becomes harder with at the faint end (Figure 7 (e)), as has been pointed out by Yang et al. (2017). Among all the million sources, are point sources with and , which can also been seen from Figure 7 (e). These point sources include few quasars with “galaxy” labels and may also include some stars that are misclassified as galaxies. We do not pay attention to these point sources because they only contribute to a tiny fraction of the whole galaxy sample.
To sum up, we examine the properties of , and PS1-AllWISE point-like sources in the color-color spaces. For quasar candidates selection, contamination from both stars and galaxies should be taken care of. Simple PS1 color cuts are only capable of selecting quasars that are away from the stellar loci. Using a series of PS1 colors in high-dimensional space might help reduce the overlap between the stellar loci and clusters of quasar and galaxy. Moreover, with AllWISE colors, quasars can be better separated from stars and galaxies. Therefore we expect that the combination of PS1 and AllWISE data will make quasar selection more efficient.
4.3 A rough estimation on the lower limit to the sky density of GPQs
An estimation on the sky density of GPQs will be useful for evaluating the final GPQ candidate sample and the selection method. However, the sky distribution in the middle panel of Figure 3 does not reflect the true density of GPQs due to two reasons: (i) the synthesizing process does not consider the source crowdedness and its effects on the photometric data quality; (ii) the source number of only depends on the size of input sample when the dust extinction map is fixed.
Let the density of be , then the relative density of quasars with good photometry across the Galactic plane is:
(1) |
where is the sky density of all PS1-AllWISE sources in the Galactic plane, and is the sky density of sources with good photometry as defined in Section 2.1 and Section 2.2. The fraction roughly quantifies the effects of source crowdedness on the photometric quality. We expect that the median sky density of GPQs is no higher than that of (), therefore the lower limit “absolute” sky density of GPQs can be computed as:
(2) |
where , and . Figure 8 shows the sky distribution of , , and . The estimated has a median of and a maximum of .

The predicted marginal probability of GPQs to the PS1-AllWISE sample with good photometry is , which ranges from to 0.17 with a median of . The maximum value of 0.17 is not reliable because it locates at edges of the HEALPix map (), where source count in a pixel does not correspond to the true number in the sky region.
5 GPQ candidate selections with XGBoost
We use XGBoost (Chen & Guestrin, 2016), a scalable tree boosting system, to perform machine learning classification for GPQ selection. XGBoost is an implementation of the original gradient boosting framework (Friedman et al., 2000; Friedman, 2001), known for high efficiency and outstanding performance in machine learning competitions (Chen & Guestrin, 2016). Compared to traditional Gradient Boosting Machines (GBM), XGBoost has made a few improvements in the algorithm level. For example, XGBoost includes regularization terms in the objective function to control the model complexity, therefore can reduce overfitting and improve the model generalization; XGBoost is optimized for sparse input data, i.e. data with missing values; Other than greedy algorithm by Friedman (2001), XGBoost supports a weighted quantile sketch algorithm that can more effectively find the optimal split points. Moreover, system enhancements for parallelization, tree pruning, and cache optimization have been integrated into XGBoost. Recently, XGBoost has been applied to astronomy and showed its capabilities of handling astronomical problems, including identifying Galactic candidates among unassociated sources from the Third Fermi Large Area Telescope (LAT) catalog (3FGL; Acero et al., 2015) (e.g. Mirabal et al., 2016), distinguishing M giants from M dwarfs for spectral surveys (e.g. Yi et al., 2019), and selecting quasar candidates with photometric data (e.g. Jin et al., 2019).
In order to obtain the optimal models, we use optuna (Akiba et al., 2019), a hyperparameter optimization framework to tune the learning hyperparameters. As has been mentioned in Section 3.3, we transform the three-class classification problem into two binary classification problems (stars versus extragalactic objects, and galaxies versus quasars). Under this setting, hyperparameters can be fine-tuned separately for the two classification steps. After classifying the Galactic plane sources with the two classifiers, we may use necessary additional criteria to ensure the purity of GPQ candidates. The classification scheme is shown in Figure 9.
A few evaluation metrics are used in the machine learning process: , , , , (Matthews correlation coefficient) and (Area Under the Precision-Recall Curve). With true positive denoted as , true negative as , false positive as and false negative as , the first five metrics are defined as:
(3) | |||
(4) | |||
(5) | |||
(6) | |||
(7) |
The Precision-Recall (PR) curve can be constructed by plotting precision-recall pairs (operating points) that are obtained using different thresholds on a probabilistic or other continuous-output classifier (Boyd et al., 2013). The can then be calculated with numerical integration methods.
Among the six metrics, , , and are commonly used. However, the and metrics fail to measure the classification performance correctly under class-imbalanced situations, because they will be heavily biased towards the majority class. For example, given a sample with 95 from the negative class and 5 from the positive class, simply classifying all instances as negative produces and . These two scores of metrics are misleading because all the positive instances are wrongly classified while the and are high. The last two metrics, and are considered better evaluation measures in class-imbalanced cases. The takes the four confusion matrix categories (, , , ) into account, and it is high only if the classifier makes good predictions on both positive and negative classes, independently of their ratios in the overall dataset (Chicco & Jurman, 2020). It is also suggested by studies that the PR curve is more informative than the more famous Receiver Operator Characteristic (ROC) curve (first recommended by Provost et al., 1998), especially on imbalanced datasets (Davis & Goadrich, 2006; Saito & Rehmsmeier, 2015). is useful as a measure of the overall performance of the model.
A total of 13 features are chosen for the two classification steps and the later photometric redshift regression, including 11 colors: , , , , , , , , , , and ; and two morphological features: and . As has been discussed in Section 4.2, using a set of PS1 colors (, , , and ) can help reduce the overlap between clusters of quasars and stellar loci on two-dimensional diagrams. Quasars have redder and colors than stars and galaxies, which makes these two colors good features for quasars selection. Jin et al. (2019) showed that three PS1-AllWISE colors (i.e. , , and ) can be used to efficiently distinguish quasars from stars and improve the performance of XGBoost classification. We construct similar colors as features by combining all PS1 bands and (i.e. , , , and ), because is the most sensitive one of AllWISE bands. These five new colors provide rough optical Spectral Energy Distributions (SEDs) for the objects and can characterize different objects with broader wavelength ranges than other optical colors (e.g. ) do. The difference between PSF magnitude and Kron magnitude in and bands ( and ) are used as morphological features to separate point sources (stars and quasars) from extended sources (galaxies). We convert Vega magnitude to AB magnitude for AllWISE data when constructing all the features. As we don’t set constraints on magnitude or , some sources may have poor or missing (and ) data. Nevertheless, the use of will not be a problem, because XGBoost can handle the missing values, and data with lower are more informative than missing values.

5.1 Binary classification for stars and extragalactic objects
In the first classification step, the input data for training and validation consist of synthetic quasar sample , synthetic galaxy sample (see Section 4.1), and LAMOST Galactic plane star sample (Section 2.6). The input data have more than 3 million rows. For binarization, we assign the label EXT (extragalactic object) to all and instances, and keep the label for as STAR. Here we regard extragalactic objects as the positive class and stars as the negative class.
We first apply five-fold cross validations with optuna to find the optimal setting of hyperparameters that minimizes the log loss among 500 trials. For a binary classification problem with a true label and a probability estimate , the log loss per sample is the negative log-likelihood of the classifier given the true label:
(8) | ||||
(9) |
Then we randomly split the whole input data into training set and validation set according to a ratio and calculate scores of the six metrics with the validation set. This split ratio is consistent with that of the five-fold cross validations. The large sample size of input data also ensures both training and validation sets have enough samples.
Some fixed parameters in our programs are: objective=binary:logistic; booster=gbtree; tree_method=hist. For hyperparameters that are tuned, the default values, optimal values found by the cross validations, and corresponding metric scores of these parameters are listed in Table 2. The number of boosting rounds (num_boost_round, a.k.a. n_estimators in scikit-learn API of XGBoost) is fixed to 100 and not tuned together with eta (a.k.a. learning_rate), because the effects of increasing num_boost_round can cancel that of decreasing eta, and vice versa. In the training process, we need to lower the learning rate eta and increase the num_boost_round to reduce the generalization error. The Classifier No.1 (CLF-1) is trained using , with other optimal parameters in Table 2.
Hyperparameter | Default | Optimal |
---|---|---|
eta (learning_rate) | 0.3 | 0.3 |
lambda (reg_lambda) | 1 | 2.32 |
alpha (reg_alpha) | 0 | 1.13 |
max_depth | 6 | 9 |
gamma (min_split_loss) | 0 | 0.60 |
grow_policy | depthwise | depthwise |
min_child_weight | 1 | 1 |
subsample | 1 | 0.96 |
colsample_bytree | 1 | 0.92 |
max_delta_step | 0 | 3 |
0.9993 | 0.9995 | |
0.9993 | 0.9995 | |
0.9995 | 0.9996 | |
0.9994 | 0.9996 | |
0.9985 | 0.9990 | |
0.9992 | 0.9995 |
We then classify the PS1-AllWISE point-like sources with CLF-1. To exclude as many stars as possible, we adopt a high threshold on (model-predicted probabilities of sources for being extragalactic) to select extragalactic candidates. Sources with are labeled as EXT, and the others are labeled as STAR and removed.
5.2 Binary classification for galaxies and quasars
We use and samples as input data for training and validation in the second classification step. Here we regard quasars as the positive class and galaxies as the negative class.
The same processes of parameter tuning and training as those of CLF-1 are applied to build CLF-2. We keep some parameters unchanged as: objective=binary:logistic; booster=gbtree; tree_method=hist. For hyperparameters that are tuned, the default values, optimal values found by the cross validations, and corresponding metric scores of these parameters are listed in Table 3. The CLF-2 is trained using , with other optimal parameters in Table 3.
The optimal scores of the six metrics in Table 3 are all lower than those in Table 2, indicating that the quasar–galaxy classification “hardness” is higher than that of star–extragalactic problem. Here we also use a high threshold of probability to select sources of our target class. We classify the sources labeled as EXT with CLF-2. Sources with are kept as GPQ candidates, where is the probability of a source for being a quasar predicted by the XGBoost model.
Hyperparameter | Default | Optimal |
---|---|---|
eta (learning_rate) | 0.3 | 0.2 |
lambda (reg_lambda) | 1 | 2.32 |
alpha (reg_alpha) | 0 | 1.10 |
max_depth | 6 | 9 |
gamma (min_split_loss) | 0 | 0.81 |
grow_policy | depthwise | depthwise |
min_child_weight | 1 | 2 |
subsample | 1 | 0.95 |
colsample_bytree | 1 | 0.86 |
max_delta_step | 0 | 6 |
0.9969 | 0.9974 | |
0.9894 | 0.9905 | |
0.9894 | 0.9918 | |
0.9894 | 0.9912 | |
0.9875 | 0.9896 | |
0.9938 | 0.9951 |
5.3 Additional cut based on Gaia proper motion to remove stellar contaminants
In the first classification process, we classify all PS1-AllWISE point-like sources to stars and extragalactic objects. We ignore stars in the second classification step. Although the metrics of CLF-1 are high (Table 2), some stars can be misclassified as extragalactic objects, and then be classified either as quasars or galaxies. Faint stars are more likely to be misclassified than bright stars because stars in the training sample () are biased towards the bright end. When using optical and near-IR colors for candidate selection, white dwarfs are major contaminants for low redshift quasars, and M/L/T dwarfs are typical contaminants for high redshift quasars (e.g. Kirkpatrick et al., 1997; Vennes et al., 2002; Chiu et al., 2006). In the mid-IR regime, potential stellar contaminants for quasars are Young Stellar Objects (YSO), Asymptotic Giant Branch (AGB) stars, and Planetary Nebulae (PNe) (Kozłowski & Kochanek, 2009; Koenig & Leisawitz, 2014; Assef et al., 2018).
The YSOs are stars at the early stages of evolution, and are often divided into four subclasses (Lada, 1987): Class I, Class II, Flat spectrum, and Class III. Among them, Class II and Flat spectrum YSOs are the most-likely contaminants since they have optical and mid-IR SEDs similar to those of quasars. Since we require both optical and mid-IR detections for classification, optically faint Class I YSOs are eliminated in the first place. As has been studied by Koenig & Leisawitz (2014), Class III YSOs are clustered around and , while Class I, II, and Flat spectrum YSOs occupy the region approximately with and (see their Figure 5). The latter YSO region is overlapped with the quasar region shown in Figure 6, therefore the contamination should be taken care of.
AGB stars are evolved stars with low temperatures and high luminosities. They are surrounded by circumstellar envelopes, and IR excess exists in their broad SEDs. According to Figure 5 from Koenig & Leisawitz (2014), only a minority of AGB stars actually overlap with Class I, II, and Flat spectrum YSOs (and thus quasars) in the versus diagram. Therefore we expect the contamination from AGB stars is less than that from YSOs.
The PNe have a series of narrow emission lines as well as IR excess. Known PNe can be later removed from the GPQ candidate sample by cross-matching with the Simbad database (Wenger et al., 2000).
In order to remove stellar contaminants such as white dwarfs, M/L/T dwarfs, YSOs, and AGB stars from GPQ candidates, we apply an additional cut based on Gaia proper motion, because the proper motion distribution of quasars is different from that of Milky Way stars. Although quasars should have negligible transverse motions, non-zero proper motions of them are measured by Gaia due to various effects, such as photocenter variability of quasars (see Bachchan et al., 2016, and references therein). In addition, proper motions with large uncertainties are not reliable. Therefore we need a probabilistic cut instead of a cut on the total proper motion. We define the probability density of zero proper motion () of a source based on the bivariate normal distribution of proper motion measurements of the source as:
(10) |
where , , and (correlation coefficient between pmra and pmdec) are obtained from Gaia DR2 catalog, while and are the true external proper motion uncertainties calculated with the method suggested by Lindegren et al. (2018a, b). The external proper motion uncertainty can be expressed as , where can be or , is a multiplicative factor, is the catalog uncertainty (pmra_error or pmdec_error), and is the systematic error. For bright sources (), ; for faint sources (), . Under the same uncertainty level, sources with smaller proper motions will have higher by definition.
We take the logarithm of for better comparison between samples. Figure 10 shows distributions of of stars, galaxies and quasars used in this study. For stellar samples, in addition to (LAMOST Galactic plane star sample), a subsample of the SDSS Stripe 82 Standard Star Catalog (hereafter S82 star; Ivezić et al., 2007) that meets the same constraints in Section 2.1 and Section 2.2 is also included for comparison. We choose a cut that excludes 94.1% of both LAMOST Galactic plane stars and S82 stars, while retains 99.8% of the quasars. Nevertheless, faint stars can be major contaminants even with such strict cut on .

We calculate for GPQ candidates after cross-matching them with Gaia DR2. For candidates without Gaia DR2 proper motion records, we assign a default value of 99 for . Sources with are kept as reliable GPQ candidates.
6 Photometric redshift estimation for GPQ candidates
Measuring redshifts is an important step for quasar surveys. For quasar candidates, photometric redshifts (photo-) estimation is a key to follow-up studies. Many different approaches have been proposed for calculating photo-s of quasars, including quasar template fitting (e.g. Budavári et al., 2001; Babbedge et al., 2004; Salvato et al., 2009), the empirical color-redshift relation (e.g. Richards et al., 2001; Weinstein et al., 2004; Wu et al., 2004; Wu & Jia, 2010; Wu et al., 2012), machine learning (e.g. Yèche et al., 2010; Laurino et al., 2011; Brescia et al., 2013; Zhang et al., 2013; Pasquet-Itam & Pasquet, 2018), XDQSOz method (Bovy et al., 2012), and Skew-QSO method (Yang et al., 2017). As the photo- estimation problem can be well described by the regression problem in machine learning, we also use XGBoost to train the regression model and predict photo-s for our reliable GPQ candidates.
To build the training set and validation set, we randomly split the de-reddened sample with a ratio of 4:1. Our application set (reliable GPQ candidates) is also de-reddened. The same 13 features as those in Section 5 are used for photo- regression: , , , , , , , , , , , and . The morphological features and are included as they may help distinguish quasars at different cosmological distances. To obtain the optimal model, we also tune the parameters with five-fold cross-validations using optuna.

The performance of the XGBoost photo- regression model on the test set can be examined in - (photometric redshift versus spectral redshift) plot (Figure 11) or with two quantities: the root-mean-square error (RMSE) and photo- accuracy. For a validation set with sample size , the root-mean-square error is . On our validation set with a sample size of 57,855, the RMSE is 0.35. The photo- accuracy is defined as the fraction of quasars with , where . Our XGBoost regression model yields a photo- accuracy of 74% on the validation set, which is comparable to that of Yang et al. (2017) on PS1 and WISE data (79%). Yang et al. (2017) adopted a multivariate Skew-t model and prior probabilities from the quasar luminosity function (QLF) to achieve the high photo- accuracy. Figure 12 shows the photo- accuracy as a function of spectral redshift (left panel) and de-reddened -band magnitude (right panel) respectively. has maximum values at and , and reaches a minimum at . Most quasars have underestimated photometric redshifts (see Figure 11) due to a degeneracy of broad-band photometry in response to quasar SEDs at different redshifts. The strong Ly emission line enters band at and moves into band at , which leads to large excess in magnitudes and hence similar PS1 colors of quasars within . This kind of degeneracy can be alleviated if SDSS -band data are available to characterize the Lyman limit systems (see Section 4 of Yang et al., 2017). The photo- accuracy is improved at , because the Lyman limit enters band. also drops at low redshift (), and at both bright and faint ends, because the training sample is biased towards intermediate redshifts and magnitudes.

7 The GPQ candidate catalog
7.1 Validation of the GPQ candidates with Simbad, Milliquas, and SDSS DR16Q
With our Transfer Learning Framework and aforementioned additional selection criteria, we obtain a reliable GPQ candidate sample with 161,532 sources from PS1 and AllWISE. We cross-match the GPQ candidates with Simbad database (Wenger et al., 2000), and find 2,786 matches. The object types and summary are shown in Table 4.
We categorize all matched sources to four groups: AGN/QSO, star (including PNe and PN candidates), galaxy, and other-type objects. Among all the matches, 53.98% (1,504) are recorded as AGN/QSO (including candidates), 8.97% (250) are recorded as star (including candidates), 4.02% (112) are recorded as galaxy, and 33.02% (920) are other-type objects labeled according to the detection properties (e.g. wavelength). Those other-type objects have higher probabilities to be AGN/QSO than stars, as most (728+27) of them are radio sources, and 64 are X-ray sources (see Table 4). For the 4.02% sources labeled as galaxy, a number of them may also host AGN/QSO as we have applied careful selection criteria to remove possible galaxy contaminants. Among the 250 sources labeled as star, 40 are candidates, and the other 210 are known stars. 101 of the known stars were once selected as QSO candidates using SDSS photometry, and then identified as stars by the 2dF-SDSS LRG and QSO Survey (Croom et al., 2009). From these analyses, we can conclude that, the purity of our GPQ candidates on the small subset of 2,786 Simbad matches can be as high as %. The true purity may vary at different locations in the Galactic plane.
AGN/QSO | Number | Star | Number | Galaxy | Number | Other types | Number |
---|---|---|---|---|---|---|---|
QSO | 1121 | Star | 175 | Galaxy | 106 | Radio source | 728 |
AGN candidate | 150 | YSO Candidate | 22 | Radio galaxy | 3 | IR source | 77 |
BL Lac object | 143 | YSO | 13 | Brightest galaxy in a cluster (BCG) | 2 | X-ray source | 64 |
Seyfert 1 galaxy | 38 | Cataclysmic binary candidate | 7 | Cluster of galaxies | 1 | Centimetric radio source | 27 |
AGN | 31 | Planetary Nebula (PN) | 5 | Blue object | 22 | ||
Other subclasses | 21 | Other subclasses | 28 | Far-IR source () | 2 | ||
Total | 1504 | Total | 250 | Total | 112 | Total | 920 |
Fraction | 53.98% | Fraction | 8.97% | Fraction | 4.02% | Fraction | 33.02% |
As another test on stellar contamination, we cross-match our GPQ candidates with LAMOST Galactic plane star sample. This match identifies 29 LAMOST stars, all of which are not recorded in Simbad. Therefore the total number of known stars in the GPQ candidates is 239.
We also examine the fraction of known GPQs that can be recovered with our candidates table. The known GPQ sample is with 1,853 sources, which is retrieved from Milliquas catalog and described in Section 2.7. The is selected with same constraints as those on our application PS1-AllWISE data, to get a consistent analysis result. Cross-matching with our GPQ candidates results in 1,763 matches, meaning that 95.14% of GPQs from Milliquas can be selected with our methods, under same quality constraints on the photometric data. The recent sixteenth data release of SDSS Quasar Catalog (DR16Q; Lyke et al., 2020) has a total of 750,414 sources, in which 3,727 sources are located at . Only 1,320 of these SDSS GPQs meet the photometric quality constraints in Section 2.1 and Section 2.2. Cross-matching our GPQ candidates with SDSS DR16Q gives 1,292 matches, which corresponds to a recall rate of 97.88% under the same photometric quality constraints, or a recall rate of 34.67% to the whole identified sample. The overall completeness of the sample of candidates is mainly limited by the photometric quality constraints.
7.2 Description of the GPQ candidate catalog
We remove 239 known stars (see Section 7.1) from our GPQ candidate sample. We then match the remaining GPQ candidates by coordinates with TOPCAT internally, and found 347 close pairs within . These pairs are very likely duplicated sources, because the PS1 survey can not resolve two sources within an angular distance of . The median image quality for PS1 survey is FWHM = (1.31, 1.19, 1.11, 1.07, 1.02) arcseconds for () (Magnier et al., 2016). Therefore we only keep one source for each close pair, and obtain the final GPQ candidates sample with sources. The GPQ candidate catalog is compiled based on this sample, with photometric data from PS1 DR1, AllWISE, and astrometric data from Gaia DR2. The descriptions for the catalog are displayed in Table 5.
Column | Units | Label | Explanations |
---|---|---|---|
1 | — | Designation | Catalog designation hhmmss.ss+ddmmss.s (J2000) based on Pan-STARRS1 (PS1) coordinates |
2 | deg | ra | PS1 right ascension in decimal degrees (J2000) (weighted mean) at mean epoch |
3 | deg | dec | PS1 declination in decimal degrees (J2000) (weighted mean) at mean epoch |
4 | deg | l | Galactic longitude in decimal degrees |
5 | deg | b | Galactic latitude in decimal degrees |
6 | — | photoz | Photometric redshift predicted with XGBoost regressor |
7 | — | p_star | Probability of the object to be a star, predicted by the first XGBoost classifier, a.k.a. (p_star+p_ext=1) |
8 | — | p_ext | Probability of the object to be an extragalactic object, predicted by the first XGBoost classifier, a.k.a. (p_star+p_ext=1) |
9 | — | p2_gal | Probability of the object to be a galaxy, predicted by the second XGBoost classifier, a.k.a. (p2_gal+p2_qso=1) |
10 | — | p2_qso | Probability of the object to be a quasar, predicted by the second XGBoost classifier, a.k.a. (p2_gal+p2_qso=1) |
11 | — | fpm0 | Probability density of zero proper motion () of the source |
12 | — | log_fpm0 | The logarithm of fpm0 () |
13 | mag | ebv | Line-of-sight given by the Planck14 dust map |
14 | — | PS_objID | Pan-STARRS1 (PS1) unique object identifier |
15 | mag | gmag | Mean PSF AB magnitude from PS1 g filter detections |
16 | mag | e_gmag | Error in gmag |
17 | mag | gKmag | Mean Kron AB magnitude from PS1 g filter detections |
18 | mag | e_gKmag | Error in gKmag |
19 | mag | rmag | Mean PSF AB magnitude from PS1 r filter detections |
20 | mag | e_rmag | Error in rmag |
21 | mag | rKmag | Mean Kron AB magnitude from PS1 r filter detections |
22 | mag | e_rKmag | Error in rKmag |
23 | mag | imag | Mean PSF AB magnitude from PS1 i filter detections |
24 | mag | e_imag | Error in imag |
25 | mag | iKmag | Mean Kron AB magnitude from PS1 i filter detections |
26 | mag | e_iKmag | Error in iKmag |
27 | mag | zmag | Mean PSF AB magnitude from PS1 z filter detections |
28 | mag | e_zmag | Error in zmag |
29 | mag | zKmag | Mean Kron AB magnitude from PS1 z filter detections |
30 | mag | e_zKmag | Error in zKmag |
31 | mag | ymag | Mean PSF AB magnitude from PS1 y filter detections |
32 | mag | e_ymag | Error in ymag |
33 | mag | yKmag | Mean Kron AB magnitude from PS1 y filter detections |
34 | mag | e_yKmag | Error in yKmag |
35 | — | AllWISE_ID | AllWISE unique source ID |
36 | mag | W1mag | W1 (Vega) magnitude (3.35m) |
37 | mag | e_W1mag | Mean error on W1 magnitude |
38 | mag | W2mag | W2 (Vega) magnitude (4.6m) |
39 | mag | e_W2mag | Mean error on W2 magnitude |
40 | mag | W3mag | W3 (Vega) magnitude (11.6m) |
41 | mag | e_W3mag | Mean error on W3 magnitude |
42 | mag | W4mag | W4 (Vega) magnitude (22.1m) |
43 | mag | e_W4mag | Mean error on W4 magnitude |
44 | mag | Jmag | 2MASS J (Vega) magnitude (1.25m) |
45 | mag | e_Jmag | Mean error on J magnitude |
46 | mag | Hmag | 2MASS H (Vega) magnitude (1.65m) |
47 | mag | e_Hmag | Mean error on H magnitude |
48 | mag | Kmag | 2MASS Ks (Vega) magnitude (2.17m) |
49 | mag | e_Kmag | Mean error on Ks magnitude |
50 | — | Gaia_source_id | Gaia DR2 unique source identifier |
51 | mas | parallax | Gaia DR2 parallax |
52 | mas | parallax_error | Standard error of parallax |
53 | mas/yr | pmra | Gaia DR2 proper motion in right ascension direction |
54 | mas/yr | pmra_error | Standard error of pmra |
55 | mas/yr | pmdec | Gaia DR2 proper motion in declination direction |
56 | mas/yr | pmdec_error | Standard error of pmdec |
57 | — | pmdec_pmdec_corr | Correlation between pmra and pmdec |
58 | mas/yr | pmra_error_ext | True external uncertainty of pmra |
59 | mas/yr | pmdec_error_ext | True external uncertainty of pmdec |
60 | — | sb_main_id | Main identifier for an object in Simbad database |
61 | — | sb_main_type | Main object type for an object in Simbad database |
62 | — | sb_redshift | Redshift of an object recorded in Simbad database |
Note. — This table is published in its entirety in the machine-readable format.
The sky density of sources from the GPQ candidate catalog is shown in Figure 13. In general, the sky distribution of the GPQ candidates is consistent with the prediction in Section 4.3. The highest sky density of the candidates is 72.7 , which is slightly higher than the estimation (66.7 ). The median density is 16.7 , which is comparable but lower than the estimated value (or the median density of ). As can be seen from Figure 13, the sky densities of GPQ candidates at are lower than those of the estimation in Figure 8 (d), which indicates that the modelling process overestimates the sky density of GPQs at lower Galactic latitudes. The region with () is blank, because it is not covered by the PS1 survey.

The distributions of de-reddened magnitudes and photometric redshifts of our GPQ candidates are displayed in Figure 14 (a). The lowest and highest photometric redshift are and , respectively. Taking into account the uncertainties in photo- estimations, the actual highest redshift of the GPQs can be up to 5. Five peaks appear in the histogram of photometric redshift (Figure 14 (a)) at (0.8, 1.2, 1.7, 2.1, 2.4), which are caused by selection effects and sample bias of the training set. Quasars with these redshifts have higher chances to be selected with PS1 photometry: (i) when , the Mg ii emission line enters band; (ii) when , Mg ii enters band; (iii) when , C iii] enters band, and Mg ii enters band; (iv) when , both Si iv and C iv line enter band, and C iii] enters band; and (v) when , Ly and Si iv enter band.
The distributions of de-reddened magnitudes and spectroscopic redshifts of sample from SDSS DR14Q are also shown in Figure 14 (b) for comparison. The sample and the sample of GPQ candidates have similar redshift distributions with some subtle differences. The magnitude distributions are also similar to each other, except that has a larger fraction of bright sources () than the GPQ candidates. Such differences in both redshift and magnitude distributions of these two samples are mainly due to their different target selection strategies. Our GPQ candidates are selected from one single parent sample, while SDSS DR14Q includes many quasar samples in various redshift and magnitude ranges (see Section 2.2 of Pâris et al., 2018).
The color-color properties of sources from the GPQ candidate catalog are shown in Figure 15. In general, GPQ candidates have color-color distributions that are well matched to those of -PS1 (see Figure 5 and 6). The unimodal structures seen from both AllWISE and PS1 colors imply a low level of contamination from stars and galaxies. However, contamination from stars can be recognized from the versus diagram, where some sources are concentrated along the stellar locus (see the slightly contaminated region “SC” in Figure 15 (c)). We apply no cut on the “SC” region because it only contains 7,892 sources (4.90% of whole catalog) and any cut is likely to also remove reddened quasars (see Figure 5).
8 Summary and conclusions
We present a Transfer Learning Framework for quasar selection, and its application on finding GPQs. We construct mock samples of quasars and galaxies behind the Galactic plane, by assigning new locations and extinction values to the extinction-corrected high- SDSS extragalactic sources. We use PS1 limiting magnitudes to select good mock sources, and compare them with high- sources in color-color spaces. We show that the covariate change of source colors is significant from high- regions to the Galactic plane. We synthesize training and validation data for machine learning with: (i) good mock samples, (ii) SDSS extragalactic sources which do not have counterparts in the good mock samples, and (iii) real LAMOST Galactic plane star sample.
We apply XGBoost algorithm for machine learning in this study. To help reduce the effects of class imbalance and class-balance change, we turn the three-class classification task (star, galaxy, and quasar) into two binary classification problems. A total of 13 features are used for the two classification steps: , , , , , , , , , , , and . In order to remove star and galaxy contaminants, we use high thresholds of model-predicted probabilities () to select extragalactic and quasar candidates. We perform an additional cut on probability density of zero proper motion () based on Gaia DR2 data to further reduce stellar contamination. Using the extinction-corrected SDSS DR14Q sources, we build the photometric redshift estimator with on the validation set.
Our GPQ candidate sample is validated with Simbad database and Milliquas catalog. The purity of quasars is % on the Simbad matches. Under our constraints for good PS1 and AllWISE photometry, 95.14% of the GPQs in the Milliquas catalog and 97.88% of the GPQs from SDSS DR16Q can be recalled with our GPQ candidate sample. The photometric quality constraints ensure reliability of the candidates, but at the cost of lower overall completeness of the candidate sample. The sky density of GPQ candidates is consistent with the estimation based on mock GPQ catalog. The median marginal probability of GPQs to the PS1-AllWISE sample with good photometry is and the lowest marginal probability is . We compile the GPQ candidate catalog after removing known stars in Simbad and LAMOST, and some duplicated sources. The GPQ candidate catalog consists of sources. In addtion to our machine learning predictions, we include PS1 and AllWISE photometry, as well as Gaia DR2 astrometry in the table. The GPQ candidate sample has a broad redshift coverage (), indicating that our selection methods can be used on wide redshift ranges.
Colors of GPQ candidates agree well with those of the mock GPQ catalog, which also indicates high purity of the candidates even though the marginal probability is low. Contamination from stars and galaxies still exists in the GPQ candidate sample, but at a low level. Because most stars in the training sample () are bright ones, identifying and removing faint stars can be challenging for the XGBoost classification model (CLF-1). Using colors instead of magnitudes as features helps to lessen the effects of such training sample bias. The strict cut can additionally remove most stellar contaminants. Galaxies overlap heavily with quasars on PS1 color-color diagrams, and show similar distribution with quasars. The use of AllWISE colors ( and ) and morphological separators ( and ) largely aids the galaxy-quasar classification. For future GPQ candidate selections, we expect to improve the machine learning performance by compiling a Galactic plane star training sample with more stars in the faint end.
We have been carrying out a series of spectroscopic identification for GPQ candidates since 2018, using optical telescopes including two-meter telescopes based at Lijiang and Xinglong in China and Siding Spring in Australia, and 200-inch Hale Telescope in the US. The success rate of identifying new GPQs is 90% in our spectroscopic campaign, which is consistent with the estimated reliability of the GPQ candidate catalog. We have also been exploring the LAMOST spectral data to find new GPQs. All these efforts yield promising results that will be presented in the next paper of this series.
References
- Acero et al. (2015) Acero, F., Ackermann, M., Ajello, M., et al. 2015, ApJS, 218, 23
- Aguado et al. (2019) Aguado, D. S., Ahumada, R., Almeida, A., et al. 2019, ApJS, 240, 23
- Akiba et al. (2019) Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. 2019, in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623–2631
- Arenou et al. (2018) Arenou, F., Luri, X., Babusiaux, C., et al. 2018, A&A, 616, A17
- Argyriou et al. (2006) Argyriou, A., Evgeniou, T., & Pontil, M. 2006, in Proceedings of the 19th International Conference on Neural Information Processing Systems, 41–48
- Assef et al. (2018) Assef, R. J., Stern, D., Noirot, G., et al. 2018, ApJS, 234, 23
- Astropy Collaboration et al. (2013) Astropy Collaboration, Robitaille, T. P., Tollerud, E. J., et al. 2013, A&A, 558, A33
- Babbedge et al. (2004) Babbedge, T. S. R., Rowan-Robinson, M., Gonzalez-Solares, E., et al. 2004, MNRAS, 353, 654
- Bachchan et al. (2016) Bachchan, R. K., Hobbs, D., & Lindegren, L. 2016, A&A, 589, A71
- Bailer-Jones et al. (2019) Bailer-Jones, C. A. L., Fouesneau, M., & Andrae, R. 2019, MNRAS, 490, 5615
- Becker et al. (2001) Becker, R. H., White, R. L., Gregg, M. D., et al. 2001, ApJS, 135, 227
- Ben Bekhti et al. (2008) Ben Bekhti, N., Richter, P., Westmeier, T., & Murphy, M. T. 2008, A&A, 487, 583
- Ben Bekhti et al. (2012) Ben Bekhti, N., Winkel, B., Richter, P., et al. 2012, A&A, 542, A110
- Bishop (2006) Bishop, C. M. 2006, Pattern recognition and machine learning (springer)
- Blanton et al. (2017) Blanton, M. R., Bershady, M. A., Abolfathi, B., et al. 2017, AJ, 154, 28
- Blitzer et al. (2006) Blitzer, J., McDonald, R., & Pereira, F. 2006, in Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, 120–128
- Bovy et al. (2011) Bovy, J., Hennawi, J. F., Hogg, D. W., et al. 2011, ApJ, 729, 141
- Bovy et al. (2012) Bovy, J., Myers, A. D., Hennawi, J. F., et al. 2012, ApJ, 749, 41
- Boyd et al. (2013) Boyd, K., Eng, K. H., & Page, C. D. 2013, in Joint European conference on machine learning and knowledge discovery in databases, Springer, 451–466
- Brescia et al. (2013) Brescia, M., Cavuoti, S., D’Abrusco, R., Longo, G., & Mercurio, A. 2013, ApJ, 772, 140
- Budavári et al. (2001) Budavári, T., Csabai, I., Szalay, A. e. S., et al. 2001, AJ, 122, 1163
- Chambers et al. (2016) Chambers, K. C., Magnier, E. A., Metcalfe, N., et al. 2016, arXiv e-prints, arXiv:1612.05560
- Chen & Guestrin (2016) Chen, T., & Guestrin, C. 2016, in Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794
- Chicco & Jurman (2020) Chicco, D., & Jurman, G. 2020, BMC genomics, 21, 6
- Chiu et al. (2006) Chiu, K., Fan, X., Leggett, S. K., et al. 2006, AJ, 131, 2722
- Condon et al. (1998) Condon, J. J., Cotton, W., Greisen, E., et al. 1998, AJ, 115, 1693
- Croom et al. (2009) Croom, S. M., Richards, G. T., Shanks, T., et al. 2009, MNRAS, 392, 19
- Cui et al. (2012) Cui, X.-Q., Zhao, Y.-H., Chu, Y.-Q., et al. 2012, Research in Astronomy and Astrophysics, 12, 1197
- Davis & Goadrich (2006) Davis, J., & Goadrich, M. 2006, in Proceedings of the 23rd international conference on Machine learning, 233–240
- Deng et al. (2012) Deng, L.-C., Newberg, H. J., Liu, C., et al. 2012, Research in Astronomy and Astrophysics, 12, 735
- Dobrzycki et al. (2003) Dobrzycki, A., Macri, L. M., Stanek, K. Z., & Groot, P. J. 2003, AJ, 125, 1330
- Du Plessis & Sugiyama (2014) Du Plessis, M. C., & Sugiyama, M. 2014, Neural Networks, 50, 110
- Fan et al. (2001) Fan, X., Narayanan, V. K., Lupton, R. H., et al. 2001, AJ, 122, 2833
- Farrow et al. (2014) Farrow, D. J., Cole, S., Metcalfe, N., et al. 2014, MNRAS, 437, 748
- Fischer et al. (2019) Fischer, T. C., Rigby, J. R., Mahler, G., et al. 2019, ApJ, 875, 102
- Flesch (2015) Flesch, E. W. 2015, PASA, 32, e010
- Flesch (2019) —. 2019, arXiv e-prints, arXiv:1912.05614
- Friedman et al. (2000) Friedman, J., Hastie, T., Tibshirani, R., et al. 2000, The annals of statistics, 28, 337
- Friedman (2001) Friedman, J. H. 2001, Annals of statistics, 1189
- Gaia Collaboration et al. (2016) Gaia Collaboration, Prusti, T., De Bruijne, J., et al. 2016, A&A, 595, A1
- Gaia Collaboration et al. (2018a) Gaia Collaboration, Mignard, F., Klioner, S. A., et al. 2018a, A&A, 616, A14
- Gaia Collaboration et al. (2018b) Gaia Collaboration, Brown, A. G. A., Vallenari, A., et al. 2018b, A&A, 616, A1
- Glikman et al. (2006) Glikman, E., Helfand, D. J., & White, R. L. 2006, ApJ, 640, 579
- Gong et al. (2019) Gong, Z., Zhong, P., & Hu, W. 2019, IEEE Access, 7, 64323
- Górski et al. (2005) Górski, K. M., Hivon, E., Banday, A. J., et al. 2005, ApJ, 622, 759
- Grazian et al. (2000) Grazian, A., Cristiani, S., D’Odorico, V., Omizzolo, A., & Pizzella, A. 2000, AJ, 119, 2540
- Green (2018) Green, G. 2018, The Journal of Open Source Software, 3, 695
- Green et al. (2019) Green, G. M., Schlafly, E., Zucker, C., Speagle, J. S., & Finkbeiner, D. 2019, ApJ, 887, 93
- Green et al. (2018) Green, G. M., Schlafly, E. F., Finkbeiner, D., et al. 2018, MNRAS, 478, 651
- Green et al. (1986) Green, R. F., Schmidt, M., & Liebert, J. 1986, ApJS, 61, 305
- Gregg et al. (1996) Gregg, M. D., Becker, R. H., White, R. L., et al. 1996, AJ, 112, 407
- Hastie et al. (2009) Hastie, T., Tibshirani, R., & Friedman, J. 2009, The elements of statistical learning: data mining, inference, and prediction (Springer Science & Business Media)
- Hernán-Caballero et al. (2016) Hernán-Caballero, A., Hatziminaoglou, E., Alonso-Herrero, A., & Mateos, S. 2016, MNRAS, 463, 2064
- Huo et al. (2010) Huo, Z.-Y., Liu, X.-W., Yuan, H.-B., et al. 2010, Research in Astronomy and Astrophysics, 10, 612
- Huo et al. (2013) Huo, Z.-Y., Liu, X.-W., Xiang, M.-S., et al. 2013, AJ, 145, 159
- Huo et al. (2015) Huo, Z.-Y., Liu, X.-W., Xiang, M.-S., et al. 2015, Research in Astronomy and Astrophysics, 15, 1438
- Im et al. (2007) Im, M., Lee, I., Cho, Y., et al. 2007, ApJ, 664, 64
- Ivezić et al. (2007) Ivezić, Ž., Smith, J. A., Miknaitis, G., et al. 2007, AJ, 134, 973
- Jin et al. (2019) Jin, X., Zhang, Y., Zhang, J., et al. 2019, MNRAS, 485, 4539
- Kirkpatrick et al. (1997) Kirkpatrick, J. D., Henry, T. J., & Irwin, M. J. 1997, AJ, 113, 1421
- Koenig & Leisawitz (2014) Koenig, X. P., & Leisawitz, D. T. 2014, ApJ, 791, 131
- Kozłowski & Kochanek (2009) Kozłowski, S., & Kochanek, C. S. 2009, ApJ, 701, 508
- Kron (1980) Kron, R. G. 1980, ApJS, 43, 305
- Lacy et al. (2004) Lacy, M., Storrie-Lombardi, L. J., Sajina, A., et al. 2004, ApJS, 154, 166
- Lada (1987) Lada, C. J. 1987, in Symposium-International astronomical union, Vol. 115, Cambridge University Press, 1–18
- Laurino et al. (2011) Laurino, O., D’Abrusco, R., Longo, G., & Riccio, G. 2011, MNRAS, 418, 2165
- Lindegren et al. (2018a) Lindegren, L., Hernández, J., Bombrun, A., et al. 2018a, A&A, 616, A2
- Lindegren et al. (2018b) Lindegren, L., Hernández, J., Bombrun, A., et al. 2018b, in IAU 30 GA - Division A: Fundamental Astronomy
- Luo et al. (2012) Luo, A. L., Zhang, H.-T., Zhao, Y.-H., et al. 2012, Research in Astronomy and Astrophysics, 12, 1243
- Luo et al. (2015) Luo, A.-L., Zhao, Y.-H., Zhao, G., et al. 2015, Research in Astronomy and Astrophysics, 15, 1095
- Lyke et al. (2020) Lyke, B. W., Higley, A. N., McLane, J. N., et al. 2020, ApJS, 250, 8
- Magnier et al. (2016) Magnier, E. A., Schlafly, E. F., Finkbeiner, D. P., et al. 2016, arXiv e-prints, arXiv:1612.05242
- Mainzer et al. (2011) Mainzer, A., Bauer, J., Grav, T., et al. 2011, ApJ, 731, 53
- Mateos et al. (2012) Mateos, S., Alonso-Herrero, A., Carrera, F. J., et al. 2012, MNRAS, 426, 3271
- Mirabal et al. (2016) Mirabal, N., Charles, E., Ferrara, E. C., et al. 2016, ApJ, 825, 69
- Palanque-Delabrouille et al. (2011) Palanque-Delabrouille, N., Yeche, C., Myers, A. D., et al. 2011, A&A, 530, A122
- Pan & Yang (2009) Pan, S. J., & Yang, Q. 2009, IEEE Transactions on knowledge and data engineering, 22, 1345
- Pâris et al. (2018) Pâris, I., Petitjean, P., Aubourg, É., et al. 2018, A&A, 613, A51
- Pasquet-Itam & Pasquet (2018) Pasquet-Itam, J., & Pasquet, J. 2018, A&A, 611, A97
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, the Journal of machine Learning research, 12, 2825
- Planck Collaboration et al. (2014) Planck Collaboration, Abergel, A., Ade, P. A. R., et al. 2014, A&A, 571, A11
- Pounds (1979) Pounds, K. A. 1979, Proceedings of the Royal Society of London Series A, 366, 375
- Price-Whelan et al. (2018) Price-Whelan, A. M., Sipőcz, B. M., Günther, H. M., et al. 2018, AJ, 156, 123
- Provost et al. (1998) Provost, F., Fawcett, T., & Kohavi, R. 1998, in ICML Conference
- Quionero-Candela et al. (2009) Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. D. 2009, Dataset shift in machine learning (The MIT Press)
- Richards et al. (2001) Richards, G. T., Weinstein, M. A., Schneider, D. P., et al. 2001, AJ, 122, 1151
- Richards et al. (2002) Richards, G. T., Fan, X., Newberg, H. J., et al. 2002, AJ, 123, 2945
- Richards et al. (2004) Richards, G. T., Nichol, R. C., Gray, A. G., et al. 2004, ApJS, 155, 257
- Saerens et al. (2002) Saerens, M., Latinne, P., & Decaestecker, C. 2002, Neural computation, 14, 21
- Saito & Rehmsmeier (2015) Saito, T., & Rehmsmeier, M. 2015, PloS one, 10
- Salvato et al. (2009) Salvato, M., Hasinger, G., Ilbert, O., et al. 2009, ApJ, 690, 1250
- Sandage (1965) Sandage, A. 1965, ApJ, 141, 1560
- Savage et al. (1993) Savage, B. D., Lu, L., Bahcall, J. N., et al. 1993, ApJ, 413, 116
- Savage et al. (2000) Savage, B. D., Wakker, B., Jannuzi, B. T., et al. 2000, ApJS, 129, 563
- Schlafly et al. (2016) Schlafly, E. F., Meisner, A. M., Stutz, A. M., et al. 2016, ApJ, 821, 78
- Schmidt (1963) Schmidt, M. 1963, Nature, 197, 1040
- Secrest et al. (2015) Secrest, N. J., Dudik, R. P., Dorland, B. N., et al. 2015, ApJS, 221, 12
- Shen et al. (2016) Shen, S.-Y., Argudo-Fernández, M., Chen, L., et al. 2016, Research in Astronomy and Astrophysics, 16, 43
- Shimodaira (2000) Shimodaira, H. 2000, Journal of statistical planning and inference, 90, 227
- Skrutskie et al. (2006) Skrutskie, M., Cutri, R., Stiening, R., et al. 2006, AJ, 131, 1163
- Stern et al. (2005) Stern, D., Eisenhardt, P., Gorjian, V., et al. 2005, ApJ, 631, 163
- Stern et al. (2012) Stern, D., Assef, R. J., Benford, D. J., et al. 2012, ApJ, 753, 30
- Strauss et al. (2002) Strauss, M. A., Weinberg, D. H., Lupton, R. H., et al. 2002, AJ, 124, 1810
- Su & Cui (2004) Su, D.-Q., & Cui, X.-Q. 2004, Chinese J. Astron. Astrophys., 4, 1, doi: 10.1088/1009-9271/4/1/1
- Sugiyama & Kawanabe (2012) Sugiyama, M., & Kawanabe, M. 2012, Machine learning in non-stationary environments: Introduction to covariate shift adaptation (MIT press)
- Tange (2011) Tange, O. 2011, ;login: The USENIX Magazine, 36, 42
- Taylor (2005) Taylor, M. B. 2005, in Astronomical Society of the Pacific Conference Series, Vol. 347, Astronomical Data Analysis Software and Systems XIV, ed. P. Shopbell, M. Britton, & R. Ebert, 29
- Vanden Berk et al. (2001) Vanden Berk, D. E., Richards, G. T., Bauer, A., et al. 2001, AJ, 122, 549
- Vapnik (2013) Vapnik, V. 2013, The nature of statistical learning theory (Springer science & business media)
- Vennes et al. (2002) Vennes, S., Smith, R. J., Boyle, B. J., et al. 2002, MNRAS, 335, 673
- Wang & Chen (2019) Wang, S., & Chen, X. 2019, ApJ, 877, 116
- Wang et al. (1996) Wang, S.-g., Su, D.-q., Chu, Y.-q., Cui, X., & Wang, Y.-n. 1996, Applied Optics, 35, 5155
- Weinstein et al. (2004) Weinstein, M. A., Richards, G. T., Schneider, D. P., et al. 2004, ApJS, 155, 243
- Wenger et al. (2000) Wenger, M., Ochsenbein, F., Egret, D., et al. 2000, A&AS, 143, 9
- Westmeier (2018) Westmeier, T. 2018, MNRAS, 474, 289
- White et al. (2000) White, R. L., Becker, R. H., Gregg, M. D., et al. 2000, ApJS, 126, 133
- Wright et al. (2010) Wright, E. L., Eisenhardt, P. R., Mainzer, A. K., et al. 2010, AJ, 140, 1868
- Wu et al. (2012) Wu, X.-B., Hao, G., Jia, Z., Zhang, Y., & Peng, N. 2012, AJ, 144, 49
- Wu & Jia (2010) Wu, X.-B., & Jia, Z. 2010, MNRAS, 406, 1583
- Wu et al. (2004) Wu, X.-B., Zhang, W., & Zhou, X. 2004, Chinese J. Astron. Astrophys., 4, 17
- Yan et al. (2013) Yan, L., Donoso, E., Tsai, C.-W., et al. 2013, AJ, 145, 55
- Yang et al. (2017) Yang, Q., Wu, X.-B., Fan, X., et al. 2017, AJ, 154, 269
- Yao et al. (2019) Yao, S., Wu, X.-B., Ai, Y. L., et al. 2019, ApJS, 240, 6
- Yèche et al. (2010) Yèche, C., Petitjean, P., Rich, J., et al. 2010, A&A, 523, A14
- Yi et al. (2019) Yi, Z., Chen, Z., Pan, J., et al. 2019, ApJ, 887, 241
- York et al. (2000) York, D. G., Adelman, J., Anderson, Jr., J. E., et al. 2000, AJ, 120, 1579
- Yuan et al. (2015) Yuan, H.-B., Liu, X.-W., Huo, Z.-Y., et al. 2015, MNRAS, 448, 855
- Zhang et al. (2013) Zhang, Y., Ma, H., Peng, N., Zhao, Y., & Wu, X.-b. 2013, AJ, 146, 22
- Zhao et al. (2012) Zhao, G., Zhao, Y.-H., Chu, Y.-Q., Jing, Y.-P., & Deng, L.-C. 2012, Research in Astronomy and Astrophysics, 12, 723
- Zonca et al. (2019) Zonca, A., Singer, L., Lenz, D., et al. 2019, Journal of Open Source Software, 4, 1298