This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Photometric redshifts for quasars from WISE-PS1-STRM

Sándor Kunsági-Máté1, Róbert Beck1, István Szapudi2, István Csabai1,
1Department of Physics of Complex Systems, ELTE Eötvös Loránd University, Pázmány Péter sétány 1/a, Budapest 1117, Hungary
2Institute for Astronomy, University of Hawaii, 2680 Woodlawn Drive, Honolulu, HI, 96822
E-mail: [email protected]
(Accepted XXX. Received YYY; in original form ZZZ)
Abstract

Three-dimensional wide-field galaxy surveys are fundamental for cosmological studies. For higher redshifts (z1.0z\gtrsim 1.0), where galaxies are too faint, quasars still trace the large-scale structure of the Universe. Since available telescope time limits spectroscopic surveys, photometric methods are efficient for estimating redshifts for many quasars. Recently, machine learning methods are increasingly successful for quasar photometric redshifts, however, they hinge on the distribution of the training set. Therefore a rigorous estimation of reliability is critical. We extracted optical and infrared photometric data from the cross-matched catalogue of the WISE All-Sky and PS1 3π\pi DR2 sky surveys. We trained an XGBoost regressor and an artificial neural network on the relation between color indices and spectroscopic redshift. We approximated the effective training set coverage with the K nearest neighbors algorithm. We estimated reliable photometric redshifts of 2,879,298 quasars which overlap with the training set in feature space. We validated the derived redshifts with an independent, clustering-based redshift estimation technique. The final catalog is publicly available.

keywords:
methods: data analysis – methods: statistical – galaxies: distances and redshifts – catalogues
pubyear: 2022pagerange: Photometric redshifts for quasars from WISE-PS1-STRMPhotometric redshifts for quasars from WISE-PS1-STRM

1 Introduction

The three-dimensional distribution of objects of our Universe is a crucial input in several cosmological studies. Although the known optically observable edge of the Universe is about more than 33 billion light years (z>11z>11) away from us (Harikane et al. (2022)), we still have relatively dense redshift measurements only of a near (z<1.z<1.) region which is a tiny part of the whole observable volume. Precise redshift determination needs spectroscopic surveys such as SDSS (Blanton et al. (2017)). However, faraway objects get so faint that the measurement of their spectra cannot be done or would need extremely long exposure. Due to these difficulties most of the recent and upcoming sky surveys (DES The Dark Energy Survey Collaboration (2005), LSST Abate et al. (2012), WISE Wright et al. (2010), PanSTARRS Chambers et al. (2016)) provide imaging data only. Several methods have been therefore developed in the last decades to create effective models that are able to estimate the redshift from observed fluxes measured in broadband filters. These approaches can be categorized into two groups, namely the template-fitting (Benitez (2000), Bolzonella et al. (2000), Csabai et al. (2000), Ilbert, O. et al. (2006), Coe et al. (2006), Brammer et al. (2008), Leistedt et al. (2016), Beck et al. (2016)) and the empirical (Wadadekar (2005), Boris et al. (2007), Miles et al. (2007), Budavári (2009), Carliles et al. (2010), O’Mill et al. (2011), Krone-Martins et al. (2014), Elliott et al. (2015), Hogan et al. (2015)) methods. Since the template fitting method relies on a physical model, it generalizes/extrapolates typically better than the empirical methods. However the empirical, mostly Machine Learning methods are better at interpolating within a subregion in the feature space specified by the spectroscopic training sample and so to avoid errors from unknown observational biases. In this work we used two empirical models, XGBoost and Artificial Neural Networks to provide reliable photometric redshift estimations for close to three millions of quasars detected in the PS1-WISE cross-match catalog. As a comparison one of the most recent quasar catalog with photometric redshifts contains about one million of objects (Nakoneczny, S. J. et al. (2021)). There are other photometric redshift catalogs as well consisting however much less quasars such as Yang et al. (2017) and Wu & Jia (2010), where the relevance of the infrared bands in the redshift estimation have been confirmed.
This paper is organized as follows: in Section 2 we give all the necessary details of the used data, in Section 3 we give all the information about the used methods, in Section 4 we present and discuss the results and finally in Section 5 we summarize our work.

2 Data

We used the cross-matched catalogue of the WISE All-Sky and PS1 3π\pi DR2 sky surveys presented by Beck et al. 2022 (submitted). They provided a highly accurate source classification with 97.67% purity and 94.28% completeness with respect to quasars. After we successfully trained our photo-z model on the spectroscopically identified quasars we applied the model on the quasar candidates found by Beck et al. 2022 (submitted). The Pan-STARRS survey performed broad-band photometric measurements of about three quarters of the sky mainly in the optical regime using the g, r, i, z, y filters (Tonry et al. (2012), Chambers et al. (2016), Magnier et al. (2020a), Magnier et al. (2020c), Magnier et al. (2020b), Waters et al. (2020)). We used the Kron and PSF (Point-spread function) magnitudes of the objects measured in the mentioned filters. The WISE survey scanned the full sky in four infrared photometric bands (W1, W2, W3, W4) having effective wavelengths of 3.4, 4.6, 12 and 22 μ\mum, respectively (Wright et al. (2010)). Regarding the high noise level and the relatively large number of missing error estimates of the W3, W4 filters, we only used the measurements obtained in the W1 and W2 filters. We selected the profile fitting photometry – which essentially fits a point-spread function on the data – as well as the aperture magnitudes related to 8.25” radius circular apertures. Finally, we determined the color indices, namely the magnitude differences of the neighboring filters by pairing the PS1 Kron and WISE aperture magnitudes as well as the PS1 PSF and WISE PSF magnitudes to each other. Hence, we ended up with a 12 dimensional feature space. Note that due to model extrapolation considerations it is very important to rely on such input parameters (in our case color indices) that are less sensitive to the actual magnitudes since the spectroscopic training set is typically brighter than the photometric inference set. The PS1 magnitudes have been corrected for the galactic dust extinction using the related extinction coefficients (αg=3.172\alpha_{g}=3.172,αr=2.271\alpha_{r}=2.271,αi=1.682\alpha_{i}=1.682,αz=1.322\alpha_{z}=1.322,αy=1.087\alpha_{y}=1.087) and the E(B-V) dust extinction values of a map that is based on PS1 observations of Milky Way stars (Schlafly et al. (2014)). For spectroscopic redshifts we used SDSS data (York et al. (2000), Lyke et al. (2020)), where the derived training set consisted of 346,691 quasars.

3 Methods

First of all we estimated the training set coverage to provide a well defined boundary in the normalized feature space111We transformed all of the features to a distribution having a zero mean and a standard deviation of 1. where our model predictions are reliable. To do this we searched for the 20 nearest neighbors in the 12-dimensional feature space using the ball tree algorithm (Liu et al. (2006)). We then calculated the mean distance from the neighbors for each data point and investigated its distribution (see Figure 1).

Refer to caption
Figure 1: Frequency distribution of the mean distance of each data point from their neighbors in the 12-dimensional normalized feature space with respect to the spectroscopic data set (orange continuous line). Blue bars denote the distribution of the average distance measured between the inference data points and their spectroscopic neighbors. The red vertical line indicates the cut off distance value which corresponds to the 95th percentile of the spectroscopic data set.

The red vertical line indicates the cut off distance value which corresponds to the 95th percentile. Next, using the previously determined K nearest neighbors model we calculated the distance of each inference data point from the 20 nearest neighbors lying in the training set, and again we calculated the mean of these distances. This way we can accurately determine the overlapping region of the training and inference sets in the feature space. We plotted the distribution of the distances denoted by blue bars in Figure 1. Altogether 2,879,298 from the total 4,849,611 quasar candidates in the inference set are closer to the training set than the cut off value. This means that the photometric redshift estimation based on our training set is the most reliable on this subset of the quasar candidates, otherwise we extrapolate into a less represented region.

We trained an XGBoost regressor (XGB) and an artificial neural network (ANN) on the complex relation between the feature space and the spectroscopic redshifts. We used XGB (Chen & Guestrin (2016)) as a baseline photo-z model and to measure the feature importances. XGB is a boosting algorithm that was developed on the basis of Gradient Boosting Decision Tree (Friedman (2001)). The main difference between the two models is that during the training phase XGB uses both of the first and second derivatives of the loss function. We set the number of estimators to 12 and the maximum depth to 15. The final photo-z catalogue was created however using the ANN, that outperformed XGB on the training set. We used four hidden layers each having 512 neurons and Exponential Linear Unit (ELU) activation function. ELU is more advanced than the widely used Rectified Linear Unit (RELU) since it solves the problem of vanishing gradient, while providing lower training times and higher accuracy. To avoid overfitting we added dropout layers after each hidden layer with a dropout rate of 0.2.

4 Results

4.1 Application of XGBoost

We split the spectroscopic data set into train, test and validation sets using 70% - 15 % - 15% ratios, respectively. To quantify the goodness of the model predictions we used the following metrics (NN: number of data points, zphotz_{phot}: photometric redshift, zspecz_{spec}: spectroscopic redshift):

  • Mean Squared Error (MSE):

    MSE=1Ni=0N(zphot,izspec,i)2MSE=\frac{1}{N}\sum\limits_{i=0}^{N}(z_{phot,i}-z_{spec,i})^{2} (1)
  • δznorm,i\delta z_{norm,i}:

    δznorm,i=zphot,izspec,i1+zspec,i\delta z_{norm,i}=\frac{z_{phot,i}-z_{spec,i}}{1+z_{spec,i}} (2)
  • Median Absolute Deviation of δznorm,i\delta z_{norm,i} (MAD)

  • Mean Absolute Difference (MeanAD):

    1Ni=0N|zphot,izspec,i|\frac{1}{N}\sum\limits_{i=0}^{N}|z_{phot,i}-z_{spec,i}| (3)
  • Bias (B):

    1Ni=0Nδznorm,i\frac{1}{N}\sum\limits_{i=0}^{N}\delta z_{norm,i} (4)
  • Outlier rate (O): fraction of objects, where |zphot,izspec,i|>3MSE|z_{phot,i}-z_{spec,i}|>3\sqrt{MSE}

First we applied XGB on the training set and we set the final value of the number of estimators to 12 where the loss function was the smallest measured on the validation set. We plot the predicted redshifts as the function of the spectroscopic redshifts as well as the residuals in Figure 3. We can observe that the median prediction is very close to the spectroscopic value but the scatter is relatively large and there remained some bias in the residuals as well. At smaller redshifts this bias is mostly positive, which means that many of closeby (z<1z<1) quasars have been considered by XGB as distant objects. We also plotted the feature importance in Figure 4. These values are calculated based on the so called information gain, meaning the average training loss reduction gained when using a feature for splitting. According to the results it seems that the near-infrared range is the most informative for XGB. To understand the high relevance of the y_w1_dered feature we calculated the median value for each of the standardized features along the redshift. We then plot the used redshift bin centers as a function of the median values of features (see Figure 2). We can observe that many of the features have large fluctuations which means that the same feature values (color indices) relate to several redshift ranges and therefore the correct prediction needs a lot of splits in these feature domains. Contrarily in case of y_w1_dered we can see that there is a broad redshift region where there is a one-to-one relation between the feature and the redshift.

Refer to caption
Figure 2: The used redshift bin centers as a function of median feature values. The y_w1_dered feature is marked with the red line.
Refer to caption
Figure 3: Photometric redshift estimates of XGBoost regressor and the residuals. The red continuous and dashed lines refer to the median and the 68%\% confidence interval, respectively.
Refer to caption
Figure 4: Feature importance value of the different color indices.

4.2 Application of ANN

We then applied ANN on the data where we found better results after a few thousand iterations using a batch size of 1024. The results can be seen in Figure 5. The plot is similar to Figure 3, however the scatter is now narrower. We achieved a mean squared error of 0.18, and a mean absolute difference of 0.226 that is very similar to the result found in Jin et al. (2019) (their MeanAD was 0.22). The characteristic fluctuation around the ground truth value remained there however, similarly to the results in Jin et al. (2019).

Refer to caption
Figure 5: Photometric redshift estimates of ANN and the residuals. The red continuous and dashed lines refer to the median and the 68%\% confidence interval, respectively.

4.3 Error estimation

To estimate the error of the ANN model we used a Monte Carlo approach. We randomly perturbed the magnitudes by adding a Gaussian distributed random variable with zero mean and standard deviation equal to the provided magnitude error. We then recalculated the color indices and applied the ANN on the perturbed data. We repeated this process 100 times and took the mean value as the final prediction and the standard deviation as the error of the estimated redshift. In Figure 6 we plotted again the results made on the test set but now the estimated error has been included as color coding. We can observe that the uncertainty is consistent with the accuracy of the predictions. Less uncertain photo-z estimations are closer to the spectroscopic redshift values.

Refer to caption
Figure 6: Photometric redshift predictions of ANN with error estimation.

4.4 Explanation of the step-like structure

Now we give an explanation for this step-like structure of the diagram. First of all we need to recap the basic idea behind the usage of color indices in the prediction of the redshift. The color indices provide information about the flux ratio between the neighbouring photometric passbands. While the different emission lines of the quasar spectrum are moving to longer wavelengths during redshifting, the color indices will also change. However, since the quasars have less features in their continuum spectra compared to galaxies, the change in the color indices will be typically smaller than the level of photometric error, and therefore the model cannot catch the relation. The model performance will be better only if the color change is significant which occurs when one strong emission line goes from one passband to another. These relatively large changes occur only at specific redshifts and therefore the Machine Learning models will be first able to predict the corresponding redshift interval of the quasars. Now, since we try to minimize the mean squared error during the optimization process the model will predict in most cases the middle of the redshift interval that produces the smallest error for every quasar in that redshift interval. Hence, the resulting plot will contain several "steps" between the mentioned redshifts. To demonstrate our concept we used a composite quasar spectrum for the optical and near infrared regime downloaded from the website of the Space Telescope Science Institute (STScI)222https://www.stsci.edu/hst/instrumentation/reference-data-for-calibration-and-tools/astronomical-catalogs/composite-qso-spectra-for-nir. In Figure 7 we plotted this spectrum, the transmission curves of the PS1 and WISE passbands and the calculated color indices.

Refer to caption
Figure 7: Composite quasar spectra at z=1.5z=1.5, transmission curves of the PanSTARRS and WISE passbands as well as the determined synthetic color indices.

Using these data we calculated the six color indices at 4000 redshift values in the range of z[0,4]z\in[0,4]. Then we calculated the standard deviation of each color index using a bin size of 0.04. Finally, we took the maximum standard deviation for each bin and plotted the results in Figure 5. It can be clearly seen that the jumps in the redshift predictions occur at the same positions where the standard deviation in the color indices is relatively high which confirms our assumption.

Refer to caption
Figure 8: Photometric redshift estimation of ANN as well as the standard deviation of synthetic quasar color indices (blue dashed line) as a function of redshift. The positions of local maxima are marked with the green dashed vertical lines.

4.5 Independent validation of the results

In this section we demonstrate the reliability of the derived non-extrapolated photometric redshifts by applying a completely independent algorithm, namely the so called clustering-based redshift estimation Ménard et al. (2013). This approach uses a set of spatial cross-correlations between a photometric and a reference spectroscopic sample. This means that we need to provide a map of objects within a defined photometric redshift range and let the model to figure out the number density change as a function of redshift (dNdz\frac{dN}{dz}). We have also calculated the frequency distribution of the quasars with respect to the predicted photometric redshifts. We accounted for the estimated uncertainty of the predictions where we used a Gaussian distribution. We consider the outcome as a successful validation if the two distributions are close to each other. We used the publicly available Tomographer333tomographer.org web user interface for this validation process. We created HEALPIX images about the spatial distributions of quasars in the galactic coordinate system using the healpy python package with NSIDE=128 and the WMAP DR4 temperature analysis exclusion mask 444https://lambda.gsfc.nasa.gov/product/wmap/dr4/masks_get.html. We created these maps in 0.5 wide redshift bins and the resulting plots can be seen in Figure 9. The correlation results of Tomographer are plotted in Figure 10. We can observe that Tomographer predicts such correlation values to its reference data set that are distributed very similarly to the frequency distribution of quasars calculated along the photometric redshifts, especially for z<2.5z<2.5. This confirms the reliability of the determined photometric redshift catalog. Only for the last z[2.5,3.]z\in[2.5,3.] redshift bin Tomographer predicts significantly larger redshift values, which is not surprising. Regarding to Figure 5 we can notice that just in that specific redshift range the model predicts systematically lower redshifts than the real values, and therefore a significant amount of distant quasars will fall into the z[2.5,3.]z\in[2.5,3.] photometric redshift bin.

Refer to caption
Figure 9: Number count of quasars in the different redshift slices. The WMAP DR4 temperature analysis exclusion mask was used.
Refer to caption
Figure 10: Prediction of redshift dependence of frequency distribution calculated by Tomographer related to the different redshift slices (blue continuous line with errorbars). For comparison the frequency distribution of quasars along the photometric redshift has been plotted as well (orange dashed line).

5 Conclusions

We created a photometric redshift catalog for a total number of 4,849,611 quasars including error estimation. From these, 2,879,298 quasars are within the training set coverage and therefore the redshift estimations are more reliable for them. We presented an XGBoost machine learning model as a base line method and used a more advanced artificial neural network model for the final predictions. We provided a detailed analysis of the results and an explanation for the observed bias in the data. Finally, we validated our redshift catalog using a completely independent, clustering-based redshift estimation method. We found good accordance between the results of the two methods below z<2.5z<2.5 therefore the published catalog will be useful for several cosmological large-scale structure studies.

Acknowledgements

This work was supported by the Ministry of Innovation and Technology NRDI Office grants OTKA NN 129148 and the MILAB Artificial Intelligence National Laboratory Program. IS acknowledges support from the National Science Foundation (NSF) award 1616974.

Data Availability

The derived photometric redshift catalogue is publicly available at https://doi.org/10.5281/zenodo.6609756.

References