¹¹institutetext: CAS Key Laboratory of Optical Astronomy, National Astronomical Observatories, Beijing 100101, China
²²institutetext: School of Astronomy and Space Science, University of Chinese Academy of Sciences, Beijing 100049, China

Distance and stellar parameter estimations of solar-like stars from the LAMOST spectroscopic survey

Yue-Yue Shen 1122 0000-0001-8636-2990 A-Li Luo^, A-Li Luo email: [email protected] 0000-0001-7865-2648

(Received mm:dd, yyyy; accepted mm:dd, yyyy)

Abstract

Context. The Gaia mission has opened up a new era for the precise astrometry of stars, thus revolutionizing our understanding of the Milky Way. However, beyond a few kiloparseconds from the Sun, parallax measurements become less reliable, and even within 2 kpc, there still exist stars with large uncertainties.

Aims. Our aim was to determine the distance and stellar parameters of 521,424 solar-like stars from LAMOST DR9; these stars lacked precise distance measurements (uncertainties higher than 20% or even without any distance estimations) when checked with Gaia.

Methods. We proposed a convolutional neural network (CNN) model to predict the absolute magnitudes, colors, and stellar parameters ( $T_{\rm eff}$ , log $\it g$ , and $\rm[Fe/H]$ ) directly from low-resolution spectra. For spectra with signal-to-noise ratios at $g$ band ( $\rm{S/N_{g}}$ ) greater than 10, the model achieves a precision of 85 K for $T_{\rm eff}$ , 0.07 dex for log $\it g$ , 0.06 dex for $\rm[Fe/H]$ , 0.25 mag for $M_{\rm G}$ , and 0.03 mag for $(\rm{{BP}-{RP}})_{0}$ . The estimated distances have a median fractional error of 4% with a standard deviation of 8%.

Results. We applied the trained CNN model to 521,424 solar-like stars to derive the distance and stellar parameters. Compared with other distance estimation studies and spectroscopic surveys, the results show good consistency. Additionally, we investigated the metallicity gradients of the Milky Way from a subsample, and find a radial gradient ranging from $-0.05<\Delta\hbox{$\rm[Fe/H]$}/\Delta R<0.0\ {\rm{dex\ kpc^{-1}}}$ and a vertical gradient ranging from $-0.26<\Delta\hbox{$\rm[Fe/H]$}/\Delta Z<-0.07\ \rm{dex\ kpc^{-1}}$ .

Conclusions. We conclude that our method is effective in estimating distances and stellar parameters for solar-like stars with limited astrometric data. Our measurements are reliable for Galactic structure studies and hopefully will be useful for exoplanet researches.

Key Words.:

methods: machine learning – techniques: spectroscopic – catalogs – stars: solar-like, fundamental parameters

1 Introduction

As we know, planet-hunting missions such as the Convection, Rotation and planetary Transits (CoRoT; Baglin et al., 2006; Bordé et al., 2003), Kepler (Borucki et al., 2010), and the Transiting Exoplanet Survey Satellite (TESS; Ricker et al., 2015) have revealed that solar-like stars commonly harbor exoplanets. Approximately one-half of these stars are observed to possess rocky planets located within their habitable zones. The distance of solar-like stars significantly influences exoplanet searches, affecting their detectability. Bright nearby solar-like stars represent the most promising targets for exoplanet detection, necessitating follow-up observations using diverse methods.

With the accurate Gaia parallax measurements (Bailer-Jones et al., 2021), their accurate luminosities can thus be derived. However, for stars at larger distances, accurate parallaxes are more difficult to measure; for example, at 5 kpc the median error in Gaia data is 15%, and at 8 kpc it grows to 34% (Gaia Collaboration et al., 2023b). Previous work has used the combination of astrometric, photometric, and spectroscopic data along with different methods to determine distances (see, e.g., Hawley et al. 2002, Burnett & Binney 2010, Santiago et al. 2016, Queiroz et al. 2018, Das & Sanders 2019, Leung & Bovy 2019a, Stone-Martinez et al. 2024). Santiago et al. (2016) developed a Bayesian inference code to determine spectro-photometric distances, followed by Queiroz et al. (2018), who presented StarHorse with extinctions and other parameters added, and improved the code with extinction treatment in Queiroz et al. (2020). Instead of pure Bayesian inference, Das & Sanders (2019) trained MADE, a Bayesian artificial neural network (ANN) that learns from and replaces stellar isochrones to estimate distance.

Beyond the reach of Gaia’s parallax, we can use the relation between spectra and luminosity (or absolute magnitude) to determine distances. Meanwhile, thanks to the large amount of stars with precise parallaxes that Gaia has already provided, data-driven methods, particularly machine learning (ML), are attractive methods for calculating spectro-photometric distances. Leung & Bovy (2019a) determined the spectro-photometric distances for stars observed by the Apache Point Observatory Galactic Evolution Experiment (APOGEE; Majewski et al., 2017) and provided a flexible model to calibrate parallax zero-point biases in Gaia Data Release 2 (DR2). Stone-Martinez et al. (2024) used a simple neural network to drive the absolute magnitude from stellar parameters and provided a value-added catalog (VAC) with 733,901 spectra.

Mapping the stellar parameters derived from spectra to absolute magnitude, the ML technologies have been successfully applied. Zhang et al. (2020) proposed the Stellar LAbel Machine (SLAM), which is a data-driven method based on the support vector regression (SVR), shows high performance in deriving stellar parameters from the Large Sky Area Multi-Object Fiber Spectroscopic Telescope (LAMOST; Luo et al., 2015; Zhao et al., 2012; Cui et al., 2012) low-resolution spectra. Fabbro et al. (2018) introduced StarNet, leveraging APOGEE observed spectra alongside corresponding ASSET synthetic data. Zhang et al. (2019) employed StarNet to analyze giants using low-resolution spectra from LAMOST. Leung & Bovy (2019a) presented astroNN, an open-source Python tool with flexible neural network features. Rui et al. (2019) introduced generative spectrum networks (GSN) for synthesizing spectra, and integrated a Bayesian framework for inferring stellar atmospheric parameters from low-resolution LAMOST spectroscopy. Following this, Wang et al. (2020) developed SPCANet, which was specifically designed for medium-resolution LAMOST spectroscopy and produced 13 high-precision chemical abundances. Ting et al. (2019) presented The Payne, an approach enabling the precise interpolation and prediction of spectra within high-dimensional label spaces. Building upon The Payne, Xiang et al. (2022) introduced HotPayne, tailored for 330,000 hot stars using LAMOST low-resolution spectra. By predicting projected rotation velocity and micro-turbulence velocity, HotPayne expands its utility. Cycle-StarNet was developed by O’Briain et al. (2021) using a hybrid generative domain-adaptation approach to convert simulated star spectra into their real-world equivalents. By effectively applying this methodology to APOGEE, the discrepancy between theoretical predictions and observational data was reduced. Expanding the scope, Wang et al. (2023b) implemented Cycle-StarNet on LAMOST medium-resolution spectra, employing the L-BFGS algorithm to optimize the fitting of synthetic spectra to estimate stellar parameters and 11 chemical abundances in a staggering dataset of 1.38 million FGKM-type stars.

In this study we focus on solar-like stars in LAMOST DR9. We found that at least 521,424 of these stars (representing 13% of the LAMOST solar-like sample) lack precise distance measurements, with uncertainties higher than 20% or even without any distance estimations when checked with Gaia. Similar to the approaches of Leung & Bovy (2019a), we use a convolutional neural network (CNN) to predict the absolute magnitudes, colors, and stellar parameters directly from the spectra. With the predicted absolute magnitudes, we can subsequently calculate the distances using the observed apparent magnitudes and extinction values. This approach provides a significant supplement to distance measurements for solar-like stars with inadequate astrometric and photometric data.

This paper is organized as follows. Section 2 describes the data used in this study, including LAMOST low-resolution spectra, stellar atmosphere parameters from APOGEE, and astrometric data from Gaia. In Sect. 3 the CNN model architecture, model training, model performance, and model comparison are detailed. In Sect. 4 the trained model is applied to 521,424 solar-like stars from LAMOST DR9 to derive distances and stellar parameters. The validation and analysis of the results, along with the metallicity gradients of the Milky Way calculated from a subsample, are presented. Finally, in Sect. 5 we discuss the feature importance and present the conclusions of this work.

2 Data

For this work we employed low-resolution spectra from LAMOST DR9, astrometric and photometric data from Gaia DR3, and stellar atmospheric parameters from APOGEE DR17. The stellar parameters ( $T_{\rm eff}$ , log $\it g$ , $\rm[Fe/H]$ ) were obtained from the APOGEE Stellar Parameters and Abundances Pipeline (ASPCAP), and the absolute magnitudes were derived using geometric distances from Bailer-Jones et al. (2021).

2.1 Data selection and calculation

High-quality low-resolution spectra of solar-like stars are selected from the LAMOST LRS Stellar Parameter Catalog of A, F, G and K Stars of LAMOST DR9 v1.1 (hereafter LAMOST LRS AFGK Catalog),¹¹1https://www.lamost.org/dr9/ which contains 7,060,679 spectra with determined stellar parameters ( $T_{\rm eff}$ , log $\it g$ , $\rm[Fe/H]$ ) by the official LAMOST Stellar Parameter Pipeline (LASP). A total of 4,067,525 solar-like stellar spectra were selected according to the criteria in Table 1, which is similar to the criteria used by Zhang et al. (2022).

Table 1: Selection criteria for solar-like stars.

parameters	condition
$T_{\rm eff}$	4800 $\leq$ $T\rm_{eff}$ $\leq$ 6800K
log $\it g$	$\hbox{log\,$\it g$}\leq 5.98-0.00035\times\hbox{$T_{\rm eff}$}$
[Fe/H]	-1.0 $<$ [Fe/H] $<$ 1.0

Solar-like stars of the late F-type, G-type, and early K-type are included in this study. The effective temperature range for these stars ranges from $4800$ K to $6800$ K, which encompasses the effective temperature of the Sun ( $\sim$ 5800K) with a fluctuation of $\pm$ 1000K. The surface gravity log $\it g$ is constrained by the empirical formula proposed by Zhang et al. (2022) in the $T_{\rm eff}$ -log $\it g$ Kiel diagram, as shown in Fig. 1 and Table 1. This empirical formula distinguishes between main-sequence stars and giants. The metallicity range is set to $-1.0<\hbox{$\rm[Fe/H]$}<1.0$ dex, which is around the metallicity of the Sun ( $\hbox{$\rm[Fe/H]$}=0.0$ ).

Refer to caption — Figure 1: $\hbox{$T_{\rm eff}$}-\hbox{log\,$\it g$}$ Kiel diagram of LAMOST DR9 AFGK-type stars. The black dashed line represents the empirical formula proposed by Zhang et al. (2022), with an effective temperature range of $4800-6800$ K; stars within this area are defined as solar-like stars in this work. For reference, the Sun ( $\hbox{$T_{\rm eff}$}\sim 5800$ K, $\hbox{log\,$\it g$}\sim 4.4$ dex) is denoted by an orange star. The inset color-magnitude diagram shows the distribution of the training data, which is a subset of stars within the black dashed area, color-coded by $\rm[Fe/H]$ .

In the LAMOST LRS AFGK catalog, there are 4,022,857 solar-like stars with Gaia data. To find the most reliable distance for training, we considered three primary sources of distance data: inverse parallax (including raw parallax data and parallax after zero-point correction), distance from the General Stellar Parametrizer from Photometry (GSP-Phot; Andrae et al., 2023), and distance obtained by Bailer-Jones et al. (2021) using the geometric methods. Before calculating the inverse parallax, the zero-point correction for parallax is determined separately for five-parameter and six-parameter astrometric solutions, as explained in Lindegren et al. (2021). The method for calculating zero points follows the Python code.²²2https://gitlab.com/icc-ub/public/gaiadr3_zeropoint

According to the distribution of distances shown in Fig. 2, as $\varpi/\sigma_{\varpi}$ increases, the amount of distance data decreases, especially for distances greater than 1 kpc. Compared to the corrected inverse parallax, the raw inverse parallax has a larger dispersion and relatively overestimates distances in panel (a). In panel (b), the geometric distance has a similar distribution to the corrected inverse parallax. In panel (c), the GSP-Phot distance has a smaller dispersion at close distances and tends to underestimate the distance.

In Fig. 3 GSP-Phot distances and geometric distances are shown for $\varpi/\sigma_{\varpi}>5$ . GSP-Phot distances are significantly underestimated when compared to geometric distances, and this underestimation worsens at larger distances. However, for larger $\varpi/\sigma_{\varpi}$ , indicating higher data quality, the underestimation is mitigated with increasing distance. In this study, geometric distances, which are derived purely through a Bayesian approach that excludes the influence of photometric data, are chosen to determine absolute magnitudes.

We calculated the absolute magnitude from the known apparent magnitude g_mean_mag in the Gaia DR3 catalog and the geometric distance $d_{\rm{geo}}$ (r_med_geo) from Bailer-Jones et al. (2021):

M_{\rm{G}}=m_{G}-5\log{d_{\rm{geo}}}+5-A_{\rm{G}},

(1)

Here $M_{\rm{G}}$ is the calculated absolute magnitude, which is used as the label for the training process and $A_{\rm{G}}$ (ag_gspphot) is the extinction in G band from GSP-Phot Aeneas best library using BP/RP spectra. Additionally, we use the extinction information from GSP-Phot (ebpminrp_gspphot) to determine $(\rm{{BP}-{RP}})_{0}$ of the color index $(\rm{{BP}-{RP}})$ (bp_rp).

Finally, a dataset containing 18,573 stars with stellar atmosphere parameters provided by APOGEE is used for training in this study. Table 2 details the step-by-step filtering process of the training data, including the filtering criteria and the remaining number of spectra at each stage. The color-magnitude diagram of the training data, color-coded by metallicity from APOGEE, is shown in the inset of Fig. 1. Additionally, the number density distribution of each parameter is presented in Fig. 17; further details on how the reference and test sets are distinguished can be found in Sect. 3.2.

Table 2: Cuts applied for the training dataset.

Cut	Number of spectra
LAMOST LRS AFGK Catalog	7,060,679
solar-like stars according to LASP	4,067,525
APOGEE DR17 cross-match within 1 arcsec	54,971
solar-like in APOGEE	47,557
Gaia DR3 cross-match within 3 arcsec	44,593
parallax_over_error $>5$	39,416
$\rm{S/N_{g}}>10$ and fibermask $=0$	18,573

2.2 Pre-processing of the input data

The radial velocity of the object causes the spectral lines to undergo the Doppler effect. Hence, for the input spectrum, we first needed to convert the wavelengths to the rest frame.

Second, to ensure alignment, we interpolated the raw spectra to a uniform wavelength range. Typically, both ends of the spectrum are noisy and contain less information about the spectral features. Therefore, we interpolated the entire sample set to a wavelength range of 3925–8800Å, using 1Å as the sampling point, resulting in a total of 4875 data points for spectral flux data.

Finally, each stellar spectrum in the training set was normalized by dividing it with its pseudocontinuum. The pseudocontinuum was obtained by fitting a polynomial to the spectrum. For this work, we found that despite FIBERMAS=0, there are still bad pixels that were not completely removed, leading to incorrect spectral lines and extremely high pixel values that affected the model prediction. To address this, we implemented two additional steps. First, the original spectrum was smoothed using the Savitzky-Golay filter, and then the smoothed spectrum was normalized. Using this approach, the normalized spectra outperformed those provided in the official LAMOST pipelines. Second, the pixels were thresholded in the normalized spectrum by simply setting to zero those above 1.2 and below 0.1. This straightforward threshold processing is effective because the anomalous pixels appear at random wavelengths. It is believed that as long as their values are not too far from the normalized flux range ( $0-1$ ), individual pixels do not significantly affect the prediction ability of the neural network. Two examples illustrating these preprocessing steps are shown in Fig. 18.

3 Method

3.1 The CNN architecture

A neural network has three components: an input layer, some hidden layers, and an output layer. Given a training set of input spectra with known labels ( $T_{\rm eff}$ , log $\it g$ , $\rm[Fe/H]$ , $M_{\rm G}$ , $(\rm{{BP}-{RP}})_{0}$ ), a neural network model can generate a function that maps the input spectra to the output parameters. This nonlinear capability is attributed to the activation function, which determines the significance of each neuron in the network. The trained model can then be used to predict the physical properties of spectra with unknown stellar parameters in another dataset.

In this work, we propose a one-dimensional convolutional neural network (CNN) regression model. The architecture of the CNN model is shown in Fig. 4. It consists of three convolutional blocks for feature extraction and two fully connected layers for regression. The activation function for both the convolutional and fully connected layers is Leaky ReLU. The output layer uses a linear activation function and has five neurons, each corresponding to a stellar parameter.

We tested different network architectures specifically for our dataset, including various combinations of convolutional blocks and fully connected layers, similar to Fig. 3 in Zhang et al. (2019). We found that the architecture with three convolutional blocks and two fully connected layers is well-suited for our dataset. It is important to note that for different tasks and datasets, the model must be modified or fine-tuned.

3.2 Model training

The training process was carried out using the dataset described in Sect. 2.1, which was divided into reference and test sets in an 8:2 ratio. A total of 14,858 stars were used for training and cross-validation in the reference set, while 3,715 stars were used in the test set to evaluate the performance of the trained models. Before training, the input pseudo-continuous spectra were already normalized, while the output labels were standardized using Z-score standardization.

To ensure robust model performance, we employed five-fold cross-validation, dividing the reference set into five non-overlapping subsets. One subset served as the validation set, and the remaining four served as the training set. This strategy ensured the efficient use of data and the proper examination of the model’s generalization ability by utilizing the entire dataset for both training and validation.

For the model optimization we used the mean squared error (MSE) loss function, defined in Eq. 2, to measure the difference between the target and predicted parameters:

\displaystyle\rm{MSE}=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{y_{i}})^{2}

(2)

Here $n$ is the number of samples, $y_{i}$ is the true value, and $\hat{y_{i}}$ is the model’s predicted value. The Adam optimizer (Kingma & Ba, 2014) was employed for gradient descent, with an initial learning rate of 0.0001. This rate automatically decreases using the adaptive learning rate tuning method, ReduceOnPlateau (Ioffe & Szegedy, 2015), which halves the learning rate if the MSE value does not improve after five epochs. This approach allows the model to find a more precise solution and avoids premature convergence to a local optimum.

The CNN model was trained using the NN library Keras (Chollet et al., 2015), which provides a high-level API for the TensorFlow (Abadi et al., 2015) machine intelligence package. The train and test split, standardization, and $k$ -fold cross-validation were performed using the scikit-learn library (Pedregosa et al., 2011).

3.3 Model performance and error analysis

The performance of the CNN model was evaluated on the test set, as shown in Fig. 5. The results indicate that the CNN model’s predictions for $T_{\rm eff}$ , log $\it g$ , and $\rm[Fe/H]$ closely align with the APOGEE parameters, while $M_{\rm G}$ and $(\rm{{BP}-{RP}})_{0}$ closely match the Gaia parameters. There is no significant bias for these parameters. The scatter is 86 K for $T_{\rm eff}$ , 0.07 dex for log $\it g$ , 0.06 dex for $\rm[Fe/H]$ , 0.25 mag for $M_{\rm G}$ , and 0.03 mag for $(\rm{{BP}-{RP}})_{0}$ . To calculate the distance, we assumed $M_{\rm{true}}\sim\mathcal{N}(\mu_{M},\sigma_{M})$ , where $\mu_{M}$ is the $M_{\rm G}$ predicted by CNN, and $\sigma_{M}$ is the scatter between the labeled values and the predicted values, as defined in Eq. 3. The predicted absolute magnitudes $M_{\rm G}$ were then used to estimate the distance distribution, as shown in Eq. 5:

$\displaystyle P_{M}(M\mid\mu_{M},\sigma_{M})$	$\displaystyle=\dfrac{1}{\sqrt{2\pi}\sigma_{M}}\cdot\exp[-\dfrac{\left(M-\mu_{M}\right)^{2}}{2\sigma_{M}^{2}}]$	(3)
$\displaystyle h(d)$	$\displaystyle=m-5\log d+5-A_{\rm{G}}$	(4)
$\displaystyle P_{d}(d\mid m,A_{\rm{G}},\mu_{M},\sigma_{M})$	$\displaystyle=\left\|h(d)^{\prime}\right\|\cdot P_{M}\left[h(d)\right]$	(5)

These estimated distances are illustrated in the lower right panel of Fig. 5, where the predicted distances are compared to the geometric distances, with a scatter of 65 pc. We also calculated the fractional error, defined as the difference between the predicted and geometric distances divided by the geometric distances. The median fractional distance error is 3.7% with a standard deviation of 8.2%, as shown in the fourth panel in Fig. 7. It should be noted that our dataset contains some contamination from unresolved binaries. These stars tend to have higher residuals because the model predicts a dimmer luminosity for them, leading to a slight bimodal distribution in the fractional distance error. However, since our focus is on studying solar-like stars, and these potential unresolved binaries are fairly uniformly distributed across the parameter space, their influence on the model is naturally averaged out across each type of spectrum during the training process. Moreover, as shown in Fig. 2 of El-Badry et al. (2018), for LAMOST-like spectra with $\rm{S/N=30}$ , the typical systematic error caused by unresolved binaries is approximately 100 K in $T_{\rm eff}$ , while the errors in log $\it g$ and $\rm[Fe/H]$ are smaller than 0.1 dex and 0.05 dex, respectively. This degree of systematic error is acceptable for most scientific objectives.

To further evaluate, we compared the fractional distance error and residual of the CNN predictions to the labeled values with $T_{\rm eff}$ , log $\it g$ , and $\rm[Fe/H]$ separately, as shown in Fig. 6. In the left three panels, the mean fractional distance errors (gray dots) do not exhibit significant deviations from $T_{\rm eff}$ . However, the standard deviation of fractional error (represented by the error bars on the gray dots) shows a slight increase at higher $T_{\rm eff}$ , lower log $\it g$ , and lower $\rm[Fe/H]$ . At lower log $\it g$ ( $\hbox{log\,$\it g$}<4.0$ ) and $\rm[Fe/H]$ ( $\hbox{$\rm[Fe/H]$}<-0.5$ ), our model shows larger deviations from the Gaia distances. These deviations likely arise because the quality of the model predictions depends on the number of stars in the training set that span the parameter space of the test set. Consequently, we suggest that the deviations are due to an insufficient number of stars in our reference set, which is a common issue in data-driven models. Additionally, the precision of predicted distances improves with smaller distances due to Gaia’s higher accuracy in these ranges.

The right three panels of Fig. 6 show the residuals of $T_{\rm eff}$ , log $\it g$ , and $\rm[Fe/H]$ , respectively. The model predictions show excellent agreement with the ASPCAP results for the high $\rm{S/N_{g}}$ spectra. However, larger scatters are observed between CNN and ASPCAP for the lower $\rm{S/N_{g}}$ spectra. From the projected residual distributions, $\rm[Fe/H]$ is the most sensitive parameter to the signal-to-noise ratio, while $T_{\rm eff}$ and log $\it g$ are less sensitive. The metallicity depends on the strength of the spectral lines, which are more affected by the signal-to-noise ratio, whereas $T_{\rm eff}$ and log $\it g$ rely on the continuum shape and are less affected. Meanwhile, for lower effective temperatures ( $\hbox{$T_{\rm eff}$}<5000$ K), lower surface gravity ( $\hbox{log\,$\it g$}<4.1$ ), and lower metallicity ( $\hbox{$\rm[Fe/H]$}<-0.5$ ), the CNN model tends to predict higher values compared to ASPCAP. This discrepancy is likely due to the nature of the spectra themselves, where weaker spectral features would make it challenging for the CNN model to identify the most significant features during training. Similarly, the parameters determined by ASPCAP have larger intrinsic uncertainties, which in turn affect the model performance.

In this work we find no strong evidence that the derived distance errors are affected by surface gravity, as reported in previous studies (Carlin et al., 2015; Leung & Bovy, 2019b; Stone-Martinez et al., 2024); this is due to the limited surface gravity range of our dataset, which is mostly concentrated around $3.5-4.5$ dex for dwarf stars. In the left three panels of Fig. 7, the fractional distance error is compared to the errors in $T_{\rm eff}$ , log $\it g$ , and $\rm[Fe/H]$ . There is no significant dependence of distance error on the uncertainties in these three stellar parameters. The second panel shows a slight correlation between $\Delta\hbox{log\,$\it g$}$ and log $\it g$ : the model tends to overestimate log $\it g$ at lower values and underestimate it at higher values, which is also reflected in the second panel on the right in Fig. 6. The last panel shows the distribution of fractional distance error, with a median of 3.7% and a standard deviation of 8.2%, which demonstrate that this CNN model has a profound ability for distance estimations.

We also compared our CNN model with other ML algorithms, such as LightGBM boosted trees (Ke et al., 2017), Random Forests, and $k$ -Nearest Neighbors. We find that our CNN model is a better choice in both precision and efficiency; the details are described in Appendix C.

4 Results

4.1 The solar-like stars from LAMOST DR9

Based on the solar-like stars obtained using the second screening condition in Table 2, we selected spectra with ${\rm{S/N_{g}}}\geq 10$ . Among these, we identified spectra with problematic distance measurements: those where Gaia did not provide parallaxes, those where Gaia GSP-Phot did not provide distances, and those where the parallaxes had large errors ( $\texttt{parallax\_over\_error}\leq 5$ , indicating fractional error larger than 20%). This selection process resulted in a total of 521,455 spectra, representing 13% of the LAMOST solar-like sample. More details are shown in the Venn diagram in Fig. 8. Even with data lacking parallax values, it can be seen that GSP-Phot can provide distances in some circumstances; however, only 9,449 spectra meet this condition. Additionally, larger parallax errors lead to less precise distance estimates (Ulla et al., 2022). Fortunately, our model offers a solution for these problematic cases.

Using the same preprocessing described in Sect. 2.2, we obtained 521,424 normalized spectra spanning wavelengths from 3925Å to 8800Å. Based on the trained CNN model, we predicted the stellar parameters $T_{\rm eff}$ , log $\it g$ , $\rm[Fe/H]$ , $M_{\rm G}$ , and $(\rm{{BP}-{RP}})_{0}$ for 521,424 solar-like stars. The results are shown in Fig. 9, which displays a color–magnitude diagram of these stars using the predicted $M_{\rm G}$ and $(\rm{{BP}-{RP}})_{0}$ . The CNN predictions are compared to stellar isochrones to show the expected trend in the parameter relationships. The four lines are the theoretical isochrones from the PAdova and TRieste Stellar Evolution Code (PARSEC) model (Bressan et al., 2012) with $\hbox{$\rm[Fe/H]$}=-0.8,-0.5,0.0,$ and $0.3$ dex at an age of 1.5 Gyr, as marked by the corresponding colors. The trend aligns well with the isochrones, though the large scatter is due to the multiple populations in the sample. The results including stellar parameters and distance are saved in a CSV format table and can be found online.³³3https://nadc.china-vo.org/res/r101400/ The column descriptions of the final catalog are given in Table 6.

4.2 Comparison with other distances

To ensure the reliability of the distances obtained by our CNN model, we compared them with those from other studies, including Queiroz et al. (2018) (hereafter StarHorse), Stone-Martinez et al. (2024) (hereafter DistMass), and Leung & Bovy (2019b) (hereafter astroNN).

StarHorse (Santiago et al., 2016; Queiroz et al., 2018, 2020) is a Bayesian isochrone-fitting code that derives distances, extinctions, and astrophysical parameters for APOGEE stars, with a typical distance uncertainty of $\sim 5\%$ . DistMass (Stone-Martinez et al., 2024) is a simple neural network that predicts distances and masses of stars in APOGEE DR17. The distances are trained from stellar parameters using Gaia DR3 distances and literature distances for star clusters, achieving a median fractional distance error of $\sim 10\%$ at higher log $\it g$ and a standard deviation of $\sim 11\%$ . The deep learning model astroNN (Leung & Bovy, 2019a, b) uses APOGEE DR17 spectra and Gaia eDR3 data to determine stellar parameters, distances, and other stellar properties. These three catalogs are provided as the VACs in APOGEE DR17.⁴⁴4https://www.sdss4.org/dr17/data_access/value-added-catalogs/

We cross-matched the 521,424 stars in our results with three catalogs, resulting in 5443, 5240, and 5648 common stars for StarHorse, DistMass, and astroNN, respectively. The top three panels in Fig. 10 show the comparisons, demonstrating the consistency of the distances obtained by our CNN model with those from other studies. The scatter of the difference between this work and astroNN is 117 pc with an 18.74% standard deviation of fractional error, which is less than the 123 pc for StarHorse and 165 pc for DistMass. The mean fractional error between this work and both DistMass and astroNN is less than 5%, while for StarHorse it is 7%. Except for StarHorse, our results appear to be slightly underestimated compared to DistMass and astroNN. This discrepancy may be due to the different training sets and methodologies used in the models.

To provide a comprehensive comparison, we include pairwise comparisons between these studies in the bottom three panels of Fig. 10, involving 4401, 4788, and 5242 common stars, respectively. DistMass and astroNN appear to overestimate compared to StarHorse, while DistMass shows higher estimates compared to astroNN. Among these three comparisons, astroNN and DistMass exhibit the best agreement with a 3.8% mean fractional error, compared to 9.4% between astroNN and StarHorse, and 15.2% between DistMass and StarHorse. However, astroNN and StarHorse show the smallest scatter of 90 pc, while it is 118 pc between astroNN and DistMass, and 157 pc between DistMass and StarHorse. The pairwise comparison shows that astroNN and DistMass, both neural network models, exhibit greater consistency with each other than with StarHorse, a Bayesian isochrone-fitting model, reflecting the methodological differences. In summary, our distance estimation agrees well with those of other studies.

4.3 Comparison with other surveys

To ensure the accuracy of the stellar parameters obtained with our CNN model, we compared our results with those from high-resolution observations, specifically the APOGEE DR17 (Abdurro’uf et al., 2022) and the third data release of the GALactic Archaeology with HERMES surveys (GALAH DR3; Buder et al., 2021), two of the most comprehensive spectroscopic surveys.

APOGEE DR17 (Abdurro’uf et al., 2022) published medium-high-resolution (R $\sim 22,500$ ) near-infrared spectra of over 650,000 stars from the APOGEE-North and APOGEE-South surveys. The stellar parameters were derived by ASPCAP using MARCS model atmospheres with new spectral grids accounting for non-LTE level populations. GALAH DR3 (Buder et al., 2021) includes 768,423 high-resolution (R $\sim 28,000$ ) optical spectra from 342,682 stars. The parameters were estimated using the Spectroscopy Made Easy (SME) model-driven approach and one-dimensional MARCS model atmospheres. Additionally, GALAH DR3 mitigated spectroscopic degeneracies using astrometry from Gaia DR2 and photometry from 2MASS.

To perform the comparison, we cross-matched our results with APOGEE DR17 and GALAH DR3, identifying 6,040 and 6,329 stars with corresponding stellar parameters. To ensure the reliability of the comparison, we applied STARFLAG $=0$ for the APOGEE spectra; flag_sp $=0$ , flag_fe_h $=0$ , and red_flag $=0$ for the GALAH spectra; and snrg $>50$ for the LAMOST spectra. Moreover, we impose constraints on the errors of $T_{\rm eff}$ , log $\it g$ , and $\rm[Fe/H]$ given by ASPCAP and GALAH, setting them to less than 200 K, 0.2 dex, and 0.2 dex, respectively. Finally, we obtain 2,677 common stars with APOGEE and 2,218 with GALAH.

Fig. 11 shows the differences of $T_{\rm eff}$ , log $\it g$ , and $\rm[Fe/H]$ between this work and the two surveys. In terms of effective temperature, our results are more consistent with those of GALAH, with a scatter of 97 K, while APOGEE shows a scatter of 121 K. This difference may be attributed to the wavelength ranges observed by the surveys, as LAMOST and GALAH both observe in the optical range, while APOGEE in the near-infrared. For surface gravity, the scatter for GALAH is 0.14 dex, which is slightly larger than the 0.12 dex for APOGEE; however, the bias for APOGEE is smaller than that for GALAH. Our results slightly overestimate log $\it g$ compared to both surveys. Regarding metallicity, this work exhibits more consistency with APOGEE than with GALAH, with APOGEE having a scatter of 0.06 dex compared to 0.13 dex for GALAH. Nonetheless, there is no significant bias for either survey. In summary, our results are in good agreement with both APOGEE and GALAH, demonstrating the reliability of our CNN model.

4.4 Metallicity gradients of the Milky Way

Using the 521,424 solar-like stars, we investigated the radial and vertical metallicity gradients to explore the utility of our distance and metallicity measurements for Galactic structure studies. Previous studies indicate that the Milky Way’s disk shows a negative radial metallicity gradient, with the inner Galaxy (where the galactocentric radius $R$ is less than the solar value) is typically more metal-rich than the outer disk. This gradient ranges from $-0.1<\Delta\hbox{$\rm[Fe/H]$}/\Delta R<0.0\ {\rm{dex\ kpc^{-1}}}$ in the Galactic plane. Additionally, the Galactic disk exhibits a negative absolute vertical metallicity gradient, varying in the range $-0.25<\Delta\hbox{$\rm[Fe/H]$}/\Delta Z<-0.10\ {\rm{dex\ kpc^{-1}}}$ , depending on the tracer population and survey volume (Önal Taş et al., 2016; Yan et al., 2019; Gaia Collaboration et al., 2023a; Wang et al., 2023a; Hawkins, 2023; Imig et al., 2023, and references therein). These negative metallicity gradients strongly suggest that the Galactic disk formed in an inside-out manner, with the inner Galaxy forming early and rapidly, followed by the outer Galaxy later on (Frankel et al., 2019).

Based on our distances, we calculated the galactocentric Cartesian coordinates using astropy.⁵⁵5https://docs.astropy.org/en/stable/coordinates/index.html In Fig. 12 we show the X-Z (top panel) and R-Z (bottom panel) spatial distribution of the solar-like stars in our results. From these we selected a subsample of 458,061 stars within $7\leq R\leq 14$ kpc and $|Z|\leq 2$ kpc to investigate the metallicity gradients. In Fig. 13 we display the metallicity distribution in the same planes as Fig. 12. The top panel shows that the inner Galaxy has a higher metallicity than the outer regions (indicating a negative radial metallicity gradient), while the bottom panel reveals that metallicity decreases as $|Z|$ increases (indicating a negative vertical metallicity gradient).

For this work we chose a single linear model for both radial and vertical metallicity gradients. The radial metallicity gradient is defined as

\hbox{$\rm[Fe/H]$}_{R}=\frac{\Delta\hbox{$\rm[Fe/H]$}}{\Delta R}R+b_{R},

(6)

where $\hbox{$\rm[Fe/H]$}_{R}$ is the metallicity at a particular galactocentric radius $R$ , $\Delta\hbox{$\rm[Fe/H]$}/\Delta R$ is the radial metallicity gradient, and $b_{R}$ is the intercept. The vertical metallicity gradient is defined as

\hbox{$\rm[Fe/H]$}_{Z}=\frac{\Delta\hbox{$\rm[Fe/H]$}}{\Delta Z}Z+b_{Z},

(7)

where $\hbox{$\rm[Fe/H]$}_{Z}$ is the metallicity at a particular absolute vertical height $|Z|$ , $\Delta\hbox{$\rm[Fe/H]$}/\Delta Z$ is the vertical metallicity gradient, and $b_{Z}$ is the intercept. However, since the metallicity gradients are not the main focus of this work, we applied a simple linear least-squares regression to estimate the gradients as well as their uncertainties ( $\sigma_{\Delta\hbox{$\rm[Fe/H]$}/\Delta R}$ , $\sigma_{b_{R}}$ ; $\sigma_{\Delta\hbox{$\rm[Fe/H]$}/\Delta Z}$ , $\sigma_{b_{Z}}$ ) using the scipy library.⁶⁶6https://docs.scipy.org/doc/scipy-1.14.0/reference/generated/scipy.stats.linregress.html To calculate the radial metallicity gradient, we divided our sample into equally spaced vertical bins with a width of 0.2 kpc, covering the range $0<|Z|<2$ kpc. Similarly, for the vertical metallicity gradient, the stars were divided into equally spaced radial bins from $7<R<14$ kpc, each with a bin width of 1 kpc.

The results are summarized in Tables 3 and 4. The radial metallicity gradient ranges from $-0.05$ to $0.0\ {\rm{dex\ kpc^{-1}}}$ , while the vertical metallicity gradient ranges from $-0.26$ to $-0.07\ {\rm{dex\ kpc^{-1}}}$ . These results are consistent with previous studies (Gaia Collaboration et al., 2023a; Hawkins, 2023; Imig et al., 2023), as shown in Fig. 14. In the top panel our result show the best agreement with those of Gaia Collaboration et al. (2023a), which is based on FGK dwarfs and giants from the Gaia DR3 survey. Imig et al. (2023) and Hawkins (2023) used red giants from APOGEE and OBAF-type stars from LAMOST as tracers, respectively, and found steeper gradients compared to ours. These differences likely arise from the distinct tracer populations and observational surveys used. In the bottom panel, our results are also generally consistent with those of Hawkins (2023) and Imig et al. (2023). The discrepancies between these studies can be attributed to variations in binning methods and sample sizes. Additionally, it can be seen that the radial and vertical gradients generally flatten at larger distances from the Galactic center and greater vertical heights, respectively, which aligns well with previous findings. Our results for the metallicity gradients serve as a simple demonstration, illustrating that our distance and stellar parameter estimations have properties similar to what is expected based on our knowledge of metallicity distributions of Milky Way components.

Table 3: Radial metallicity gradients for

|Z|

bins.

$\|Z\|$	$\Delta\hbox{$\rm[Fe/H]$}/\Delta R$	$\sigma_{\Delta\hbox{$\rm[Fe/H]$}/\Delta R}$	$b_{R}$	$\sigma_{b_{R}}$	N
(kpc)	( $\rm{dex\ kpc^{-1}}$ )	( $\rm{dex\ kpc^{-1}}$ )	(dex)	(dex)
0.1	$-0.054$	0.001	0.367	0.008	110 637
0.3	$-0.040$	0.001	0.192	0.007	111 861
0.5	$-0.030$	0.001	0.062	0.007	72 406
0.7	$-0.022$	0.001	$-0.051$	0.008	46 237
0.9	$-0.011$	0.001	$-0.202$	0.008	31 200
1.1	$0.000$	0.001	$-0.360$	0.009	23 342
1.3	$0.003$	0.001	$-0.408$	0.010	19 706
1.5	$0.004$	0.001	$-0.434$	0.011	16 660
1.7	$0.005$	0.001	$-0.456$	0.011	14 009
1.9	$0.008$	0.001	$-0.497$	0.011	12 003

Table 4: Vertical metallicity gradients for

R

bins.

$R$	$\Delta\hbox{$\rm[Fe/H]$}/\Delta Z$	$\sigma_{\Delta\hbox{$\rm[Fe/H]$}/\Delta Z}$	$b_{Z}$	$\sigma_{b_{Z}}$	N
(kpc)	( $\rm{dex\ kpc^{-1}}$ )	( $\rm{dex\ kpc^{-1}}$ )	(dex)	(dex)
7.5	$-0.263$	0.002	$-0.044$	0.001	67 685
8.5	$-0.237$	0.001	$-0.073$	0.001	247 688
9.5	$-0.163$	0.001	$-0.124$	0.001	67 250
10.5	$-0.098$	0.002	$-0.223$	0.002	38 715
11.5	$-0.078$	0.002	$-0.283$	0.003	24 062
12.5	$-0.065$	0.003	$-0.299$	0.004	9 809
13.5	$-0.069$	0.006	$-0.293$	0.008	2 852

5 Discussion and Conclusion

5.1 Feature importance

To examine which parts of the spectra the CNN is weighting when predicting stellar parameters, we analyzed the feature importance through Shapley values. Using the Python package SHAP (Shapley Additive Explanations) (Lundberg & Lee, 2017), we calculated the Shapley values of each feature for our CNN model, where each feature corresponds to a 1Å wavelength.

We calculated the Shapley values for 2,000 stars. To quantify the impact, we computed the normalized average absolute Shapley values for each parameter, where larger values at a given wavelength indicate a greater influence on the parameter predictions. Comparisons are then made between hot ( $\hbox{$T_{\rm eff}$}>5000$ K) and cool stars ( $\hbox{$T_{\rm eff}$}\leq 5000$ K), and between metal-poor stars ( $\hbox{$\rm[Fe/H]$}<0.0$ dex) and metal-rich stars ( $\hbox{$\rm[Fe/H]$}\geq 0.0$ dex), as shown in Figs. 15 and 16. Only the results for the wavelength range of $3925-5500$ Å are demonstrated, as the $5500-8800$ Å range does not show significant importance. A few notable features include the following:

(i)

The helium line plays a crucial role in the determination of surface gravity and color, as well as the metallicity estimation for hot stars.
(ii)

Atomic metal lines, such as FeI, MgI, CrI, Ca, and Mg, significantly contribute to the determination of temperature and metallicity throughout our stellar parameter range.
(iii)

Spectral lines that substantially impact a particular parameter also influence other parameters, although their significance may vary. For instance, the FeI line at 4272Å has a pronounced effect on absolute magnitude, but a weaker influence on surface gravity.
(iv)

Unidentified or weaker lines, including those in the 5500–8800Å range that may not be prominent individually, collectively contribute significantly to the measurement of stellar parameters. Therefore, employing full spectrum analysis might be preferable to using spectral indices.
(v)

In Fig. 15 for hot stars, the CH line plays an important role in determining temperature and absolute magnitude, while the helium line contributes substantially to the estimation of metallicity.

5.2 Conclusion

In this work we proposed a convolutional neural network model to derive distances and stellar parameters ( $T_{\rm eff}$ , log $\it g$ , and $\rm[Fe/H]$ ) directly from LAMOST low-resolution spectra, combined with photometric data from Gaia. The model is trained on the LAMOST DR9 spectra, using APOGEE DR17 stellar parameters as well as Gaia DR3 for solar-like stars. For spectra with $\rm{S/N_{g}}>10$ , the test shows a scatter of 86 K for $T_{\rm eff}$ , 0.07 dex for log $\it g$ , 0.06 dex for $\rm[Fe/H]$ , 0.25 mag for $M_{\rm G}$ , and 0.03 mag for $(\rm{{BP}-{RP}})_{0}$ . The distance predictions have a scatter of 65 pc, indicating a median fractional error of less than 4% and a standard deviation of 8%.

Using the CNN model, we determined the distance and stellar parameters of 521,424 solar-like stars from LAMOST DR9, thus compensating for those that lack precise distance measurements. The reliability of the distance estimation was evaluated by comparing it with three other studies: StarHorse, DistMass, and astroNN. The comparisons yield mean fractional errors of $7\%$ , $-4\%$ , and $-4\%$ , respectively, indicating good agreement with these studies. Additionally, the accuracy of the stellar parameters is validated by comparison with APOGEE DR17 and GALAH DR3, showing minor biases and different degrees of scatter but overall consistency with these datasets.

Moreover, using a subsample of solar-like stars from our dataset, we calculated the radial and vertical metallicity gradients of the Milky Way. The radial gradient ranges from $-0.05<\Delta\hbox{$\rm[Fe/H]$}/\Delta R<0.0\ {\rm{dex\ kpc^{-1}}}$ for stars located at $0<|Z|<2$ kpc, while the vertical gradient ranges from $-0.26<\Delta\hbox{$\rm[Fe/H]$}/\Delta Z<-0.07\ \rm{dex\ kpc^{-1}}$ for stars situated at $7<R<14$ kpc. These results align well with previous studies, indicating that our distance and metallicity measurements are reliable for Galactic structure studies.

In the future, we plan to extend this model to include a broader range of stellar types, beyond solar-like stars. This includes improving the determination of distances and stellar parameters for a larger number of stars in spectroscopic surveys.

Acknowledgements.

This work is supported by the National Natural Science Foundation of China (12261141689, 12273075), the National Key R&D Program of China No. 2019YFA0405502, and the science research grants from the China Manned Space Project with NO.CMS-CSST-2021-B05. We sincerely thank the anonymous referee for the helpful comments. Yue-Yue Shen thanks Bo Zhang for the insightful suggestions with the paper. This work has made use of data from the Guoshoujing Telescope (the Large Sky Area Multi-Object Fiber Spectroscopic Telescope LAMOST), a National Major Scientific Project built by the Chinese Academy of Sciences. Funding for the project has been provided by the National Development and Reform Commission. LAMOST is operated and managed by the National Astronomical Observatories, Chinese Academy of Sciences. This work has made use of data from the European Space Agency (ESA) mission Gaia (https://www.cosmos.esa.int/gaia), processed by the Gaia Data Processing and Analysis Consortium (DPAC, https://www.cosmos.esa.int/web/gaia/dpac/consortium). Funding for the DPAC has been provided by national institutions, in particular the institutions participating in the Gaia Multilateral Agreement. Funding for the Sloan Digital Sky Survey IV has been provided by the Alfred P. Sloan Foundation, the U.S. Department of Energy Office of Science, and the Participating Institutions. SDSS-IV acknowledges support and resources from the Center for High Performance Computing at the University of Utah. The SDSS website is www.sdss4.org. SDSS-IV is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS Collaboration including the Brazilian Participation Group, the Carnegie Institution for Science, Carnegie Mellon University, Center for Astrophysics — Harvard & Smithsonian, the Chilean Participation Group, the French Participation Group, Instituto de Astrofísica de Canarias, The Johns Hopkins University, Kavli Institute for the Physics and Mathematics of the Universe (IPMU) / University of Tokyo, the Korean Participation Group, Lawrence Berkeley National Laboratory, Leibniz Institut für Astrophysik Potsdam (AIP), Max-Planck-Institut für Astronomie (MPIA Heidelberg), Max-Planck-Institut für Astrophysik (MPA Garching), Max-Planck-Institut für Extraterrestrische Physik (MPE), National Astronomical Observatories of China, New Mexico State University, New York University, University of Notre Dame, Observatário Nacional / MCTI, The Ohio State University, Pennsylvania State University, Shanghai Astronomical Observatory, United Kingdom Participation Group, Universidad Nacional Autónoma de México, University of Arizona, University of Colorado Boulder, University of Oxford, University of Portsmouth, University of Utah, University of Virginia, University of Washington, University of Wisconsin, Vanderbilt University, and Yale University. This work made use of the Third Data Release of the GALAH Survey (Buder et al. 2021). The GALAH Survey is based on data acquired through the Australian Astronomical Observatory, under programs: A/2013B/13 (The GALAH pilot survey); A/2014A/25, A/2015A/19, A2017A/18 (The GALAH survey phase 1); A2018A/18 (Open clusters with HERMES); A2019A/1 (Hierarchical star formation in Ori OB1); A2019A/15 (The GALAH survey phase 2); A/2015B/19, A/2016A/22, A/2016B/10, A/2017B/16, A/2018B/15 (The HERMES-TESS program); and A/2015A/3, A/2015B/1, A/2015B/19, A/2016A/22, A/2016B/12, A/2017A/14 (The HERMES K2-follow-up program). We acknowledge the traditional owners of the land on which the AAT stands, the Gamilaraay people, and pay our respects to elders past and present. This paper includes data that has been provided by AAO Data Central (datacentral.org.au). The preparation of this work has made use of TOPCAT(Taylor, 2005), NASA’s Astrophysics Data System Bibliographic Services, as well as the open-source Python packages astropy(Astropy Collaboration et al., 2013, 2018, 2022), NumPy(Harris et al., 2020), tensorflow(Abadi et al., 2015), scipy(Virtanen et al., 2020), scikit-learn(Pedregosa et al., 2011) and pandas(Wes McKinney, 2010). The figures in this paper were produced with matplotlib(Hunter, 2007).

References

Abadi et al. (2015) Abadi, M., Agarwal, A., Barham, P., et al. 2015, TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, software available from tensorflow.org
Abdurro’uf et al. (2022) Abdurro’uf, Accetta, K., Aerts, C., et al. 2022, ApJS, 259, 35
Andrae et al. (2023) Andrae, R., Fouesneau, M., Sordo, R., et al. 2023, A&A, 674, A27
Astropy Collaboration et al. (2022) Astropy Collaboration, Price-Whelan, A. M., Lim, P. L., et al. 2022, ApJ, 935, 167
Astropy Collaboration et al. (2018) Astropy Collaboration, Price-Whelan, A. M., Sipőcz, B. M., et al. 2018, AJ, 156, 123
Astropy Collaboration et al. (2013) Astropy Collaboration, Robitaille, T. P., Tollerud, E. J., et al. 2013, A&A, 558, A33
Baglin et al. (2006) Baglin, A., Auvergne, M., Boisnard, L., et al. 2006, in 36th COSPAR Scientific Assembly, Vol. 36, 3749
Bailer-Jones et al. (2021) Bailer-Jones, C. A. L., Rybizki, J., Fouesneau, M., Demleitner, M., & Andrae, R. 2021, AJ, 161, 147
Bordé et al. (2003) Bordé, P., Rouan, D., & Léger, A. 2003, A&A, 405, 1137
Borucki et al. (2010) Borucki, W. J., Koch, D., Basri, G., et al. 2010, Science, 327, 977
Bressan et al. (2012) Bressan, A., Marigo, P., Girardi, L., et al. 2012, MNRAS, 427, 127
Buder et al. (2021) Buder, S., Sharma, S., Kos, J., et al. 2021, MNRAS, 506, 150
Burnett & Binney (2010) Burnett, B. & Binney, J. 2010, MNRAS, 407, 339
Carlin et al. (2015) Carlin, J. L., Liu, C., Newberg, H. J., et al. 2015, AJ, 150, 4
Chollet et al. (2015) Chollet, F. et al. 2015, Keras
Cui et al. (2012) Cui, X.-Q., Zhao, Y.-H., Chu, Y.-Q., et al. 2012, RAA, 12, 1197
Das & Sanders (2019) Das, P. & Sanders, J. L. 2019, MNRAS, 484, 294
Dorogush et al. (2018) Dorogush, A. V., Ershov, V., & Gulin, A. 2018, arXiv preprint arXiv:1810.11363
El-Badry et al. (2018) El-Badry, K., Rix, H.-W., Ting, Y.-S., et al. 2018, MNRAS, 473, 5043
Erickson et al. (2020) Erickson, N., Mueller, J., Shirkov, A., et al. 2020, arXiv e-prints, arXiv:2003.06505
Fabbro et al. (2018) Fabbro, S., Venn, K. A., O’Briain, T., et al. 2018, MNRAS, 475, 2978
Frankel et al. (2019) Frankel, N., Sanders, J., Rix, H.-W., Ting, Y.-S., & Ness, M. 2019, ApJ, 884, 99
Gaia Collaboration et al. (2023a) Gaia Collaboration, Recio-Blanco, A., Kordopatis, G., et al. 2023a, A&A, 674, A38
Gaia Collaboration et al. (2023b) Gaia Collaboration, Vallenari, A., Brown, A. G. A., et al. 2023b, A&A, 674, A1
Harris et al. (2020) Harris, C. R., Millman, K. J., van der Walt, S. J., et al. 2020, Nature, 585, 357
Hawkins (2023) Hawkins, K. 2023, MNRAS, 525, 3318
Hawley et al. (2002) Hawley, S. L., Covey, K. R., Knapp, G. R., et al. 2002, AJ, 123, 3409
Hunter (2007) Hunter, J. D. 2007, Computing in Science & Engineering, 9, 90
Imig et al. (2023) Imig, J., Price, C., Holtzman, J. A., et al. 2023, ApJ, 954, 124
Ioffe & Szegedy (2015) Ioffe, S. & Szegedy, C. 2015, arXiv e-prints, arXiv:1502.03167
Ke et al. (2017) Ke, G., Meng, Q., Finley, T., et al. 2017, Advances in neural information processing systems, 30
Kingma & Ba (2014) Kingma, D. P. & Ba, J. 2014, arXiv e-prints, arXiv:1412.6980
Leung & Bovy (2019a) Leung, H. W. & Bovy, J. 2019a, MNRAS, 483, 3255
Leung & Bovy (2019b) Leung, H. W. & Bovy, J. 2019b, MNRAS, 489, 2079
Lindegren et al. (2021) Lindegren, L., Bastian, U., Biermann, M., et al. 2021, A&A, 649, A4
Lundberg & Lee (2017) Lundberg, S. & Lee, S.-I. 2017, arXiv e-prints, arXiv:1705.07874
Luo et al. (2015) Luo, A.-L., Zhao, Y.-H., Zhao, G., et al. 2015, RAA, 15, 1095
Majewski et al. (2017) Majewski, S. R., Schiavon, R. P., Frinchaboy, P. M., et al. 2017, AJ, 154, 94
Önal Taş et al. (2016) Önal Taş, Ö., Bilir, S., Seabroke, G., et al. 2016, PASP, 33, e044
O’Briain et al. (2021) O’Briain, T., Ting, Y.-S., Fabbro, S., et al. 2021, ApJ, 906, 130
Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., et al. 2011, Journal of Machine Learning Research, 12, 2825
Queiroz et al. (2020) Queiroz, A. B. A., Anders, F., Chiappini, C., et al. 2020, A&A, 638, A76
Queiroz et al. (2018) Queiroz, A. B. A., Anders, F., Santiago, B. X., et al. 2018, MNRAS, 476, 2556
Ricker et al. (2015) Ricker, G. R., Winn, J. N., Vanderspek, R., et al. 2015, Journal of Astronomical Telescopes, Instruments, and Systems, 1, 014003
Rui et al. (2019) Rui, W., A-li, L., Shuo, Z., et al. 2019, PASP, 131, 024505
Santiago et al. (2016) Santiago, B. X., Brauer, D. E., Anders, F., et al. 2016, A&A, 585, A42
Stone-Martinez et al. (2024) Stone-Martinez, A., Holtzman, J. A., Imig, J., et al. 2024, AJ, 167, 73
Taylor (2005) Taylor, M. B. 2005, in Astronomical Society of the Pacific Conference Series, Vol. 347, Astronomical Data Analysis Software and Systems XIV, ed. P. Shopbell, M. Britton, & R. Ebert, 29
Ting et al. (2019) Ting, Y.-S., Conroy, C., Rix, H.-W., & Cargile, P. 2019, ApJ, 879, 69
Ulla et al. (2022) Ulla, A., Creevey, O. L., Álvarez, M. A., et al. 2022, Gaia DR3 documentation Chapter 11: Astrophysical parameters
Virtanen et al. (2020) Virtanen, P., Gommers, R., Oliphant, T. E., et al. 2020, Nature Methods, 17, 261
Wang et al. (2023a) Wang, C., Yuan, H., Xiang, M., et al. 2023a, A&A, 674, A129
Wang et al. (2020) Wang, R., Luo, A.-L., Chen, J.-J., et al. 2020, ApJ, 891, 23
Wang et al. (2023b) Wang, R., Luo, A.-L., Zhang, S., et al. 2023b, ApJS, 266, 40
Wes McKinney (2010) Wes McKinney. 2010, in Proceedings of the 9th Python in Science Conference, ed. Stéfan van der Walt & Jarrod Millman, 56 – 61
Xiang et al. (2022) Xiang, M., Rix, H.-W., Ting, Y.-S., et al. 2022, A&A, 662, A66
Yan et al. (2019) Yan, Y., Du, C., Liu, S., et al. 2019, ApJ, 880, 36
Zhang et al. (2020) Zhang, B., Liu, C., & Deng, L.-C. 2020, ApJS, 246, 9
Zhang et al. (2022) Zhang, W., Zhang, J., He, H., et al. 2022, ApJS, 263, 12
Zhang et al. (2019) Zhang, X., Zhao, G., Yang, C. Q., Wang, Q. X., & Zuo, W. B. 2019, PASP, 131, 094202
Zhao et al. (2012) Zhao, G., Zhao, Y.-H., Chu, Y.-Q., Jing, Y.-P., & Deng, L.-C. 2012, RAA, 12, 723

Appendix A Dataset for CNN model

The parameter distribution of the dataset with a total of 18,573 stars is shown in Fig. 17. The reference set consists of 14,858 stars, while the test set contains 3,715 stars.

Appendix B Pre-processing of the spectra

Two examples of pre-processing are shown in Fig. 18.

Appendix C Model comparison

The primary goal of this study is to solve a multi-label regression problem, which can be addressed by many Machine Learning (ML) algorithms. To compare our CNN model with other ML algorithms, we use the AutoGluon toolkit (Erickson et al. 2020), an open-source⁷⁷7https://github.com/autogluon/autogluon AutoML framework. With AutoGluon, we can automatically train and tune several traditional models specifically on our dataset, including LightGBM boosted trees (Ke et al. 2017), CatBoost boosted trees (Dorogush et al. 2018), Random Forests, Extremely Randomized Trees, $k$ -Nearest Neighbors, and multilayer perceptrons (MLP). This approach ensures a fair and comprehensive comparison by effectively optimizing each traditional model’s hyperparameters.

To further elaborate, AutoGluon tackles a multi-label problem by breaking it down into multiple single-label regression tasks. The process is sequential: the model first predicts one label, then uses this prediction as an input for predicting the next label. Correlations between labels are accounted for by imposing an order on the labels, ensuring that the predictions for later labels incorporate the information from earlier ones. This sequential process captures the dependencies between labels more effectively.

We evaluated these models using the same test set and performance metric as our CNN model to enable model comparison, as presented in Table 5. The CNN represents our approach, while the other models are automatically tuned by AutoGluon without any human intervention. LightGBMXT, LightGBM, and LightGBMLarge are three different versions of the LightGBM boosted model, each with different hyperparameters. NeuralNetFastAI and NeuralNetTorch are both MLPs with different implementations. KNeighborsUnif and KNeighborsDist are both $k$ -Nearest Neighbors models, differing in their weight functions. For $T_{\rm eff}$ prediction, the CNN outperforms all other models, achieving the smallest MSE value. The CNN also performs well for log $\it g$ , $\rm[Fe/H]$ , $M_{\rm G}$ , and $(\rm{{BP}-{RP}})_{0}$ . Although the CNN model does not exhibit the best overall performance, it remains competitive given its computational efficiency. It is evident that AutoGluon requires additional computation time and storage space to train and tune these various models. Furthermore, the precision differences to three decimal places are unlikely to significantly improve the results. In such cases, the CNN model is a better choice.

Table 5: Comparison of our CNN model with other models.

Model	$T_{\rm eff}$	log $\it g$	$\rm[Fe/H]$	$M_{\rm G}$	$(\rm{{BP}-{RP}})_{0}$	Weight mean MSE
CNN	7390	0.00497	0.00318	0.0649	0.00117	0.0685
LightGBMXT	7653	0.00409	0.00236	0.0531	0.0011	0.0606
LightGBM	7803	0.00409	0.00259	0.0558	0.00108	0.0619
LightGBMLarge	7815	0.0045	0.0033	0.0539	0.00109	0.0648
CatBoost	7925	0.00401	0.0023	0.0536	0.00117	0.0618
NeuralNetFastAI	8671	0.00438	0.00165	0.0615	0.00136	0.0668
NeuralNetTorch	8313	0.00561	0.00323	0.0672	0.00158	0.0782
ExtraTreesMSE	8470	0.00628	0.0051	0.0607	0.00128	0.0803
RandomForestMSE	8523	0.00586	0.00568	0.0614	0.00123	0.0804
KNeighborsUnif	12530	0.01385	0.02056	0.1644	0.00212	0.1891
KNeighborsDist	12607	0.01377	0.02015	0.1635	0.00211	0.1875

⁸⁸8Other models are automatically tuned by AutoGluon using the same training set. The performance values are evaluated using the test set. The first five columns show the MSE values of the five parameters under different models, demonstrating the parameter prediction ability for each parameter. The last column shows the weighted mean MSE of all five parameters, reflecting the overall performance of each model.

Appendix D Catalog of 521,424 solar-like stars

Table 6 presents the final catalog of this work, including the stellar parameters and distance of solar-like stars of LAMOST and Gaia. The complete catalog is publicly available through the China-VO Paper Data Repository and can be accessed at https://nadc.china-vo.org/res/r101400/. In addition, both the dataset and the trained model are accessible via Zenodo at doi: [10.5281/zenodo.13748129]. The source code is hosted on GitHub at https://github.com/sarashenyy/SolarDis.

Table 6: The description of columns of the final catalog.

Column	Unit	Type	Description
obsid		integer	LAMOST observation identifier
ra	degree	double	right ascension of object
dec	degree	double	declination of object
snrg		float	S/N at g band
snrr		float	S/N at r band
flag		short	flag=1 means existing bad pixels
cnn_teff	K	double	effective temperature determined by this work
cnn_logg	dex	double	surface gravity determined by this work
cnn_feh	dex	double	Metallicity determined by this work
cnn_mg	mag	double	absolute magnitude of Gaia G band determined by this work
cnn_bprp	mag	double	BP - RP color of Gaia determined by this work
cnn_distance	pc	double	distance determined by this work
lasp_teff	K	float	effective temperature obtained by LASP
lasp_teff_err	K	float	effective temperature uncertainty obtained by LASP
lasp_logg	dex	float	surface gravity obtained by LASP
lasp_logg_err	dex	float	surface gravity uncertainty obtained by LASP
lasp_feh	dex	float	metallicity obtained by LASP
lasp_feh_err	dex	float	metallicity uncertainty obtained by LASP
lasp_rv	km/s	float	radial velocity obtained by LASP
lasp_rv_err	km/s	float	radial velocity uncertainty obtained by LASP
gaia_source_id		long	source identifier in Gaia DR3
parallax	mas	double	parallax provided by Gaia DR3
parallax_over_error		double	parallax divided by its standard error
parallax_zeropoint	mas	double	parallax zero-point according to Lindegren et al. (2021)
parallax_correction	mas	double	parallax after zero-point correction
gaia_g_mean_mag	mag	double	G mag provided by Gaia DR3
gaia_bp_rp	mag	double	BP - RP color provided by Gaia DR3
r_med_geo	pc	double	distance from Bailer-Jones et al. (2021)
APOGEE_ID		string	identifier in APOGEE