This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Domain adaptation in application to gravitational lens finding

Hanna Parul Department of Physics & Astronomy, University of Alabama, Tuscaloosa, AL 35401, USA Sergei Gleyzer Department of Physics & Astronomy, University of Alabama, Tuscaloosa, AL 35401, USA Pranath Reddy University of Florida, Gainesville, FL 32611, USA Michael W. Toomey
Abstract

The next decade is expected to see a tenfold increase in the number of strong gravitational lenses, driven by new wide-field imaging surveys. To discover these rare objects, efficient automated detection methods need to be developed. In this work, we assess the performance of three domain adaptation techniques – Adversarial Discriminative Domain Adaptation (ADDA), Wasserstein Distance Guided Representation Learning (WDGRL), and Supervised Domain Adaptation (SDA) – in enhancing lens-finding algorithms trained on simulated data when applied to observations from the Hyper Suprime-Cam Subaru Strategic Program. We find that WDGRL combined with an ENN-based encoder provides the best performance in an unsupervised setting and that supervised domain adaptation is able to enhance the model’s ability to distinguish between lenses and common similar-looking false positives, such as spiral galaxies, which is crucial for future lens surveys.

software: Astropy (Astropy Collaboration et al., 2013, 2018, 2022), SciPy (Virtanen et al., 2020), Matplotlib (Hunter, 2007), NumPy (Harris et al., 2020), Lenstronomy (Birrer & Amara, 2018; Birrer et al., 2021)555https://github.com/lenstronomy/lenstronomy, pyHalo (Gilman et al., 2020)666https://github.com/dangilman/pyHalo, PyTorch (Ansel et al., 2024).
\reportnum

MIT-CTP/5784

\address

Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA

1 Introduction

Gravitational lensing, the distortion of light from a distant source due to the presence of a massive object between that source and the observer, is one of the most important phenomena predicted by the general relativity theory and a powerful tool to study the universe. For example, the sensitivity of lensing observables to the distribution of mass in the lens provides a way to study the dark matter profiles of individual galaxies, galaxy groups, and galaxy clusters (Koopmans et al., 2009; Barnabè et al., 2011; Sonnenfeld et al., 2013; Newman et al., 2015; Li et al., 2018; Etherington et al., 2023). It also allows to probe dark matter substructures on subgalactic scales (Daylan et al., 2018; Brehmer et al., 2019; Diaz Rivero & Dvorkin, 2020; Bayer et al., 2023), and to test various dark matter theories (Li et al., 2016; Gilman et al., 2018; Alexander et al., 2020; Şengül & Dvorkin, 2022; Vegetti et al., 2023; Anau Montel et al., 2023; Gilman et al., 2023). Time-delay measurements of gravitationally lensed time-variable sources, such as supernovae and quasars, allow for measuring absolute distances and constraining the Hubble constant and other cosmological parameters (Refsdal, 1964; Shajib et al., 2020; Wong et al., 2020; Treu et al., 2022; Birrer et al., 2022; Suyu et al., 2024). Cluster-scale lenses provide magnification that enables spectroscopic studies of high-redshift galaxies, helping to uncover star formation processes in the early universe (Rigby et al., 2018; Sharon et al., 2022; Zhang et al., 2023; Mainali et al., 2023; Keerthi Vasan G. et al., 2024).

Strong gravitational lensing, characterized by the occurence of multiple images, arcs, and rings, requires precise alignment between foreground and background sources. Therefore, these systems are very rare, and their discovery demands extensive search efforts. Upcoming wide-field surveys, such as Euclid Wide Survey (Laureijs et al., 2011; Euclid Collaboration et al., 2022) and Legacy Survey of Space and Time (LSST; LSST Science Collaboration et al., 2009) are expected to deliver on the order of 10510^{5} strong lenses (Collett, 2015). However, finding them will require searching through hundreds of millions of sources, which is infeasible for human inspection. Over the last 20 years, various methods have been employed to discover gravitational lenses, such as citizen science (e.g. Marshall et al., 2016), spectroscopic selection of lens candidates (Bolton et al., 2004, 2006; Shu et al., 2017), arcs and ring-like feature finders (More et al., 2012; Gavazzi et al., 2014; Sonnenfeld et al., 2018). Nevertheless, in recent years, the majority of lens discoveries have come from deep learning algorithms.

Typically, the lens finding problem is treated as an image classification problem, which is well-suited for Convolutional Neural Networks (CNN) (Lecun et al., 1998). Supervised CNN-based algorithms have demonstrated superior performance in the gravitational lens finding challenge (Metcalf et al., 2019) and have recently been applied to various wide-field surveys, yielding a few thousand lens candidates (e.g. Petrillo et al., 2017; Pourrahmani et al., 2018; Canameras et al., 2021; Cañameras et al., 2020; Li et al., 2021; Rojas et al., 2022; Shu et al., 2022a; Storfer et al., 2022; Huang et al., 2020, 2021; Jacobs et al., 2019). To ensure that the model, trained in a supervised way, is able to learn relevant features and generalize well to unseen data, the training dataset typically requires a large number of labeled samples (on the order of 10310410^{3}-10^{4}), which exceeds the number of known lenses. For that reason, it is common to use simulated lenses at the training stage (however, see e.g. Huang et al., 2020, 2021 for supervised training only with observational data).

Despite efforts to increase the realism of simulations, differences between real observations and mock images are unavoidable and can significantly degrade the performance of the model (Ćiprijanović et al., 2020, 2022). Among the ways to tackle this issue are, for example, implementing unsupervised methods and train a model purely on observational data, as in Cheng et al. (2020) and Stein et al. (2022), or applying domain adaptation techniques to reduce the gap between the simulated and observational domains.

Domain adaptation (DA) is a class of methods applied in situations when the training and test datasets come from different, but related domains, namely the source and the target domains. Based on the availability of labels in the target dataset, DA methods can be divided into three categories: unsupervised domain adaptation (UDA), when ground truth for the target data is unknown; semi-supervised domain adaptation (SSDA), where labels are available for a fraction of the target data; and supervised domain adaptation (SDA), where labels for the entire dataset are accessible. In astrophysics, domain adaptation has found various applications, including, but not limited to: improving the performance of galactic morphology classifiers across different surveys (Xu et al., 2023; Ćiprijanović et al., 2023), identifying merging galaxies in simulated and observational datasets (Ćiprijanović et al., 2021), and classifying strong gravitational lenses with different dark matter substructure simulated for surveys with varying observational characteristics (Alexander et al., 2023).

In this paper, we assess how much domain adaptation techniques can enhance the performance of lens-finding algorithms trained in a supervised way on simulated or real data. Building upon the work in Alexander et al. (2023), we examine three domain adaptation methods: Adversarial Discriminative Domain Adaptation (ADDA), Wasserstein Distance Guided Representation Learning (WDGRL), and Supervised Domain Adaptation (SDA). The paper is structured as follows: Section 2 describes the algorithms we employ; Section 3 presents the datasets used in the experiments; and Section 4 outlines the neural network architecture and training process. We discuss our results in Section 5 and conclude in Section 6.

2 Domain Adaptation methods

In the context of gravitational lens finding, our goal is to train a model on a source dataset containing simulated lenses and adapt it to a smaller unlabeled target dataset with real lenses, using either the Adversarial Discriminative Domain Adaptation (ADDA) or the Wasserstein Distance Guided Representation Learning (WDGRL) method. We will also explore the supervised setting for domain adaptation and compare the supervised classifier trained purely on real data with the Supervised Domain Adaptation (SDA) algorithm trained on a labeled target dataset. A brief description of each domain adaptation algorithm follows. For the remainder of the section, (𝐗s,Ys)(\mathbf{X}^{s},Y^{s}) denotes images and labels of the source dataset, while (𝐗t,Yt)(\mathbf{X}^{t},Y^{t}) represents images and labels of the target dataset.

2.1 Adversarial Discriminative Domain Adaptation

The ADDA framework (Tzeng et al., 2017) includes source and target encoders MsM_{s} and MtM_{t}, respectively, which map the input images to a lower-dimensional latent space, discriminator DD, that learns to identify whether the input representation comes from the source or target domain, and classifiers CsC_{s} and CtC_{t}, which act on representations and output class labels for source and target data, respectively. The goal of the training process is to minimize the distance between the source and target representations, Ms(𝐗s)M_{s}(\mathbf{X}^{s}) and Mt(𝐗t)M_{t}(\mathbf{X}^{t}). In this case, the source classifier CsC_{s}, trained on source data in a supervised manner, could be directly applied to the target data, such that Ct=CsC_{t}=C_{s}.

The minimization of the distance between representations is achieved through the alternating optimization of the discriminator and the target encoder. The discriminator is updated to minimize the discriminator loss:

advD(𝐗s,𝐗t,Ms,Mt)=𝔼𝐱s𝐗s[logD(Ms(𝐱s))]𝔼𝐱t𝐗t[log(1D(Mt(𝐱t)))]\begin{split}&\mathcal{L}_{\mathrm{adv}_{D}}(\mathbf{X}^{s},\mathbf{X}^{t},M_{s},M_{t})=\\ &-\mathbb{E}_{\mathbf{x}^{s}\sim\mathbf{X}^{s}}[\log D(M_{s}(\mathbf{x}^{s}))]-\mathbb{E}_{\mathbf{x}^{t}\sim\mathbf{X}^{t}}[\log(1-D(M_{t}(\mathbf{x}^{t})))]\end{split} (1)

The target encoder learns to fool the discriminator by finding representations of the target images that are indistinguishable from the representations of the source images. It is optimized according to a standard loss function with inverted labels:

(𝐗t,D)=𝔼𝐱t𝐗t[logD(Mt(𝐱t))]\mathcal{L}(\mathbf{X}^{t},D)=\mathbb{E}_{\mathbf{x}^{t}\sim\mathbf{X}^{t}}[\log D(M_{t}(\mathbf{x}^{t}))] (2)

2.2 Wasserstein Distance Guided Representation Learning

The WDGRL method (Shen et al., 2017) is inspired by the idea of Wasserstein Generative Adversarial Networks (WGAN). WGANs (Arjovsky et al., 2017) were introduced to address the problem of standard GANs when the discriminator quickly learns to distinguish between fake and real samples and ceases to provide reliable gradient information. The employment of the Wasserstein distance allowed for obtaining more stable gradients even when the two distributions are distant.

The Wasserstein distance is a measure of similarity between two probability distributions. For the case of the first Wasserstein distance, it can be written in the following form:

W1(,)=supfwL1𝔼x[f(x)]𝔼x[f(x)]W_{1}(\mathbb{P},\mathbb{Q})=\sup_{||f_{w}||_{L}\leq 1}\mathbb{E}_{x\sim\mathbb{P}}[f(x)]-\mathbb{E}_{x\sim\mathbb{Q}}[f(x)] (3)

where \mathbb{P} and \mathbb{Q} are two probability measures and ff is 1-Lipschitz function.

WDGRL applies this concept to learn domain-invariant features through the iterative training of the following components: a domain critic DD, a feature extractor MM, and a classifier CC with parameters θD\theta_{D}, θM\theta_{M}, θC\theta_{C}, respectively.

The goal of the domain critic is to approximate the Wasserstein distance between the representations by maximizing the domain critic loss wd\mathcal{L}_{wd}:

wd(𝐱s,𝐱t)=1ns𝐱s𝐗sD(M(𝐱s))1nt𝐱t𝐗tD(M(𝐱t))\mathcal{L}_{wd}(\mathbf{x}^{s},\mathbf{x}^{t})=\frac{1}{n^{s}}\sum_{\mathbf{x}^{s}\in\mathbf{X}^{s}}D(M(\mathbf{x}^{s}))-\frac{1}{n^{t}}\sum_{\mathbf{x}^{t}\in\mathbf{X}^{t}}D(M(\mathbf{x}^{t})) (4)

where nsn^{s} and ntn^{t} are the number of source and target samples, respectively.

To ensure that DD is a 1-Lipschitz continuous function, a gradient penalty is enforced on the domain critic parameters:

grad(h^)=(h^D(h^)21)2,\mathcal{L}_{grad}(\hat{h})=(||\nabla_{\hat{h}}D(\hat{h})||_{2}-1)^{2}, (5)

where h^\hat{h} are the points sampled between the source and target representations.

During training, for each mini-batch, the domain critic is first trained to optimality by maximizing the domain critic loss wdγgrad\mathcal{L}_{wd}-\gamma\mathcal{L}_{grad}, where γ\gamma is a balancing coefficient. In practice, the number of iterations to train the domain critic is set by the parameter n_critic. Next, the optimal domain critic parameters are fixed, and the feature extractor network is trained to minimize the Wasserstein distance and learn invariant representations. Since the learned representations should be informative enough for the classifier CC, the source labels are incorporated, and the classifier is optimized by minimizing the cross-entropy loss c(𝐱s,ys)\mathcal{L}_{c}(\mathbf{x}^{s},y^{s}), where ysy^{s} represents the source labels. Therefore, the combined objective function for this step is minθM,θC{c+λmaxθDwd}\displaystyle\min_{\theta_{M},\theta_{C}}\bigl{\{}\mathcal{L}_{c}+\lambda\max_{\theta_{D}}\mathcal{L}_{wd}\bigr{\}}, where λ\lambda is a trade-off parameter.

2.3 Supervised Domain Adaptation

Typically, unsupervised DA methods expect a large amount of target data to be effective. For a limited size target dataset with available labels, supervised DA methods tend to be more powerful.

The SDA approach (Motiian et al., 2017) aims to align the source and target representations in the embedding space, similar to the unsupervised methods described above. However, access to target labels enables contrastive semantic alignment. This means that the encoder MM maps samples from the different domains but with the same class label near to each other in the embedding space, while keeping samples with different class label well separated.

The alignment of the same-class samples is formulated via minimizing the semantic alignment loss:

SA=a=1Nd(M(𝐗as),M(𝐗at)),\mathcal{L}_{SA}=\sum_{a=1}^{N}d(M(\mathbf{X}_{a}^{s}),M(\mathbf{X}_{a}^{t})), (6)

where NN is the number of class labels, 𝐗as=𝐗s|Ys=a\mathbf{X}_{a}^{s}=\mathbf{X}^{s}|{Y^{s}=a} and 𝐗at=𝐗t|Yt=a\mathbf{X}_{a}^{t}=\mathbf{X}^{t}|{Y^{t}=a} are conditional random variables, and dd is the distance metric in the embedding space.

To maximize the distance between the representations of samples from different domains and different labels, the separation loss is included:

S=a,b|abk(M(𝐗as),M(𝐗at)),\mathcal{L}_{S}=\sum_{a,b|a\neq b}k(M(\mathbf{X}_{a}^{s}),M(\mathbf{X}_{a}^{t})), (7)

where kk is the similarity metric between the distributions of representations, which penalizes them if they come close. Following the implementation in Motiian et al. (2017), we compute the distance, dd, and similarity, kk, as average pairwise distances and similarities between points in the embedding space, where for each pair of points they are defined as:

d(M(xis),M(xjt))=12M(xis)M(xjt)2d(M(x_{i}^{s}),M(x_{j}^{t}))=\frac{1}{2}\|M(x_{i}^{s})-M(x_{j}^{t})\|^{2} (8)
k(M(xis),M(xjt))=12max(0,mM(xis)M(xjt))2,k(M(x_{i}^{s}),M(x_{j}^{t}))=\frac{1}{2}\max(0,m-\|M(x_{i}^{s})-M(x_{j}^{t})\|)^{2}, (9)

where margin mm is a hyperparameter that defines the minimum desired separation between samples from different classes in the embedding space.

The model also includes a classifier CC trained purely on the source dataset in a supervised manner. Therefore, the combined classification and contrastive semantic alignment loss takes the following form:

CCSA=(1α)C+α(SA+S),\mathcal{L}_{CCSA}=(1-\alpha)\mathcal{L}_{C}+\alpha(\mathcal{L}_{SA}+\mathcal{L}_{S}), (10)

where C\mathcal{L}_{C} is a classifier loss and α\alpha is a balancing coefficient.

3 Data

As a source dataset, we use a mixture of simulated lenses and real non-lensed galaxies. For the target dataset, we use purely observational data, combining previously discovered lens candidates with non-lensed galaxies. We describe the construction of these datasets below.

We begin by selecting a parent dataset from Hyper Suprime-Cam Subaru Strategic Program (HSC-SSP) PDR2 Wide field (Aihara et al., 2019), following the criteria in Shu et al. (2022b) and Cañameras et al. (2021). The selection targets the extended sources brighter than 26 mag in the g, r, and i bands with the following color cuts: 0.6<g_cmodel_magr_cmodel_mag<30.6<\texttt{g\_cmodel\_mag}-\texttt{r\_cmodel\_mag}<3, 2<g_cmodel_magi_cmodel_mag<52<\texttt{g\_cmodel\_mag}-\texttt{i\_cmodel\_mag}<5. The color cuts serve to narrow the sample to preferentially red galaxies, as massive red ellipticals are expected to dominate the lens population. We cross-matched the parent dataset with the list of known lens candidates and removed matches with separation less than 40 arcsec to reduce possible contamination among the non-lensed sources. We then used this cleaned parent dataset to draw negative examples (non-lenses) for both the source and target dataset. While there is a possibility of undiscovered lenses in the parent sample, we expect the fraction of such serendipitous lenses to be negligible.

3.1 Mock lenses

The essential components for creating a mock gravitational lens are the light profiles of the foreground and background galaxies and the gravitational potential of the lens. To enhance the realism of the mock systems, we employ images of real galaxies for both the foreground and background sources, following the approach in Cañameras et al. (2020).

For the deflectors (or foreground galaxies) we selected galaxies from the parent dataset that also have spectroscopic redshift and velocity dispersion measurements in the Sloan Digital Sky Survey (SDSS; Kollmeier et al., 2019). We used the HSC Cutout service111https://hsc-release.mtk.nao.ac.jp/das_cutout/pdr2/ to extract 72 x 72 pixel cutouts which corresponds to 11.5 x 11.5 arcseconds in the g, r, and i filters.

For the background sources, we used a sample of galaxies observed in the Hubble eXtra Deep Field (Illingworth et al., 2013) in the F435W, F606W, and F775W bands with redshifts from Inami et al. (2017). All the HST data used in this paper can be found in MAST: http://dx.doi.org/10.17909/T9RG6J (catalog 10.17909/T9RG6J).

To simulate strong galaxy-galaxy lens systems, we used the package lenstronomy222https://github.com/lenstronomy/lenstronomy (Birrer & Amara, 2018; Birrer et al., 2021). The process followed several steps. We begin by drawing a random foreground galaxy with lens redshift zlensz_{lens} and velocity dispersion σV\sigma_{V} from the deflector sample. From the set of background sources, we select a random galaxy with the source redshift zsrc>zlensz_{src}>z_{lens}. We approximate the potential of the deflector with a singular isothermal sphere profile (SIS) which is set by the velocity dispersion σV\sigma_{V}:

ρ(r)=σV22πGr2\displaystyle\rho(r)=\frac{\sigma_{V}^{2}}{2\pi Gr^{2}} (11)

In addition to the the deflector potential, we also add the external shear due to the large-scale structure. We compute the Einstein radius, θE\theta_{E}, of the system and only keep lenses with 0.75"<θE<3.0"0.75"<\theta_{E}<3.0". The lower limit is chosen to exceed the seeing in HSC i band, and the upper limit is set based on the size of the cutout (64 pix = 10.2"10.2") to ensure that the resulting lens fits well within our 64×\times64 image. We place the background source at a random location in the source plane with an offset (ΔX\Delta X, ΔY\Delta Y) drawn from the uniform distribution [0.3θE[-0.3\theta_{E}, 0.3θE]0.3\theta_{E}]. For the case of SIS, it can be shown that multiple images are produced when the source position is smaller than Einstein radius of the lens.

The image of the background source might contain multiple sources. To remove contaminants and compute the lensed signal only for the central source, we convolve the source image with a Gaussian filter fitted to the central source. We compute the distorted light profile of the background source in each filter with the lenstronomy module ImSim. We inspect the resulting image and compare the brightest pixel of the lensed light profile with the corresponding pixel in the deflector image to discard overly faint lenses from the dataset. If the lensed signal is not bright enough, we boost the magnitude of the background source in each filter by 0.5 and try to simulate the lens again. We select a different source if we could not obtain a bright enough lensed image after the total magnitude boost of 5 mag. We coadd the successful lensed image with the deflector image and apply the background noise simulated with lenstronomy for the PSF and observing conditions (e.g. seeing and sky brightness) typical for HSC. Fig. 1 shows randomly selected examples of mock lenses in the i band.

Panels in Fig. 2 compare the distributions of different properties of the simulated lens population, predictions for LSST from Collett (2015), and a subsample of real lens candidates, for which these properties are available in the literature. The top two rows compare the distributions of redshifts for the deflectors and the background sources. While the distribution of lens redshifts is similar in shape among the three datasets and displays a peak at zlens0.5z_{lens}\approx 0.50.750.75, the distribution of redshifts for background sources differs significantly for the mock lenses and display a bimodal behavior, that results from the two peaks in the redshift distribution of the background sources from Inami et al. (2017). The lower left panel compares the distributions of velocity dispersion for deflector galaxies, considered as a proxy for galactic mass. All three samples are dominated by lenses with σV250\sigma_{V}\approx 250-300km/s300~{}\mathrm{km/s}. The lower right panel compares the Einstein radii of the systems and shows a peak in all three distributions at θE0.75"\theta_{E}\approx 0.75"1.25"1.25", with the LSST-predicted population displaying slightly larger radii. The sharp cutoff at θE,mock=0.75"\theta_{E,\mathrm{mock}}=0.75" comes from the threshold imposed in the simulation.

Refer to caption
Figure 1: Examples of mock lenses simulated with lenstronomy.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Distribution for the simulated lens systems, real lens systems, and LSST predictions, for lens redshift (top-left), source redshift (top-right), velocity dispersion (bottom-left), and Einstein radii (bottom-right).

3.2 Real lenses

To construct positive examples in the target dataset, we compiled a comprehensive list of known lens systems and lens candidates. These systems were discovered in previous campaigns through various methods, including both machine learning and traditional approaches. We added all sources from the Master Lens Database333https://test.masterlens.org/index.php (version July 2021), SuGOHI Candidate List444https://www-utap.phys.s.u-tokyo.ac.jp/~oguri/sugohi/, and other published catalogs: Canameras et al. (2021); Cañameras et al. (2020); Diehl et al. (2017); Garvin et al. (2022); Huang et al. (2020, 2021); Jacobs et al. (2019); Petrillo et al. (2017); Rojas et al. (2022); Shu et al. (2022a); Stein et al. (2022); Li et al. (2021); Storfer et al. (2022). We performed an internal cross-match to remove duplicate sources and obtained a final list of around 12000 galaxy-galaxy systems that were identified across multiple surveys. We cross-matched the full catalog with HSC PDR2 Wide layer and extracted cutouts in the g, r, and i bands. We performed a visual inspection of extracted images and excluded group-scale lenses or objects with barely visible or unclear lensed features, which is possible when the lens was discovered in a survey with higher resolution or depth than HSC. The final sample of real lenses included 1954 objects. We set aside 200 randomly selected lenses for the test set and used the rest in the target dataset.

In summary, we prepared the following datasets:

  1. 1.

    A balanced source dataset consisting of 30 000 mock lenses and 30 000 non-lensed galaxies.

  2. 2.

    A balanced target dataset comprising 1754 real lenses and 1754 non-lenses.

  3. 3.

    An unbalanced real train dataset comprising 1754 real lenses (from target) and 31 754 non-lenses (from combining source and target non-lensed sources)

  4. 4.

    A test dataset containing 200 real lenses and 20 000 non-lenses to reflect the unbalanced nature of real-world datasets.

We explore domain adaptation methods in two settings. For unsupervised domain adaptation, we compare it against a model trained solely on the source dataset with simulated lenses and applied to the test dataset with real lenses in naive way, i.e. without any domain adaptation. The supervised domain adaptation is compared against a model trained on the unbalanced real dataset. In both cases, we run inference on the test dataset containing only real data.

4 Network architecture and training

We compare the performance of three domain adaptation methods described in Sec. 2. While these methods differ in their training approaches, they share common components: an encoder that maps input images to lower-dimensional representations, and a classifier that outputs binary classes for these representations.

For the encoder, we explored two different architectures: the residual neural network (ResNet, He et al., 2016) and the equivariant neural network (ENN, Weiler & Cesa, 2019). ResNet is a type of convolutional neural networks that was developed to overcome the problem of degrading accuracy with an increasing number of layers in earlier models such as AlexNet or VGG. Its key component is the residual block with skip connections. In a residual block, the input is propagated through the layers without being changed, so the block is learning the difference between the original function and an identity transformation. Since their introduction, ResNets have gained popularity in image recognition tasks; specifically in the field of gravitational lens detection, a ResNet-based model won the gravitational lens finding challenge (Metcalf et al., 2019; Lanusse et al., 2018).

We based the encoder on the ResNet-18 architecture. The encoder’s core consists of four residual blocks, each containing two convolutional layers with batch normalization and ReLU activations. We modified the standard ResNet-18 architecture by replacing its final fully connected layer with a custom module: a dropout layer for regularization, followed by a linear layer that outputs a 256-dimensional embedding vector of the input image.

In addition to ResNet, we explored equivariant neural networks (ENN), which are designed to preserve inherent symmetries in the data through their architecture. For gravitational lenses, which often display rotational and reflectional symmetries, the use of ENNs can lead to more efficient learning, better generalization, and a reduced need for data augmentation. For example, in Alexander et al. (2023) domain adaptation with an ENN-based encoder showed superior performance in the task of classifying signatures of dark matter substructures in simulated strong gravitational lenses.

We used the e2cnn (Cesa et al., 2022) package and implemented an ENN with the dihedral group D4, which includes the identity transformation, rotations by ±π/2\pm\pi/2 and π\pi, and horizontal/vertical reflections. Our model comprises six equivariant convolutional blocks, each composed of a convolutional layer, a batch normalization layer, and a ReLU activation function. The final layer also outputs a 256-dimensional representation.

For the classification task, both encoders are followed by a simple classifier network, consisting of two fully connected layers for the binary classification.

We used data augmentation with rotations and horizontal/vertical flips. While augmentation is not required for ENNs due to symmetries being incorporated in the structure of the network, it turned out to be crucial for the successful training of the ResNet-based model.

We used the Adam optimizer (Kingma & Ba, 2014) to minimize losses and 1-cycle scheduler to adjust the learning rate. We trained both ENN and ResNet-based networks for 100 epochs with a patience of 5 epochs, so the training stops if the validation loss of the model does not improve for 5 consecutive epochs.

After the source encoder and the classifier are trained on the source dataset in a supervised way, we continue with the domain adaptation training.

In the ADDA method, the training step is split into two parts. First, we update a discriminator which operates on the embeddings and define whether they come from a source or target dataset. In our case, the discriminator is implemented as a simple three-layer network. Next we update a target encoder, initialized with the same weights as the source encoder, using inverted labels. For the ResNet-based model, we update encoder 5 times for each discriminator step for more stable training.

For the WDGRL approach, we train a critic network and update the pre-trained encoder and classifier. The goal of the critic network is to estimate the Wasserstein distance between the representations of source and target images. For each mini-batch, we first train the critic for n_critic steps and then train the encoder and classifier with cross-entropy loss for the classifier and Wasserstein distance as the loss for the encoder.

For the supervised domain adaptation, we also update the pre-trained classifier and encoder; however we use a contrastive semantic alignment (CSA) loss. In this case, the most important hyperparameters are the balancing coefficient α\alpha, which determines the relative contribution of classifier loss and CSA loss, and the margin mm, which sets the minimal separation distance for the examples from different domains and different class labels.

In all cases we use the Adam optimizer to optimize the losses. All hyperparameters were fine-tuned via grid search and are listed in Table 1. The models were trained on the high-performance computing cluster at The University of Alabama.

Table 1: Hyperparameters used for training DA algorithms
Method Parameters ResNet-based ENN-based
Supervised learning rate (lr) 6e-4 1e-5
scheduler lr 1e-4 0.002
weight decay 0.001 1e-6
batch size 128 128
ADDA target encoder lr 1e-6 1e-6
discriminator lr 5e-5 1e-5
epochs 20 20
WDGRL domain critic lr 5e-5 1e-4
classifier lr 1e-5 1e-4
encoder lr 1e-3 1e-4
n_critic 5 5
γ\gamma 0.3 1
epochs 20 20
SDA encoder lr 1e-4 1e-3
classifier lr 1e-4 1e-3
α\alpha 0.3 0.75
mm 8 15
epochs 40 20
Table 2: Results for unsupervised domain adaptation algorithms
Method AUROC TPR1%\mathrm{TPR}_{1\%} F11%\mathrm{F1}_{1\%} TPR0.1%\mathrm{TPR}_{0.1\%} F10.1%\mathrm{F1}_{0.1\%}
ENN (naive) 0.921 0.354 0.303 0. 0.
ADDA+ENN 0.942 0.462 0.377 0. 0.
WDGRL+ENN 0.940 0.528 0.416 0.241 0.360
ResNet (naive) 0.858 0.262 0.231 0.046 0.080
ADDA+ResNet 0.886 0.236 0.210 0.036 0.061
WDGRL+ResNet 0.903 0.277 0.242 0.072 0.122

5 Results

To compare the efficiency of domain adaptation techniques for lens finding, we use the Receiver Operation characteristic Curve (ROC) curve as our main metric. The ROC curve illustrates how the false positive rate (FPR) and true positive rate (TPR) change with the variation of the classification threshold. TPR, also known as sensitivity or recall, is defined as the number of correctly classified lenses relative to the total number of lenses in the test dataset:

TPR=TPTP+FN\mathrm{TPR=\frac{TP}{TP+FN}} (12)

FPR is the ratio of incorrectly classified non-lensed objects to the total number of non-lenses:

FPR=FPTN+FP\mathrm{FPR=\frac{FP}{TN+FP}} (13)

For an ideal classifier, FPR is close to 0 when the TPR is close to 1. For a random classifier, FPR equals TPR for all thresholds. The model that performs better has higher area under the ROC curve (AUROC).

Besides the AUROC, it is also particularly interesting to compare the behaviour of the ROC curve at low FPRs; we quantify it by measuring TPR1%\mathrm{TPR}_{1\%} and TPR0.1%\mathrm{TPR}_{0.1\%} – true positive rates at the thresholds that provide false positive rate of 1% and 0.1% respectively. In practice, all objects classified as positive examples by automated methods are visually validated. For the purpose of saving time in human inspection, the ideal algorithm should have a very low contamination fraction while maintaining a high number of identified lenses.

Fig. 3 displays ROC curves for unsupervised domain adaptation methods with a ResNet-based encoder on the left panel and an ENN-based encoder on the right, evaluated on the imbalanced test dataset with 200 lenses and 20 000 non-lenses. For the ResNet, both ADDA and WDGRL result in an overall higher AUROC; however, the increase in performance is rather small, and at low false positive rates the recall of all three algorithms is similar (see Table 2). Compared to ResNet, the ENN-based model works much better even without domain adaptation (AUROC score 0.921 vs 0.858) and improvement from application of domain adaptation is more prominent, especially at low false positive rates: TPR1%=0.354\mathrm{TPR}_{1\%}=0.354 for the naive approach, 0.462 for ADDA, and 0.528 for WDGRL. In both cases, the WDGRL method outperforms ADDA.

Refer to caption
Refer to caption
Figure 3: ROC curves for ResNet-18-based (left) and ENN-based (right) algorithms with unsupervised domain adaptation (blue and orange curves) compared to the model trained on a source dataset and applied to the test dataset without domain adaptation (red curve). Dashed line represents ROC curve of a random classifier.

Fig. 4 demonstrates the results for the supervised domain adaptation. In this setting, we compared the adapted model (blue curve) with a model trained on an unbalanced dataset containing only observational images of lensed and non-lensed sources (red curve). The supervised domain adaptation in the case of ENN-based encoder gives some improvement, however it is rather small (AUROC score 0.987 vs 0.991). For the ResNet-based classifier, SDA performed even worse than naive approach. One interesting result is that the model, trained on real sources, significantly outperform the classifiers trained on simulations (compare red curves on Fig. 3 and 4), with AUROC for ENN improving from 0.921 to 0.987 and for ResNet – from 0.858 to 0.978, even though the number of real lenses is 15\sim 15 times smaller than the number of mock lenses. While the unbalanced proportion of the training dataset is in better agreement with the unbalanced nature of the test dataset and of real world data, it is usually considered a good practice to balance training dataset to make sure that the model is able to see a sufficient number of examples from all classes and learn relevant features. The fact, that the model trained on a smaller set of real data performs better than the model trained on a larger dataset with simulations highlights that simulations are not perfect representation of real data. Another explanation might come from the fact that majority of real lens candidates that consitute training and test set were found with machine learning algorithms and might represent subsample of objects with similar properties, for example, in terms of relative brightness and size of lens features. Therefore, in future work it would be interesting to apply both models to the larger sample of unseen data and compare properties of the identified lens candidates.

Refer to caption
Refer to caption
Figure 4: ROC curves for ResNet-18-based (left) and ENN-based (right) algorithms with supervised domain adaptation (blue curve) compared to the model trained on a source dataset and applied to the test dataset without domain adaptation (red curve). Dashed line represents ROC curve of a random classifier.
Table 3: Results for supervised domain adaptation algorithm
Method AUROC TPR1%\mathrm{TPR}_{1\%} F11%\mathrm{F1}_{1\%} TPR0.1%\mathrm{TPR}_{0.1\%} F10.1%\mathrm{F1}_{0.1\%}
ENN (naive) 0.987 0.861 0.597 0.624 0.725
SDA+ENN 0.991 0.887 0.636 0.734 0.537
ResNet (naive) 0.978 0.759 0.542 0.451 0.583
SDA+ENN 0.973 0.687 0.506 0.154 0.245

5.1 Spiral contaminants

One of the major challenges in reducing the number of false positives is the correct classification between strong lenses and objects with similar arc-like features, such as spiral galaxies, ring galaxies, or galaxy mergers. While these objects are not as rare as gravitational lenses, a randomly sampled subset of non-lenses typically includes only a small fraction of such objects. Consequently, a trained classifier can easily become confused by these similar-looking patterns.

To explore the potential of domain adaptation in mitigating this problem, we conducted the following experiment. Using the catalog of spiral galaxies from Tadaki et al. (2020), we randomly drew 1754 spiral galaxies (to match the number of real lenses) from a subsample of spirals with redshift z>0.4z>0.4 to align with the distribution of redshifts of non-lenses and deflectors. We used our best performing model, which uses an ENN-based encoder combined with supervised domain adaptation, and retrained it while including the spiral subsample at different stages of the training. We obtained three models: 1) with spiral galaxies added to the source dataset; 2) with spiral galaxies added to the target dataset; and 3) with spiral galaxies split equally between the source and the target dataset. We compared the SDA model with the model trained solely on the unbalanced real dataset, to which we also added the same spiral galaxies. We also constructed a modified test dataset: to the set of 200 real lenses and 20 000 randomly sampled non-lenses we added 2000 spirals from the same catalog.

The results of the evaluation on the modified test dataset are shown in Fig. 5. The naive supervised classifier has the lowest AUROC, however it shows much higher sensitivity at low FPR than the SDA model that has seen spiral galaxies only in the source dataset. The SDA model, that was trained on a target dataset containing all spiral galaxies, shows the best result, both in terms of AUROC and TPR1%\mathrm{TPR}_{1\%} and TPR0.1%\mathrm{TPR}_{0.1\%}. While SDA model trained in the setting where sample of spirals was split between the source and target dataset shows the second best result. The summary of results is listed in Table 4.

Refer to caption
Figure 5: ROC curves for ENN-based algorithm tested on a dataset with spiral contaminants. Red curve represents the supervised classifier without SDA trained on the dataset containing spiral galaxies among non-lenses. The rest of the curves represent the results for SDA model with spiral galaxies added only to the source dataset (blue), only to the target dataset (green), or splitted between the source and target (orange). Dashed line represents ROC curve for a random classifier.
Table 4: Results for supervised domain adaptation algorithm trained on a dataset with spiral galaxies.
Method AUROC TPR1%\mathrm{TPR}_{1\%} F11%\mathrm{F1}_{1\%} TPR0.1%\mathrm{TPR}_{0.1\%} F10.1%\mathrm{F1}_{0.1\%}
ENN (naive) 0.975 0.774 0.539 0.492 0.613
spirals in src 0.983 0.723 0.511 0.169 0.265
spirals in tgt 0.993 0.897 0.586 0.626 0.722
spirals in src & tgt 0.987 0.846 0.574 0.605 0.707

This experiment demonstrates that supervised domain adaptation with a carefully constructed target dataset is able to improve the classifier’s ability to differentiate between lenses and similar-looking objects, such as spiral galaxies. This approach could be extended to other common contaminants such as ring galaxies or mergers, potentially leading to more robust and reliable automated lens detection. Future work could also explore the integration of superresolution techniques (e.g. Reddy et al., 2024) to enhance the quality of input images, potentially improving the performance of both domain adaptation methods and traditional classifiers.

6 Discussion & Conclusion

In this work, we investigate whether domain adaptation is able to enhance the performance and reliability of CNN-based gravitational lens detection algorithms. Due to the limited sample of known lenses, deep learning models are typically trained on simulated strong lenses, which can lead to degraded performance on real data. The shift between training and test domains can be alleviated with domain adaptation techniques.

We used simulated mock lenses as positive examples in the source dataset and observations of lens candidates from HSC SSP PDR2 Wide layer as constituents in the target dataset. We explored two neural network architectures, ResNet-18 and Equivariant Neural Network, for the encoder and found that ENNs are more successful in identifying lenses.

For unsupervised domain adaptation, we implemented two methods, ADDA and WDGRL. The WDGRL approach resulted in the largest increase of ROC score compared to the naive inference of a classifier trained with mock lenses on real data. For the supervised domain adaptation, we compared the performance of the adapted algorithm with a model trained on an unbalanced dataset containing purely observational data. Somewhat surprisingly, despite the relatively small number of lenses in the training sample (1754 real lenses vs 31754 mock lenses in the source dataset), the model showed superior performance. Supervised domain adaptation provided only marginal improvement in the case of the ENN-based algorithm, and did not yield any improvement for the ResNet-based model.

In the common setting for lens searches, algorithms are applied to large datasets (10610710^{6}-10^{7} objects), and objects with a classification score above a predefined threshold are visually inspected by a team of experts. A higher threshold leads to increased sample purity at the cost of reduced completeness, and vice versa. This trade-off means that human experts must examine a large number of lens candidates. For example, in Jaelani et al. (2023), 20 241 sources with a score higher than 0.9 were visually inspected, first reduced to 1522 cutouts, and then to 43 definite and 269 probable lenses. Upcoming surveys are expected to contain 𝒪(105)\mathcal{O}(10^{5}) lenses, which makes the task of reducing false positives particularly important. In our experiment we saw that the improvement in recall from the combination of ENN-based algorithm and WDGRL domain adaptation method is particularly noticeable for low false positive rates.

One of the challenges of the lens finding task is the correct classification between lenses and similar-looking contaminants, such as ring galaxies or spiral galaxies. In the supervised domain adaptation setting, we show that an algorithm trained on a target dataset that includes a relatively small number of spirals is able to enhance the model’s ability to distinguish between lenses and spiral galaxies. The model’s performance increases with the number of spiral galaxies in the target dataset. However, adding spirals only to the source dataset leads to performance degradation.

Domain adaptation techniques can be seamlessly integrated into lens finding pipelines and easily combined with other methods. Our results emphasize that DA methods should be considered an essential component in future lens finding campaigns.

7 Acknowledgments

We acknowledge useful conversations with Stephon Alexander. H. P. and S. G. were supported in part by U.S. National Science Foundation award No. 2108645. P. R. was a participant in the Google Summer of Code 2023 program. This material is based upon work supported by the U.S. Department of Energy, Office of Science, Office of High Energy Physics of U.S. Department of Energy under grant Contract Number DE-SC0012567. M. W. T. acknowledges financial support from the Simons Foundation (Grant Number 929255).

References

  • Aihara et al. (2019) Aihara, H., AlSayyad, Y., Ando, M., et al. 2019, PASJ, 71, 114, doi: 10.1093/pasj/psz103
  • Alexander et al. (2020) Alexander, S., Gleyzer, S., McDonough, E., Toomey, M. W., & Usai, E. 2020, ApJ, 893, 15, doi: 10.3847/1538-4357/ab7925
  • Alexander et al. (2023) Alexander, S., Gleyzer, S., Parul, H., et al. 2023, ApJ, 954, 28, doi: 10.3847/1538-4357/acdfc7
  • Anau Montel et al. (2023) Anau Montel, N., Coogan, A., Correa, C., Karchev, K., & Weniger, C. 2023, MNRAS, 518, 2746, doi: 10.1093/mnras/stac3215
  • Ansel et al. (2024) Ansel, J., Yang, E., He, H., et al. 2024, in 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24) (ACM), doi: 10.1145/3620665.3640366
  • Arjovsky et al. (2017) Arjovsky, M., Chintala, S., & Bottou, L. 2017, arXiv e-prints, arXiv:1701.07875, doi: 10.48550/arXiv.1701.07875
  • Astropy Collaboration et al. (2013) Astropy Collaboration, Robitaille, T. P., Tollerud, E. J., et al. 2013, A&A, 558, A33, doi: 10.1051/0004-6361/201322068
  • Astropy Collaboration et al. (2018) Astropy Collaboration, Price-Whelan, A. M., Sipőcz, B. M., et al. 2018, AJ, 156, 123, doi: 10.3847/1538-3881/aabc4f
  • Astropy Collaboration et al. (2022) Astropy Collaboration, Price-Whelan, A. M., Lim, P. L., et al. 2022, ApJ, 935, 167, doi: 10.3847/1538-4357/ac7c74
  • Barnabè et al. (2011) Barnabè, M., Czoske, O., Koopmans, L. V. E., Treu, T., & Bolton, A. S. 2011, MNRAS, 415, 2215, doi: 10.1111/j.1365-2966.2011.18842.x
  • Bayer et al. (2023) Bayer, D., Koopmans, L. V. E., McKean, J. P., et al. 2023, MNRAS, 523, 1326, doi: 10.1093/mnras/stad1403
  • Birrer & Amara (2018) Birrer, S., & Amara, A. 2018, Physics of the Dark Universe, 22, 189, doi: 10.1016/j.dark.2018.11.002
  • Birrer et al. (2022) Birrer, S., Millon, M., Sluse, D., et al. 2022, arXiv e-prints, arXiv:2210.10833, doi: 10.48550/arXiv.2210.10833
  • Birrer et al. (2021) Birrer, S., Shajib, A. J., Gilman, D., et al. 2021, Journal of Open Source Software, 6, 3283, doi: 10.21105/joss.03283
  • Bolton et al. (2006) Bolton, A. S., Burles, S., Koopmans, L. V. E., Treu, T., & Moustakas, L. A. 2006, ApJ, 638, 703, doi: 10.1086/498884
  • Bolton et al. (2004) Bolton, A. S., Burles, S., Schlegel, D. J., Eisenstein, D. J., & Brinkmann, J. 2004, AJ, 127, 1860, doi: 10.1086/382714
  • Brehmer et al. (2019) Brehmer, J., Mishra-Sharma, S., Hermans, J., Louppe, G., & Cranmer, K. 2019, ApJ, 886, 49, doi: 10.3847/1538-4357/ab4c41
  • Cañameras et al. (2020) Cañameras, R., Schuldt, S., Suyu, S. H., et al. 2020, A&A, 644, A163, doi: 10.1051/0004-6361/202038219
  • Cañameras et al. (2021) Cañameras, R., Schuldt, S., Shu, Y., et al. 2021, A&A, 653, L6, doi: 10.1051/0004-6361/202141758
  • Canameras et al. (2021) Canameras, R., Schuldt, S., Shu, Y., et al. 2021, VizieR Online Data Catalog, J/A+A/653/L6
  • Cesa et al. (2022) Cesa, G., Lang, L., & Weiler, M. 2022, in International Conference on Learning Representations. https://openreview.net/forum?id=WE4qe9xlnQw
  • Cheng et al. (2020) Cheng, T.-Y., Li, N., Conselice, C. J., et al. 2020, MNRAS, 494, 3750, doi: 10.1093/mnras/staa1015
  • Ćiprijanović et al. (2020) Ćiprijanović, A., Kafkes, D., Jenkins, S., et al. 2020, arXiv e-prints, arXiv:2011.03591, doi: 10.48550/arXiv.2011.03591
  • Ćiprijanović et al. (2023) Ćiprijanović, A., Lewis, A., Pedro, K., et al. 2023, Machine Learning: Science and Technology, 4, 025013, doi: 10.1088/2632-2153/acca5f
  • Ćiprijanović et al. (2021) Ćiprijanović, A., Kafkes, D., Downey, K., et al. 2021, MNRAS, 506, 677, doi: 10.1093/mnras/stab1677
  • Ćiprijanović et al. (2022) Ćiprijanović, A., Kafkes, D., Snyder, G., et al. 2022, Machine Learning: Science and Technology, 3, 035007, doi: 10.1088/2632-2153/ac7f1a
  • Collett (2015) Collett, T. E. 2015, ApJ, 811, 20, doi: 10.1088/0004-637X/811/1/20
  • Şengül & Dvorkin (2022) Şengül, A. Ç., & Dvorkin, C. 2022, MNRAS, 516, 336, doi: 10.1093/mnras/stac2256
  • Daylan et al. (2018) Daylan, T., Cyr-Racine, F.-Y., Diaz Rivero, A., Dvorkin, C., & Finkbeiner, D. P. 2018, ApJ, 854, 141, doi: 10.3847/1538-4357/aaaa1e
  • Diaz Rivero & Dvorkin (2020) Diaz Rivero, A., & Dvorkin, C. 2020, Phys. Rev. D, 101, 023515, doi: 10.1103/PhysRevD.101.023515
  • Diehl et al. (2017) Diehl, H. T., Buckley-Geer, E. J., Lindgren, K. A., et al. 2017, ApJS, 232, 15, doi: 10.3847/1538-4365/aa8667
  • Etherington et al. (2023) Etherington, A., Nightingale, J. W., Massey, R., et al. 2023, MNRAS, 521, 6005, doi: 10.1093/mnras/stad582
  • Euclid Collaboration et al. (2022) Euclid Collaboration, Scaramella, R., Amiaux, J., et al. 2022, A&A, 662, A112, doi: 10.1051/0004-6361/202141938
  • Garvin et al. (2022) Garvin, E. O., Kruk, S., Cornen, C., et al. 2022, A&A, 667, A141, doi: 10.1051/0004-6361/202243745
  • Gavazzi et al. (2014) Gavazzi, R., Marshall, P. J., Treu, T., & Sonnenfeld, A. 2014, ApJ, 785, 144, doi: 10.1088/0004-637X/785/2/144
  • Gilman et al. (2020) Gilman, D., Birrer, S., Nierenberg, A., et al. 2020, MNRAS, 491, 6077, doi: 10.1093/mnras/stz3480
  • Gilman et al. (2018) Gilman, D., Birrer, S., Treu, T., Keeton, C. R., & Nierenberg, A. 2018, MNRAS, 481, 819, doi: 10.1093/mnras/sty2261
  • Gilman et al. (2023) Gilman, D., Zhong, Y.-M., & Bovy, J. 2023, Phys. Rev. D, 107, 103008, doi: 10.1103/PhysRevD.107.103008
  • Harris et al. (2020) Harris, C. R., Millman, K. J., van der Walt, S. J., et al. 2020, Nature, 585, 357, doi: 10.1038/s41586-020-2649-2
  • He et al. (2016) He, K., Zhang, X., Ren, S., & Sun, J. 2016, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR, 1, doi: 10.1109/CVPR.2016.90
  • Huang et al. (2020) Huang, X., Storfer, C., Ravi, V., et al. 2020, ApJ, 894, 78, doi: 10.3847/1538-4357/ab7ffb
  • Huang et al. (2021) Huang, X., Storfer, C., Gu, A., et al. 2021, ApJ, 909, 27, doi: 10.3847/1538-4357/abd62b
  • Hunter (2007) Hunter, J. D. 2007, Computing in Science & Engineering, 9, 90, doi: 10.1109/MCSE.2007.55
  • Illingworth et al. (2013) Illingworth, G. D., Magee, D., Oesch, P. A., et al. 2013, ApJS, 209, 6, doi: 10.1088/0067-0049/209/1/6
  • Inami et al. (2017) Inami, H., Bacon, R., Brinchmann, J., et al. 2017, A&A, 608, A2, doi: 10.1051/0004-6361/201731195
  • Jacobs et al. (2019) Jacobs, C., Collett, T., Glazebrook, K., et al. 2019, ApJS, 243, 17, doi: 10.3847/1538-4365/ab26b6
  • Jaelani et al. (2023) Jaelani, A. T., More, A., Wong, K. C., et al. 2023, arXiv e-prints, arXiv:2312.07333, doi: 10.48550/arXiv.2312.07333
  • Keerthi Vasan G. et al. (2024) Keerthi Vasan G., C., Jones, T., Shajib, A. J., et al. 2024, arXiv e-prints, arXiv:2402.00942, doi: 10.48550/arXiv.2402.00942
  • Kingma & Ba (2014) Kingma, D. P., & Ba, J. 2014, arXiv e-prints, arXiv:1412.6980, doi: 10.48550/arXiv.1412.6980
  • Kollmeier et al. (2019) Kollmeier, J., Anderson, S. F., Blanc, G. A., et al. 2019, in Bulletin of the American Astronomical Society, Vol. 51, 274
  • Koopmans et al. (2009) Koopmans, L. V. E., Bolton, A., Treu, T., et al. 2009, ApJ, 703, L51, doi: 10.1088/0004-637X/703/1/L51
  • Lanusse et al. (2018) Lanusse, F., Ma, Q., Li, N., et al. 2018, MNRAS, 473, 3895, doi: 10.1093/mnras/stx1665
  • Laureijs et al. (2011) Laureijs, R., Amiaux, J., Arduini, S., et al. 2011, arXiv e-prints, arXiv:1110.3193, doi: 10.48550/arXiv.1110.3193
  • Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. 1998, Proceedings of the IEEE, 86, 2278, doi: 10.1109/5.726791
  • Li et al. (2016) Li, R., Frenk, C. S., Cole, S., et al. 2016, MNRAS, 460, 363, doi: 10.1093/mnras/stw939
  • Li et al. (2018) Li, R., Shu, Y., & Wang, J. 2018, MNRAS, 480, 431, doi: 10.1093/mnras/sty1813
  • Li et al. (2021) Li, R., Napolitano, N. R., Spiniello, C., et al. 2021, ApJ, 923, 16, doi: 10.3847/1538-4357/ac2df0
  • LSST Science Collaboration et al. (2009) LSST Science Collaboration, Abell, P. A., Allison, J., et al. 2009, arXiv e-prints, arXiv:0912.0201, doi: 10.48550/arXiv.0912.0201
  • Mainali et al. (2023) Mainali, R., Stark, D. P., Jones, T., et al. 2023, MNRAS, 520, 4037, doi: 10.1093/mnras/stad387
  • Marshall et al. (2016) Marshall, P. J., Verma, A., More, A., et al. 2016, MNRAS, 455, 1171, doi: 10.1093/mnras/stv2009
  • Metcalf et al. (2019) Metcalf, R. B., Meneghetti, M., Avestruz, C., et al. 2019, A&A, 625, A119, doi: 10.1051/0004-6361/201832797
  • More et al. (2012) More, A., Cabanac, R., More, S., et al. 2012, ApJ, 749, 38, doi: 10.1088/0004-637X/749/1/38
  • Motiian et al. (2017) Motiian, S., Piccirilli, M., Adjeroh, D. A., & Doretto, G. 2017, arXiv e-prints, arXiv:1709.10190, doi: 10.48550/arXiv.1709.10190
  • Newman et al. (2015) Newman, A. B., Ellis, R. S., & Treu, T. 2015, ApJ, 814, 26, doi: 10.1088/0004-637X/814/1/26
  • Petrillo et al. (2017) Petrillo, C. E., Tortora, C., Chatterjee, S., et al. 2017, MNRAS, 472, 1129, doi: 10.1093/mnras/stx2052
  • Pourrahmani et al. (2018) Pourrahmani, M., Nayyeri, H., & Cooray, A. 2018, ApJ, 856, 68, doi: 10.3847/1538-4357/aaae6a
  • Reddy et al. (2024) Reddy, P., Toomey, M. W., Parul, H., & Gleyzer, S. 2024, arXiv e-prints, arXiv:2406.08442, doi: 10.48550/arXiv.2406.08442
  • Refsdal (1964) Refsdal, S. 1964, MNRAS, 128, 307, doi: 10.1093/mnras/128.4.307
  • Rigby et al. (2018) Rigby, J. R., Bayliss, M. B., Sharon, K., et al. 2018, AJ, 155, 104, doi: 10.3847/1538-3881/aaa2ff
  • Rojas et al. (2022) Rojas, K., Savary, E., Clément, B., et al. 2022, A&A, 668, A73, doi: 10.1051/0004-6361/202142119
  • Shajib et al. (2020) Shajib, A. J., Birrer, S., Treu, T., et al. 2020, MNRAS, 494, 6072, doi: 10.1093/mnras/staa828
  • Sharon et al. (2022) Sharon, K., Mahler, G., Rivera-Thorsen, T. E., et al. 2022, ApJ, 941, 203, doi: 10.3847/1538-4357/ac927a
  • Shen et al. (2017) Shen, J., Qu, Y., Zhang, W., & Yu, Y. 2017, arXiv e-prints, arXiv:1707.01217, doi: 10.48550/arXiv.1707.01217
  • Shu et al. (2022a) Shu, Y., Cañameras, R., Schuldt, S., et al. 2022a, A&A, 662, A4, doi: 10.1051/0004-6361/202243203
  • Shu et al. (2022b) —. 2022b, A&A, 662, A4, doi: 10.1051/0004-6361/202243203
  • Shu et al. (2017) Shu, Y., Brownstein, J. R., Bolton, A. S., et al. 2017, ApJ, 851, 48, doi: 10.3847/1538-4357/aa9794
  • Sonnenfeld et al. (2013) Sonnenfeld, A., Treu, T., Gavazzi, R., et al. 2013, ApJ, 777, 98, doi: 10.1088/0004-637X/777/2/98
  • Sonnenfeld et al. (2018) Sonnenfeld, A., Chan, J. H. H., Shu, Y., et al. 2018, PASJ, 70, S29, doi: 10.1093/pasj/psx062
  • Stein et al. (2022) Stein, G., Blaum, J., Harrington, P., Medan, T., & Lukić, Z. 2022, ApJ, 932, 107, doi: 10.3847/1538-4357/ac6d63
  • Storfer et al. (2022) Storfer, C., Huang, X., Gu, A., et al. 2022, arXiv e-prints, arXiv:2206.02764, doi: 10.48550/arXiv.2206.02764
  • Suyu et al. (2024) Suyu, S. H., Goobar, A., Collett, T., More, A., & Vernardos, G. 2024, Space Sci. Rev., 220, 13, doi: 10.1007/s11214-024-01044-7
  • Tadaki et al. (2020) Tadaki, K.-i., Iye, M., Fukumoto, H., et al. 2020, MNRAS, 496, 4276, doi: 10.1093/mnras/staa1880
  • Treu et al. (2022) Treu, T., Suyu, S. H., & Marshall, P. J. 2022, A&A Rev., 30, 8, doi: 10.1007/s00159-022-00145-y
  • Tzeng et al. (2017) Tzeng, E., Hoffman, J., Saenko, K., & Darrell, T. 2017, arXiv e-prints, arXiv:1702.05464, doi: 10.48550/arXiv.1702.05464
  • Vegetti et al. (2023) Vegetti, S., Birrer, S., Despali, G., et al. 2023, arXiv e-prints, arXiv:2306.11781, doi: 10.48550/arXiv.2306.11781
  • Virtanen et al. (2020) Virtanen, P., Gommers, R., Oliphant, T. E., et al. 2020, Nature Methods, 17, 261, doi: 10.1038/s41592-019-0686-2
  • Weiler & Cesa (2019) Weiler, M., & Cesa, G. 2019, arXiv e-prints, arXiv:1911.08251, doi: 10.48550/arXiv.1911.08251
  • Wong et al. (2020) Wong, K. C., Suyu, S. H., Chen, G. C. F., et al. 2020, MNRAS, 498, 1420, doi: 10.1093/mnras/stz3094
  • Xu et al. (2023) Xu, Q., Shen, S., de Souza, R. S., et al. 2023, MNRAS, 526, 6391, doi: 10.1093/mnras/stad3181
  • Zhang et al. (2023) Zhang, Y., Manwadkar, V., Gladders, M. D., et al. 2023, ApJ, 950, 58, doi: 10.3847/1538-4357/acc9be