Sketch-based Medical Image Retrieval

Kazuma Kobayashi [email protected] Lin Gu [email protected] Ryuichiro Hataya [email protected] Takaaki Mizuno [email protected]
Mototaka Miyake [email protected] Hirokazu Watanabe [email protected] Masamichi Takahashi [email protected] Yasuyuki Takamizawa [email protected]
Yukihiro Yoshida [email protected] Satoshi Nakamura [email protected] Nobuji Kouno [email protected] Amina Bolatkan [email protected] Yusuke Kurose [email protected]
Tatsuya Harada [email protected] Ryuji Hamamoto [email protected] Division of Medical AI Research and Development, National Cancer Center Research Institute,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project,
1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan Machine Intelligence for Medical Engineering Team, RIKEN Center for Advanced Intelligence Project,
1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan Research Center for Advanced Science and Technology, The University of Tokyo,
4-6-1 Komaba, Meguro-ku, Tokyo 153-8904, Japan Medical Data Deep Learning Team, Advanced Data Science Project, RIKEN Information R&D and Strategy Headquarters,
1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan Department of Experimental Therapeutics, National Cancer Center Hospital,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan Department of Diagnostic Radiology, National Cancer Center Hospital,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan Department of Neurosurgery and Neuro-Oncology, National Cancer Center Hospital,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan Department of Colorectal Surgery, National Cancer Center Hospital,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan Department of Thoracic Surgery, National Cancer Center Hospital,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan Radiation Safety and Quality Assurance Division, National Cancer Center Hospital,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan Division of Research and Development for Boron Neutron Capture Therapy, National Cancer Center, Exploratory Oncology Research & Clinical Trial Center,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan Medical Physics Laboratory, Division of Health Science, Graduate School of Medicine, Osaka University
Yamadaoka 1-7, Suita-shi, Osaka 565-0871, Japan Department of Surgery, Kyoto University Graduate School of Medicine
54 Shogoin Kawahara-cho, Sakyo-ku, Kyoto 606-8507, Japan

(December 2022)

Abstract

The amount of medical images stored in hospitals is increasing faster than ever; however, utilizing the accumulated medical images has been limited. This is because existing content-based medical image retrieval (CBMIR) systems usually require example images to construct query vectors; nevertheless, example images cannot always be prepared. Besides, there can be images with rare characteristics that make it difficult to find similar example images, which we call isolated samples. Here, we introduce a novel sketch-based medical image retrieval (SBMIR) system that enables users to find images of interest without example images. The key idea lies in feature decomposition of medical images, whereby the entire feature of a medical image can be decomposed into and reconstructed from normal and abnormal features. By extending this idea, our SBMIR system provides an easy-to-use two-step graphical user interface: users first select a template image to specify a normal feature and then draw a semantic sketch of the disease on the template image to represent an abnormal feature. Subsequently, it integrates the two kinds of input to construct a query vector and retrieves reference images with the closest reference vectors. Using two datasets, ten healthcare professionals with various clinical backgrounds participated in the user test for evaluation. As a result, our SBMIR system enabled users to overcome previous challenges, including image retrieval based on fine-grained image characteristics, image retrieval without example images, and image retrieval for isolated samples. Our SBMIR system achieves flexible medical image retrieval on demand, thereby expanding the utility of medical image databases.

keywords:

Sketch-based image retrieval, content-based image retrieval, feature decomposition, query by sketch, query by example

\nonumnote

Abbreviations: 2D, two-dimensional; 3D, three-dimensional; AC, anatomy code; CBMIR, content-based medical image retrieval; CT, computed tomography; ED, peritumoral edema; ET, gadolinium-enhancing tumor; FLAIR, fluid-attenuated inversion recovery sequence; GAN, generative adversarial network; Gd, gadolinium; GUI, graphical user interface; KL, Kullback–Leibler; MRI, magnetic resonance imaging; nDCG: normalized discounted cumulative gain; NET, necrotic and non-enhancing tumor core; NN, nearest neighbor; PT, primary tumor; SBMIR, sketch-based medical image retrieval; SVM, support vector machine; T1, T1-weighted sequence; T1CE, T1-weighted contrast-enhanced sequence; VAE, variational autoencoder

1 Introduction

As the amount of medical images stored in hospital databases is increasing much faster than healthcare professionals can manage, the need for effective medical image retrieval has become greater owing to its potential in patient care, research, and development (Quellec et al., 2011; Allan et al., 2012; Tschandl et al., 2020; Chen et al., 2022). Indeed, healthcare professionals refer not only to evidence in the literature but also to the accumulated past cases because they often provide valuable insights into what differentiates individual patients from common characteristics in the patient population. Because the radiographic phenotype of a disease is closely related to its diagnosis, treatment response, and prognosis (Aerts et al., 2014), content-based medical image retrieval (CBMIR), which can calculate the similarity of medical images based on image contents, has been the mainstay for the development of medical image retrieval.

To date, CBMIR has been treated as a computer science problem to devise how to measure the similarity between medical images by capturing the unstructured nature of clinical findings. Based on a query-by-example approach (Pinho et al., 2019), most CBMIR systems exploit an example image as a query image, from which a query vector representing requested information is extracted (see the left part of Fig. 1a). As a search result, reference images with the closest reference vectors to the query vector are retrieved from a database. CBMIR emerged as a method to calculate the similarity based on visual features such as color, texture, shape, and spatial relationships among regions of interest (Long et al., 2003; Li et al., 2018). More recently, it has incorporated deep learning (LeCun et al., 2015) to efficiently extract semantic features from example images in order to create query vectors (Hosny et al., 2018). This achieves better results by minimizing the semantic gap between high-level semantic concepts and low-level visual features in medical images (Zheng et al., 2018).

However, the information-seeking objectives of healthcare professionals can be diverse beyond the extent to which solely calculating the similarity between the query and reference images can handle. For example, to validate the clinical decision for a present case, healthcare professionals may need not only past cases with exactly the same clinical findings in the same anatomical location but also cases with similar findings in a different location or with different findings in the same location. Besides, suppose that one needs to search for images with particular clinical findings from medical image repositories on the Internet, which are becoming popular recently (Prior et al., 2017). In such cases, it is usually difficult for users to find the right image to initiate the search because they can not prepare an example image in advance. Furthermore, clinical medicine often puts high reference value on rare cases, including rare diseases and rare clinical findings (Turro et al., 2020). Nevertheless, it can be challenging for healthcare professionals to retrieve the relevant images from a database because similar example images are hardly available due to their rarity.

Based on these points of view, we focus on two potential limitations of conventional CBMIR systems using the query-by-example approach: usability and searchability problems. Usability is referred to as a qualitative attribute representing how easy it is for users to satisfy their information-seeking objectives whenever necessary. Since query-by-example approaches are dependent on example images, it is challenging to search in situations where no example images can be available in advance. Also, adding or subtracting arbitrary features to a query vector extracted from an example image is difficult, hindering the refinement of user queries for better results based on the search results. On the other hand, searchability indicates the scope of potentially retrievable images to all images stored in a database. Suppose a latent space where feature vectors corresponding to image characteristics are distributed. While a typical image can be surrounded by other typical images in the vicinity, a rare image may be located far away from the others due to its unique characteristics, which we call an isolated sample in a database. The isolated sample might hardly be present in the nearest neighborhood of usually available example images, implying that the rarer the case, the more difficult its retrieval can be. By using ResNet-101 features (He et al., 2016) (see A), we demonstrate the substantial number of isolated samples in medical image datasets in B.

To mitigate these problems, one alternative could be a query-by-sketch approach (Sangkloy et al., 2016; Li and Li, 2018; Dutta and Akata, 2020; Zhang et al., 2020; Vinker et al., 2022; Bhunia et al., 2022), which has achieved notable success in the field of computer vision. A user’s query sketch can convey the shape, orientation, and fine-grained details of objects without preparing example images. Nevertheless, sketch-based queries have not been demonstrated in medical image retrieval, perhaps because sketching all anatomical information seems too laborious for a real-world application. Therefore, a practical CBMIR system that does not require users to provide example images or to comprehensively sketch anatomical features is needed.

Refer to caption — Figure 1: Overview of the algorithm for the sketch-based medical image retrieval (SBMIR) system. a The conventional *query-by-example* approach in content-based medical image retrieval extracts a query vector from a query image as an example; nevertheless, it is not always easy to prepare an example image when users need to refer to images with specific findings. Here, we introduce a feasible *query-by-sketch* approach in which users can specify a query vector by selecting a template image and by sketching the findings of interest onto the template image. b The technical basis is feature decomposition of medical images, where the entire feature (the entire anatomy code [AC]) of a medical image can be decomposed into and reconstructed from a normal feature (a normal AC) and an abnormal feature (an abnormal AC). c In our SBMIR system, a query vector is composed of a normal AC extracted from a selected template image and an abnormal AC approximated from a sketched label, while reference vectors are automatically computed from reference images. The reference images with the reference vector closest to the query vector are obtained as search results. AC, anatomy code.

1.1 SBMIR: sketch-based medical image retrieval system

Here, we introduce a feasible query-by-sketch approach that minimizes the effort required for sketching (see the right part of Fig. 1a), which we used to establish the SBMIR system. The underlying assumption is as follows. Individual disease phenotypes are diverse and indefinite, whereas the surrounding normal anatomy shares many common features within a given population. Hence, if we can specify the semantic features of a disease with sketches and those of the surrounding normal anatomy by selecting a template image that approximates the normal anatomical features, we can construct a query vector conveying the target image content with its desired anatomical location. Based on this assumption, our SBMIR system uses two different modalities–a template image and a semantic sketch of the disease–to construct a query vector. The technological basis for constructing a query vector from these different modalities is the feature decomposition of medical images (Kobayashi et al., 2021), whereby the entire feature vector of a medical image, which is referred to as an entire anatomy code (AC), can be decomposed into and reconstructed from two semantically different feature vectors, a normal AC and an abnormal AC (see Fig. 1b). The relationship among the three ACs can be formulated as follows:

\mathrm{Entire\,AC}=\mathrm{Normal\,AC}+\mathrm{Abnormal\,AC}.

(1)

As demonstrated in Fig. 1b, the normal AC represents the counterfactual normal anatomy that should have existed in the absence of an abnormality, and the abnormal AC represents any abnormality as a residual from the normal baseline. By extending this concept, we devised a deep-learning-based algorithm to extract a normal AC from a template image and an abnormal AC from a semantic sketch of the disease to construct a query vector by adding them together.

Our SBMIR system consists of a feature extraction module and a similarity calculation module (see Fig. 1c). A query vector is calculated through the following two-step user operation (see the left part of Fig. 1c). First, users select a two-dimensional (2D) template image by slicing through a three-dimensional (3D) template volume (see “Step 1” in Fig. 2a for an example). The 3D template volume comprises a series of 2D slices of a specific organ (e.g., brain) or anatomical region (e.g., chest). It is assumed that the variation in normal anatomy is limited to a small range in a given population, such that by selecting a slice as the template image that matches the area where users want the clinical findings to be present, users can specify the location of a disease. Second, users sketch semantic segmentation labels representing the disease on the selected template image (see “Step 2” in Fig. 2a for an example). These semantic segmentation labels are predefined and then learned for each specific clinical finding in medical images (e.g., a segmentation label representing a tumor region or a particular component of a disease, such as a necrotic region) as the image content that should be located therein. Then, the feature extraction module extracts a normal AC from the template image and an abnormal AC from the semantic sketch of the disease, both of which are summed to give the query vector according to Eq. 1. As a result, users can obtain reference images with the closest reference vector to the query vector, which is processed by the similarity calculation module (see the middle part of Fig. 1c). These reference vectors are calculated in advance from reference images (see the right part of Fig. 1c). Note that each of the retrieved top- $K$ similar images relative to the query vector belongs to different reference volumes (e.g., individual magnetic resonance imaging [MRI] and computed tomography [CT] scans) in order to avoid redundancy between consecutive 2D slices within a 3D volume.

For the model training and evaluation, a dataset consisting of brain MRI scans with gliomas and a dataset containing chest CT scans with lung cancers were used. To show that our SBMIR system is easy for healthcare professionals to use and that it can mitigate the usability and searchability problems of conventional CBMIR systems, 10 healthcare professionals with various clinical backgrounds participated in user tests based on a dedicated graphical user interface (GUI) (see Fig. 2b). The evaluators underwent a practice stage followed by three testing stages as follows: Test-1 demonstrated the image retrieval performance when example images were available, Test-2 revealed the image retrieval performance without example images, and Test-3 investigated the image retrieval performance for isolated samples. The Supplementary Video 1 demonstrates our SBMIR system in action. Besides, we will soon release the source code and materials to show researchers how our SBMIR system works.

The main contributions of this study can be summarized into the following:

1.

By extending the concept of the feature decomposition of medical images, we devised an algorithm to demonstrate the feasible query-by-sketch approach to construct a query vector, which requires neither an example image nor a detailed sketch of all the anatomical structures.
2.

We implemented the first SBMIR system that achieves flexible medical image retrieval on demand through an easy-to-use two-step user operation, the search results of which can change according to which template image is selected and how the disease is sketched.
3.

The user test showed that our SBMIR system could overcome the usability and searchability problems of conventional CBMIR systems through better image retrieval according to fine-grained image characteristics (Test-1), image retrieval without example images (Test-2), and image retrieval for isolated samples (Test-3).

2 Algorithm

Our SBMIR system has a unique feature extraction module at its core, which is based on feature decomposition of medical images extended with semantically organized latent space. Such extension is critical to achieving the practical SBMIR system, which we will demonstrate in E. Here, we describe the training algorithm of the feature extraction module, which is combined with the similarity calculation module to realize our real-time, large-scale image retrieval system.

2.1 Network architecture for feature decomposition

The feature extraction module of our SBMIR system is a deep-learning framework that can perform feature decomposition of medical images (Kobayashi et al., 2021) with extended mapping functions between the image space $\mathcal{X}$ and the label space $\mathcal{L}$ via the latent space $\mathcal{Z}$ ( $\mathcal{X}\leftrightarrow\mathcal{Z}\leftrightarrow\mathcal{L}$ ). The training of the neural networks requires a dataset consisting of pairs of an image and a corresponding segmentation label of tumor-associated regions $(\bm{x},\bm{l})\in\mathcal{X}\times\mathcal{L}$ . Below, we assume that the image space $\mathcal{X}$ has a subspace corresponding to healthy images $\mathcal{X}^{\mathrm{h}}$ and another subspace corresponding to diseased images $\mathcal{X}^{\mathrm{d}}$ . Although feature decomposition of medical images often involves elaborate algorithms such as conditional generation using generative adversarial networks (GANs) (Goodfellow et al., 2014; Liu et al., 2022), our SBMIR system achieves this objective by combining three simple components, without utilizing adversarial training, to stabilize the training process.

2.1.1 Variational auto-encoder for learning normal ACs

As illustrated in Fig. 3a, the first component of our feature extraction module is a variational autoencoder (VAE) (Kingma and Welling, 2014; Rezende et al., 2014) that learns the mapping function of the normal ACs $\bm{n}$ between the image space and the latent space ( $\mathcal{X}\xrightarrow{\bm{n}}\mathcal{Z}\xrightarrow{\bm{n}}\mathcal{X}$ ). This VAE consists of a pair of a normal AC encoder $E_{\mathrm{NAC}}$ and an image decoder $D_{\mathrm{Img}}$ , which together take a healthy image $\bm{x}^{\mathrm{h}}\sim\mathcal{X}^{\mathrm{h}}$ as input and output a reconstructed image $\hat{\bm{x}}^{\mathrm{h}}$ by simultaneously producing a normal AC $\bm{n}$ in the latent space ( $D_{\mathrm{Img}}(E_{\mathrm{NAC}}(\bm{x}^{\mathrm{h}}))=D_{\mathrm{Img}}(\bm{n})=\hat{\bm{x}}^{\mathrm{h}}$ ). Based on the idea that the distribution of normal ACs $\bm{n}$ should be within a certain range reflecting the limited range of normal anatomic variation within a population, we impose an isotropic multivariate Gaussian $\mathcal{N}(\bm{n};\bm{0},\bm{I})$ as a prior distribution over the normal ACs ( $p(\bm{n})=\mathcal{N}(\bm{0},\bm{I})$ ). As such, using the two output variables, $\bm{\mu}$ and $\bm{\sigma}$ , a posterior distribution estimated by the encoder $E_{\mathrm{NAC}}(\bm{n}\rvert\bm{x}^{\mathrm{h}})=\mathcal{N}(\bm{\mu}(\bm{x}^{\mathrm{h}}),\bm{\sigma}(\bm{x}^{\mathrm{h}}))$ is forced to be close to the prior by minimizing the Kullback–Leibler (KL)-divergence between the prior and the posterior distributions.

2.1.2 Segmentation network for learning abnormal ACs

Figure 3b shows the second component of our feature extraction module, an encoder–decoder type of segmentation network that learns the mapping function for the abnormal ACs $\bm{a}$ alongside the sequence $\mathcal{X}\xrightarrow{\bm{a}}\mathcal{Z}\xrightarrow{\bm{a}}\mathcal{L}$ based on a pair of a diseased image $\bm{x}^{\mathrm{d}}\sim\mathcal{X}^{\mathrm{d}}$ and a segmentation label of tumor-associated regions $\bm{l}^{\mathrm{d}}$ . The abnormal ACs $\bm{a}$ are acquired as the outputs of an abnormal AC encoder $E_{\mathrm{AAC}}$ and are decoded to a semantic segmentation label $\hat{\bm{l}}^{\mathrm{d}}$ through a label decoder $D_{\mathrm{Lbl}}$ , as follows: $D_{\mathrm{Lbl}}(E_{\mathrm{AAC}}(\bm{x}^{\mathrm{d}}))=D_{\mathrm{Lbl}}(\bm{a})=\hat{\bm{l}}^{\mathrm{d}}$ . Note that there is no skip connection between the abnormal AC encoder $E_{\mathrm{AAC}}$ and the label decoder $D_{\mathrm{Lbl}}$ . Therefore, through training using segmentation losses between $\hat{\bm{l}}^{\mathrm{d}}$ and $\bm{l}^{\mathrm{d}}$ , the abnormal ACs $\bm{a}$ can be optimized to encode the semantic features particularly relevant to the tumor-associated regions.

2.1.3 Encoding network to estimate abnormal ACs from semantic segmentation labels

Figure 3c describes the third component of our feature extraction module, a label encoder $E_{\mathrm{Lbl}}$ that enables the mapping function from the label space to the latent space with respect to the abnormal ACs $\bm{a}$ ( $\mathcal{L}\xrightarrow{\bm{a}}\mathcal{Z}$ ). Specifically, the label encoder $E_{\mathrm{Lbl}}$ estimates an abnormal AC $\hat{\bm{a}}$ from a ground-truth segmentation label $\bm{l}^{\mathrm{d}}$ as input ( $E_{\mathrm{Lbl}}(\bm{l}^{\mathrm{d}})=\hat{\bm{a}}$ ), which is an inverse function of the label decoder ( $D_{\mathrm{Lbl}}(\bm{a})=\hat{\bm{l}}^{\mathrm{d}}$ ). In the training of the label encoder $E_{\mathrm{Lbl}}$ , the corresponding abnormal AC $\bm{a}$ that gives the closest segmentation prediction to the ground-truth label through the label decoder $D_{\mathrm{Lbl}}$ can be the ground-truth for the estimated abnormal AC $\hat{\bm{a}}$ .

2.1.4 The whole network architecture

By combining these components, the deep-learning framework for the feature extraction module in our SBMIR system is trained as illustrated in Fig. 4. The training process is changed according to whether the input image is healthy $\bm{x}^{\mathrm{h}}\sim\mathcal{X}^{\mathrm{h}}$ or diseased $\bm{x}^{\mathrm{d}}\sim\mathcal{X}^{\mathrm{d}}$ . Notably, the VAE component is trained only when healthy images $\bm{x}^{\mathrm{h}}$ are given as input, while its inference results are used when diseased images $\bm{x}^{\mathrm{d}}$ are given.

When a healthy image $\bm{x}^{\mathrm{h}}$ is given as input, the VAE component is switched to be trainable (see the upper part of Fig. 4). At the encoding step, from the healthy image $\bm{x}^{\mathrm{h}}$ , the normal AC encoder $E_{\mathrm{NAC}}$ estimates the posterior distribution of the normal AC ( $E_{\mathrm{NAC}}(\bm{n}\rvert\bm{x}^{\mathrm{h}})=\mathcal{N}(\bm{\mu}(\bm{x}^{\mathrm{h}}),\bm{\sigma}(\bm{x}^{\mathrm{h}}))$ ) and the abnormal AC encoder $E_{\mathrm{AAC}}$ outputs the abnormal AC $\bm{a}$ in the latent space ( $E_{\mathrm{AAC}}(\bm{x}^{\mathrm{h}})=\bm{a}$ ). A normal AC $\bm{n}$ can be sampled from $\mathcal{N}(\bm{n};\bm{\mu}(\bm{x}^{\mathrm{h}}),\bm{\sigma}(\bm{x}^{\mathrm{h}}))$ using a reparameterization trick (Kingma and Welling, 2014; Rezende et al., 2014), that is, $\bm{n}=\bm{\mu}+\bm{\sigma}\odot\bm{\epsilon}$ , where $\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{I})$ and $\odot$ indicates the Hadamard product. Then, an entire AC $\bm{e}$ is calculated as the sum of the normal AC $\bm{n}$ and the abnormal AC $\bm{a}$ ( $\bm{e}=\bm{n}+\bm{a}$ ). Note that since the input image does not include any abnormality, the abnormal AC $\bm{a}$ is trained to be the zero vector ( $\bm{a}\approx\bm{0}$ ) so as not to convey any abnormal information about the image, making the entire AC $\bm{e}$ and the normal AC $\bm{n}$ identical as a result ( $\bm{e}\approx\bm{n}$ ). At the decoding step, the normal AC $\bm{n}$ and the entire AC $\bm{e}$ are independently fed into the image decoder $D_{\mathrm{Img}}$ to reconstruct the same healthy input image $\hat{\bm{x}}^{\mathrm{h}}$ ( $D_{\mathrm{Img}}(\bm{n})=\hat{\bm{x}}^{\mathrm{h}}$ and $D_{\mathrm{Img}}(\bm{e})=\hat{\bm{x}}^{\mathrm{h}}$ ). Additionally, the abnormal AC $\bm{a}$ , which is trained to be the zero vector, is taken by the label encoder $D_{\mathrm{Lbl}}$ as input to generate a segmentation label $\hat{\bm{l}}^{\mathrm{h}}$ that is encouraged to be similar to a ground-truth label $\bm{l}^{\mathrm{h}}$ filled with zeros ( $D_{\mathrm{Lbl}}(\bm{a})=\hat{\bm{l}}^{\mathrm{h}}$ ). Finally, the ground-truth label $\bm{l}^{\mathrm{h}}$ is fed into the label encoder $E_{\mathrm{Lbl}}$ that is trained to estimate the corresponding abnormal AC $\hat{\bm{a}}$ , which should be the zero vector as well ( $E_{\mathrm{Lbl}}(\bm{l}^{\mathrm{h}})=\hat{\bm{a}}$ ).

Conversely, when a diseased image $\bm{x}^{\mathrm{d}}$ is given as input, the VAE component is not used for learning, and only its inference results are utilized (see the lower part of Fig. 4). The assumption is that the normal ACs $\bm{n}$ , which are trained only on the healthy images $\bm{x}^{\mathrm{h}}$ to encode the normal anatomical information, should be incapable of reconstructing abnormal lesions in the diseased images $\bm{x}^{\mathrm{d}}$ (Schlegl et al., 2017). Thus, the inference results from the diseased image $\bm{x}^{\mathrm{d}}$ can be the normal ACs $\bm{n}$ that match a pseudo-normal image corresponding to the input image. The rest of the training process is essentially analogous to the process starting with a healthy image. At the encoding step, from the diseased image $\bm{x}^{\mathrm{d}}$ , the normal AC encoder $E_{\mathrm{NAC}}$ infers the normal AC $\bm{n}$ ( $E_{\mathrm{NAC}}(\bm{x}^{\mathrm{d}})=\bm{n}$ ), and the abnormal AC encoder $E_{\mathrm{AAC}}$ outputs the abnormal AC $\bm{a}$ ( $E_{\mathrm{AAC}}(\bm{x}^{\mathrm{d}})=\bm{a}$ ). Then, the entire AC $\bm{e}$ is calculated as the sum of the normal AC $\bm{n}$ and the abnormal AC $\bm{a}$ ( $\bm{e}=\bm{n}+\bm{a}$ ). At the decoding step, the entire AC $\bm{e}$ is fed into the image decoder $D_{\mathrm{Img}}$ to reconstruct the whole input image with abnormal findings $\hat{\bm{x}}^{\mathrm{d}}$ ( $D_{\mathrm{Img}}(\bm{e})=\hat{\bm{x}}^{\mathrm{d}}$ ). Note that reconstruction from the normal AC $\bm{n}$ is not performed because there is no ground-truth for the pseudo-normal image corresponding to the input image. Then, the abnormal AC $\bm{a}$ is fed into the label decoder $D_{\mathrm{Lbl}}$ to predict the segmentation label of the abnormalities $\hat{\bm{l}}^{\mathrm{d}}$ that should be matched to the ground-truth segmentation label $\bm{l}^{\mathrm{d}}$ ( $D_{\mathrm{Lbl}}(\bm{a})=\hat{\bm{l}}^{\mathrm{d}}$ ). Finally, the ground-truth segmentation label $\bm{l}^{\mathrm{d}}$ is taken as input by the label encoder $E_{\mathrm{Lbl}}$ to estimate the corresponding abnormal AC $\hat{\bm{a}}$ ( $E_{\mathrm{Lbl}}(\bm{l}^{\mathrm{d}})=\hat{\bm{a}}$ ).

2.2 Creating the semantically organized latent space

The semantically organized latent space is critical for our SBMIR system to perform image retrieval based on semantics (i.e., whether an image is healthy or diseased). That is, a query vector conveying the information of a diseased region should retrieve only diseased images $\bm{x}^{\mathrm{d}}$ and that not conveying any disease region information should retrieve only healthy images $\bm{x}^{\mathrm{h}}$ . We call this semantic consistency, which will be quantitatively evaluated later in E.1. To achieve this, the latent space $\mathcal{Z}$ needs to be separated into a subspace representing healthy images $\mathcal{Z}^{\mathrm{h}}$ (i.e., the healthy subspace), and another subspace representing diseased images $\mathcal{Z}^{\mathrm{d}}$ (i.e., the diseased subspace). In other words, our SBMIR system should enable the corresponding mapping based on semantics, between the image space and the latent space as follows: $\mathcal{X}^{\mathrm{h}}\leftrightarrow\mathcal{Z}^{\mathrm{h}}$ and $\mathcal{X}^{\mathrm{d}}\leftrightarrow\mathcal{Z}^{\mathrm{d}}$ . Here, we explain how to configure the semantically organized latent space.

2.2.1 Four latent distributions in the latent space

How the information conveyed by an abnormal AC $\bm{a}$ changes according to the semantics is essential for configuring the semantically organized latent space. As shown in Fig. 5a, the entire AC $\bm{e}$ of a diseased image $\bm{x}^{\mathrm{d}}$ is represented as a sum of the normal AC $\bm{n}$ and the abnormal AC $\bm{a}$ ( $\bm{e}=\bm{n}+\bm{a}$ ); in contrast, that of a healthy image $\bm{x}^{\mathrm{h}}$ can be approximated only by the normal AC $\bm{n}$ ( $\bm{e}\approx\bm{n}$ ), as the abnormal AC $\bm{a}$ should be the zero vector reflecting the absence of abnormality ( $\bm{a}\approx\bm{0}$ ). Therefore, there should be four latent distributions in the latent space: a distribution of entire ACs from healthy images $\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}})$ , that of normal ACs from healthy images $\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}})$ , that of entire ACs from diseased images $\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}})$ , and that of normal ACs from diseased images $\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{d}})$ . Considering the semantics, the healthy subspace $\mathcal{Z}^{\mathrm{h}}$ should enclose the distribution of entire ACs from healthy images $\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}})$ , that of normal ACs from healthy images $\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}})$ , and that of normal ACs from diseased images $\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{d}})$ ( $\{\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}}),\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}}),\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{d}})\}\subset\mathcal{Z}^{\mathrm{h}}$ ), because all of these distributions represent healthy images. On the other hand, the diseased subspace $\mathcal{Z}^{\mathrm{d}}$ includes the remaining distribution of entire ACs from diseased images $\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}})$ ( $\{\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}})\}\subset\mathcal{Z}^{\mathrm{d}}$ ) because only this distribution represents diseased images. Note that since the abnormal ACs $\bm{a}$ express only the amount of change from the healthy subspace $\mathcal{Z}^{\mathrm{h}}$ to the diseased subspace $\mathcal{Z}^{\mathrm{d}}$ (i.e., as shown in Fig. 1b, an abnormal AC does not have enough information to reconstruct a whole image), the distributions of the abnormal ACs $\bm{a}$ are not taken into account.

2.2.2 Configuration of the healthy subspace

Figure 5b illustrates how the three distributions in the healthy subspace ( $\{\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}}),\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}}),\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{d}})\}\subset\mathcal{Z}^{\mathrm{h}}$ ) are trained to overlap with each other. During the model training, the three distributions in the healthy subspace $\mathcal{Z}^{\mathrm{h}}$ are forced to follow the isotropic multivariate Gaussian distribution $\mathcal{N}(\bm{n};\bm{0},\bm{I})$ that is formulated by the VAE component to learn normal ACs $\bm{n}$ . In particular, the distribution of normal ACs from healthy images $\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}})$ is directly optimized to follow the Gaussian distribution during the training of the VAE component (see the process starting with a healthy image in Fig. 4). Additionally, the distribution of normal ACs from diseased images $\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{d}})$ is ensured to follow the Gaussian distribution because each normal AC $\bm{n}$ from diseased images $\bm{x}^{\mathrm{d}}$ is sampled from the posterior distribution of the VAE component as an inference result (see the process starting with a diseased image in Fig. 4). The distribution of entire ACs from healthy images $\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}})$ can be indirectly optimized to follow the Gaussian distribution by forcing the abnormal ACs extracted from healthy images to the zero vector ( $\bm{a}\approx\bm{0}$ ), as follows: $\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}})=\mathcal{D}(\bm{n}+\bm{a}\rvert\bm{x}^{\mathrm{h}})\approx\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}})$ . Here, $\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}})$ is trained to be the Gaussian distribution by the learning objective of the VAE component.

2.2.3 Configuration of the diseased subspace

Figure 5b also depicts how the distribution of entire ACs from diseased images in the diseased subspace ( $\{\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}})\}\subset\mathcal{Z}^{\mathrm{d}}$ ) should be separated from the Gaussian distribution $\mathcal{N}(\bm{0},\bm{I})$ that represents healthy images $\bm{x}^{\mathrm{h}}$ . To achieve this, a margin parameter is provided as a hyperparameter that determines how far apart the two subspaces should be, which is indicated by an arrow labeled “Margin” in Fig. 5b. The distance between the distribution of entire ACs from diseased images $\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}})$ and the Gaussian distribution $\mathcal{N}(\bm{0},\bm{I})$ is measured by KL-divergence, and it is optimized to exceed the margin parameter, which is formulated as a margin loss as follows:

L_{\mathrm{margin}}=\max(0,m-\mathrm{KL}(\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}})\|\mathcal{N}(\bm{0},\bm{I}))).

(2)

Here, $m$ is the margin parameter, and $\mathrm{KL}(\cdot\|\cdot)$ is the KL-divergence between two distributions. As the margin parameter increases, the distance, which is represented as the KL-divergence, between the two distributions is configured to be large.

2.3 Learning objectives

To train the deep-learning framework, several loss functions are defined. Reconstruction loss $L_{\mathrm{recon}}$ is a composite loss function of both a perceptual loss function using the VGG network (Simonyan and Zisserman, 2015) and a mean squared loss function, which forces the reconstructed images $\hat{\bm{x}}$ to be similar to the input images $\bm{x}$ . When necessary, the reconstruction loss focusing on the tumor-associated regions is additionally calculated for the entire reconstruction to force the model to generate the abnormal regions more precisely (see Section 3.2.4). Segmentation loss $L_{\mathrm{seg}}$ is a composite loss function combining a Dice loss function (Dice, 1945) and a focal loss function (Lin et al., 2020) between the predicted segmentation label for abnormal areas $\hat{\bm{l}}$ and its ground-truth label $\bm{l}$ . Consistency loss $L_{\mathrm{cons}}$ calculates the L1 distance between the estimated abnormal ACs $\hat{\bm{a}}$ by the label encoder $E_{\mathrm{Lbl}}$ and the corresponding abnormal ACs $\bm{a}$ , forcing the estimated abnormal ACs $\hat{\bm{a}}$ to be close to the corresponding abnormal ACs $\bm{a}$ . Abnormality loss $L_{\mathrm{abn}}$ works only when a healthy image is given to force the norm of the abnormal ACs $\bm{a}$ to be zero. Regularization loss $L_{\mathrm{reg}}$ works only when a healthy image is given, matching the posterior distribution estimated by the encoder $E_{\mathrm{NAC}}(\bm{n}\rvert\bm{x}^{\mathrm{h}})=\mathcal{N}(\bm{\mu}(\bm{x}^{\mathrm{h}}),\bm{\sigma}(\bm{x}^{\mathrm{h}}))$ and the prior distribution $\mathcal{N}(\bm{0},\bm{I})$ by minimizing the KL-divergence between them, as follows:

L_{\mathrm{reg}}=\mathrm{KL}(\mathcal{N}(\bm{\mu}(\bm{x}^{\mathrm{h}}),\bm{\sigma}(\bm{x}^{\mathrm{h}}))\|\mathcal{N}(\bm{0},\bm{I})).

(3)

Finally, margin loss $L_{\mathrm{margin}}$ works only when a diseased image is given, imposing a distance between the distribution of the entire ACs from diseased images $\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}})$ and the Gaussian distribution $\mathcal{N}(\bm{0},\bm{I})$ , as formulated in Eq. 2. In the training process, the weighted sum of the abovementioned loss functions was set to be minimized:

\begin{split}L_{\mathrm{total}}&=w_{\mathrm{recon}}L_{\mathrm{recon}}\\ &+w_{\mathrm{seg}}L_{\mathrm{seg}}\\ &+w_{\mathrm{cons}}L_{\mathrm{cons}}\\ &+w_{\mathrm{abn}}L_{\mathrm{abn}}\\ &+w_{\mathrm{reg}}L_{\mathrm{reg}}\\ &+w_{\mathrm{margin}}L_{\mathrm{margin}}.\end{split}

(4)

Here, $w_{\mathrm{recon}}$ , $w_{\mathrm{seg}}$ , $w_{\mathrm{cons}}$ , $w_{\mathrm{abn}}$ , $w_{\mathrm{reg}}$ , and $w_{\mathrm{margin}}$ are weights for the reconstruction loss $L_{\mathrm{recon}}$ , segmentation loss $L_{\mathrm{seg}}$ , consistency loss $L_{\mathrm{cons}}$ , abnormality loss $L_{\mathrm{abn}}$ , regularization loss $L_{\mathrm{reg}}$ , and margin loss $L_{\mathrm{margin}}$ , respectively. The full algorithm for training the feature extraction module of our SBMIR system is summarized in Algorithm 1.

sg: a stop-gradient operator

while not converge do

/* Forward path for healthy images */

Sample a batch of healthy images

\bm{x}^{\mathrm{h}}

and corresponding segmentation labels

\bm{l}^{\mathrm{h}}

from the training dataset.

\bm{\mu},\bm{\sigma}\leftarrow E_{\mathrm{NAC}}(\bm{x}^{\mathrm{h}})

\bm{n}=\bm{\mu}+\bm{\sigma}\odot\bm{\epsilon},\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{I})

\bm{a}=E_{\mathrm{AAC}}(\bm{x}^{\mathrm{h}})

\bm{e}=\bm{n}+\bm{a}

\hat{\bm{x}}^{\mathrm{h}}=D_{\mathrm{Img}}(\bm{e})

\hat{\bm{x}}^{\mathrm{h}}=D_{\mathrm{Img}}(\bm{n})

\hat{\bm{l}}^{\mathrm{h}}=D_{\mathrm{Lbl}}(\bm{a})

\hat{\bm{a}}=E_{\mathrm{Lbl}}(\bm{l}^{\mathrm{h}})

Compute

L_{\mathrm{recon}}(\hat{\bm{x}}^{\mathrm{h}},\bm{x}^{\mathrm{h}})

L_{\mathrm{seg}}(\hat{\bm{l}}^{\mathrm{h}},\bm{l}^{\mathrm{h}})

L_{\mathrm{cons}}(\hat{\bm{a}},\mathrm{sg}(\bm{a}))

L_{\mathrm{abn}}(\bm{a})

, and

L_{\mathrm{reg}}(\bm{\mu},\bm{\sigma})

Update parameters of

E_{\mathrm{NAC}}

E_{\mathrm{AAC}}

E_{\mathrm{Lbl}}

D_{\mathrm{Img}}

, and

D_{\mathrm{Lbl}}

to minimize

w_{\mathrm{recon}}L_{\mathrm{recon}}+w_{\mathrm{seg}}L_{\mathrm{seg}}+w_{\mathrm{cons}}L_{\mathrm{cons}}+w_{\mathrm{abn}}L_{\mathrm{abn}}+w_{\mathrm{reg}}L_{\mathrm{reg}}

using stochastic gradient descent (e.g., Adam).

/* Forward path for diseased images */

Sample a batch of diseased images

\bm{x}^{\mathrm{d}}

and corresponding segmentation labels

\bm{l}^{\mathrm{d}}

from the training dataset.

\bm{\mu}\leftarrow E_{\mathrm{NAC}}(\bm{x}^{\mathrm{d}})

\bm{n}\leftarrow\mathrm{sg}(\bm{\mu})

\bm{a}=E_{\mathrm{AAC}}(\bm{x}^{\mathrm{d}})

\bm{e}=\bm{n}+\bm{a}

\hat{\bm{x}}^{\mathrm{d}}=D_{\mathrm{Img}}(\bm{e})

\hat{\bm{l}}^{\mathrm{d}}=D_{\mathrm{Lbl}}(\bm{a})

\hat{\bm{a}}=E_{\mathrm{Lbl}}(\bm{l}^{\mathrm{d}})

Compute

L_{\mathrm{recon}}(\hat{\bm{x}}^{\mathrm{d}},\bm{x}^{\mathrm{d}})

L_{\mathrm{seg}}(\hat{\bm{l}}^{\mathrm{d}},\bm{l}^{\mathrm{d}})

L_{\mathrm{cons}}(\hat{\bm{a}},\mathrm{sg}(\bm{a}))

, and

L_{\mathrm{margin}}(\bm{e})

Update parameters of

E_{\mathrm{AAC}}

E_{\mathrm{Lbl}}

D_{\mathrm{Img}}

, and

D_{\mathrm{Lbl}}

to minimize

w_{\mathrm{recon}}L_{\mathrm{recon}}+w_{\mathrm{seg}}L_{\mathrm{seg}}+w_{\mathrm{cons}}L_{\mathrm{cons}}+w_{\mathrm{margin}}L_{\mathrm{margin}}

using stochastic gradient descent (e.g., Adam).

end while

Algorithm 1 Training of the feature extraction module of our SBMIR system

3 Implementation

3.1 Datasets

Two types of datasets, a glioma dataset and a lung cancer dataset, were used in the model training and evaluation, each of which was split into a training dataset and a testing dataset.

3.1.1 Glioma dataset

The glioma dataset initially aggregated three datasets comprising brain MRI scans with glioma obtained from the MICCAI 2019 BraTS Challenge (Menze et al., 2015; Bakas et al., 2017c, a, b), a dataset of 51,925 slices from 335 patients (MICCAI_BraTS_Training), a dataset of 19,375 slices from 125 patients (MICCAI_BraTS_Validation), and a dataset of 25,730 slices from 166 patients (MICCAI_BraTS_Testing). The patients were from multiple hospitals. Each MRI scan consists of T1-weighted (T1), T1-weighted contrast-enhanced (T1CE), T2-weighted, and fluid-attenuated inversion recovery (FLAIR) sequences. As shown in Fig. 6a, the tumor-associated segmentation labels include three classes: gadolinium (Gd)-enhancing tumor (ET), peritumoral edema (ED), and necrotic and non-enhancing tumor core (NET), while the normal-anatomy-associated labels include six classes: left cerebrum, right cerebrum, left cerebellum, right cerebellum, left ventricle, and right ventricle. Except for the tumor-associated labels in MICCAI_BraTS_Training, we supplemented each dataset with the tumor-associated labels according to the procedure described in a previous study (Kobayashi et al., 2021). Normal-anatomy-associated labels were generated using the software BrainSuite (version: 19a) (Shattuck and Leahy, 2002). We then assigned each dataset for the use of training and testing our SBMIR system as follows: the glioma training dataset ( $N=$ 45,105 slices from 291 patients) consists of both MICCAI_BraTS_Validation and MICCAI_BraTS_Testing, and the glioma testing dataset ( $N=$ 51,925 slices from 335 patients) consists of MICCAI_BraTS_Training. Note that the original names of the datasets in the 2019 BraTS Challenge and the notations according to their purpose in this study are different. Following the previous study (Kobayashi et al., 2021), this difference in labeling was deemed appropriate for evaluating the performance of our SBMIR system based on data that have widely accepted and publicly available ground-truth tumor-associated labels (MICCAI_BraTS_Training) in order to ensure the objectivity and reproducibility.

3.1.2 Lung cancer dataset

The lung cancer dataset consists of chest CT scans from 1,000 patients with lung cancer collected from a single hospital. The study, data use, and data protection procedures were approved by the Ethics Committee of the National Cancer Center, Tokyo, Japan (protocol number 2016-496). The requirement for informed consent was waived in view of the retrospective nature of the study. All procedures followed applicable laws and regulations and the Declaration of Helsinki. As shown in Fig. 6b, the tumor-associated segmentation labels included one class, primary tumor (PT), the region of which was segmented by an expert radiation oncologist (K.K.). Other potential tumor-associated regions, such as lymph node metastases, were not annotated because the model training was conducted in the lung window, making it difficult to identify the diseased areas in soft tissues such as mediastinum. The normal-anatomy-associated labels included five classes: right upper lobe, right middle lobe, right lower lobe, left upper lobe, and left lower lobe. These five labels were generated by an off-the-shelf deep-learning model that is available from a public repository (Hofmanninger et al., 2020). Then, we split the patients randomly into the lung cancer training dataset ( $N=$ 49,696 slices from 600 patients) and the lung cancer testing dataset ( $N=$ 33,572 slices from 400 patients). Notably, the large testing dataset was beneficial for assessing the effectiveness of image search because it can ensure that corresponding images exist for individual user queries.

3.2 Training settings

The detailed training settings of the models are described here. We first determined the hyperparameters using the glioma training dataset and then applied most of them to the lung cancer training dataset, except for the margin parameter. Note that the two types of datasets differed in terms of the spatial resolution of input images; the image size of the glioma training dataset was set to $256\times 256$ , and that of the lung cancer training dataset was set to $512\times 512$ . Also, it is important to note that the average voxel volume in the tumor-associated regions in the lung cancer testing dataset ( $1.0\times 10^{4}$ ) was much smaller than that in the glioma testing dataset ( $4.1\times 10^{4}$ ).

3.2.1 Preprocessing of the datasets

For the glioma dataset (i.e., the glioma training dataset and the glioma testing dataset), T1, T1CE, and FLAIR sequences were concatenated into a three-channel MR volume. Then, $Z$ -score normalization was applied channel-wise in a manner that was limited to the area inside the body. Subsequently, each 3D MR volume was decomposed into a collection of three-channel 2D axial slices, for which the size was resized to $3\times 256\times 256$ . On the other hand, for the lung cancer dataset (i.e., the lung cancer training dataset and the lung cancer testing dataset), the voxel value was normalized into a lung window with a window width of 1500 and a window center of -550. Then, to set the number of channels to 3, which is similar to the glioma dataset, the CT volume was decomposed into a collection of three-channel 2D axial slices by concatenating each slice with the adjacent upper and lower slices. Thus, the size of the input tensor was $3\times 512\times 512$ . This 2.5-dimensional approach for the CT slices is valid because the diagnosis of lung nodules usually requires investigation of adjacent slices to distinguish abnormal structures from normal structures such as vessels. Random rotation and random horizontal flipping were applied in the data augmentation of each image for training the models.

3.2.2 Implementation of the neural networks

All neural networks were implemented in Python 3.8 with PyTorch library 1.10.0 (Paszke et al., 2019) using an NVIDIA Tesla V100 graphics processing unit with CUDA 10.2. We implemented almost the same neural network architecture for the two training datasets (i.e., the glioma training dataset and the lung cancer training dataset), the purpose of which was to maintain the same compression ratio from the input image to the ACs as latent representations. See C for the detailed network architectures.

3.2.3 Hyperparameters for the glioma dataset

For the glioma training dataset, the hyperparameters shared across the configurations were as follows: batch size = 200, the number of training epochs = 300, learning rate = $1.0\times 10^{-4}$ , weight decay = $1.0\times 10^{-5}$ , $w_{\mathrm{recon}}=1.0$ , $w_{\mathrm{seg}}=10.0$ , $w_{\mathrm{cons}}=1.0$ , $w_{\mathrm{abn}}=0.1$ , $w_{\mathrm{reg}}=0.1$ , $w_{\mathrm{margin}}=0.1$ , and $m=10$ . We determined these hyperparameters by grid search from the candidate values as follows: $w_{\mathrm{recon}}=\{1.0,10.0,100.0\}$ , $w_{\mathrm{seg}}=\{1.0,10.0\}$ , $w_{\mathrm{cons}}=\{0.1,1.0\}$ , $w_{\mathrm{abn}}=\{0.1,1.0\}$ , $w_{\mathrm{reg}}=\{0.1,1.0\}$ , $w_{\mathrm{margin}}=\{0.1,1.0\}$ , and $m=\{0,5,10,20,40\}$ .

3.2.4 Hyperparameters for the lung cancer dataset

For the lung cancer training dataset, the hyperparameters shared across the configurations were as follows: batch size = 144, the number of training epochs = 50, learning rate = $1.0\times 10^{-4}$ , weight decay = $1.0\times 10^{-5}$ , $w_{\mathrm{recon}}=1.0$ , $w_{\mathrm{seg}}=10.0$ , $w_{\mathrm{cons}}=1.0$ , $w_{\mathrm{abn}}=0.1$ , $w_{\mathrm{reg}}=0.1$ , $w_{\mathrm{margin}}=0.1$ , and $m=40$ . Owing to the relatively small areas of the tumor-associated regions, the reconstruction loss focusing on the area of tumor-associated regions was additionally calculated for the entire reconstruction. Almost all the hyperparameters above were determined as those optimized in the model training on the glioma training dataset, except for the margin parameter $m$ that was determined by grid search from the candidate values of $\{0,20,40,60,80\}$ .

3.3 Image retrieval pipeline

After the model is trained, the whole SBMIR system combining the feature extraction module and the similarity calculation module can be implemented. Using the feature extraction module, reference images in a database were converted into reference vectors before user operation (see the right part of Fig. 1c). Each reference vector was constructed from a reference image as the sum of a normal AC (through the normal AC encoder) and an abnormal AC (through the abnormal AC encoder). At the time of the image retrieval, the two-step user operation constructs a query vector to meet the information needs (see the left part of Fig. 1c): first, selecting a template image to specify the location where the target image content should exist, and second, sketching the semantic segmentation label of the disease to express the image content therein. Then, the normal AC encoder extracts a normal AC from the template image, and the label encoder extracts an abnormal AC from the semantic sketch of the disease, both of which are summed as the query vector according to Eq. 1.

In the similarity calculation module (see the middle part of Fig. 1c), the distance of the reference vectors to the query vector was calculated using approximated Euclidean distances by an algorithm called Annoy (Bernhardsson, 2022). The reference images with close reference vectors to the query vector are then rearranged according to each reference volume. Finally, the top- $K$ similar images, each of which belongs to a different reference volume, are obtained. Note that this volume-wise similarity calculation is essential to avoid the redundancy resulting from consecutive slices with similar appearances in a single 3D volume. Figure 7 explains the details of the similarity calculation module to realize volume-wise image retrieval in our SBMIR system.

3.4 Implementation of our SBMIR system on the glioma dataset

The feature extraction module was trained on the glioma training dataset. Then, the image retrieval pipeline was implemented on the glioma testing dataset for evaluation. The template volume was selected to be the image series with the minimum volume of tumor-associated regions in the glioma testing dataset, as the smaller the tumor volume, the smaller the deviation of normal structures can be. An ideal template volume may be a totally healthy image series; however, the glioma dataset did not contain an image series without abnormal findings. Thus, this may be the second-best setting for the template volume. For gliomas, we found that users can specify the following types of information regarding the target images to be retrieved: the location, shape, size, and internal characteristics of the tumor (see Section 5.1), which we call fine-grained characteristics of the disease. Among these, the location, shape, and size of the target diseased area can be defined by selecting a template image and sketching the outer edge of the tumor. Furthermore, depending on which combination of tumor-associated labels (i.e., ET, ED, and NET) is used to sketch the tumor, the internal characteristics of the tumor (e.g., contrast enhancement effect, necrosis, presence of edema, etc.) can be expressed.

Table 1: Backgrounds of the evaluators and participation in the user test. Ten and nine healthcare professionals participated in the evaluations of gliomas and lung cancers, respectively.

Evaluators	Background	Years of experience
#1	Radiation oncologist	10 - 19
#2	Medical oncologist	10 - 19
#3	Diagnostic radiologist	20 - 29
#4	Diagnostic radiologist	20 - 29
#5	Neurosurgeon	20 - 29
#6	Colorectal surgeon	10 - 19
#7	Thoracic surgeon	20 - 29
#8	Medical physicist	10 - 19
#9	General surgeon	1 - 9
#10	Medical researcher	1 - 9

3.5 Implementation of our SBMIR system on the lung cancer dataset

After training the feature extraction module using the lung cancer training dataset, the image retrieval pipeline of our SBMIR system was implemented on the lung cancer testing dataset for evaluation. The template image series was also selected to be the one with the minimum volume of tumor-associated regions in the lung cancer testing dataset, similar to the glioma dataset. For lung cancers, we found that users can identify fine-grained characteristics of the disease, including the location, shape, and size of a tumor (see Section 5.1); however, in contrast with gliomas, the internal characteristics of the tumor (e.g., ground-glass opacity and solid tumor components) are not explicitly expressed by the model because only a single-class tumor-associated segmentation label (i.e., PT) was given in the training dataset.

4 Evaluation

The present study comprehensively evaluates our SBMIR system from technological and clinical standpoints. As for technical evaluations, the training results of the feature decomposition, hyperparameter optimization focusing on the image retrieval performance, and ablation studies are described in D, E, and F, respectively. In this section, we explain the details of the clinical evaluation, which focuses on how our SBMIR system can help the information-seeking objectives of healthcare professionals.

In exchange for the flexibility of the query-by-sketch approach, standardizing the user query and preparing the ground truth for the retrieved images are challenges. Therefore, user testing is the most valid evaluation schema to assess the image retrieval performance of our SBMIR system. A group of healthcare professionals with various clinical backgrounds, including radiologists, physicians, surgeons, medical physicists, and researchers, participated in the user tests as the evaluators (see Table 1). The user tests were conducted using a dedicated GUI (see Fig. 2b).

4.1 Definition of clinical similarity

We developed criteria for the evaluators to determine whether the retrieved image was clinically similar to the features specified by the user query for each dataset.

4.1.1 Clinical similarity of glioma images

For gliomas, our SBMIR system can specify the location, shape, size, and internal characteristics of the tumor, as demonstrated in Section 5.1. Among these characteristics, we considered that determining whether the shape is consistent or not can be too subjective to standardize among evaluators. Hence, the clinically similar images were defined to be the images that met the following three criteria: (1) a difference in the maximum tumor diameter of within 2 cm; (2) the same location of the tumor according to the brain lobe (i.e., right frontal lobe, right parietal lobe, right occipital lobe, right temporal lobe, left frontal lobe, left parietal lobe, left occipital lobe, and left temporal lobe); (3) the same pattern of the contrast enhancement (e.g., the presence of contrast-enhancement and tumor necrosis). When a user intention included a more detailed location (e.g., the relative position in a brain lobe), the evaluators were required to judge whether the retrieved images matched the detailed location. Each criterion was judged with a score of 0 or 1, and a maximum score of 3 points was possible for each image, which we call the similarity score. When the similarity score was 3/3, the retrieved image was considered clinically similar to the query.

4.1.2 Clinical similarity of lung cancer images

For lung cancers, our SBMIR system can specify the location, shape, and size of the tumor, as demonstrated in Section 5.1. Note that the internal characteristics of the tumor are not explicitly expressed, as only a single class of tumor-associated segmentation labels (i.e., PT) was applied to the datasets. We also concluded that determining the concordance in the shape of lung cancer can be too subjective. Therefore, the clinical similarity was defined according to the following two criteria: (1) a difference in the maximum tumor diameter of within 2 cm; (2) the same location of the tumor according to the lung lobe (i.e., right upper lobe, right middle lobe, right lower lobe, left upper lobe, left lower lobe). When a user intention included a more detailed location (e.g., the lung segment, the apex of the lung, the pleural contact, and the hilar region), the evaluators were asked to judge whether the retrieved images matched the detailed location. Each criterion was evaluated with a score of 0 or 1, and a maximum similarity score of 2 points was possible for each image. When the similarity score was 2/2, the retrieval image was considered clinically similar to the query.

4.2 Evaluation metrics for the retrieved images

We devised two types of evaluation metrics for the retrieved images – user-oriented and objective metrics. Every evaluation metric was averaged among the evaluators ( $N=10$ for gliomas and $N=9$ for lung cancers, as shown in Table 1), and the mean $\pm$ standard deviation of each metric was reported.

4.2.1 User-oriented evaluation metrics

The user-oriented metrics included precision@K, reciprocal rank, and normalized discounted cumulative gain (nDCG), which reflect how the retrieved results were clinically similar to the intent of the query based on the judgment of the evaluators. These were evaluated according to a 2D slice basis; that is, when evaluating the similarity score to calculate these metrics, each evaluator interpreted the consistency between the intention of the query and retrieved images without considering the adjacent slices of each retrieved image. Because the framework of the user tests is characterized by the large size of the datasets ( $N=51,925$ images in the glioma testing dataset and $N=33,572$ images in the lung cancer testing dataset) and the fine-grained attributes that user queries can specify, it is straightforward to assess how many of the retrieved top- $K$ images contain corresponding images that satisfy the intent of the query by the user-oriented metrics. We defined the following three user-oriented metrics.

Precision@K (Shirahatti and Barnard, 2005) is the ratio between the number of clinically similar images in the top- $K$ retrieved images and the number of retrieved images $K$ , which is formulated as

\mathrm{Precision@K}=\frac{\rvert\mathcal{S}\bigcap\mathcal{L}_{K}\rvert}{K},

(5)

where $\mathcal{S}$ is the set of clinically similar images judged by an evaluator, $\mathcal{L}_{K}$ is the list of the top- $K$ retrieved images, and $K$ is the number of retrieved images.

The reciprocal rank (Pedronette and Torres, 2015) is calculated from the inverse of the rank position of the first clinically similar image in the top- $K$ retrieved images, which is defined as

\mathrm{Reciprocal\,rank}=\frac{1}{k},

(6)

where $k$ is the rank position of the first clinically similar image in the top- $K$ retrieved images. When there is no clinically similar image in the retrieval list, the reciprocal rank is set to 0.

The nDCG (Wang et al., 2013) is the total similarity score of the retrieved images in the order defined as

\mathrm{nDCG}=\frac{1}{\mathrm{DCG}_{\mathrm{perfect}}}\left(s_{1}+\sum^{K}_{k=2}\frac{s_{k}}{\log_{2}k}\right),

(7)

where $s_{i}$ indicates the score of the $i$ -th ranked image, $k$ is the rank position, and $\mathrm{DCG}_{\mathrm{perfect}}$ represents the maximum DCG in an ideal retrieval result to normalize the value within a range from 0 to 1. The role of the log functions is to discount the score of retrieved images that ranked lower.

Note that precision@5 can be influenced by how many clinically similar images for each user query were originally included in the testing dataset. As such, precision@5 for rare images can be small even when the image retrieval works successfully. Hence, we prepared the user tests to guarantee that the lower bound of precision@5 will be 0.2 when the image retrieval works properly, which will be described in detail in Section 4.3.

4.2.2 Objective evaluation metrics

Recall@K was defined as an objective metric that can be assessed independently of the evaluator’s interpretation of the retrieval results. This objective metric was determined according to a 3D volume basis, reflecting whether the images belonging to the same volume as the presented image itself, which we call the same-volume images, are listed among the top- $K$ retrieved images.

Recall@K (Shirahatti and Barnard, 2005) is automatically calculated based on whether the same-volume images are acquired in the top- $K$ retrieved images, which is formulated as

\displaystyle\mathrm{Recall@K}=\begin{cases}1&(i\in\mathcal{L}_{K})\\ 0&(i\notin\mathcal{L}_{K}),\end{cases}

(8)

where $i$ is the same-volume image, and $\mathcal{L}_{K}$ is a list of the top- $K$ retrieved images.

4.3 Preparation for the user tests

As illustrated in Fig. 8a, we prepared question items for the practice stage and the three stages of user tests, including Test-1, Test-2, and Test-3. Each stage included five question items, consisting of either images or text descriptions, presented to the evaluators to retrieve medical images consistent with the clinical characteristics in each question item. To prepare question items, we focused on representative images. A representative image is defined as the 2D axial slice containing the largest tumor-associated region in each 3D volume, which is usually considered to be the image that best characterizes the clinical meaning of the volume.

Each question item was developed based on a specific representative image assigned in the following steps. First, from a set of representative images in each testing dataset, we identified isolated samples that were “not” included in the 5-nearest neighbors (NN) groups of any other images using a fine-tuned ResNet-101 (see B) for exclusion from Test-1 and Test-2. Five of these isolated samples were randomly selected for Test-3. Then, pre-evaluation was conducted on the remaining representative images. The purpose of the pre-evaluation was to identify a set of representative images that were certain to have at least one clinically similar image in the testing dataset. To confirm this, we directly compared Dice similarities (Dice, 1945) to maximize the average overlap of the tumor-associated labels and the normal-anatomy-associated labels (see Fig. 6). As described previously (Kobayashi et al., 2021), a CBMIR based on the direct comparison of Dice similarities can be considered an oracle for retrieving images similar to a query image. The Dice similarities were computed between each representative image and all the other images in the dataset, and the similar images were then rearranged volume-wise, in the same manner as shown in Fig. 7. Subsequently, whether the top-5 retrieved images based on the Dice similarities included at least one clinically similar image according to the similarity score was evaluated from a clinical perspective, and the representative images that did not meet this criterion were excluded (i.e., the lower bound of precision@5 can be 0.2). Lastly, the manual selection was performed to assign as diverse a selection of disease phenotypes as possible in each testing stage, considering the tumor location and other disease characteristics.

After assigning five representative images for each stage, the practice stage, Test-1, and Test-3 used the assigned images as question items presented to the evaluators. For Test-2, text descriptions as question items expressing specific clinical findings, to simulate radiology reports, were made based on the assigned images. Because it is generally difficult to fully communicate the image characteristics of a medical image by text, the originally assigned representative images were only used as references for making the text descriptions and were not used in the evaluation process.

4.4 Flow of the user tests

Figure 8b illustrates the flow of the user tests, consisting of the practice stage, Test-1, Test-2, and Test-3. For each of the testing datasets, the evaluators first underwent the practice stage. In the practice stage, five sets of a representative image and a ground-truth tumor-associated label were consecutively presented to the evaluators. By referring to the ground-truth tumor-associated labels as guidance for sketching the diseases, they were able to learn how to construct a user query to specify the content of medical images. Also, they received automated feedback on whether the same-volume image was listed in the top-5 retrieved images to confirm the validity of the user query. This practice stage is important to demonstrate that our SBMIR system can be learned easily with minimum practice. Then, three types of user tests were conducted as follows. Test-1 demonstrated the image retrieval performance when example images were available (see Section 5.2). Test-2 revealed the image retrieval performance without example images (see Section 5.3). Test-3 investigated the image retrieval performance for isolated samples (see Section 5.4). Each test stage contained five question items indicated by images or text descriptions, which is described in detail in G. User-oriented metrics were independently evaluated by each evaluator in Test-1 and Test-2, while an objective metric was automatically assessed in Test-1 and Test-3.

4.5 Comparison with conventional CBMIR methods

A conventional CBMIR system using the query-by-example approach was implemented for the comparison purpose. The fine-tuned ResNet-101 was used as a feature extractor (see A), and the image retrieval pipeline was built in a similar manner to our SBMIR system (see Fig. 7). In Test-1, where example images were available, the image retrieval results were evaluated based on the user-oriented metrics. An expert diagnostic radiologist (M.M.) and an expert radiation oncologist (K.K.) were responsible for the aforementioned processes that required clinical perspectives, including the pre-evaluation and manual selection in Section 4.3.

5 Results and discussion

5.1 SBMIR retrieved medical images according to fine-grained imaging characteristics

Here, we demonstrate that our SBMIR system enables medical image retrieval to be performed according to fine-grained characteristics, including the location, shape, size, and internal characteristics of a disease. Because a query vector can be specified based on which 2D template image is selected from the 3D template volume and how the disease is sketched, we observed how the search results changed when each piece of information in a user query was varied.

Figure 9a demonstrates medical image retrieval performed according to the fine-grained characteristics of gliomas. The radiological components of gliomas were categorized into three classes using tumor-associated segmentation labels– ET, ED, and NET–all of which can be sketched on a selected template image. We started with a user query, consisting of only a template image without a sketch, to retrieve a normal image with similar anatomy to the template image (see the first row). This result was consistent with the assumption that normal ACs convey only information characterizing normal anatomy. We then drew a semantic sketch representing a non-enhancing tumor surrounded by mild peritumoral edema in the left frontal lobe on the same template image, which successfully retrieved an image containing the intended characteristics (see the second row). When we changed the internal characteristics of the tumor to exhibit ring enhancement, the retrieval results were again changed according to the intention of the query (see the third row). Finally, we sketched a similar disease in a different location on a different template image, the left temporal lobe, whereby the retrieved image also accompanied a tumor with the specified features at the intended location (see the fourth row). The retrieved image in the second row suggests a low-grade glioma, and those in the third and fourth rows suggest high-grade gliomas (The Cancer Genome Atlas Research Network, 2015). Hence, by configuring the three types of tumor-associated segmentation labels to specify the image content, even an image with a particular differential diagnosis can be retrieved.

Figure 9b shows that flexible medical image retrieval can also be successfully performed for lung cancers. Only a single class of tumor-associated segmentation label representing PT can be sketched on a selected template image because, in contrast with gliomas, the internal characteristics of lung cancers were not explicitly modeled. We started by sketching a tumor with a triangular shape in the left upper lobe on a template image, successfully retrieving an image with a tumor exhibiting the corresponding shape and location (see the first row). Subsequently, we altered the shape of the tumor to round, changing the retrieval results to include an image with a relatively round tumor located in the same region (see the second row). The detailed anatomical location of the semantic sketch in the same template image can also affect the search results. To demonstrate this, we sketched a round tumor at a different position on the same template image to contact the pleura, thus retrieving a tumor with pleural contact (see the third row). Lastly, we sketched a similar tumor in the right lower lobe on a different template image, changing the location as intended (see the fourth row). Hence, the detailed characteristics regarding the size, shape, and anatomical location of lung cancers can be reflected in the search results.

5.2 SBMIR outperformed the existing CBMIR even when example images were available

In Test-1, the evaluators were asked to translate the characteristics of a presented image into a user query to retrieve clinically similar images, including the images belonging to the same volume as the presented one, which we call the same-volume images. There were two primary purposes of this test. First, by observing whether the same-volume images were retrievable or not, the representation power of user queries was assessed. Second, the image retrieval performance of our SBMIR system was compared with that of the existing CBMIR system based on the query-by-example approach, which used the presented images as example images (see Section 4.5).

For gliomas, five images with various sizes, locations, and patterns of contrast enhancement, were presented to the evaluators (see Fig. 10a and G.1). The average recall@5 among the evaluators was consistently high, above 0.9. Such high recall@5 values indicate that the user queries could capture most of the characteristics of the presented images and thereby successfully retrieve the same-volume images. Before the user tests, we had thought that it might be difficult for some evaluators to construct appropriate user queries based on their understanding of the presented image; however, such a skill-based limitation hindering image retrieval performance was not substantial for gliomas. Furthermore, all presented images had a precision@5 averaging above 0.6, indicating that at least three clinically similar images were listed on average among the top-5 retrieved images (see Section 4.1 for the definition of clinical similarity). For comparison, the means $\pm$ standard deviations of precision@5, reciprocal rank, nDCG, and recall@5 for the retrieval results among the question items based on the query-by-example approach were $0.32\pm 0.18$ , $1.00\pm 0.00$ , $0.68\pm 0.07$ , and $1.00\pm 0.00$ , respectively. As the presented images were also subject to image retrieval, it is natural that recall@5 and reciprocal rank values were as high as 1.00, reflecting the fact that all the presented images were the top ranked images. Nevertheless, the consistently higher precision@5 of our SBMIR system (i.e., all precision@5 values in Fig. 10a are above 0.32) supports its superior image retrieval performance according to fine-grained image characteristics even when example images are available. Figure 1a shows the results for Q.2, whose recall@5 ( $1.00\pm 0.00$ ) was the highest, while Fig. 1b shows the results for Q.3, whose recall@5 ( $0.90\pm 0.32$ ) was the lowest. These examples demonstrate the robustness of our system, as most of the retrieved images were judged to be clinically similar, even though each evaluator sketched the disease with unique details on the same or slightly different template images.

For lung cancers, five images with different sizes and locations were presented to the evaluators (see Fig. 10b and G.1). Notably, recall@5 varied among the question items. A low recall@5 suggests that the representation power of the user query was insufficient to retrieve the same-volume images. To examine the causes of this, we compared the user queries in Q.4 exhibiting the highest recall@5 ( $0.90\pm 0.32$ ) (see Fig. 2a) with those in Q.1 showing the lowest recall@5 ( $0.10\pm 0.32$ ) (see Fig. 2b). Because the user queries resembled each other among the evaluators for each question item, we concluded that skill-based limitations were not the primary cause. Instead, a sketch-based limitation, which can be troublesome when a semantic sketch has insufficient representation power to specify the image characteristics of the disease phenotype, might have affected the variability of recall@5. This was because the single class of tumor-associated labels (i.e., PT) seemed insufficient for differentiating the internal characteristics of lung cancers, including ground-glass opacity and solid components. Meanwhile, the means $\pm$ standard deviations of precision@5, reciprocal rank, nDCG, and recall@5 of the query-by-example approach were $0.28\pm 0.11$ , $1.00\pm 0.00$ , $0.66\pm 0.03$ , and $1.00\pm 0.00$ , respectively. The consistently higher precision@5 (i.e., all precision@5 values in Fig. 10b are above 0.28) suggests that our SBMIR system’s overall performance for retrieving clinically similar images was still superior to the conventional CBMIR system despite the possible sketch-based limitation for lung cancers.

5.3 SBMIR enabled image retrieval without example images

Test-2 demonstrated that our SBMIR system could retrieve clinically similar images even without example images. This would be impossible using a conventional CBMIR system based on the query-by-example approach, posing the usability problem. Five descriptions of clinical findings, which simulate radiology reports, were presented to the evaluators. Each evaluator constructed a user query according to clinical findings recalled by interpreting the description and evaluated the clinical similarity of retrieved images to the intention of their query using the user-oriented metrics.

For gliomas, the five descriptions included various types of tumors (see Fig. 11a and G.2). The mean precision@5 ranged from 0.48 in Q.5 to 0.86 in Q.3. The mean reciprocal rank and mean nDCG were intermediate-to-high, exceeding 0.6 for all question items. Thus, most top-5 retrieved images were given high similarity scores, including more than two clinically similar images ranked high on average (see Section 4.1 for the definitions of the similarity scores). As can be seen in the example results of Q.3 (see Fig. 3a), even though each evaluator was presented with the description independently, the user queries resembled each other, retrieving clinically similar images as intended. Notably, our SBMIR system was effective in retrieving images with distinctive yet rare characteristics, for example, the so-called butterfly glioma in Q.5 (see Fig. 3b). These promising results were obtained in a situation that required sufficiently detailed descriptions for the evaluators to recall common clinical findings, which makes the evaluation stricter and more complex than in Test-1.

For lung cancers, in a more challenging task, we presented descriptions that specify the detailed anatomical location of the disease (see Fig. 11b and G.2), including the lung segment, the apex of the lung, the pleural contact, and the hilar region. This is because the radiological phenotypes of lung cancers are so diverse that the information about the size and lung lobes alone would be insufficient for the evaluators to specify their user queries. Except for Q.4, the mean precision@5 (ranging from 0.46 in Q.1 to 0.84 in Q.3) and mean reciprocal rank (ranging from 0.77 in Q.5 to 1.00 in Q.3) showed intermediate-to-high values, suggesting that at least two clinically similar images were successfully ranked higher on average. See Fig. 4a for the results of successful image retrieval with relatively similar user queries in Q.3. In contrast, for particular cases, such as the central type of lung cancer in Q.4 (see Fig. 4b), even though the mean precision@5 of 0.24 exceeded 0.2 (i.e., as described in Section 4.2.1, the lower bound of precision@5 will be 0.2 when image retrieval works properly), the mean reciprocal rank of 0.47 was still low. The detailed anatomical location, such as the hilar region, may not have been fully learned by the model, impeding localization of the area by normal ACs from the template images. We call this a template-image-based limitation, which can hinder the image retrieval performance of our SBMIR system.

5.4 SBMIR enabled image retrieval for isolated samples

Test-3 confirmed that our SBMIR system could obtain the same-volume images even when the isolated samples were presented, overcoming the searchability problem. See B for the details of how we identified the isolated samples, which were defined as images that are “not” included within the $k$ -NN of any other images in the database. Five randomly selected isolated samples, identified based on the fine-tuned ResNet-101 features, were presented to the evaluators. Recall@5 was automatically evaluated for each isolated sample.

For gliomas, the five isolated samples were presented to the evaluators (see Fig. 12a and G.3). The mean recall@5 was more than or equal to 0.7, implying that most evaluators succeeded in retrieving the same-volume images. Example results in Q.1 with the highest recall@5 ( $1.00\pm 0.00$ ) and those in Q.2 with the lowest recall@5 ( $0.70\pm 0.48$ ) are presented in Fig. 5a and Fig. 5b, respectively. Despite the slightly different user queries, the same-volume images were effectively retrieved within the top-5 retrieved images, as highlighted by the red boxes.

For lung cancers, we presented the five isolated samples with the evaluators (see Fig. 12b and G.3). Recall@5 exceeded 0.5 for all question items, except for Q.2, indicating that more than half of the evaluators successfully retrieved the same-volume images. The example results in Q.4 showing the highest recall@5 ( $0.89\pm 0.33$ ) are shown in Fig. 6a. In contrast, the isolated sample in Q.2 was difficult to retrieve, as shown by the recall@5 of $0.44\pm 0.53$ (see Fig. 6b). The low recall@5 for Q.2 may be related to a template-image-based limitation based on the observation that the body and chest wall of the presented image seems much larger than the corresponding template image, possibly hindering the representation power of the template image to encompass such individual anatomical differences.

6 Conclusion

Despite tremendous advancements in CBMIR (Haq et al., 2021; Zhong et al., 2021; Fang et al., 2021; Rossi et al., 2021), the practical aspect of information-seeking objectives of healthcare professionals has not been paid much attention, overlooking practical usability and searchability limitations. To overcome this, we developed the SBMIR system, which does not require preparing example images or sketching the entire anatomical appearance. The most innovative aspect is the simple two-step user operation (see Fig. 2a), which can specify fine-grained characteristics of image content. The user test showed that our SBMIR system could overcome previous limitations through better image retrieval performance according to fine-grained image characteristics (see Fig. 10), image retrieval without example images (see Fig. 11), and image retrieval for isolated samples (see Fig. 12). Our SBMIR system enables users to retrieve images of interest on demand, expanding the utility of medical image databases.

Three possible sources of limitations of our SBMIR system were observed. Skill-based limitations seemed to be minimized by the practice stage. Sketch-based and template-image-based limitations are associated with the algorithm, indicating room for improvement. As malignant tumors are characterized not only by shape and size but also by internal characteristics (Aerts et al., 2014), learning internal characteristics may mitigate sketch-based limitations, which were particularly evident for lung cancers. Besides, modeling normal anatomy using detailed segmentation labels could reduce template-image-based limitations, while information about normal anatomy was learned in a self-supervised manner in the present study. Because our SBMIR system can be applied to other diseases as long as lesions are segmentable, the development of medical image retrieval technologies will hopefully be accelerated by considering these issues.

The present study reminds us that database searching is an inherently interactive process. Indeed, the bidirectional communication between the user and the system to refine user queries for better results has been a fundamental concern for practical information access (Lamine, 2008; Miao et al., 2021). However, in conventional query-by-example approaches, there has been no room for such interaction in the whole process. In our SBMIR system, on the other hand, users can seek a better way to express their user intention by observing how the search results change according to which template image is selected and how the disease is sketched (see Fig. 9 and the Supplementary Video 1). Such human-machine interaction has the potential to establish reliable and trustworthy data-driven applications in medicine (Cutillo et al., 2020; Liang et al., 2022).

Acknowledgement

The authors thank the members of the Division of Medical AI Research and Development of the National Cancer Center Research Institute for their kind support. The RIKEN AIP Deep Learning Environment (RAIDEN) supercomputer system was used in this study to perform computations.

Funding

This work was supported by JST CREST (Grant Number JPMJCR1689), JST AIP-PRISM (Grant Number JPMJCR18Y4), JSPS Grant-in-Aid for Scientific Research on Innovative Areas (Grant Number JP18H04908), and JSPS KAKENHI (Grant Number JP22K07681).

Competing interests

Kazuma Kobayashi and Ryuji Hamamoto have received research funding from Fujifilm Corporation.

Contributions

K.K. conceived the study, devised the algorithms, developed the software, coordinated the user tests, performed the technical evaluation, and analyzed the results. K.K., L.G., and R. Hataya prepared the manuscript. K.K. and M.M. prepared the datasets and the question items for the user tests. K.K, T.M., M.M., and M.T. designed the framework of the user tests. K.K., T.M., M.M., H.W., M.T., Y.T., Y.Y., S.N., N.K., and A.B. participated in the user tests. Y.K. provided technical advice. T.H. and R. Hamamoto supervised the research.

Data availability

The glioma dataset is available on the website of the MICCAI BraTS Challenge (Menze et al., 2015; Bakas et al., 2017c, a, b). Note that the collected chest CT scans from the hospital that were utilized in the lung cancer dataset remain under their custody.

Code availability

All source codes for the training and evaluation of the present study will be publicly available on GitHub (https://github.com/Kaz-K/sketch-based-medical-image-retrieval).

Appendix A Preparation of ResNet-101 to extract image-level features

We prepared two types of feature extractors that represent the characteristics of entire images: ImageNet-trained ResNet-101 and fine-tuned ResNet-101. ResNet-101 is a 101-layered deep neural network that produces a feature vector with 2,048 dimensions from a layer just before the final fully connected layer (He et al., 2016). Because ImageNet contains 1.28 million natural training images consisting of 1,000 classes (Russakovsky et al., 2015), ImageNet-trained ResNet-101 can be expected to take general image features into account; however, there are substantial differences between ImageNet classification and medical image diagnosis (Raghu et al., 2019). Hence, we fine-tuned the ImageNet-trained ResNet-101 using the training datasets (see Section 3.1). For fine-tuning based on the training dataset, the final 1,000-node classification layer of the ResNet-101 was removed and replaced by a one-node layer to output a probability of whether the input image contains abnormal findings or not. As our datasets contain tumor-associated labels corresponding to a relatively small portion of areas in a large image of a body region (see Fig. 6), we expected that fine-tuning would force the model to focus on the local pathological change in the image rather than a global subject of the image. All the model parameters were optimized according to the training settings described previously (Kobayashi et al., 2021). For each training dataset, the models were trained for 100 epochs. The resultant image-wise classification performances for the abnormality of the fine-tuned ResNet-101 were as follows. The means $\pm$ standard deviations of accuracy, precision, recall, and specificity were $0.94\pm 0.07$ , $0.94\pm 0.20$ , $0.84\pm 0.24$ , and $0.97\pm 0.08$ for the glioma testing dataset and $0.97\pm 0.04$ , $0.81\pm 0.30$ , $0.86\pm 0.29$ , and $0.98\pm 0.04$ for the lung cancer testing dataset.

Appendix B Identification of isolated samples in the conventional query-by-example approach

We defined an isolated sample as an image that is “not” included among the $k$ -nearest neighbors ( $k$ -NN) of any other images (see Fig. 1a). To count the isolated samples in the testing datasets, we focused on representative images. A representative image is defined as the 2D axial slice containing the largest tumor-associated region in each 3D volume. This is because 3D volume data, such as CT and MRI scans, usually contain similar abnormal findings in consecutive slices, which can raise a concern about duplicative counting of similar images. We decided that the evaluation using the representative images is valid because the slice that best characterizes the clinical significance of 3D volume data is often the one in which a lesion appears largest. After identifying the representative images from each 3D volume, a feature extractor (i.e., ImageNet-trained ResNet-101 or fine-tuned ResNet-101, as described in A) extracted an image-level feature from each representative image. Then, the L2 distances with all the other representative images based on the feature vectors were calculated to obtain $k$ -NN samples for each representative image. The number of representative images that were “not” included in the $k$ -NN of all the other representative images was counted. The numbers of isolated samples in the glioma and lung cancer testing datasets are shown in Fig. 1b and Fig. 1c, respectively.

Appendix C Detailed network architecture

For the glioma training dataset, the input image size was $3\times 256\times 256(=196,608)$ and the segmentation label size was $4\times 256\times 256$ , which indicates three classes of tumor-associated regions (i.e., ET, ED, and NET) plus a background class. Table 1 shows the detailed implementation of the normal AC encoder, which mainly consists of a repeated structure with residual blocks (He et al., 2016) and a strided convolution (ResBlock + StridedConv). For the abnormal AC encoder, the network architecture demonstrated in Table 2 was employed. Almost the same network architecture as the Table 2 was used for the label encoder, except for its input image size of $4\times 256\times 256$ , which is consistent with the segmentation label size. The image decoder employed the neural network architecture shown in Table 3. For the label decoder, almost the same architecture in Table 3 was used, except for its final output size, which was adjusted to be $4\times 256\times 256$ .

For the lung cancer training dataset, the network architectures were the same, but the spatial resolutions of the output shape in Table 1, Table 2, and Table 3 were doubled because the spatial resolution of the input image was doubled, to be $3\times 512\times 512(=786,432)$ . Additionally, the segmentation label size was $2\times 512\times 512$ , reflecting one class of tumor-associated region (i.e., PT) plus a background class. Because the input image was concatenated with adjacent upper and lower slices to reach a channel size of 3, the segmentation label corresponding to the center slice was set to be the learning objective. Consequently, the sizes of the ACs were $512\times 2\times 2=2,048$ for the glioma dataset and $512\times 4\times 4=8,192$ for the lung cancer dataset, maintaining the same compression ratio ( $\nicefrac{{2,048}}{{196,608}}=\nicefrac{{8,192}}{{786,432}}=\nicefrac{{1}}{{96}}$ ) across the datasets irrespective of the difference in the spatial resolution of images.

Table 1: Basic architecture of the encoder networks of the variational autoencoder component

Module

Activation

Output shape

Input image

Conv

StridedConv

\begin{bmatrix}3\times 3&32\end{bmatrix}

\begin{bmatrix}3\times 3&64\end{bmatrix}

3\times 256\times 256

32\times 256\times 256

64\times 128\times 128

ResBlock

StridedConv

\begin{bmatrix}3\times 3&64\\ 3\times 3&64\end{bmatrix}

\begin{bmatrix}3\times 3&128\end{bmatrix}

64\times 128\times 128

128\times 64\times 64

ResBlock

StridedConv

\begin{bmatrix}3\times 3&128\\ 3\times 3&128\end{bmatrix}

\begin{bmatrix}3\times 3&256\end{bmatrix}

128\times 64\times 64

256\times 32\times 32

ResBlock

StridedConv

\begin{bmatrix}3\times 3&256\\ 3\times 3&256\end{bmatrix}

\begin{bmatrix}3\times 3&512\end{bmatrix}

256\times 32\times 32

512\times 16\times 16

ResBlock

StridedConv

\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}

\begin{bmatrix}3\times 3&512\end{bmatrix}

512\times 16\times 16

512\times 8\times 8

ResBlock

StridedConv

\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}

\begin{bmatrix}3\times 3&512\end{bmatrix}

512\times 8\times 8

512\times 4\times 4

ResBlock

StridedConv

Split

\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}

\begin{bmatrix}3\times 3&512\end{bmatrix}

512\times 4\times 4

512\times 2\times 2

Conv

\begin{bmatrix}1\times 1&512\end{bmatrix}

\begin{bmatrix}1\times 1&512\end{bmatrix}

512\times 2\times 2

512\times 2\times 2

Table 2: Basic architecture of the encoder networks

Module

Activation

Output shape

Input image

Conv

StridedConv

\begin{bmatrix}3\times 3&32\end{bmatrix}

\begin{bmatrix}3\times 3&64\end{bmatrix}

3\times 256\times 256

32\times 256\times 256

64\times 128\times 128

ResBlock

StridedConv

\begin{bmatrix}3\times 3&64\\ 3\times 3&64\end{bmatrix}

\begin{bmatrix}3\times 3&128\end{bmatrix}

64\times 128\times 128

128\times 64\times 64

ResBlock

StridedConv

\begin{bmatrix}3\times 3&128\\ 3\times 3&128\end{bmatrix}

\begin{bmatrix}3\times 3&256\end{bmatrix}

128\times 64\times 64

256\times 32\times 32

ResBlock

StridedConv

\begin{bmatrix}3\times 3&256\\ 3\times 3&256\end{bmatrix}

\begin{bmatrix}3\times 3&512\end{bmatrix}

256\times 32\times 32

512\times 16\times 16

ResBlock

StridedConv

\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}

\begin{bmatrix}3\times 3&512\end{bmatrix}

512\times 16\times 16

512\times 8\times 8

ResBlock

StridedConv

\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}

\begin{bmatrix}3\times 3&512\end{bmatrix}

512\times 8\times 8

512\times 4\times 4

ResBlock

StridedConv

\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}

\begin{bmatrix}3\times 3&512\end{bmatrix}

512\times 4\times 4

512\times 2\times 2

Table 3: Basic architecture of the decoder networks

Module

Activation

Output shape

Latent representation

512\times 2\times 2

Conv

ResBlock

\begin{bmatrix}3\times 3&512\end{bmatrix}

\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}

512\times 2\times 2

512\times 2\times 2

Conv

Upsample

ResBlock

\begin{bmatrix}1\times 1&512\end{bmatrix}

\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}

512\times 2\times 2

512\times 4\times 4

512\times 4\times 4

Conv

Upsample

ResBlock

\begin{bmatrix}1\times 1&512\end{bmatrix}

\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}

512\times 4\times 4

512\times 8\times 8

512\times 8\times 8

Conv

Upsample

ResBlock

\begin{bmatrix}1\times 1&256\end{bmatrix}

\begin{bmatrix}3\times 3&256\\ 3\times 3&256\end{bmatrix}

256\times 8\times 8

256\times 16\times 16

256\times 16\times 16

Conv

Upsample

ResBlock

\begin{bmatrix}1\times 1&128\end{bmatrix}

\begin{bmatrix}3\times 3&128\\ 3\times 3&128\end{bmatrix}

128\times 16\times 16

128\times 32\times 32

128\times 32\times 32

Conv

Upsample

ResBlock

\begin{bmatrix}1\times 1&64\end{bmatrix}

\begin{bmatrix}3\times 3&64\\ 3\times 3&64\end{bmatrix}

64\times 32\times 32

64\times 64\times 64

64\times 64\times 64

Conv

Upsample

ResBlock

\begin{bmatrix}1\times 1&32\end{bmatrix}

\begin{bmatrix}3\times 3&32\\ 3\times 3&32\end{bmatrix}

32\times 64\times 64

32\times 128\times 128

32\times 128\times 128

Conv

Upsample

ResBlock

\begin{bmatrix}1\times 1&3\end{bmatrix}

\begin{bmatrix}3\times 3&3\\ 3\times 3&3\end{bmatrix}

3\times 128\times 128

3\times 256\times 256

3\times 256\times 256

Appendix D Training results of the feature extraction module of our SBMIR system

We trained the feature extraction module of our SBMIR system using the glioma training dataset and the lung cancer training dataset independently (see Section 3.2.3 and Section 3.2.4). We qualitatively and quantitatively evaluated its training results to verify the algorithm from technical viewpoints.

D.1 Qualitative evaluation of the image reconstruction

How the feature decomposition was achieved can be assessed by visualizing images generated by the image decoder and the label decoder. Recall that the reconstructed images from the normal ACs should be pseudo-normal images, while those from the entire ACs should be entire images with some abnormalities if they exist (see Fig. 1b). Additionally, the abnormal ACs produce segmentation labels for the tumor-associated regions when they are inputted into the label decoder.

Figure 1 presents the images and segmentation labels decoded from the ACs using the glioma testing dataset and the lung cancer testing dataset. The first row shows the input images. The second row demonstrates the entire reconstructions of the input images, which were decoded by the image decoder taking the entire ACs as input. The third row indicates the pseudo-normal reconstructions, which were decoded by the image decoder taking the normal ACs as input. Note that the abnormal imaging features that appear both in the input images and in the entire reconstructions (see the arrows in Fig. 1) are diminished in the pseudo-normal images, recovering the normal anatomy that should have existed therein. These results suggest successful feature decomposition. Moreover, the fourth row presents the ground-truth segmentation labels for the tumor-associated regions, and the fifth row shows the predicted segmentation labels that were decoded by the label decoder taking the abnormal ACs as input. Note that the segmentation was trained only for the tumor-associated regions and not for the normal anatomy-associated regions, as shown in the results.

One may argue that the quality of the reconstructed images and segmentation labels was insufficient, as observed from the blurred and rounded appearance that did not recover the detailed image characteristics. These tendencies are reasonable because the image information is compressed because of the limited size of the latent representation, which raises a trade-off between the reconstruction qualities and latent size (Razavi et al., 2019; Kobayashi et al., 2021). Indeed, we did not pursue the generation quality of the reconstructions as a primary purpose because the lower dimension of the latent representations can be advantageous for computational efficiency in similarity search. Besides, although the detailed part of the image characteristics was not perfectly reproduced, it was still sufficient for recognizing the anatomical location and presence of abnormalities in the reconstructed images.

D.2 Qualitative evaluation of the semantically organized latent space

We applied t-distributed stochastic neighbor embedding (t-SNE) analysis (van der Maaten and Hinton, 2011) to evaluate how the latent space was organized according to the semantics by changing the margin parameter (see Section 2.2.3). We randomly selected 50 individual volumes from each dataset and extracted entire ACs and normal ACs in an image-wise manner. The once randomly selected samples were used repeatedly in the following t-SNE analyses for the purpose of comparison, particularly in F.

The results using the glioma testing dataset and the lung cancer testing dataset are shown in Fig. 2a and Fig. 2b, respectively. When the margin parameter was set to 0, there was no visible cluster in the latent space, where the different ACs intermingled with each other (see the leftmost images in Fig. 2a and Fig. 2b). This is particularly evident in Fig. 2b, possibly reflecting that the primary site of lung cancer usually occupies such a small region that a latent feature representing an abnormality does not convey meaningful information when no margin parameter is assigned. On the other hand, when the margin parameter was increased, two clusters appeared and moved away from each other, particularly when the margin parameter was more than 10 for the glioma testing dataset and more than 40 for the lung cancer testing dataset. One cluster consisting of the entire ACs from healthy images (green dots), the normal ACs from healthy images (blue dots), and the normal ACs from diseased images (orange dots) can be interpreted as corresponding to the healthy subspace. In contrast, the other cluster consisting of the entire ACs from diseased images (red dots) can be considered the diseased distribution. Therefore, sufficiently large margin parameters are necessary for configuring the semantically organized latent space (see Fig. 5b).

D.3 Quantitative evaluation of the image reconstruction

Given entire ACs, the image decoder was trained to generate entire reconstructions $\hat{\bm{x}}$ of input images $\bm{x}$ . Then, the reconstruction error between the entire reconstructions $\hat{\bm{x}}$ and input images $\bm{x}$ was evaluated as $\|\hat{\bm{x}}-\bm{x}\|_{2}^{2}$ . The means $\pm$ standard deviations of the reconstruction error were $0.22\pm 0.23$ and $0.16\pm 0.05$ in the glioma testing dataset and lung cancer testing dataset, respectively.

D.4 Quantitative evaluation of the segmentation

The label decoder predicts segmentation labels of tumor-associated regions $\hat{\bm{l}}$ , which should be close to the ground-truth label $\bm{l}$ . The segmentation performance was evaluated based on the Dice score with respect to the tumor-associated labels. To calculate the Dice score for each volume, the segmentation outputs of each 2D axial image were concatenated into a 3D volume. The means $\pm$ standard deviations of the Dice score were $0.37\pm 0.21$ , $0.60\pm 0.14$ , and $0.44\pm 0.21$ for NET, ED, and ET, respectively, in the glioma testing dataset. The mean $\pm$ standard deviation of the Dice score was $0.64\pm 0.18$ for PT in the lung cancer testing dataset. The intermediate levels of these Dice scores were reasonable because the model did not have skip-connections, which is essential for precise medical image segmentation (Drozdzal et al., 2016), in order to concentrate the information on the diseased regions in the abnormal ACs.

D.5 Quantitative evaluation of the mapping function from the label space to the latent space

We evaluated the performance of the label encoder taking the ground-truth labels as input to estimate the corresponding abnormal ACs $\hat{\bm{a}}$ , which should be similar to the output of the abnormal AC encoder $\bm{a}$ taking the corresponding image as input. The means $\pm$ standard deviations of L2 distance between the output of the label encoder and the corresponding abnormal ACs $\|\hat{\bm{a}}-\bm{a}\|_{2}^{2}$ were $3.1\times 10^{-2}\pm 4.5\times 10^{-2}$ and $0.7\times 10^{-2}\pm 3.6\times 10^{-2}$ in the glioma testing dataset and the lung cancer testing dataset, respectively. As the L2 distance is small, the inverse mapping from the semantic labels to the corresponding abnormal ACs can be precise, enabling users to specify the characteristics of abnormal findings of interest by drawing semantic sketches.

D.6 Quantitative evaluation of the separability in latent space

To quantitatively evaluate how well the latent space $\mathcal{Z}$ can be separated into the healthy subspace $\mathcal{Z}^{\mathrm{h}}$ and the diseased subspace $\mathcal{Z}^{\mathrm{d}}$ (see Fig. 5b and Fig. 2), we trained support vector machines (SVMs) to learn the support vector between the entire ACs from healthy images $\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}})$ and the entire ACs from diseased images $\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}})$ based on the training datasets (i.e., the glioma training dataset and the lung cancer training dataset). Because the SVM is expected to learn the hyperplane that has the maximum margin between the two distributions, its classification performance reflects how well the two distributions are separated. In the glioma testing dataset, the classification performance of the SVM was as follows: accuracy, 0.94; precision, 0.96; recall, 0.89; F-score, 0.92. In the lung cancer testing dataset, the classification performance of the SVM was as follows: accuracy, 0.97; precision, 0.90; recall, 0.82; F-score, 0.86. We consider that the high classification performance of the SVMs implies that the configuration of the semantically organized latent space was successfully achieved.

Appendix E Technological evaluation focusing on the image retrieval performance

Here, we demonstrate that extending the feature decomposition of medical images by imposing the semantically organized latent space is critical to achieving the practical SBMIR system. The practical SBMIR system should realize image retrieval according to the similarity of images while simultaneously reflecting the semantics. Note that in image retrieval reflecting semantics, a query vector conveying the information about a diseased region should retrieve only diseased images and that not conveying the information about a diseased region should retrieve only healthy images. From this perspective, we devised two evaluation methods to objectively demonstrate how the retrieved image features can preserve information regarding both the semantics and similarity of images. The image retrieval performance was observed to be affected by the margin parameter for configuring the semantically organized latent space (see Fig. 5b).

E.1 Semantic consistency in image retrieval

To evaluate how well the image retrieval is performed according to semantics, we developed an evaluation method to quantify semantic consistency. As shown in Fig. 1a, image retrieval reflecting semantics should retrieve healthy or diseased images with an entire AC from a healthy or diseased image, respectively, when the entire AC is used as a query vector. To achieve this, each sample in the latent space should be grouped with samples with the same semantics rather than samples with different semantics (see Fig. 1b), as is expected to be configured by the margin parameter. Therefore, we evaluated semantic consistency based on the following procedure. First, each image in a testing dataset was considered as a query image, and the corresponding entire AC was calculated as a query vector. Using each query vector, the five closest slices from different volumes other than the one the query vector belongs to were retrieved. Then, the ratio of retrieved images whose semantics are consistent with the query vector was evaluated. The resultant ratio was reported as the semantic consistency value, which was calculated according to the semantics of the query images (i.e., healthy images or diseased images).

In the glioma testing dataset (see Fig. 1c), the semantic consistency of diseased images increased from 0.86 to 0.88 as the margin parameter increased from 0 to 10. More remarkable effects were observed in the lung cancer testing dataset (see Fig. 1d). When the margin parameter was set to 0, the semantic consistency of diseased images was as small as 0.25. This indicates that even if the user intention is to retrieve images with abnormalities, the retrieval results can be mixed with healthy images. In contrast, the semantic consistency was improved to 0.72 when the margin parameter was increased to 40, showing that the margin parameter had a meaningful effect on the semantic consistency of diseased images. As the average tumor volume in the lung cancer testing dataset was significantly smaller than that in the glioma testing dataset (see Section 3.2), we inferred that the latent features representing such small diseased areas could be overwhelmed by those representing the large image of a body region. Consequently, the margin parameter is essential for image retrieval with semantic consistency, effectively conveying the imaging features relevant to abnormal findings in medical image retrieval.

E.2 Latent proximity in image retrieval

To evaluate how the similarity in the image space is preserved in latent space, we devised an index called latent proximity based on the notion that similar images should also be nearby in the latent space, which was evaluated as follows (see Fig. 2a). First, each image in a testing dataset was considered as a query image, and the corresponding entire AC was calculated as a query vector. Based on each query vector, the closest slice from all volumes, including the original volume the query vector belongs to, was retrieved. Then, whether the retrieved image was included in the five cranially or caudally consecutive slices in the original volume was assessed according to the semantics of the query image. Finally, the ratio between the number of retrieved images not included in the consecutive slices and those included in the consecutive slices was calculated as the latent proximity value. As shown in Fig. 2b, low latent proximity indicates that even similar images are not in close proximity in latent space, which can cause unintended search results with dissimilar images. In contrast, high latent proximity can guarantee that similarity in image space is also reproduced in latent space, leading to retrieval results that are faithful to the user intention. The evaluation was repeated by varying the margin parameter in the analysis of both the glioma and lung cancer testing datasets.

As shown in Fig. 2c, the latent proximity of the glioma testing dataset does not differ dramatically according to the magnitude of the margin parameter. On the other hand, the latent proximity of diseased images in the lung cancer testing dataset was clearly changed by the margin parameter (see Fig. 2d). Therefore, a margin parameter that is too large may worsen the correspondence between the image space and the latent space, hindering image retrieval performance according to the similarity of images.

Appendix F Ablation studies of the feature extraction module of our SBMIR system

Among several loss functions used in the training of the feature extraction module (i.e., the reconstruction loss, segmentation loss, consistency loss, abnormality loss, regularization loss, and margin loss) (see Section 2.3), the necessity and the optimal values of the margin parameter are demonstrated in the previous sections. Here, we confirm that the abnormality loss and regularization loss are also essential for semantically organized latent space. For simplicity, only the results based on the glioma testing dataset are shown.

F.1 Ablation study of abnormality loss

Abnormality loss forces the abnormal AC encoder that takes healthy images to output zero vectors, as the abnormal ACs from healthy images should not contain any relevant information (see Fig. 5a). Because segmentation output from the subsequent label encoder is also trained to be zero vectors reflecting the absence of the tumor-associated regions, one can argue that the abnormality loss is duplicative. However, when training the model without abnormality loss, the entire ACs from healthy images (green dots) moved away from the Gaussian distribution as shown in Fig. 1. This may be because the abnormal ACs, even from healthy images, had significant norms when trained without abnormality loss. In this case, a discrepancy emerged between the entire ACs from healthy images and the normal ACs from healthy images, which weakened the rationale that the former distribution represents the normal anatomy, undermining the assumption of the semantically organized latent space (see Fig. 5b).

F.2 Ablation study of regularization loss

Regularization loss plays a role in regularizing the distribution of the normal ACs from healthy images to follow the Gaussian distribution, which is essential for training the VAE component (see Fig. 3a). One assumption of the VAE component is that normal anatomic variation can be modeled according to the Gaussian distribution, which is analogous to the fact that many medical indicators, such as body heights, are known to follow the Gaussian distribution. Here, we also show that the assumption is essential for the semantically organized latent space (see Fig. 5b). When the regularization loss was not imposed on the distribution of normal ACs, the latent space did not contain clustering as demonstrated in Fig. 2a. Moreover, owing to the lack of the prior distribution, the pair of the normal AC encoder and the image decoder (i.e., the neural networks trained as the VAE component) acted like an identity function, resulting in image reconstructions with abnormal findings even from the normal ACs (see the arrowhead in Fig. 2b). In summary, the assumption of the Gaussian distribution for the normal ACs is critical in our implementation not only for the semantically organized latent space but also for the feature decomposition of medical images.

Appendix G Description of question items for the user tests

G.1 Question items in Test-1

For gliomas, the clinical characteristics of the five presented images are as follows (see Fig. 10a):

1.

Q.1: A 95-mm non-enhancing tumor with peritumoral edema in the left frontal lobe.
2.

Q.2: A 90-mm ring-enhancing tumor with multiple cores with peritumoral edema in the left frontal lobe.
3.

Q.3: A 70-mm non-enhancing tumor in the left temporal lobe.
4.

Q.4: A 50-mm ring-enhancing tumor with peritumoral edema in the right parietal lobe.
5.

Q.5: An 85-mm ring-enhancing tumor with peritumoral edema in the right temporal lobe.

For lung cancers, the clinical characteristics of the five presented images are as follows (see Fig. 10b):

1.

Q.1: A 25-mm part-solid nodule in the left upper lobe.
2.

Q.2: A 30-mm tumor in the left lower lobe.
3.

Q.3: A 10-mm solid nodule in the right upper lobe.
4.

Q.4: A 45-mm tumor in the right middle lobe.
5.

Q.5: A 35-mm tumor in the right lower lobe.

G.2 Question items in Test-2

For gliomas, the five descriptions presented to the evaluators are as follows (see Fig. 11a):

1.

Q.1: A 60-mm ring-enhancing tumor is located primarily in the left temporal lobe. It is associated with massive peritumoral edema (about 100 mm in maximum length) extending through the left temporal lobe.
2.

Q.2: A 50-mm non-enhancing tumor is located in the right frontal lobe. It is associated with mild peritumoral edema (about 70 mm in maximum length).
3.

Q.3: A 25-mm ring-enhancing tumor is located in the left temporal pole (the tip of the left temporal lobe). It is associated with mild peritumoral edema (about 40 mm in maximum length).
4.

Q.4: A 30-mm ring-enhancing tumor is localized in the right occipital lobe. It is associated with peritumoral edema extending anteriorly (about 60 mm in maximum length).
5.

Q.5: A 60-mm ring-enhancing tumor is located in the midline of the bilateral frontal lobes. It is associated with extensive peritumoral edema in the bilateral frontal lobes (about 90 mm in maximum length).

For lung cancers, the five descriptions presented to the evaluators are as follows (see Fig. 11b):

1.

Q.1: A 20-mm nodule in the left upper lobe that contacts the pleura on the mediastinal side.
2.

Q.2: A 40-mm tumor in the posterior-basal segment of the left lower lobe in contact with the pleura.
3.

Q.3: A 50-mm tumor in the apex of the right lung.
4.

Q.4: A 70-mm tumor in the right pulmonary hilar region, possibly invading the mediastinum.
5.

Q.5: A 20-mm peripheral nodule in the right lateral-basal segment in contact with the chest wall pleura.

G.3 Question items in Test-3

For gliomas, the clinical characteristics of the five isolated samples presented to the evaluators are as follows (see Fig. 12a):

1.

Q.1: A 65-mm ring-enhancing tumor with peritumoral edema located in the deep white matter in the right parietal lobe.
2.

Q.2: A 90-mm ring-enhancing tumor with extensive edema located primarily in the right frontal lobe.
3.

Q.3: A 45-mm enhancing tumor with peritumoral edema located primarily in the left insular lobe.
4.

Q.4: A 70-mm non-enhancing tumor with peritumoral edema in the left frontal lobe.
5.

Q.5: A 35-mm ring-enhancing tumor with peritumoral edema located at the anterior edge of the left temporal lobe.

For lung cancers, the clinical characteristics of the five isolated samples presented to the evaluators are as follows (see Fig. 12b):

1.

Q.1: A 25-mm part-solid nodule in the left upper lobe.
2.

Q.2: A 65-mm tumor in the left lower lobe.
3.

Q.3: A 65-mm tumor in the right lower lobe.
4.

Q.4: A 70-mm tumor in the left lower lobe.
5.

Q.5: A 55-mm tumor in the right lower lobe.

Appendix H Example results of the user tests

H.1 Example results of Test-1

Example results of Test-1 for gliomas and those for lung cancers are shown in Fig. 1 and Fig. 2, respectively.

H.2 Example results of Test-2

Example results of Test-2 for gliomas and those for lung cancers are shown in Fig. 3 and Fig. 4, respectively.

H.3 Example results of Test-3

Example results of Test-3 for gliomas and those for lung cancers are shown in Fig. 5 and Fig. 6, respectively.

References

Aerts et al. (2014) Aerts, H.J.W.L., Velazquez, E.R., Leijenaar, R.T.H., Parmar, C., Grossmann, P., Carvalho, S., Bussink, J., Monshouwer, R., Haibe-Kains, B., Rietveld, D., Hoebers, F., Rietbergen, M.M., Leemans, C.R., Dekker, A., Quackenbush, J., Gillies, R.J., Lambin, P., 2014. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat. Commun. 5, 4006. URL: https://doi.org/10.1038/ncomms5006, doi:10.1038/ncomms5006.
Allan et al. (2012) Allan, C., Burel, J.M., Moore, J., Blackburn, C., Linkert, M., Loynton, S., MacDonald, D., Moore, W.J., Neves, C., Patterson, A., Porter, M., Tarkowska, A., Loranger, B., Avondo, J., Lagerstedt, I., Lianas, L., Leo, S., Hands, K., Hay, R.T., Patwardhan, A., Best, C., Kleywegt, G.J., Zanetti, G., Swedlow, J.R., 2012. OMERO: Flexible, model-driven data management for experimental biology. Nat. Methods 9, 245–253. URL: https://doi.org/10.1038/nmeth.1896, doi:10.1038/nmeth.1896.
Bakas et al. (2017a) Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J., Freymann, J., Farahani, K., Davatzikos, C., 2017a. Segmentation labels and radiomic features for the pre-operative scans of the TCGA-GBM collection. The Cancer Imaging Archive URL: https://doi.org/10.7937/K9/TCIA.2017.KLXWJJ1Q, doi:10.7937/K9/TCIA.2017.KLXWJJ1Q.
Bakas et al. (2017b) Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J., Freymann, J., Farahani, K., Davatzikos, C., 2017b. Segmentation labels and radiomic features for the pre-operative scans of the TCGA-LGG collection. The Cancer Imaging Archive URL: https://doi.org/10.7937/K9/TCIA.2017.GJQ7R0EF, doi:10.7937/K9/TCIA.2017.GJQ7R0EF.
Bakas et al. (2017c) Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J.S., Freymann, J.B., Farahani, K., Davatzikos, C., 2017c. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 4, 1–13. URL: https://doi.org/10.1038/sdata.2017.117, doi:10.1038/sdata.2017.117.
Bernhardsson (2022) Bernhardsson, E., 2022. Approximate nearest neighbors oh yeah. https://github.com/spotify/annoy.
Bhunia et al. (2022) Bhunia, A.K., Koley, S., Khilji, A.F.U.R., Sain, A., Chowdhury, P.N., Xiang, T., Song, Y.Z., 2022. Sketching without worrying: Noise-tolerant sketch-based image retrieval, in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 999–1008. URL: https://doi.org/10.1109/cvpr52688.2022.00107, doi:10.1109/cvpr52688.2022.00107.
Chen et al. (2022) Chen, C., Lu, M.Y., Williamson, D.F.K., Chen, T.Y., Schaumberg, A.J., Mahmood, F., 2022. Fast and scalable search of whole-slide images via self-supervised deep learning. Nat. Biomed. Eng. , 1–15URL: https://doi.org/10.1038/s41551-022-00929-8, doi:10.1038/s41551-022-00929-8.
Cutillo et al. (2020) Cutillo, C.M., Sharma, K.R., Foschini, L., Kundu, S., Mackintosh, M., Mandl, K.D., Beck, T., Collier, E., Colvis, C., Gersing, K., Gordon, V., Jensen, R., Shabestari, B., Southall, N., 2020. Machine intelligence in healthcare—perspectives on trustworthiness, explainability, usability, and transparency. npj Digit. Med. 3, 1–5. URL: https://doi.org/10.1038/s41746-020-0254-2, doi:10.1038/s41746-020-0254-2.
Dice (1945) Dice, L.R., 1945. Measures of the amount of ecologic association between species. Ecology 26, 297–302. URL: https://doi.org/10.2307/1932409, doi:10.2307/1932409.
Drozdzal et al. (2016) Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C., 2016. The importance of skip connections in biomedical image segmentation, in: Carneiro, G., Mateus, D., Peter, L., Bradley, A., Tavares, J.a.M.R.S., Belagiannis, V., Papa, J.a.P., Nascimento, J.C., Loog, M., Lu, Z., Cardoso, J.S., Cornebise, J. (Eds.), Deep Learning and Data Labeling for Medical Applications, Springer International Publishing, Cham. pp. 179–187. URL: https://link.springer.com/chapter/10.1007/978-3-319-46976-8_19.
Dutta and Akata (2020) Dutta, A., Akata, Z., 2020. Semantically tied paired cycle consistency for any-shot sketch-based image retrieval. Int. J. Comput. Vision 128, 2684–2703. URL: https://doi.org/10.1007/s11263-020-01350-x, doi:10.1007/s11263-020-01350-x.
Fang et al. (2021) Fang, J., Fu, H., Liu, J., 2021. Deep triplet hashing network for case-based medical image retrieval. Med. Image Anal. 69, 101981. URL: https://doi.org/10.1016/J.MEDIA.2021.101981, doi:10.1016/J.MEDIA.2021.101981.
Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (Eds.), Advances in Neural Information Processing Systems 27 (NIPS), Curran Associates, Inc.. pp. 1–9. URL: https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
Haq et al. (2021) Haq, N.F., Moradi, M., Wang, Z.J., 2021. A deep community based approach for large scale content based x-ray image retrieval. Med. Image Anal. 68, 101847. URL: https://doi.org/10.1016/J.MEDIA.2020.101847, doi:10.1016/J.MEDIA.2020.101847.
He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 770–778. URL: https://doi.org/10.1109/CVPR.2016.90, doi:10.1109/CVPR.2016.90.
Hofmanninger et al. (2020) Hofmanninger, J., Prayer, F., Pan, J., Röhrich, S., Prosch, H., Langs, G., 2020. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur. Radiol. Exp. 4, 1–13. URL: https://doi.org/10.1186/s41747-020-00173-2, doi:10.1186/s41747-020-00173-2.
Hosny et al. (2018) Hosny, A., Parmar, C., Quackenbush, J., Schwartz, L.H., Aerts, H.J.W.L., 2018. Artificial intelligence in radiology. Nat. Rev. Cancer 18, 500–510. URL: https://doi.org/10.1038/s41568-018-0016-5, doi:10.1038/s41568-018-0016-5.
Kingma and Welling (2014) Kingma, D.P., Welling, M., 2014. Auto-encoding variational bayes, in: The 2nd International Conference on Learning Representations (ICLR), pp. 1–14. URL: https://arxiv.org/abs/1312.6114.
Kobayashi et al. (2021) Kobayashi, K., Hataya, R., Kurose, Y., Miyake, M., Takahashi, M., Nakagawa, A., Harada, T., Hamamoto, R., 2021. Decomposing normal and abnormal features of medical images for content-based image retrieval of glioma imaging. Med. Image Anal. 74, 102227. URL: https://doi.org/10.1016/j.media.2021.102227, doi:10.1016/j.media.2021.102227.
Lamine (2008) Lamine, M., 2008. Review of human-computer interaction issues in image retrieval, in: Pinder, S. (Ed.), Advances in Human Computer Interaction. InTech, Rijeka. chapter 14, pp. 215–240. URL: https://doi.org/10.5772/5929, doi:10.5772/5929.
LeCun et al. (2015) LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444. URL: https://doi.org/10.1038/nature14539, doi:10.1038/nature14539.
Li and Li (2018) Li, Y., Li, W., 2018. A survey of sketch-based image retrieval. Mach. Vision Appl. 29, 1083–1100. URL: https://doi.org/10.1007/s00138-018-0953-8, doi:10.1007/s00138-018-0953-8.
Li et al. (2018) Li, Z., Zhang, X., Müller, H., Zhang, S., 2018. Large-scale retrieval for medical image analytics: A comprehensive review. Med. Image Anal. 43, 66–84. URL: https://doi.org/10.1016/j.media.2017.09.007, doi:10.1016/j.media.2017.09.007.
Liang et al. (2022) Liang, W., Tadesse, G.A., Ho, D., Fei-Fei, L., Zaharia, M., Zhang, C., Zou, J., 2022. Advances, challenges and opportunities in creating data for trustworthy AI. Nat. Mach. Intell. 4, 669–677. URL: https://doi.org/10.1038/s42256-022-00516-1, doi:10.1038/s42256-022-00516-1.
Lin et al. (2020) Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P., 2020. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42, 318–327. URL: https://doi.org/10.1109/tpami.2018.2858826, doi:10.1109/tpami.2018.2858826.
Liu et al. (2022) Liu, X., Sanchez, P., Thermos, S., O’Neil, A.Q., Tsaftaris, S.A., 2022. Learning disentangled representations in the imaging domain. Med. Image Anal. 80, 102516. URL: https://doi.org/10.1016/j.media.2022.102516, doi:10.1016/j.media.2022.102516.
Long et al. (2003) Long, F., Zhang, H., Feng, D.D., 2003. Fundamentals of content-based image retrieval, in: Feng, D.D., Siu, W.C., Zhang, H.J. (Eds.), Multimedia Information Retrieval and Management. Springer, Berlin, Heidelberg, pp. 1–26. URL: https://doi.org/10.1007/978-3-662-05300-3_1, doi:10.1007/978-3-662-05300-3_1.
van der Maaten and Hinton (2011) van der Maaten, L., Hinton, G., 2011. Visualizing non-metric similarities in multiple maps. Mach. Learn. 87, 33–55. URL: https://doi.org/10.1007/s10994-011-5273-4, doi:10.1007/s10994-011-5273-4.
Menze et al. (2015) Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., Lanczi, L., Gerstner, E., Weber, M.A., Arbel, T., Avants, B.B., Ayache, N., Buendia, P., Collins, D.L., Cordier, N., Corso, J.J., Criminisi, A., Das, T., Delingette, H., Demiralp, C., Durst, C.R., Dojat, M., Doyle, S., Festa, J., Forbes, F., Geremia, E., Glocker, B., Golland, P., Guo, X., Hamamci, A., Iftekharuddin, K.M., Jena, R., John, N.M., Konukoglu, E., Lashkari, D., Mariz, J.A., Meier, R., Pereira, S., Precup, D., Price, S.J., Raviv, T.R., Reza, S.M.S., Ryan, M., Sarikaya, D., Schwartz, L., Shin, H.C., Shotton, J., Silva, C.A., Sousa, N., Subbanna, N.K., Szekely, G., Taylor, T.J., Thomas, O.M., Tustison, N.J., Unal, G., Vasseur, F., Wintermark, M., Ye, D.H., Zhao, L., Zhao, B., Zikic, D., Prastawa, M., Reyes, M., Van Leemput, K., 2015. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34, 1993–2024. URL: https://doi.org/10.1109/tmi.2014.2377694, doi:10.1109/tmi.2014.2377694.
Miao et al. (2021) Miao, Z., Liu, Z., Gaynor, K.M., Palmer, M.S., Yu, S.X., Getz, W.M., 2021. Iterative human and automated identification of wildlife images. Nat. Mach. Intell. 3, 885–895. URL: https://doi.org/10.1038/s42256-021-00393-0, doi:10.1038/s42256-021-00393-0.
Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019. PyTorch: An imperative style, high-performance deep learning library, in: Advances in Neural Information Processing Systems 32 (NeurIPS), pp. 8024–8035. URL: https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.
Pedronette and Torres (2015) Pedronette, D.C.G., Torres, R.D.S., 2015. Unsupervised effectiveness estimation for image retrieval using reciprocal rank information, in: 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images, IEEE. pp. 321–328. URL: https://doi.org/10.1109/SIBGRAPI.2015.28, doi:10.1109/SIBGRAPI.2015.28.
Pinho et al. (2019) Pinho, E., Silva, J.F., Costa, C., 2019. Volumetric feature learning for query-by-example in medical imaging archives, in: 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS), IEEE. pp. 138–143. URL: https://doi.org/10.1109/CBMS.2019.00038, doi:10.1109/CBMS.2019.00038.
Prior et al. (2017) Prior, F., Smith, K., Sharma, A., Kirby, J., Tarbox, L., Clark, K., Bennett, W., Nolan, T., Freymann, J., 2017. The public cancer radiology imaging collections of the cancer imaging archive. Sci. data 4. URL: https://pubmed.ncbi.nlm.nih.gov/28925987/, doi:10.1038/SDATA.2017.124.
Quellec et al. (2011) Quellec, G., Lamard, M., Cazuguel, G., Roux, C., Cochener, B., 2011. Case retrieval in medical databases by fusing heterogeneous information. IEEE Trans. Med. Imaging 30, 108–118. URL: https://doi.org/10.1109/tmi.2010.2063711, doi:10.1109/tmi.2010.2063711.
Raghu et al. (2019) Raghu, M., Zhang, C., Kleinberg, J., Bengio, S., 2019. Transfusion: Understanding transfer learning for medical imaging, in: Advances in Neural Information Processing Systems 32 (NeurIPS), Curran Associates Inc.. pp. 1–11. URL: https://proceedings.neurips.cc/paper/2019/file/eb1e78328c46506b46a4ac4a1e378b91-Paper.pdf.
Razavi et al. (2019) Razavi, A., van den Oord, A., Vinyals, O., 2019. Generating diverse high-fidelity images with VQ-VAE-2, in: Advances in Neural Information Processing Systems 32 (NeurIPS), pp. 1–11. URL: http://papers.nips.cc/paper/9625-generating-diverse-high-fidelity-images-with-vq-vae-2.
Rezende et al. (2014) Rezende, D.J., Mohamed, S., Wierstra, D., 2014. Stochastic backpropagation and approximate inference in deep generative models, in: Proceedings of the 31st International Conference on Machine Learning (ICML), pp. 1278–1286. URL: https://proceedings.mlr.press/v32/rezende14.html.
Rossi et al. (2021) Rossi, A., Hosseinzadeh, M., Bianchini, M., Scarselli, F., Huisman, H., 2021. Multi-modal siamese network for diagnostically similar lesion retrieval in prostate mri. IEEE Trans. Med. Imaging 40, 986–995. URL: https://doi.org/10.1109/TMI.2020.3043641, doi:10.1109/TMI.2020.3043641.
Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L., 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252. URL: https://doi.org/10.1007/s11263-015-0816-y, doi:10.1007/s11263-015-0816-y.
Sangkloy et al. (2016) Sangkloy, P., Burnell, N., Ham, C., Hays, J., 2016. The sketchy database. ACM Trans. Graphics 35, 1–12. URL: https://doi.org/10.1145/2897824.2925954, doi:10.1145/2897824.2925954.
Schlegl et al. (2017) Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G., 2017. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery, in: The 25th International Conference on Information Processing in Medical Imaging (IPMI), pp. 146–157. URL: https://link.springer.com/chapter/10.1007/978-3-319-59050-9_12.
Shattuck and Leahy (2002) Shattuck, D.W., Leahy, R.M., 2002. BrainSuite: An automated cortical surface identification tool. Med. Image Anal. 6, 129–142. URL: https://doi.org/10.1016/s1361-8415(02)00054-3, doi:10.1016/s1361-8415(02)00054-3.
Shirahatti and Barnard (2005) Shirahatti, N.V., Barnard, K., 2005. Evaluating image retrieval, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 955–961. URL: https://doi.org/10.1109/CVPR.2005.147, doi:10.1109/CVPR.2005.147.
Simonyan and Zisserman (2015) Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-scale image recognition, in: The 3rd International Conference on Learning Representations (ICLR), pp. 1–14. URL: https://arxiv.org/abs/1409.1556.
The Cancer Genome Atlas Research Network (2015) The Cancer Genome Atlas Research Network, 2015. Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas. New Engl. J. Med. 372, 2481–2498. URL: https://doi.org/10.1056/nejmoa1402121, doi:10.1056/nejmoa1402121.
Tschandl et al. (2020) Tschandl, P., Rinner, C., Apalla, Z., Argenziano, G., Codella, N., Halpern, A., Janda, M., Lallas, A., Longo, C., Malvehy, J., Paoli, J., Puig, S., Rosendahl, C., Soyer, H.P., Zalaudek, I., Kittler, H., 2020. Human–computer collaboration for skin cancer recognition. Nat. Med. 26, 1229–1234. URL: https://doi.org/10.1038/s41591-020-0942-0, doi:10.1038/s41591-020-0942-0.
Turro et al. (2020) Turro, E., Astle, W.J., Megy, K., Gräf, S., Greene, D., Shamardina, O., Allen, H.L., Sanchis-Juan, A., Frontini, M., Thys, C., Stephens, J., Mapeta, R., Burren, O.S., Downes, K., Haimel, M., Tuna, S., Deevi, S.V., Aitman, T.J., Bennett, D.L., Calleja, P., Carss, K., Caulfield, M.J., Chinnery, P.F., Dixon, P.H., Gale, D.P., James, R., Koziell, A., Laffan, M.A., Levine, A.P., Maher, E.R., Markus, H.S., Morales, J., Morrell, N.W., Mumford, A.D., Ormondroyd, E., Rankin, S., Rendon, A., Richardson, S., Roberts, I., Roy, N.B., Saleem, M.A., Smith, K.G., Stark, H., Tan, R.Y., Themistocleous, A.C., Thrasher, A.J., Watkins, H., Webster, A.R., Wilkins, M.R., Williamson, C., Whitworth, J., Humphray, S., Bentley, D.R., Abbs, S., Abulhoul, L., Adlard, J., Ahmed, M., Alachkar, H., Allsup, D.J., Almeida-King, J., Ancliff, P., Antrobus, R., Armstrong, R., Arno, G., Ashford, S., Attwood, A., Aurora, P., Babbs, C., Bacchelli, C., Bakchoul, T., Banka, S., Bariana, T., Barwell, J., Batista, J., Baxendale, H.E., Beales, P.L., Bentley, D.R., Bierzynska, A., Biss, T., Bitner-Glindzicz, M.A., Black, G.C., Bleda, M., Blesneac, I., Bockenhauer, D., Bogaard, H., Bourne, C.J., Boyce, S., Bradley, J.R., Bragin, E., Breen, G., Brennan, P., Brewer, C., Brown, M., Browning, A.C., Browning, M.J., Buchan, R.J., Buckland, M.S., Bueser, T., Diz, C.B., Burn, J., Burns, S.O., Burren, O.S., Burrows, N., Campbell, C., Carr-White, G., Carss, K., Casey, R., Chambers, J., Chambers, J., Chan, M.M., Cheah, C., Cheng, F., Chinnery, P.F., Chitre, M., Christian, M.T., Church, C., Clayton-Smith, J., Cleary, M., Brod, N.C., Coghlan, G., Colby, E., Cole, T.R., Collins, J., Collins, P.W., Colombo, C., Compton, C.J., Condliffe, R., Cook, S., Cook, H.T., Cooper, N., Corris, P.A., Furnell, A., Cunningham, F., Curry, N.S., Cutler, A.J., Daniels, M.J., Dattani, M., Daugherty, L.C., Davis, J., Soyza, A.D., Deevi, S.V., Dent, T., Deshpande, C., Dewhurst, E.F., Dixon, P.H., Douzgou, S., Downes, K., Drazyk, A.M., Drewe, E., Duarte, D., Dutt, T., Edgar, J.D.M., Edwards, K., Egner, W., Ekani, M.N., Elliott, P., Erber, W.N., Erwood, M., Estiu, M.C., Evans, D.G., Evans, G., Everington, T., Eyries, M., Fassihi, H., Favier, R., Findhammer, J., Fletcher, D., Flinter, F.A., Floto, R.A., Fowler, T., Fox, J., Frary, A.J., French, C.E., Freson, K., Frontini, M., Gale, D.P., Gall, H., Ganesan, V., Gattens, M., Geoghegan, C., Gerighty, T.S., Gharavi, A.G., Ghio, S., Ghofrani, H.A., Gibbs, J.S.R., Gibson, K., Gilmour, K.C., Girerd, B., Gleadall, N.S., Goddard, S., Goldstein, D.B., Gomez, K., Gordins, P., Gosal, D., Gräf, S., Graham, J., Grassi, L., Greene, D., Greenhalgh, L., Greinacher, A., Gresele, P., Griffiths, P., Grigoriadou, S., Grocock, R.J., Grozeva, D., Gurnell, M., Hackett, S., Hadinnapola, C., Hague, W.M., Hague, R., Haimel, M., Hall, M., Hanson, H.L., Haque, E., Harkness, K., Harper, A.R., Harris, C.L., Hart, D., Hassan, A., Hayman, G., Henderson, A., Herwadkar, A., Hoffman, J., Holden, S., Horvath, R., Houlden, H., Houweling, A.C., Howard, L.S., Hu, F., Hudson, G., Hughes, J., Huissoon, A.P., Humbert, M., Humphray, S., Hunter, S., Hurles, M., Irving, M., Izatt, L., Johnson, S.A., Jolles, S., Jolley, J., Josifova, D., Jurkute, N., Karten, T., Karten, J., Kasanicki, M.A., Kazkaz, H., Kazmi, R., Kelleher, P., Kelly, A.M., Kelsall, W., Kempster, C., Kiely, D.G., Kingston, N., Klima, R., Koelling, N., Kostadima, M., Kovacs, G., Koziell, A., Kreuzhuber, R., Kuijpers, T.W., Kumar, A., Kumararatne, D., Kurian, M.A., Laffan, M.A., Lalloo, F., Lambert, M., Lawrie, A., Layton, D.M., Lench, N., Lentaigne, C., Lester, T., Levine, A.P., Linger, R., Longhurst, H., Lorenzo, L.E., Louka, E., Lyons, P.A., Machado, R.D., Ross, R.V.M., Madan, B., Maher, E.R., Maimaris, J., Malka, S., Mangles, S., Mapeta, R., Marchbank, K.J., Marks, S., Marschall, H.U., Marshall, A., Martin, J., Mathias, M., Matthews, E., Maxwell, H., McAlinden, P., McCarthy, M.I., McKinney, H., McMahon, A., Meacham, S., Mead, A.J., Castello, I.M., Megy, K., Mehta, S.G., Michaelides, M., Millar, C., Mohammed, S.N., Moledina, S., Montani, D., Moore, A.T., Morales, J., Morrell, N.W., Mozere, M., Muir, K.W., Mumford, A.D., Nemeth, A.H., Newman, W.G., Newnham, M., Noorani, S., Nurden, P., O’Sullivan, J., Obaji, S., Odhams, C., Okoli, S., Olschewski, A., Olschewski, H., Ong, K.R., Oram, S.H., Ormondroyd, E., Ouwehand, W.H., Palles, C., Papadia, S., Park, S.M., Parry, D., Patel, S., Paterson, J., Peacock, A., Pearce, S.H., Peden, J., Peerlinck, K., Penkett, C.J., Pepke-Zaba, J., Petersen, R., Pilkington, C., Poole, K.E., Prathalingam, R., Psaila, B., Pyle, A., Quinton, R., Rahman, S., Rankin, S., Rao, A., Raymond, F.L., Rayner-Matthews, P.J., Rees, C., Rendon, A., Renton, T., Rhodes, C.J., Rice, A.S., Richardson, S., Richter, A., Robert, L., Roberts, I., Rogers, A., Rose, S.J., Ross-Russell, R., Roughley, C., Roy, N.B., Ruddy, D.M., Sadeghi-Alavijeh, O., Saleem, M.A., Samani, N., Samarghitean, C., Sanchis-Juan, A., Sargur, R.B., Sarkany, R.N., Satchell, S., Savic, S., Sayer, J.A., Sayer, G., Scelsi, L., Schaefer, A.M., Schulman, S., Scott, R., Scully, M., Searle, C., Seeger, W., Sen, A., Sewell, W.A., Seyres, D., Shah, N., Shamardina, O., Shapiro, S.E., Shaw, A.C., Short, P.J., Sibson, K., Side, L., Simeoni, I., Simpson, M.A., Sims, M.C., Sivapalaratnam, S., Smedley, D., Smith, K.R., Smith, K.G., Snape, K., Soranzo, N., Soubrier, F., Southgate, L., Spasic-Boskovic, O., Staines, S., Staples, E., Stark, H., Stephens, J., Steward, C., Stirrups, K.E., Stuckey, A., Suntharalingam, J., Swietlik, E.M., Syrris, P., Tait, R.C., Talks, K., Tan, R.Y., Tate, K., Taylor, J.M., Taylor, J.C., Thaventhiran, J.E., Themistocleous, A.C., Thomas, E., Thomas, D., Thomas, M.J., Thomas, P., Thomson, K., Thrasher, A.J., Threadgold, G., Thys, C., Tilly, T., Tischkowitz, M., Titterton, C., Todd, J.A., Toh, C.H., Tolhuis, B., Tomlinson, I.P., Toshner, M., Traylor, M., Treacy, C., Treadaway, P., Trembath, R., Tuna, S., Turek, W., Turro, E., Twiss, P., Vale, T., Geet, C.V., van Zuydam, N., Vandekuilen, M., Vandersteen, A.M., Vazquez-Lopez, M., von Ziegenweidt, J., Noordegraaf, A.V., Wagner, A., Waisfisz, Q., Walker, S.M., Walker, N., Walter, K., Ware, J.S., Watkins, H., Watt, C., Webster, A.R., Wedderburn, L., Wei, W., Welch, S.B., Wessels, J., Westbury, S.K., Westwood, J.P., Wharton, J., Whitehorn, D., Whitworth, J., Wilkie, A.O., Wilkins, M.R., Williamson, C., Wilson, B.T., Wong, E.K., Wood, N., Wood, Y., Woods, C.G., Woodward, E.R., Wort, S.J., Worth, A., Wright, M., Yates, K., Yong, P.F., Young, T., Yu, P., Yu-Wai-Man, P., Zlamalova, E., Kingston, N., Walker, N., Penkett, C.J., Freson, K., Stirrups, K.E., Raymond, F.L., 2020. Whole-genome sequencing of patients with rare diseases in a national health system. Nature 583, 96–102. URL: https://www.nature.com/articles/s41586-020-2434-2, doi:10.1038/s41586-020-2434-2.
Vinker et al. (2022) Vinker, Y., Pajouheshgar, E., Bo, J.Y., Bachmann, R.C., Bermano, A.H., Cohen-Or, D., Zamir, A., Shamir, A., 2022. CLIPasso. ACM Trans. Graphics 41, 1–11. URL: https://doi.org/10.1145/3528223.3530068, doi:10.1145/3528223.3530068.
Wang et al. (2013) Wang, Y., Wang, L., Li, Y., He, D., Liu, T.Y., 2013. A theoretical analysis of NDCG type ranking measures, in: Shalev-Shwartz, S., Steinwart, I. (Eds.), Proceedings of the 26th Annual Conference on Learning Theory, PMLR, Princeton, NJ, USA. pp. 25–54. URL: https://proceedings.mlr.press/v30/Wang13.html.
Zhang et al. (2020) Zhang, Z., Zhang, Y., Feng, R., Zhang, T., Fan, W., 2020. Zero-shot sketch-based image retrieval via graph convolution network, in: Proceedings of the AAAI Conference on Artificial Intelligence, AAAI. pp. 12943–12950. URL: https://doi.org/10.1609/aaai.v34i07.6993, doi:10.1609/aaai.v34i07.6993.
Zheng et al. (2018) Zheng, L., Yang, Y., Tian, Q., 2018. SIFT meets CNN: A decade survey of instance retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1224–1244. URL: https://doi.org/10.1109/tpami.2017.2709749, doi:10.1109/tpami.2017.2709749.
Zhong et al. (2021) Zhong, A., Li, X., Wu, D., Ren, H., Kim, K., Kim, Y., Buch, V., Neumark, N., Bizzo, B., Tak, W.Y., Park, S.Y., Lee, Y.R., Kang, M.K., Park, J.G., Kim, B.S., Chung, W.J., Guo, N., Dayan, I., Kalra, M.K., Li, Q., 2021. Deep metric learning-based image retrieval system for chest radiograph and its clinical applications in covid-19. Med. Image Anal. 70, 101993. URL: https://doi.org/10.1016/J.MEDIA.2021.101993, doi:10.1016/J.MEDIA.2021.101993.