This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sketch-based Medical Image Retrieval

Kazuma Kobayashi [email protected] Lin Gu [email protected] Ryuichiro Hataya [email protected] Takaaki Mizuno [email protected]
Mototaka Miyake
[email protected]
Hirokazu Watanabe [email protected] Masamichi Takahashi [email protected] Yasuyuki Takamizawa [email protected]
Yukihiro Yoshida
[email protected]
Satoshi Nakamura [email protected] Nobuji Kouno [email protected] Amina Bolatkan [email protected] Yusuke Kurose [email protected]
Tatsuya Harada
[email protected]
Ryuji Hamamoto [email protected] Division of Medical AI Research and Development, National Cancer Center Research Institute,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan
Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project,
1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan
Machine Intelligence for Medical Engineering Team, RIKEN Center for Advanced Intelligence Project,
1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan
Research Center for Advanced Science and Technology, The University of Tokyo,
4-6-1 Komaba, Meguro-ku, Tokyo 153-8904, Japan
Medical Data Deep Learning Team, Advanced Data Science Project, RIKEN Information R&D and Strategy Headquarters,
1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan
Department of Experimental Therapeutics, National Cancer Center Hospital,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan
Department of Diagnostic Radiology, National Cancer Center Hospital,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan
Department of Neurosurgery and Neuro-Oncology, National Cancer Center Hospital,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan
Department of Colorectal Surgery, National Cancer Center Hospital,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan
Department of Thoracic Surgery, National Cancer Center Hospital,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan
Radiation Safety and Quality Assurance Division, National Cancer Center Hospital,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan
Division of Research and Development for Boron Neutron Capture Therapy, National Cancer Center, Exploratory Oncology Research & Clinical Trial Center,
5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan
Medical Physics Laboratory, Division of Health Science, Graduate School of Medicine, Osaka University
Yamadaoka 1-7, Suita-shi, Osaka 565-0871, Japan
Department of Surgery, Kyoto University Graduate School of Medicine
54 Shogoin Kawahara-cho, Sakyo-ku, Kyoto 606-8507, Japan
(December 2022)
Abstract

The amount of medical images stored in hospitals is increasing faster than ever; however, utilizing the accumulated medical images has been limited. This is because existing content-based medical image retrieval (CBMIR) systems usually require example images to construct query vectors; nevertheless, example images cannot always be prepared. Besides, there can be images with rare characteristics that make it difficult to find similar example images, which we call isolated samples. Here, we introduce a novel sketch-based medical image retrieval (SBMIR) system that enables users to find images of interest without example images. The key idea lies in feature decomposition of medical images, whereby the entire feature of a medical image can be decomposed into and reconstructed from normal and abnormal features. By extending this idea, our SBMIR system provides an easy-to-use two-step graphical user interface: users first select a template image to specify a normal feature and then draw a semantic sketch of the disease on the template image to represent an abnormal feature. Subsequently, it integrates the two kinds of input to construct a query vector and retrieves reference images with the closest reference vectors. Using two datasets, ten healthcare professionals with various clinical backgrounds participated in the user test for evaluation. As a result, our SBMIR system enabled users to overcome previous challenges, including image retrieval based on fine-grained image characteristics, image retrieval without example images, and image retrieval for isolated samples. Our SBMIR system achieves flexible medical image retrieval on demand, thereby expanding the utility of medical image databases.

keywords:
Sketch-based image retrieval, content-based image retrieval, feature decomposition, query by sketch, query by example
\nonumnote

Abbreviations: 2D, two-dimensional; 3D, three-dimensional; AC, anatomy code; CBMIR, content-based medical image retrieval; CT, computed tomography; ED, peritumoral edema; ET, gadolinium-enhancing tumor; FLAIR, fluid-attenuated inversion recovery sequence; GAN, generative adversarial network; Gd, gadolinium; GUI, graphical user interface; KL, Kullback–Leibler; MRI, magnetic resonance imaging; nDCG: normalized discounted cumulative gain; NET, necrotic and non-enhancing tumor core; NN, nearest neighbor; PT, primary tumor; SBMIR, sketch-based medical image retrieval; SVM, support vector machine; T1, T1-weighted sequence; T1CE, T1-weighted contrast-enhanced sequence; VAE, variational autoencoder

1 Introduction

As the amount of medical images stored in hospital databases is increasing much faster than healthcare professionals can manage, the need for effective medical image retrieval has become greater owing to its potential in patient care, research, and development (Quellec et al., 2011; Allan et al., 2012; Tschandl et al., 2020; Chen et al., 2022). Indeed, healthcare professionals refer not only to evidence in the literature but also to the accumulated past cases because they often provide valuable insights into what differentiates individual patients from common characteristics in the patient population. Because the radiographic phenotype of a disease is closely related to its diagnosis, treatment response, and prognosis (Aerts et al., 2014), content-based medical image retrieval (CBMIR), which can calculate the similarity of medical images based on image contents, has been the mainstay for the development of medical image retrieval.

To date, CBMIR has been treated as a computer science problem to devise how to measure the similarity between medical images by capturing the unstructured nature of clinical findings. Based on a query-by-example approach (Pinho et al., 2019), most CBMIR systems exploit an example image as a query image, from which a query vector representing requested information is extracted (see the left part of Fig. 1a). As a search result, reference images with the closest reference vectors to the query vector are retrieved from a database. CBMIR emerged as a method to calculate the similarity based on visual features such as color, texture, shape, and spatial relationships among regions of interest (Long et al., 2003; Li et al., 2018). More recently, it has incorporated deep learning (LeCun et al., 2015) to efficiently extract semantic features from example images in order to create query vectors (Hosny et al., 2018). This achieves better results by minimizing the semantic gap between high-level semantic concepts and low-level visual features in medical images (Zheng et al., 2018).

However, the information-seeking objectives of healthcare professionals can be diverse beyond the extent to which solely calculating the similarity between the query and reference images can handle. For example, to validate the clinical decision for a present case, healthcare professionals may need not only past cases with exactly the same clinical findings in the same anatomical location but also cases with similar findings in a different location or with different findings in the same location. Besides, suppose that one needs to search for images with particular clinical findings from medical image repositories on the Internet, which are becoming popular recently (Prior et al., 2017). In such cases, it is usually difficult for users to find the right image to initiate the search because they can not prepare an example image in advance. Furthermore, clinical medicine often puts high reference value on rare cases, including rare diseases and rare clinical findings (Turro et al., 2020). Nevertheless, it can be challenging for healthcare professionals to retrieve the relevant images from a database because similar example images are hardly available due to their rarity.

Based on these points of view, we focus on two potential limitations of conventional CBMIR systems using the query-by-example approach: usability and searchability problems. Usability is referred to as a qualitative attribute representing how easy it is for users to satisfy their information-seeking objectives whenever necessary. Since query-by-example approaches are dependent on example images, it is challenging to search in situations where no example images can be available in advance. Also, adding or subtracting arbitrary features to a query vector extracted from an example image is difficult, hindering the refinement of user queries for better results based on the search results. On the other hand, searchability indicates the scope of potentially retrievable images to all images stored in a database. Suppose a latent space where feature vectors corresponding to image characteristics are distributed. While a typical image can be surrounded by other typical images in the vicinity, a rare image may be located far away from the others due to its unique characteristics, which we call an isolated sample in a database. The isolated sample might hardly be present in the nearest neighborhood of usually available example images, implying that the rarer the case, the more difficult its retrieval can be. By using ResNet-101 features (He et al., 2016) (see A), we demonstrate the substantial number of isolated samples in medical image datasets in B.

To mitigate these problems, one alternative could be a query-by-sketch approach (Sangkloy et al., 2016; Li and Li, 2018; Dutta and Akata, 2020; Zhang et al., 2020; Vinker et al., 2022; Bhunia et al., 2022), which has achieved notable success in the field of computer vision. A user’s query sketch can convey the shape, orientation, and fine-grained details of objects without preparing example images. Nevertheless, sketch-based queries have not been demonstrated in medical image retrieval, perhaps because sketching all anatomical information seems too laborious for a real-world application. Therefore, a practical CBMIR system that does not require users to provide example images or to comprehensively sketch anatomical features is needed.

Refer to caption
Figure 1: Overview of the algorithm for the sketch-based medical image retrieval (SBMIR) system. a The conventional query-by-example approach in content-based medical image retrieval extracts a query vector from a query image as an example; nevertheless, it is not always easy to prepare an example image when users need to refer to images with specific findings. Here, we introduce a feasible query-by-sketch approach in which users can specify a query vector by selecting a template image and by sketching the findings of interest onto the template image. b The technical basis is feature decomposition of medical images, where the entire feature (the entire anatomy code [AC]) of a medical image can be decomposed into and reconstructed from a normal feature (a normal AC) and an abnormal feature (an abnormal AC). c In our SBMIR system, a query vector is composed of a normal AC extracted from a selected template image and an abnormal AC approximated from a sketched label, while reference vectors are automatically computed from reference images. The reference images with the reference vector closest to the query vector are obtained as search results. AC, anatomy code.

1.1 SBMIR: sketch-based medical image retrieval system

Here, we introduce a feasible query-by-sketch approach that minimizes the effort required for sketching (see the right part of Fig. 1a), which we used to establish the SBMIR system. The underlying assumption is as follows. Individual disease phenotypes are diverse and indefinite, whereas the surrounding normal anatomy shares many common features within a given population. Hence, if we can specify the semantic features of a disease with sketches and those of the surrounding normal anatomy by selecting a template image that approximates the normal anatomical features, we can construct a query vector conveying the target image content with its desired anatomical location. Based on this assumption, our SBMIR system uses two different modalities–a template image and a semantic sketch of the disease–to construct a query vector. The technological basis for constructing a query vector from these different modalities is the feature decomposition of medical images (Kobayashi et al., 2021), whereby the entire feature vector of a medical image, which is referred to as an entire anatomy code (AC), can be decomposed into and reconstructed from two semantically different feature vectors, a normal AC and an abnormal AC (see Fig. 1b). The relationship among the three ACs can be formulated as follows:

EntireAC=NormalAC+AbnormalAC.\mathrm{Entire\,AC}=\mathrm{Normal\,AC}+\mathrm{Abnormal\,AC}. (1)

As demonstrated in Fig. 1b, the normal AC represents the counterfactual normal anatomy that should have existed in the absence of an abnormality, and the abnormal AC represents any abnormality as a residual from the normal baseline. By extending this concept, we devised a deep-learning-based algorithm to extract a normal AC from a template image and an abnormal AC from a semantic sketch of the disease to construct a query vector by adding them together.

Our SBMIR system consists of a feature extraction module and a similarity calculation module (see Fig. 1c). A query vector is calculated through the following two-step user operation (see the left part of Fig. 1c). First, users select a two-dimensional (2D) template image by slicing through a three-dimensional (3D) template volume (see “Step 1” in Fig. 2a for an example). The 3D template volume comprises a series of 2D slices of a specific organ (e.g., brain) or anatomical region (e.g., chest). It is assumed that the variation in normal anatomy is limited to a small range in a given population, such that by selecting a slice as the template image that matches the area where users want the clinical findings to be present, users can specify the location of a disease. Second, users sketch semantic segmentation labels representing the disease on the selected template image (see “Step 2” in Fig. 2a for an example). These semantic segmentation labels are predefined and then learned for each specific clinical finding in medical images (e.g., a segmentation label representing a tumor region or a particular component of a disease, such as a necrotic region) as the image content that should be located therein. Then, the feature extraction module extracts a normal AC from the template image and an abnormal AC from the semantic sketch of the disease, both of which are summed to give the query vector according to Eq. 1. As a result, users can obtain reference images with the closest reference vector to the query vector, which is processed by the similarity calculation module (see the middle part of Fig. 1c). These reference vectors are calculated in advance from reference images (see the right part of Fig. 1c). Note that each of the retrieved top-KK similar images relative to the query vector belongs to different reference volumes (e.g., individual magnetic resonance imaging [MRI] and computed tomography [CT] scans) in order to avoid redundancy between consecutive 2D slices within a 3D volume.

For the model training and evaluation, a dataset consisting of brain MRI scans with gliomas and a dataset containing chest CT scans with lung cancers were used. To show that our SBMIR system is easy for healthcare professionals to use and that it can mitigate the usability and searchability problems of conventional CBMIR systems, 10 healthcare professionals with various clinical backgrounds participated in user tests based on a dedicated graphical user interface (GUI) (see Fig. 2b). The evaluators underwent a practice stage followed by three testing stages as follows: Test-1 demonstrated the image retrieval performance when example images were available, Test-2 revealed the image retrieval performance without example images, and Test-3 investigated the image retrieval performance for isolated samples. The Supplementary Video 1 demonstrates our SBMIR system in action. Besides, we will soon release the source code and materials to show researchers how our SBMIR system works.

The main contributions of this study can be summarized into the following:

  • 1.

    By extending the concept of the feature decomposition of medical images, we devised an algorithm to demonstrate the feasible query-by-sketch approach to construct a query vector, which requires neither an example image nor a detailed sketch of all the anatomical structures.

  • 2.

    We implemented the first SBMIR system that achieves flexible medical image retrieval on demand through an easy-to-use two-step user operation, the search results of which can change according to which template image is selected and how the disease is sketched.

  • 3.

    The user test showed that our SBMIR system could overcome the usability and searchability problems of conventional CBMIR systems through better image retrieval according to fine-grained image characteristics (Test-1), image retrieval without example images (Test-2), and image retrieval for isolated samples (Test-3).

Refer to caption
Figure 2: Illustration of the software demonstrating our sketch-based medical image retrieval system. a A user query is constructed through the following two-step user operation. First, a template image is selected to specify the location where a disease should exist, and second, a semantic segmentation label of the disease is sketched therein. b An easy-to-use graphical user interface was developed for the user tests. In the left field, an image is presented as a question item in Test-1. In the middle field, a ring-enhancing brain tumor surrounded by peritumoral edema located in the left temporal pole is drawn on a selected template image by an evaluator, the intention of which is to correspond with the presented image. The resulting top-5 retrieved images showing the corresponding reference images extracted from the testing dataset are listed in the right field. Each evaluator judges the consistency of the given criteria by checking the checkboxes. The orange highlighting of the retrieved images indicates that the image belongs to the same volume as the presented image (i.e., the same-volume image). CE, contrast enhancement; Gd, gadolinium; T1, T1-weighted sequence; T1CE, T1-weighted contrast-enhanced sequence; FLAIR, fluid-attenuated inversion recovery sequence; SEG, tumor-associated segmentation labels.

2 Algorithm

Our SBMIR system has a unique feature extraction module at its core, which is based on feature decomposition of medical images extended with semantically organized latent space. Such extension is critical to achieving the practical SBMIR system, which we will demonstrate in E. Here, we describe the training algorithm of the feature extraction module, which is combined with the similarity calculation module to realize our real-time, large-scale image retrieval system.

Refer to caption
Figure 3: Basic components of the feature extraction module. The deep-learning framework for feature extraction consists of three parts. a A variational autoencoder, consisting of a normal anatomy code (AC) encoder and an image decoder, learns normal ACs using sub-batches comprising only healthy images. b An encoder–decoder type of segmentation network, consisting of an abnormal AC encoder and a label decoder, acquires abnormal ACs through the prediction of segmentation labels for the abnormal lesions. c An encoding network, consisting of a label encoder, maps ground-truth segmentation labels representing abnormal lesions in order to estimate the corresponding abnormal ACs that are output by the abnormal AC encoder using a corresponding diseased image as input.

2.1 Network architecture for feature decomposition

The feature extraction module of our SBMIR system is a deep-learning framework that can perform feature decomposition of medical images (Kobayashi et al., 2021) with extended mapping functions between the image space 𝒳\mathcal{X} and the label space \mathcal{L} via the latent space 𝒵\mathcal{Z} (𝒳𝒵\mathcal{X}\leftrightarrow\mathcal{Z}\leftrightarrow\mathcal{L}). The training of the neural networks requires a dataset consisting of pairs of an image and a corresponding segmentation label of tumor-associated regions (𝒙,𝒍)𝒳×(\bm{x},\bm{l})\in\mathcal{X}\times\mathcal{L}. Below, we assume that the image space 𝒳\mathcal{X} has a subspace corresponding to healthy images 𝒳h\mathcal{X}^{\mathrm{h}} and another subspace corresponding to diseased images 𝒳d\mathcal{X}^{\mathrm{d}}. Although feature decomposition of medical images often involves elaborate algorithms such as conditional generation using generative adversarial networks (GANs) (Goodfellow et al., 2014; Liu et al., 2022), our SBMIR system achieves this objective by combining three simple components, without utilizing adversarial training, to stabilize the training process.

2.1.1 Variational auto-encoder for learning normal ACs

As illustrated in Fig. 3a, the first component of our feature extraction module is a variational autoencoder (VAE) (Kingma and Welling, 2014; Rezende et al., 2014) that learns the mapping function of the normal ACs 𝒏\bm{n} between the image space and the latent space (𝒳𝒏𝒵𝒏𝒳\mathcal{X}\xrightarrow{\bm{n}}\mathcal{Z}\xrightarrow{\bm{n}}\mathcal{X}). This VAE consists of a pair of a normal AC encoder ENACE_{\mathrm{NAC}} and an image decoder DImgD_{\mathrm{Img}}, which together take a healthy image 𝒙h𝒳h\bm{x}^{\mathrm{h}}\sim\mathcal{X}^{\mathrm{h}} as input and output a reconstructed image 𝒙^h\hat{\bm{x}}^{\mathrm{h}} by simultaneously producing a normal AC 𝒏\bm{n} in the latent space (DImg(ENAC(𝒙h))=DImg(𝒏)=𝒙^hD_{\mathrm{Img}}(E_{\mathrm{NAC}}(\bm{x}^{\mathrm{h}}))=D_{\mathrm{Img}}(\bm{n})=\hat{\bm{x}}^{\mathrm{h}}). Based on the idea that the distribution of normal ACs 𝒏\bm{n} should be within a certain range reflecting the limited range of normal anatomic variation within a population, we impose an isotropic multivariate Gaussian 𝒩(𝒏;𝟎,𝑰)\mathcal{N}(\bm{n};\bm{0},\bm{I}) as a prior distribution over the normal ACs (p(𝒏)=𝒩(𝟎,𝑰)p(\bm{n})=\mathcal{N}(\bm{0},\bm{I})). As such, using the two output variables, 𝝁\bm{\mu} and 𝝈\bm{\sigma}, a posterior distribution estimated by the encoder ENAC(𝒏|𝒙h)=𝒩(𝝁(𝒙h),𝝈(𝒙h))E_{\mathrm{NAC}}(\bm{n}\rvert\bm{x}^{\mathrm{h}})=\mathcal{N}(\bm{\mu}(\bm{x}^{\mathrm{h}}),\bm{\sigma}(\bm{x}^{\mathrm{h}})) is forced to be close to the prior by minimizing the Kullback–Leibler (KL)-divergence between the prior and the posterior distributions.

2.1.2 Segmentation network for learning abnormal ACs

Figure 3b shows the second component of our feature extraction module, an encoder–decoder type of segmentation network that learns the mapping function for the abnormal ACs 𝒂\bm{a} alongside the sequence 𝒳𝒂𝒵𝒂\mathcal{X}\xrightarrow{\bm{a}}\mathcal{Z}\xrightarrow{\bm{a}}\mathcal{L} based on a pair of a diseased image 𝒙d𝒳d\bm{x}^{\mathrm{d}}\sim\mathcal{X}^{\mathrm{d}} and a segmentation label of tumor-associated regions 𝒍d\bm{l}^{\mathrm{d}}. The abnormal ACs 𝒂\bm{a} are acquired as the outputs of an abnormal AC encoder EAACE_{\mathrm{AAC}} and are decoded to a semantic segmentation label 𝒍^d\hat{\bm{l}}^{\mathrm{d}} through a label decoder DLblD_{\mathrm{Lbl}}, as follows: DLbl(EAAC(𝒙d))=DLbl(𝒂)=𝒍^dD_{\mathrm{Lbl}}(E_{\mathrm{AAC}}(\bm{x}^{\mathrm{d}}))=D_{\mathrm{Lbl}}(\bm{a})=\hat{\bm{l}}^{\mathrm{d}}. Note that there is no skip connection between the abnormal AC encoder EAACE_{\mathrm{AAC}} and the label decoder DLblD_{\mathrm{Lbl}}. Therefore, through training using segmentation losses between 𝒍^d\hat{\bm{l}}^{\mathrm{d}} and 𝒍d\bm{l}^{\mathrm{d}}, the abnormal ACs 𝒂\bm{a} can be optimized to encode the semantic features particularly relevant to the tumor-associated regions.

2.1.3 Encoding network to estimate abnormal ACs from semantic segmentation labels

Figure 3c describes the third component of our feature extraction module, a label encoder ELblE_{\mathrm{Lbl}} that enables the mapping function from the label space to the latent space with respect to the abnormal ACs 𝒂\bm{a} (𝒂𝒵\mathcal{L}\xrightarrow{\bm{a}}\mathcal{Z}). Specifically, the label encoder ELblE_{\mathrm{Lbl}} estimates an abnormal AC 𝒂^\hat{\bm{a}} from a ground-truth segmentation label 𝒍d\bm{l}^{\mathrm{d}} as input (ELbl(𝒍d)=𝒂^E_{\mathrm{Lbl}}(\bm{l}^{\mathrm{d}})=\hat{\bm{a}}), which is an inverse function of the label decoder (DLbl(𝒂)=𝒍^dD_{\mathrm{Lbl}}(\bm{a})=\hat{\bm{l}}^{\mathrm{d}}). In the training of the label encoder ELblE_{\mathrm{Lbl}}, the corresponding abnormal AC 𝒂\bm{a} that gives the closest segmentation prediction to the ground-truth label through the label decoder DLblD_{\mathrm{Lbl}} can be the ground-truth for the estimated abnormal AC 𝒂^\hat{\bm{a}}.

2.1.4 The whole network architecture

Refer to caption
Figure 4: Overall architecture of the feature extraction module for our sketch-based medical image retrieval system. The deep-learning framework can learn normal anatomy codes (ACs), abnormal ACs, and entire ACs according to each semantic concept and mapping function between the image space and the label space via the latent space. The training is performed differently, depending on whether the input images are healthy or diseased. Particularly, the VAE component is switched to be trainable when healthy images are given, whereas its inference results are utilized when diseased images are given. This is reasonable because there are no ground-truth pseudo-normal images corresponding to the diseased images.

By combining these components, the deep-learning framework for the feature extraction module in our SBMIR system is trained as illustrated in Fig. 4. The training process is changed according to whether the input image is healthy 𝒙h𝒳h\bm{x}^{\mathrm{h}}\sim\mathcal{X}^{\mathrm{h}} or diseased 𝒙d𝒳d\bm{x}^{\mathrm{d}}\sim\mathcal{X}^{\mathrm{d}}. Notably, the VAE component is trained only when healthy images 𝒙h\bm{x}^{\mathrm{h}} are given as input, while its inference results are used when diseased images 𝒙d\bm{x}^{\mathrm{d}} are given.

When a healthy image 𝒙h\bm{x}^{\mathrm{h}} is given as input, the VAE component is switched to be trainable (see the upper part of Fig. 4). At the encoding step, from the healthy image 𝒙h\bm{x}^{\mathrm{h}}, the normal AC encoder ENACE_{\mathrm{NAC}} estimates the posterior distribution of the normal AC (ENAC(𝒏|𝒙h)=𝒩(𝝁(𝒙h),𝝈(𝒙h))E_{\mathrm{NAC}}(\bm{n}\rvert\bm{x}^{\mathrm{h}})=\mathcal{N}(\bm{\mu}(\bm{x}^{\mathrm{h}}),\bm{\sigma}(\bm{x}^{\mathrm{h}}))) and the abnormal AC encoder EAACE_{\mathrm{AAC}} outputs the abnormal AC 𝒂\bm{a} in the latent space (EAAC(𝒙h)=𝒂E_{\mathrm{AAC}}(\bm{x}^{\mathrm{h}})=\bm{a}). A normal AC 𝒏\bm{n} can be sampled from 𝒩(𝒏;𝝁(𝒙h),𝝈(𝒙h))\mathcal{N}(\bm{n};\bm{\mu}(\bm{x}^{\mathrm{h}}),\bm{\sigma}(\bm{x}^{\mathrm{h}})) using a reparameterization trick (Kingma and Welling, 2014; Rezende et al., 2014), that is, 𝒏=𝝁+𝝈ϵ\bm{n}=\bm{\mu}+\bm{\sigma}\odot\bm{\epsilon}, where ϵ𝒩(𝟎,𝑰)\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{I}) and \odot indicates the Hadamard product. Then, an entire AC 𝒆\bm{e} is calculated as the sum of the normal AC 𝒏\bm{n} and the abnormal AC 𝒂\bm{a} (𝒆=𝒏+𝒂\bm{e}=\bm{n}+\bm{a}). Note that since the input image does not include any abnormality, the abnormal AC 𝒂\bm{a} is trained to be the zero vector (𝒂𝟎\bm{a}\approx\bm{0}) so as not to convey any abnormal information about the image, making the entire AC 𝒆\bm{e} and the normal AC 𝒏\bm{n} identical as a result (𝒆𝒏\bm{e}\approx\bm{n}). At the decoding step, the normal AC 𝒏\bm{n} and the entire AC 𝒆\bm{e} are independently fed into the image decoder DImgD_{\mathrm{Img}} to reconstruct the same healthy input image 𝒙^h\hat{\bm{x}}^{\mathrm{h}} (DImg(𝒏)=𝒙^hD_{\mathrm{Img}}(\bm{n})=\hat{\bm{x}}^{\mathrm{h}} and DImg(𝒆)=𝒙^hD_{\mathrm{Img}}(\bm{e})=\hat{\bm{x}}^{\mathrm{h}}). Additionally, the abnormal AC 𝒂\bm{a}, which is trained to be the zero vector, is taken by the label encoder DLblD_{\mathrm{Lbl}} as input to generate a segmentation label 𝒍^h\hat{\bm{l}}^{\mathrm{h}} that is encouraged to be similar to a ground-truth label 𝒍h\bm{l}^{\mathrm{h}} filled with zeros (DLbl(𝒂)=𝒍^hD_{\mathrm{Lbl}}(\bm{a})=\hat{\bm{l}}^{\mathrm{h}}). Finally, the ground-truth label 𝒍h\bm{l}^{\mathrm{h}} is fed into the label encoder ELblE_{\mathrm{Lbl}} that is trained to estimate the corresponding abnormal AC 𝒂^\hat{\bm{a}}, which should be the zero vector as well (ELbl(𝒍h)=𝒂^E_{\mathrm{Lbl}}(\bm{l}^{\mathrm{h}})=\hat{\bm{a}}).

Conversely, when a diseased image 𝒙d\bm{x}^{\mathrm{d}} is given as input, the VAE component is not used for learning, and only its inference results are utilized (see the lower part of Fig. 4). The assumption is that the normal ACs 𝒏\bm{n}, which are trained only on the healthy images 𝒙h\bm{x}^{\mathrm{h}} to encode the normal anatomical information, should be incapable of reconstructing abnormal lesions in the diseased images 𝒙d\bm{x}^{\mathrm{d}} (Schlegl et al., 2017). Thus, the inference results from the diseased image 𝒙d\bm{x}^{\mathrm{d}} can be the normal ACs 𝒏\bm{n} that match a pseudo-normal image corresponding to the input image. The rest of the training process is essentially analogous to the process starting with a healthy image. At the encoding step, from the diseased image 𝒙d\bm{x}^{\mathrm{d}}, the normal AC encoder ENACE_{\mathrm{NAC}} infers the normal AC 𝒏\bm{n} (ENAC(𝒙d)=𝒏E_{\mathrm{NAC}}(\bm{x}^{\mathrm{d}})=\bm{n}), and the abnormal AC encoder EAACE_{\mathrm{AAC}} outputs the abnormal AC 𝒂\bm{a} (EAAC(𝒙d)=𝒂E_{\mathrm{AAC}}(\bm{x}^{\mathrm{d}})=\bm{a}). Then, the entire AC 𝒆\bm{e} is calculated as the sum of the normal AC 𝒏\bm{n} and the abnormal AC 𝒂\bm{a} (𝒆=𝒏+𝒂\bm{e}=\bm{n}+\bm{a}). At the decoding step, the entire AC 𝒆\bm{e} is fed into the image decoder DImgD_{\mathrm{Img}} to reconstruct the whole input image with abnormal findings 𝒙^d\hat{\bm{x}}^{\mathrm{d}} (DImg(𝒆)=𝒙^dD_{\mathrm{Img}}(\bm{e})=\hat{\bm{x}}^{\mathrm{d}}). Note that reconstruction from the normal AC 𝒏\bm{n} is not performed because there is no ground-truth for the pseudo-normal image corresponding to the input image. Then, the abnormal AC 𝒂\bm{a} is fed into the label decoder DLblD_{\mathrm{Lbl}} to predict the segmentation label of the abnormalities 𝒍^d\hat{\bm{l}}^{\mathrm{d}} that should be matched to the ground-truth segmentation label 𝒍d\bm{l}^{\mathrm{d}} (DLbl(𝒂)=𝒍^dD_{\mathrm{Lbl}}(\bm{a})=\hat{\bm{l}}^{\mathrm{d}}). Finally, the ground-truth segmentation label 𝒍d\bm{l}^{\mathrm{d}} is taken as input by the label encoder ELblE_{\mathrm{Lbl}} to estimate the corresponding abnormal AC 𝒂^\hat{\bm{a}} (ELbl(𝒍d)=𝒂^E_{\mathrm{Lbl}}(\bm{l}^{\mathrm{d}})=\hat{\bm{a}}).

2.2 Creating the semantically organized latent space

Refer to caption
Figure 5: Semantically organized latent space. a The abnormal anatomy codes (ACs) extracted from healthy images should be close to zero vectors, such that the normal ACs and entire ACs should be identical. On the other hand, the abnormal ACs extracted from diseased images should contain meaningful information, such that the normal ACs and the entire ACs are distributed separately. b The normal ACs from healthy images, the entire ACs from healthy images, and the normal ACs from diseased images should overlap with each other, whereas the entire ACs from diseased images should be far away from the others by introducing a margin parameter.

The semantically organized latent space is critical for our SBMIR system to perform image retrieval based on semantics (i.e., whether an image is healthy or diseased). That is, a query vector conveying the information of a diseased region should retrieve only diseased images 𝒙d\bm{x}^{\mathrm{d}} and that not conveying any disease region information should retrieve only healthy images 𝒙h\bm{x}^{\mathrm{h}}. We call this semantic consistency, which will be quantitatively evaluated later in E.1. To achieve this, the latent space 𝒵\mathcal{Z} needs to be separated into a subspace representing healthy images 𝒵h\mathcal{Z}^{\mathrm{h}} (i.e., the healthy subspace), and another subspace representing diseased images 𝒵d\mathcal{Z}^{\mathrm{d}} (i.e., the diseased subspace). In other words, our SBMIR system should enable the corresponding mapping based on semantics, between the image space and the latent space as follows: 𝒳h𝒵h\mathcal{X}^{\mathrm{h}}\leftrightarrow\mathcal{Z}^{\mathrm{h}} and 𝒳d𝒵d\mathcal{X}^{\mathrm{d}}\leftrightarrow\mathcal{Z}^{\mathrm{d}}. Here, we explain how to configure the semantically organized latent space.

2.2.1 Four latent distributions in the latent space

How the information conveyed by an abnormal AC 𝒂\bm{a} changes according to the semantics is essential for configuring the semantically organized latent space. As shown in Fig. 5a, the entire AC 𝒆\bm{e} of a diseased image 𝒙d\bm{x}^{\mathrm{d}} is represented as a sum of the normal AC 𝒏\bm{n} and the abnormal AC 𝒂\bm{a} (𝒆=𝒏+𝒂\bm{e}=\bm{n}+\bm{a}); in contrast, that of a healthy image 𝒙h\bm{x}^{\mathrm{h}} can be approximated only by the normal AC 𝒏\bm{n} (𝒆𝒏\bm{e}\approx\bm{n}), as the abnormal AC 𝒂\bm{a} should be the zero vector reflecting the absence of abnormality (𝒂𝟎\bm{a}\approx\bm{0}). Therefore, there should be four latent distributions in the latent space: a distribution of entire ACs from healthy images 𝒟(𝒆|𝒙h)\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}}), that of normal ACs from healthy images 𝒟(𝒏|𝒙h)\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}}), that of entire ACs from diseased images 𝒟(𝒆|𝒙d)\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}}), and that of normal ACs from diseased images 𝒟(𝒏|𝒙d)\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{d}}). Considering the semantics, the healthy subspace 𝒵h\mathcal{Z}^{\mathrm{h}} should enclose the distribution of entire ACs from healthy images 𝒟(𝒆|𝒙h)\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}}), that of normal ACs from healthy images 𝒟(𝒏|𝒙h)\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}}), and that of normal ACs from diseased images 𝒟(𝒏|𝒙d)\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{d}}) ({𝒟(𝒆|𝒙h),𝒟(𝒏|𝒙h),𝒟(𝒏|𝒙d)}𝒵h\{\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}}),\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}}),\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{d}})\}\subset\mathcal{Z}^{\mathrm{h}}), because all of these distributions represent healthy images. On the other hand, the diseased subspace 𝒵d\mathcal{Z}^{\mathrm{d}} includes the remaining distribution of entire ACs from diseased images 𝒟(𝒆|𝒙d)\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}}) ({𝒟(𝒆|𝒙d)}𝒵d\{\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}})\}\subset\mathcal{Z}^{\mathrm{d}}) because only this distribution represents diseased images. Note that since the abnormal ACs 𝒂\bm{a} express only the amount of change from the healthy subspace 𝒵h\mathcal{Z}^{\mathrm{h}} to the diseased subspace 𝒵d\mathcal{Z}^{\mathrm{d}} (i.e., as shown in Fig. 1b, an abnormal AC does not have enough information to reconstruct a whole image), the distributions of the abnormal ACs 𝒂\bm{a} are not taken into account.

2.2.2 Configuration of the healthy subspace

Figure 5b illustrates how the three distributions in the healthy subspace ({𝒟(𝒆|𝒙h),𝒟(𝒏|𝒙h),𝒟(𝒏|𝒙d)}𝒵h\{\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}}),\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}}),\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{d}})\}\subset\mathcal{Z}^{\mathrm{h}}) are trained to overlap with each other. During the model training, the three distributions in the healthy subspace 𝒵h\mathcal{Z}^{\mathrm{h}} are forced to follow the isotropic multivariate Gaussian distribution 𝒩(𝒏;𝟎,𝑰)\mathcal{N}(\bm{n};\bm{0},\bm{I}) that is formulated by the VAE component to learn normal ACs 𝒏\bm{n}. In particular, the distribution of normal ACs from healthy images 𝒟(𝒏|𝒙h)\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}}) is directly optimized to follow the Gaussian distribution during the training of the VAE component (see the process starting with a healthy image in Fig. 4). Additionally, the distribution of normal ACs from diseased images 𝒟(𝒏|𝒙d)\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{d}}) is ensured to follow the Gaussian distribution because each normal AC 𝒏\bm{n} from diseased images 𝒙d\bm{x}^{\mathrm{d}} is sampled from the posterior distribution of the VAE component as an inference result (see the process starting with a diseased image in Fig. 4). The distribution of entire ACs from healthy images 𝒟(𝒆|𝒙h)\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}}) can be indirectly optimized to follow the Gaussian distribution by forcing the abnormal ACs extracted from healthy images to the zero vector (𝒂𝟎\bm{a}\approx\bm{0}), as follows: 𝒟(𝒆|𝒙h)=𝒟(𝒏+𝒂|𝒙h)𝒟(𝒏|𝒙h)\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}})=\mathcal{D}(\bm{n}+\bm{a}\rvert\bm{x}^{\mathrm{h}})\approx\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}}). Here, 𝒟(𝒏|𝒙h)\mathcal{D}(\bm{n}\rvert\bm{x}^{\mathrm{h}}) is trained to be the Gaussian distribution by the learning objective of the VAE component.

2.2.3 Configuration of the diseased subspace

Figure 5b also depicts how the distribution of entire ACs from diseased images in the diseased subspace ({𝒟(𝒆|𝒙d)}𝒵d\{\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}})\}\subset\mathcal{Z}^{\mathrm{d}}) should be separated from the Gaussian distribution 𝒩(𝟎,𝑰)\mathcal{N}(\bm{0},\bm{I}) that represents healthy images 𝒙h\bm{x}^{\mathrm{h}}. To achieve this, a margin parameter is provided as a hyperparameter that determines how far apart the two subspaces should be, which is indicated by an arrow labeled “Margin” in Fig. 5b. The distance between the distribution of entire ACs from diseased images 𝒟(𝒆|𝒙d)\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}}) and the Gaussian distribution 𝒩(𝟎,𝑰)\mathcal{N}(\bm{0},\bm{I}) is measured by KL-divergence, and it is optimized to exceed the margin parameter, which is formulated as a margin loss as follows:

Lmargin=max(0,mKL(𝒟(𝒆|𝒙d)𝒩(𝟎,𝑰))).L_{\mathrm{margin}}=\max(0,m-\mathrm{KL}(\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}})\|\mathcal{N}(\bm{0},\bm{I}))). (2)

Here, mm is the margin parameter, and KL()\mathrm{KL}(\cdot\|\cdot) is the KL-divergence between two distributions. As the margin parameter increases, the distance, which is represented as the KL-divergence, between the two distributions is configured to be large.

2.3 Learning objectives

To train the deep-learning framework, several loss functions are defined. Reconstruction loss LreconL_{\mathrm{recon}} is a composite loss function of both a perceptual loss function using the VGG network (Simonyan and Zisserman, 2015) and a mean squared loss function, which forces the reconstructed images 𝒙^\hat{\bm{x}} to be similar to the input images 𝒙\bm{x}. When necessary, the reconstruction loss focusing on the tumor-associated regions is additionally calculated for the entire reconstruction to force the model to generate the abnormal regions more precisely (see Section 3.2.4). Segmentation loss LsegL_{\mathrm{seg}} is a composite loss function combining a Dice loss function (Dice, 1945) and a focal loss function (Lin et al., 2020) between the predicted segmentation label for abnormal areas 𝒍^\hat{\bm{l}} and its ground-truth label 𝒍\bm{l}. Consistency loss LconsL_{\mathrm{cons}} calculates the L1 distance between the estimated abnormal ACs 𝒂^\hat{\bm{a}} by the label encoder ELblE_{\mathrm{Lbl}} and the corresponding abnormal ACs 𝒂\bm{a}, forcing the estimated abnormal ACs 𝒂^\hat{\bm{a}} to be close to the corresponding abnormal ACs 𝒂\bm{a}. Abnormality loss LabnL_{\mathrm{abn}} works only when a healthy image is given to force the norm of the abnormal ACs 𝒂\bm{a} to be zero. Regularization loss LregL_{\mathrm{reg}} works only when a healthy image is given, matching the posterior distribution estimated by the encoder ENAC(𝒏|𝒙h)=𝒩(𝝁(𝒙h),𝝈(𝒙h))E_{\mathrm{NAC}}(\bm{n}\rvert\bm{x}^{\mathrm{h}})=\mathcal{N}(\bm{\mu}(\bm{x}^{\mathrm{h}}),\bm{\sigma}(\bm{x}^{\mathrm{h}})) and the prior distribution 𝒩(𝟎,𝑰)\mathcal{N}(\bm{0},\bm{I}) by minimizing the KL-divergence between them, as follows:

Lreg=KL(𝒩(𝝁(𝒙h),𝝈(𝒙h))𝒩(𝟎,𝑰)).L_{\mathrm{reg}}=\mathrm{KL}(\mathcal{N}(\bm{\mu}(\bm{x}^{\mathrm{h}}),\bm{\sigma}(\bm{x}^{\mathrm{h}}))\|\mathcal{N}(\bm{0},\bm{I})). (3)

Finally, margin loss LmarginL_{\mathrm{margin}} works only when a diseased image is given, imposing a distance between the distribution of the entire ACs from diseased images 𝒟(𝒆|𝒙d)\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}}) and the Gaussian distribution 𝒩(𝟎,𝑰)\mathcal{N}(\bm{0},\bm{I}), as formulated in Eq. 2. In the training process, the weighted sum of the abovementioned loss functions was set to be minimized:

Ltotal=wreconLrecon+wsegLseg+wconsLcons+wabnLabn+wregLreg+wmarginLmargin.\begin{split}L_{\mathrm{total}}&=w_{\mathrm{recon}}L_{\mathrm{recon}}\\ &+w_{\mathrm{seg}}L_{\mathrm{seg}}\\ &+w_{\mathrm{cons}}L_{\mathrm{cons}}\\ &+w_{\mathrm{abn}}L_{\mathrm{abn}}\\ &+w_{\mathrm{reg}}L_{\mathrm{reg}}\\ &+w_{\mathrm{margin}}L_{\mathrm{margin}}.\end{split} (4)

Here, wreconw_{\mathrm{recon}}, wsegw_{\mathrm{seg}}, wconsw_{\mathrm{cons}}, wabnw_{\mathrm{abn}}, wregw_{\mathrm{reg}}, and wmarginw_{\mathrm{margin}} are weights for the reconstruction loss LreconL_{\mathrm{recon}}, segmentation loss LsegL_{\mathrm{seg}}, consistency loss LconsL_{\mathrm{cons}}, abnormality loss LabnL_{\mathrm{abn}}, regularization loss LregL_{\mathrm{reg}}, and margin loss LmarginL_{\mathrm{margin}}, respectively. The full algorithm for training the feature extraction module of our SBMIR system is summarized in Algorithm 1.

sg: a stop-gradient operator
while not converge do
       /* Forward path for healthy images */
       Sample a batch of healthy images 𝒙h\bm{x}^{\mathrm{h}} and corresponding segmentation labels 𝒍h\bm{l}^{\mathrm{h}} from the training dataset.
       𝝁,𝝈ENAC(𝒙h)\bm{\mu},\bm{\sigma}\leftarrow E_{\mathrm{NAC}}(\bm{x}^{\mathrm{h}})
       𝒏=𝝁+𝝈ϵ,ϵ𝒩(𝟎,𝑰)\bm{n}=\bm{\mu}+\bm{\sigma}\odot\bm{\epsilon},\bm{\epsilon}\sim\mathcal{N}(\bm{0},\bm{I})
       𝒂=EAAC(𝒙h)\bm{a}=E_{\mathrm{AAC}}(\bm{x}^{\mathrm{h}})
       𝒆=𝒏+𝒂\bm{e}=\bm{n}+\bm{a}
       𝒙^h=DImg(𝒆)\hat{\bm{x}}^{\mathrm{h}}=D_{\mathrm{Img}}(\bm{e})
       𝒙^h=DImg(𝒏)\hat{\bm{x}}^{\mathrm{h}}=D_{\mathrm{Img}}(\bm{n})
       𝒍^h=DLbl(𝒂)\hat{\bm{l}}^{\mathrm{h}}=D_{\mathrm{Lbl}}(\bm{a})
       𝒂^=ELbl(𝒍h)\hat{\bm{a}}=E_{\mathrm{Lbl}}(\bm{l}^{\mathrm{h}})
       Compute Lrecon(𝒙^h,𝒙h)L_{\mathrm{recon}}(\hat{\bm{x}}^{\mathrm{h}},\bm{x}^{\mathrm{h}}), Lseg(𝒍^h,𝒍h)L_{\mathrm{seg}}(\hat{\bm{l}}^{\mathrm{h}},\bm{l}^{\mathrm{h}}), Lcons(𝒂^,sg(𝒂))L_{\mathrm{cons}}(\hat{\bm{a}},\mathrm{sg}(\bm{a})), Labn(𝒂)L_{\mathrm{abn}}(\bm{a}), and Lreg(𝝁,𝝈)L_{\mathrm{reg}}(\bm{\mu},\bm{\sigma}).
       Update parameters of ENACE_{\mathrm{NAC}}, EAACE_{\mathrm{AAC}}, ELblE_{\mathrm{Lbl}}, DImgD_{\mathrm{Img}}, and DLblD_{\mathrm{Lbl}} to minimize wreconLrecon+wsegLseg+wconsLcons+wabnLabn+wregLregw_{\mathrm{recon}}L_{\mathrm{recon}}+w_{\mathrm{seg}}L_{\mathrm{seg}}+w_{\mathrm{cons}}L_{\mathrm{cons}}+w_{\mathrm{abn}}L_{\mathrm{abn}}+w_{\mathrm{reg}}L_{\mathrm{reg}} using stochastic gradient descent (e.g., Adam).
      
      /* Forward path for diseased images */
       Sample a batch of diseased images 𝒙d\bm{x}^{\mathrm{d}} and corresponding segmentation labels 𝒍d\bm{l}^{\mathrm{d}} from the training dataset.
       𝝁ENAC(𝒙d)\bm{\mu}\leftarrow E_{\mathrm{NAC}}(\bm{x}^{\mathrm{d}})
       𝒏sg(𝝁)\bm{n}\leftarrow\mathrm{sg}(\bm{\mu})
       𝒂=EAAC(𝒙d)\bm{a}=E_{\mathrm{AAC}}(\bm{x}^{\mathrm{d}})
       𝒆=𝒏+𝒂\bm{e}=\bm{n}+\bm{a}
       𝒙^d=DImg(𝒆)\hat{\bm{x}}^{\mathrm{d}}=D_{\mathrm{Img}}(\bm{e})
       𝒍^d=DLbl(𝒂)\hat{\bm{l}}^{\mathrm{d}}=D_{\mathrm{Lbl}}(\bm{a})
       𝒂^=ELbl(𝒍d)\hat{\bm{a}}=E_{\mathrm{Lbl}}(\bm{l}^{\mathrm{d}})
       Compute Lrecon(𝒙^d,𝒙d)L_{\mathrm{recon}}(\hat{\bm{x}}^{\mathrm{d}},\bm{x}^{\mathrm{d}}), Lseg(𝒍^d,𝒍d)L_{\mathrm{seg}}(\hat{\bm{l}}^{\mathrm{d}},\bm{l}^{\mathrm{d}}), Lcons(𝒂^,sg(𝒂))L_{\mathrm{cons}}(\hat{\bm{a}},\mathrm{sg}(\bm{a})), and Lmargin(𝒆)L_{\mathrm{margin}}(\bm{e}).
       Update parameters of EAACE_{\mathrm{AAC}}, ELblE_{\mathrm{Lbl}}, DImgD_{\mathrm{Img}}, and DLblD_{\mathrm{Lbl}} to minimize wreconLrecon+wsegLseg+wconsLcons+wmarginLmarginw_{\mathrm{recon}}L_{\mathrm{recon}}+w_{\mathrm{seg}}L_{\mathrm{seg}}+w_{\mathrm{cons}}L_{\mathrm{cons}}+w_{\mathrm{margin}}L_{\mathrm{margin}} using stochastic gradient descent (e.g., Adam).
end while
Algorithm 1 Training of the feature extraction module of our SBMIR system

3 Implementation

3.1 Datasets

Refer to caption
Figure 6: Examples in the datasets. a Example images with tumor-associated labels and normal-anatomy-associated labels in the glioma testing dataset are shown. b Example images with tumor-associated labels and normal-anatomy-associated labels in the lung cancer testing dataset are shown. Note that in the normal-anatomy-associated labels, regions annotated with tumor-associated labels were assigned normal anatomical classes that should be present therein and vice versa. ET, gadolinium-enhancing tumor; ED, peritumoral edema; NET, necrotic and non-enhancing tumor-core; T1CE, T1-weighted contrast-enhanced sequence.

Two types of datasets, a glioma dataset and a lung cancer dataset, were used in the model training and evaluation, each of which was split into a training dataset and a testing dataset.

3.1.1 Glioma dataset

The glioma dataset initially aggregated three datasets comprising brain MRI scans with glioma obtained from the MICCAI 2019 BraTS Challenge (Menze et al., 2015; Bakas et al., 2017c, a, b), a dataset of 51,925 slices from 335 patients (MICCAI_BraTS_Training), a dataset of 19,375 slices from 125 patients (MICCAI_BraTS_Validation), and a dataset of 25,730 slices from 166 patients (MICCAI_BraTS_Testing). The patients were from multiple hospitals. Each MRI scan consists of T1-weighted (T1), T1-weighted contrast-enhanced (T1CE), T2-weighted, and fluid-attenuated inversion recovery (FLAIR) sequences. As shown in Fig. 6a, the tumor-associated segmentation labels include three classes: gadolinium (Gd)-enhancing tumor (ET), peritumoral edema (ED), and necrotic and non-enhancing tumor core (NET), while the normal-anatomy-associated labels include six classes: left cerebrum, right cerebrum, left cerebellum, right cerebellum, left ventricle, and right ventricle. Except for the tumor-associated labels in MICCAI_BraTS_Training, we supplemented each dataset with the tumor-associated labels according to the procedure described in a previous study (Kobayashi et al., 2021). Normal-anatomy-associated labels were generated using the software BrainSuite (version: 19a) (Shattuck and Leahy, 2002). We then assigned each dataset for the use of training and testing our SBMIR system as follows: the glioma training dataset (N=N= 45,105 slices from 291 patients) consists of both MICCAI_BraTS_Validation and MICCAI_BraTS_Testing, and the glioma testing dataset (N=N= 51,925 slices from 335 patients) consists of MICCAI_BraTS_Training. Note that the original names of the datasets in the 2019 BraTS Challenge and the notations according to their purpose in this study are different. Following the previous study (Kobayashi et al., 2021), this difference in labeling was deemed appropriate for evaluating the performance of our SBMIR system based on data that have widely accepted and publicly available ground-truth tumor-associated labels (MICCAI_BraTS_Training) in order to ensure the objectivity and reproducibility.

3.1.2 Lung cancer dataset

The lung cancer dataset consists of chest CT scans from 1,000 patients with lung cancer collected from a single hospital. The study, data use, and data protection procedures were approved by the Ethics Committee of the National Cancer Center, Tokyo, Japan (protocol number 2016-496). The requirement for informed consent was waived in view of the retrospective nature of the study. All procedures followed applicable laws and regulations and the Declaration of Helsinki. As shown in Fig. 6b, the tumor-associated segmentation labels included one class, primary tumor (PT), the region of which was segmented by an expert radiation oncologist (K.K.). Other potential tumor-associated regions, such as lymph node metastases, were not annotated because the model training was conducted in the lung window, making it difficult to identify the diseased areas in soft tissues such as mediastinum. The normal-anatomy-associated labels included five classes: right upper lobe, right middle lobe, right lower lobe, left upper lobe, and left lower lobe. These five labels were generated by an off-the-shelf deep-learning model that is available from a public repository (Hofmanninger et al., 2020). Then, we split the patients randomly into the lung cancer training dataset (N=N= 49,696 slices from 600 patients) and the lung cancer testing dataset (N=N= 33,572 slices from 400 patients). Notably, the large testing dataset was beneficial for assessing the effectiveness of image search because it can ensure that corresponding images exist for individual user queries.

3.2 Training settings

The detailed training settings of the models are described here. We first determined the hyperparameters using the glioma training dataset and then applied most of them to the lung cancer training dataset, except for the margin parameter. Note that the two types of datasets differed in terms of the spatial resolution of input images; the image size of the glioma training dataset was set to 256×256256\times 256, and that of the lung cancer training dataset was set to 512×512512\times 512. Also, it is important to note that the average voxel volume in the tumor-associated regions in the lung cancer testing dataset (1.0×1041.0\times 10^{4}) was much smaller than that in the glioma testing dataset (4.1×1044.1\times 10^{4}).

3.2.1 Preprocessing of the datasets

For the glioma dataset (i.e., the glioma training dataset and the glioma testing dataset), T1, T1CE, and FLAIR sequences were concatenated into a three-channel MR volume. Then, ZZ-score normalization was applied channel-wise in a manner that was limited to the area inside the body. Subsequently, each 3D MR volume was decomposed into a collection of three-channel 2D axial slices, for which the size was resized to 3×256×2563\times 256\times 256. On the other hand, for the lung cancer dataset (i.e., the lung cancer training dataset and the lung cancer testing dataset), the voxel value was normalized into a lung window with a window width of 1500 and a window center of -550. Then, to set the number of channels to 3, which is similar to the glioma dataset, the CT volume was decomposed into a collection of three-channel 2D axial slices by concatenating each slice with the adjacent upper and lower slices. Thus, the size of the input tensor was 3×512×5123\times 512\times 512. This 2.5-dimensional approach for the CT slices is valid because the diagnosis of lung nodules usually requires investigation of adjacent slices to distinguish abnormal structures from normal structures such as vessels. Random rotation and random horizontal flipping were applied in the data augmentation of each image for training the models.

3.2.2 Implementation of the neural networks

All neural networks were implemented in Python 3.8 with PyTorch library 1.10.0 (Paszke et al., 2019) using an NVIDIA Tesla V100 graphics processing unit with CUDA 10.2. We implemented almost the same neural network architecture for the two training datasets (i.e., the glioma training dataset and the lung cancer training dataset), the purpose of which was to maintain the same compression ratio from the input image to the ACs as latent representations. See C for the detailed network architectures.

3.2.3 Hyperparameters for the glioma dataset

For the glioma training dataset, the hyperparameters shared across the configurations were as follows: batch size = 200, the number of training epochs = 300, learning rate = 1.0×1041.0\times 10^{-4}, weight decay = 1.0×1051.0\times 10^{-5}, wrecon=1.0w_{\mathrm{recon}}=1.0, wseg=10.0w_{\mathrm{seg}}=10.0, wcons=1.0w_{\mathrm{cons}}=1.0, wabn=0.1w_{\mathrm{abn}}=0.1, wreg=0.1w_{\mathrm{reg}}=0.1, wmargin=0.1w_{\mathrm{margin}}=0.1, and m=10m=10. We determined these hyperparameters by grid search from the candidate values as follows: wrecon={1.0,10.0,100.0}w_{\mathrm{recon}}=\{1.0,10.0,100.0\}, wseg={1.0,10.0}w_{\mathrm{seg}}=\{1.0,10.0\}, wcons={0.1,1.0}w_{\mathrm{cons}}=\{0.1,1.0\}, wabn={0.1,1.0}w_{\mathrm{abn}}=\{0.1,1.0\}, wreg={0.1,1.0}w_{\mathrm{reg}}=\{0.1,1.0\}, wmargin={0.1,1.0}w_{\mathrm{margin}}=\{0.1,1.0\}, and m={0,5,10,20,40}m=\{0,5,10,20,40\}.

3.2.4 Hyperparameters for the lung cancer dataset

For the lung cancer training dataset, the hyperparameters shared across the configurations were as follows: batch size = 144, the number of training epochs = 50, learning rate = 1.0×1041.0\times 10^{-4}, weight decay = 1.0×1051.0\times 10^{-5}, wrecon=1.0w_{\mathrm{recon}}=1.0, wseg=10.0w_{\mathrm{seg}}=10.0, wcons=1.0w_{\mathrm{cons}}=1.0, wabn=0.1w_{\mathrm{abn}}=0.1, wreg=0.1w_{\mathrm{reg}}=0.1, wmargin=0.1w_{\mathrm{margin}}=0.1, and m=40m=40. Owing to the relatively small areas of the tumor-associated regions, the reconstruction loss focusing on the area of tumor-associated regions was additionally calculated for the entire reconstruction. Almost all the hyperparameters above were determined as those optimized in the model training on the glioma training dataset, except for the margin parameter mm that was determined by grid search from the candidate values of {0,20,40,60,80}\{0,20,40,60,80\}.

3.3 Image retrieval pipeline

Refer to caption
Figure 7: Similarity calculation module of our sketch-based medical image retrieval (SBMIR) system. The similarity calculation module of our SBMIR system is staged in five steps. First, a query vector extracted from a user query, which is the combination of a template image and a semantic sketch, is obtained from the feature extraction module. Second, the approximated Euclidean distances between the query vector and reference vectors of every slice in the reference volumes are computed. Third, the slices with reference vectors close to the query vector are identified irrespective of the reference volumes. Fourth, the extracted slices are rearranged to select the most similar slice in each reference volume. Fifth, the top-KK most similar images belonging to different reference volumes are presented to the user as the search results.

After the model is trained, the whole SBMIR system combining the feature extraction module and the similarity calculation module can be implemented. Using the feature extraction module, reference images in a database were converted into reference vectors before user operation (see the right part of Fig. 1c). Each reference vector was constructed from a reference image as the sum of a normal AC (through the normal AC encoder) and an abnormal AC (through the abnormal AC encoder). At the time of the image retrieval, the two-step user operation constructs a query vector to meet the information needs (see the left part of Fig. 1c): first, selecting a template image to specify the location where the target image content should exist, and second, sketching the semantic segmentation label of the disease to express the image content therein. Then, the normal AC encoder extracts a normal AC from the template image, and the label encoder extracts an abnormal AC from the semantic sketch of the disease, both of which are summed as the query vector according to Eq. 1.

In the similarity calculation module (see the middle part of Fig. 1c), the distance of the reference vectors to the query vector was calculated using approximated Euclidean distances by an algorithm called Annoy (Bernhardsson, 2022). The reference images with close reference vectors to the query vector are then rearranged according to each reference volume. Finally, the top-KK similar images, each of which belongs to a different reference volume, are obtained. Note that this volume-wise similarity calculation is essential to avoid the redundancy resulting from consecutive slices with similar appearances in a single 3D volume. Figure 7 explains the details of the similarity calculation module to realize volume-wise image retrieval in our SBMIR system.

3.4 Implementation of our SBMIR system on the glioma dataset

The feature extraction module was trained on the glioma training dataset. Then, the image retrieval pipeline was implemented on the glioma testing dataset for evaluation. The template volume was selected to be the image series with the minimum volume of tumor-associated regions in the glioma testing dataset, as the smaller the tumor volume, the smaller the deviation of normal structures can be. An ideal template volume may be a totally healthy image series; however, the glioma dataset did not contain an image series without abnormal findings. Thus, this may be the second-best setting for the template volume. For gliomas, we found that users can specify the following types of information regarding the target images to be retrieved: the location, shape, size, and internal characteristics of the tumor (see Section 5.1), which we call fine-grained characteristics of the disease. Among these, the location, shape, and size of the target diseased area can be defined by selecting a template image and sketching the outer edge of the tumor. Furthermore, depending on which combination of tumor-associated labels (i.e., ET, ED, and NET) is used to sketch the tumor, the internal characteristics of the tumor (e.g., contrast enhancement effect, necrosis, presence of edema, etc.) can be expressed.

Table 1: Backgrounds of the evaluators and participation in the user test. Ten and nine healthcare professionals participated in the evaluations of gliomas and lung cancers, respectively.
Evaluators Background Years of experience Glioma Lung cancer
#1 Radiation oncologist 10 - 19
#2 Medical oncologist 10 - 19
#3 Diagnostic radiologist 20 - 29
#4 Diagnostic radiologist 20 - 29
#5 Neurosurgeon 20 - 29
#6 Colorectal surgeon 10 - 19
#7 Thoracic surgeon 20 - 29
#8 Medical physicist 10 - 19
#9 General surgeon 1 - 9
#10 Medical researcher 1 - 9

3.5 Implementation of our SBMIR system on the lung cancer dataset

After training the feature extraction module using the lung cancer training dataset, the image retrieval pipeline of our SBMIR system was implemented on the lung cancer testing dataset for evaluation. The template image series was also selected to be the one with the minimum volume of tumor-associated regions in the lung cancer testing dataset, similar to the glioma dataset. For lung cancers, we found that users can identify fine-grained characteristics of the disease, including the location, shape, and size of a tumor (see Section 5.1); however, in contrast with gliomas, the internal characteristics of the tumor (e.g., ground-glass opacity and solid tumor components) are not explicitly expressed by the model because only a single-class tumor-associated segmentation label (i.e., PT) was given in the training dataset.

4 Evaluation

The present study comprehensively evaluates our SBMIR system from technological and clinical standpoints. As for technical evaluations, the training results of the feature decomposition, hyperparameter optimization focusing on the image retrieval performance, and ablation studies are described in D, E, and F, respectively. In this section, we explain the details of the clinical evaluation, which focuses on how our SBMIR system can help the information-seeking objectives of healthcare professionals.

In exchange for the flexibility of the query-by-sketch approach, standardizing the user query and preparing the ground truth for the retrieved images are challenges. Therefore, user testing is the most valid evaluation schema to assess the image retrieval performance of our SBMIR system. A group of healthcare professionals with various clinical backgrounds, including radiologists, physicians, surgeons, medical physicists, and researchers, participated in the user tests as the evaluators (see Table 1). The user tests were conducted using a dedicated GUI (see Fig. 2b).

4.1 Definition of clinical similarity

We developed criteria for the evaluators to determine whether the retrieved image was clinically similar to the features specified by the user query for each dataset.

4.1.1 Clinical similarity of glioma images

For gliomas, our SBMIR system can specify the location, shape, size, and internal characteristics of the tumor, as demonstrated in Section 5.1. Among these characteristics, we considered that determining whether the shape is consistent or not can be too subjective to standardize among evaluators. Hence, the clinically similar images were defined to be the images that met the following three criteria: (1) a difference in the maximum tumor diameter of within 2 cm; (2) the same location of the tumor according to the brain lobe (i.e., right frontal lobe, right parietal lobe, right occipital lobe, right temporal lobe, left frontal lobe, left parietal lobe, left occipital lobe, and left temporal lobe); (3) the same pattern of the contrast enhancement (e.g., the presence of contrast-enhancement and tumor necrosis). When a user intention included a more detailed location (e.g., the relative position in a brain lobe), the evaluators were required to judge whether the retrieved images matched the detailed location. Each criterion was judged with a score of 0 or 1, and a maximum score of 3 points was possible for each image, which we call the similarity score. When the similarity score was 3/3, the retrieved image was considered clinically similar to the query.

4.1.2 Clinical similarity of lung cancer images

For lung cancers, our SBMIR system can specify the location, shape, and size of the tumor, as demonstrated in Section 5.1. Note that the internal characteristics of the tumor are not explicitly expressed, as only a single class of tumor-associated segmentation labels (i.e., PT) was applied to the datasets. We also concluded that determining the concordance in the shape of lung cancer can be too subjective. Therefore, the clinical similarity was defined according to the following two criteria: (1) a difference in the maximum tumor diameter of within 2 cm; (2) the same location of the tumor according to the lung lobe (i.e., right upper lobe, right middle lobe, right lower lobe, left upper lobe, left lower lobe). When a user intention included a more detailed location (e.g., the lung segment, the apex of the lung, the pleural contact, and the hilar region), the evaluators were asked to judge whether the retrieved images matched the detailed location. Each criterion was evaluated with a score of 0 or 1, and a maximum similarity score of 2 points was possible for each image. When the similarity score was 2/2, the retrieval image was considered clinically similar to the query.

Refer to caption
Figure 8: Evaluation flow for our sketch-based medical image retrieval (SBMIR) system. a To prepare question items for each test, we conducted the following steps. First, a set of representative images was identified in a testing dataset. Second, the isolated samples were excluded from the set, from which five isolated samples were randomly assigned to Test-3. Third, from the remaining representative images, those with at least one clinically similar image were extracted in the pre-evaluation process based on the direct comparison of Dice similarities. Fourth, expert radiologists selected representative images for the question items of the practice stage, Test-1, and Test-2, in order to vary the clinical features as much as possible. b In the user tests, the healthcare professionals first learned how to operate our SBMIR system in the practice stage. After that, retrieval tests for presented images (Test-1), presented sentences (Test-2), and isolated samples (Test-3) were conducted sequentially.

4.2 Evaluation metrics for the retrieved images

We devised two types of evaluation metrics for the retrieved images – user-oriented and objective metrics. Every evaluation metric was averaged among the evaluators (N=10N=10 for gliomas and N=9N=9 for lung cancers, as shown in Table 1), and the mean ±\pm standard deviation of each metric was reported.

4.2.1 User-oriented evaluation metrics

The user-oriented metrics included precision@K, reciprocal rank, and normalized discounted cumulative gain (nDCG), which reflect how the retrieved results were clinically similar to the intent of the query based on the judgment of the evaluators. These were evaluated according to a 2D slice basis; that is, when evaluating the similarity score to calculate these metrics, each evaluator interpreted the consistency between the intention of the query and retrieved images without considering the adjacent slices of each retrieved image. Because the framework of the user tests is characterized by the large size of the datasets (N=51,925N=51,925 images in the glioma testing dataset and N=33,572N=33,572 images in the lung cancer testing dataset) and the fine-grained attributes that user queries can specify, it is straightforward to assess how many of the retrieved top-KK images contain corresponding images that satisfy the intent of the query by the user-oriented metrics. We defined the following three user-oriented metrics.

Precision@K (Shirahatti and Barnard, 2005) is the ratio between the number of clinically similar images in the top-KK retrieved images and the number of retrieved images KK, which is formulated as

Precision@K=|𝒮K|K,\mathrm{Precision@K}=\frac{\rvert\mathcal{S}\bigcap\mathcal{L}_{K}\rvert}{K}, (5)

where 𝒮\mathcal{S} is the set of clinically similar images judged by an evaluator, K\mathcal{L}_{K} is the list of the top-KK retrieved images, and KK is the number of retrieved images.

The reciprocal rank (Pedronette and Torres, 2015) is calculated from the inverse of the rank position of the first clinically similar image in the top-KK retrieved images, which is defined as

Reciprocalrank=1k,\mathrm{Reciprocal\,rank}=\frac{1}{k}, (6)

where kk is the rank position of the first clinically similar image in the top-KK retrieved images. When there is no clinically similar image in the retrieval list, the reciprocal rank is set to 0.

The nDCG (Wang et al., 2013) is the total similarity score of the retrieved images in the order defined as

nDCG=1DCGperfect(s1+k=2Ksklog2k),\mathrm{nDCG}=\frac{1}{\mathrm{DCG}_{\mathrm{perfect}}}\left(s_{1}+\sum^{K}_{k=2}\frac{s_{k}}{\log_{2}k}\right), (7)

where sis_{i} indicates the score of the ii-th ranked image, kk is the rank position, and DCGperfect\mathrm{DCG}_{\mathrm{perfect}} represents the maximum DCG in an ideal retrieval result to normalize the value within a range from 0 to 1. The role of the log functions is to discount the score of retrieved images that ranked lower.

Note that precision@5 can be influenced by how many clinically similar images for each user query were originally included in the testing dataset. As such, precision@5 for rare images can be small even when the image retrieval works successfully. Hence, we prepared the user tests to guarantee that the lower bound of precision@5 will be 0.2 when the image retrieval works properly, which will be described in detail in Section 4.3.

4.2.2 Objective evaluation metrics

Recall@K was defined as an objective metric that can be assessed independently of the evaluator’s interpretation of the retrieval results. This objective metric was determined according to a 3D volume basis, reflecting whether the images belonging to the same volume as the presented image itself, which we call the same-volume images, are listed among the top-KK retrieved images.

Recall@K (Shirahatti and Barnard, 2005) is automatically calculated based on whether the same-volume images are acquired in the top-KK retrieved images, which is formulated as

Recall@K={1(iK)0(iK),\displaystyle\mathrm{Recall@K}=\begin{cases}1&(i\in\mathcal{L}_{K})\\ 0&(i\notin\mathcal{L}_{K}),\end{cases} (8)

where ii is the same-volume image, and K\mathcal{L}_{K} is a list of the top-KK retrieved images.

4.3 Preparation for the user tests

As illustrated in Fig. 8a, we prepared question items for the practice stage and the three stages of user tests, including Test-1, Test-2, and Test-3. Each stage included five question items, consisting of either images or text descriptions, presented to the evaluators to retrieve medical images consistent with the clinical characteristics in each question item. To prepare question items, we focused on representative images. A representative image is defined as the 2D axial slice containing the largest tumor-associated region in each 3D volume, which is usually considered to be the image that best characterizes the clinical meaning of the volume.

Each question item was developed based on a specific representative image assigned in the following steps. First, from a set of representative images in each testing dataset, we identified isolated samples that were “not” included in the 5-nearest neighbors (NN) groups of any other images using a fine-tuned ResNet-101 (see B) for exclusion from Test-1 and Test-2. Five of these isolated samples were randomly selected for Test-3. Then, pre-evaluation was conducted on the remaining representative images. The purpose of the pre-evaluation was to identify a set of representative images that were certain to have at least one clinically similar image in the testing dataset. To confirm this, we directly compared Dice similarities (Dice, 1945) to maximize the average overlap of the tumor-associated labels and the normal-anatomy-associated labels (see Fig. 6). As described previously (Kobayashi et al., 2021), a CBMIR based on the direct comparison of Dice similarities can be considered an oracle for retrieving images similar to a query image. The Dice similarities were computed between each representative image and all the other images in the dataset, and the similar images were then rearranged volume-wise, in the same manner as shown in Fig. 7. Subsequently, whether the top-5 retrieved images based on the Dice similarities included at least one clinically similar image according to the similarity score was evaluated from a clinical perspective, and the representative images that did not meet this criterion were excluded (i.e., the lower bound of precision@5 can be 0.2). Lastly, the manual selection was performed to assign as diverse a selection of disease phenotypes as possible in each testing stage, considering the tumor location and other disease characteristics.

After assigning five representative images for each stage, the practice stage, Test-1, and Test-3 used the assigned images as question items presented to the evaluators. For Test-2, text descriptions as question items expressing specific clinical findings, to simulate radiology reports, were made based on the assigned images. Because it is generally difficult to fully communicate the image characteristics of a medical image by text, the originally assigned representative images were only used as references for making the text descriptions and were not used in the evaluation process.

4.4 Flow of the user tests

Figure 8b illustrates the flow of the user tests, consisting of the practice stage, Test-1, Test-2, and Test-3. For each of the testing datasets, the evaluators first underwent the practice stage. In the practice stage, five sets of a representative image and a ground-truth tumor-associated label were consecutively presented to the evaluators. By referring to the ground-truth tumor-associated labels as guidance for sketching the diseases, they were able to learn how to construct a user query to specify the content of medical images. Also, they received automated feedback on whether the same-volume image was listed in the top-5 retrieved images to confirm the validity of the user query. This practice stage is important to demonstrate that our SBMIR system can be learned easily with minimum practice. Then, three types of user tests were conducted as follows. Test-1 demonstrated the image retrieval performance when example images were available (see Section 5.2). Test-2 revealed the image retrieval performance without example images (see Section 5.3). Test-3 investigated the image retrieval performance for isolated samples (see Section 5.4). Each test stage contained five question items indicated by images or text descriptions, which is described in detail in G. User-oriented metrics were independently evaluated by each evaluator in Test-1 and Test-2, while an objective metric was automatically assessed in Test-1 and Test-3.

4.5 Comparison with conventional CBMIR methods

A conventional CBMIR system using the query-by-example approach was implemented for the comparison purpose. The fine-tuned ResNet-101 was used as a feature extractor (see A), and the image retrieval pipeline was built in a similar manner to our SBMIR system (see Fig. 7). In Test-1, where example images were available, the image retrieval results were evaluated based on the user-oriented metrics. An expert diagnostic radiologist (M.M.) and an expert radiation oncologist (K.K.) were responsible for the aforementioned processes that required clinical perspectives, including the pre-evaluation and manual selection in Section 4.3.

5 Results and discussion

Refer to caption
Figure 9: Image retrieval performed according to fine-grained characteristics. a Example search results from our sketch-based medical image retrieval (SBMIR) system on the glioma testing dataset. b Example search results of our SBMIR system on the lung cancer testing dataset. These results show that the fine-grained characteristics of the retrieved images change depending on how the tumor is sketched and which template image is selected. Gd, gadolinium; T1, T1-weighted sequence; T1CE, T1-weighted contrast-enhanced sequence; FLAIR, fluid-attenuated inversion recovery sequence.

5.1 SBMIR retrieved medical images according to fine-grained imaging characteristics

Here, we demonstrate that our SBMIR system enables medical image retrieval to be performed according to fine-grained characteristics, including the location, shape, size, and internal characteristics of a disease. Because a query vector can be specified based on which 2D template image is selected from the 3D template volume and how the disease is sketched, we observed how the search results changed when each piece of information in a user query was varied.

Figure 9a demonstrates medical image retrieval performed according to the fine-grained characteristics of gliomas. The radiological components of gliomas were categorized into three classes using tumor-associated segmentation labels– ET, ED, and NET–all of which can be sketched on a selected template image. We started with a user query, consisting of only a template image without a sketch, to retrieve a normal image with similar anatomy to the template image (see the first row). This result was consistent with the assumption that normal ACs convey only information characterizing normal anatomy. We then drew a semantic sketch representing a non-enhancing tumor surrounded by mild peritumoral edema in the left frontal lobe on the same template image, which successfully retrieved an image containing the intended characteristics (see the second row). When we changed the internal characteristics of the tumor to exhibit ring enhancement, the retrieval results were again changed according to the intention of the query (see the third row). Finally, we sketched a similar disease in a different location on a different template image, the left temporal lobe, whereby the retrieved image also accompanied a tumor with the specified features at the intended location (see the fourth row). The retrieved image in the second row suggests a low-grade glioma, and those in the third and fourth rows suggest high-grade gliomas (The Cancer Genome Atlas Research Network, 2015). Hence, by configuring the three types of tumor-associated segmentation labels to specify the image content, even an image with a particular differential diagnosis can be retrieved.

Figure 9b shows that flexible medical image retrieval can also be successfully performed for lung cancers. Only a single class of tumor-associated segmentation label representing PT can be sketched on a selected template image because, in contrast with gliomas, the internal characteristics of lung cancers were not explicitly modeled. We started by sketching a tumor with a triangular shape in the left upper lobe on a template image, successfully retrieving an image with a tumor exhibiting the corresponding shape and location (see the first row). Subsequently, we altered the shape of the tumor to round, changing the retrieval results to include an image with a relatively round tumor located in the same region (see the second row). The detailed anatomical location of the semantic sketch in the same template image can also affect the search results. To demonstrate this, we sketched a round tumor at a different position on the same template image to contact the pleura, thus retrieving a tumor with pleural contact (see the third row). Lastly, we sketched a similar tumor in the right lower lobe on a different template image, changing the location as intended (see the fourth row). Hence, the detailed characteristics regarding the size, shape, and anatomical location of lung cancers can be reflected in the search results.

Refer to caption
Figure 10: Five question items and their evaluation metrics in Test-1. a Based on the glioma testing dataset, question items Q.1 to Q.5 were presented to the evaluators in Test-1 to reproduce the image characteristics of the presented images in user queries for retrieving clinically similar images, including the presented one. The means ±\pm standard deviations of the evaluation metrics, including precision@5, reciprocal rank, normalized discounted cumulative gain (nDCG), and recall@5, were calculated among the evaluators (N=10N=10). Example queries that successfully retrieved at least one clinically similar image are shown. b The five question items in Test-1 using the lung cancer testing dataset are shown with example queries. The means ±\pm standard deviations of the same evaluation metrics were calculated among the evaluators (N=9N=9). Image retrieval based on our sketch-based medical image retrieval system was successful in terms of clinical similarity according to the fine-grained characteristics in most cases for both gliomas and lung cancers. Gd, gadolinium; T1, T1-weighted sequence; T1CE, T1-weighted contrast-enhanced sequence; FLAIR, fluid-attenuated inversion recovery sequence.
Refer to caption
Figure 11: Five question items and their evaluation metrics in Test-2. a The five question items in Test-2 on the glioma testing dataset are shown with example user queries and retrieved images. The retrieved images are shown as T1-weighted contrast-enhanced sequences. The means ±\pm standard deviations of the evaluation metrics, including precision@5, reciprocal rank, and normalized discounted cumulative gain (nDCG), were calculated among the evaluators (N=10N=10). b The five question items in Test-2 on the lung cancer testing dataset are shown with example user queries and retrieved images. The means ±\pm standard deviations of the same evaluation metrics were calculated among the evaluators (N=9N=9). At least one or more clinically similar images were obtained for each statement in both datasets. Hence, our sketch-based medical image retrieval system enables medical image retrieval even without example images. Gd, gadolinium.
Refer to caption
Figure 12: Five question items and their evaluation metrics in Test-3. a The five question items, which present isolated samples to retrieve in Test-3 on the glioma testing dataset and their evaluation metrics using recall@5, are shown. The means ±\pm standard deviations of recall@5 were calculated among the evaluators (N=10N=10). b The five question items in Test-3 on the lung cancer testing dataset and their evaluation metrics are shown. The means ±\pm standard deviations of recall@5 were calculated among the evaluators (N=9N=9). For both datasets, images belonging to the same volume as the isolated samples (i.e., same-volume images) were successfully retrieved by our sketch-based medical image retrieval system. Gd, gadolinium; T1, T1-weighted sequence; T1CE, T1-weighted contrast-enhanced sequence; FLAIR, fluid-attenuated inversion recovery sequence.

5.2 SBMIR outperformed the existing CBMIR even when example images were available

In Test-1, the evaluators were asked to translate the characteristics of a presented image into a user query to retrieve clinically similar images, including the images belonging to the same volume as the presented one, which we call the same-volume images. There were two primary purposes of this test. First, by observing whether the same-volume images were retrievable or not, the representation power of user queries was assessed. Second, the image retrieval performance of our SBMIR system was compared with that of the existing CBMIR system based on the query-by-example approach, which used the presented images as example images (see Section 4.5).

For gliomas, five images with various sizes, locations, and patterns of contrast enhancement, were presented to the evaluators (see Fig. 10a and G.1). The average recall@5 among the evaluators was consistently high, above 0.9. Such high recall@5 values indicate that the user queries could capture most of the characteristics of the presented images and thereby successfully retrieve the same-volume images. Before the user tests, we had thought that it might be difficult for some evaluators to construct appropriate user queries based on their understanding of the presented image; however, such a skill-based limitation hindering image retrieval performance was not substantial for gliomas. Furthermore, all presented images had a precision@5 averaging above 0.6, indicating that at least three clinically similar images were listed on average among the top-5 retrieved images (see Section 4.1 for the definition of clinical similarity). For comparison, the means ±\pm standard deviations of precision@5, reciprocal rank, nDCG, and recall@5 for the retrieval results among the question items based on the query-by-example approach were 0.32±0.180.32\pm 0.18, 1.00±0.001.00\pm 0.00, 0.68±0.070.68\pm 0.07, and 1.00±0.001.00\pm 0.00, respectively. As the presented images were also subject to image retrieval, it is natural that recall@5 and reciprocal rank values were as high as 1.00, reflecting the fact that all the presented images were the top ranked images. Nevertheless, the consistently higher precision@5 of our SBMIR system (i.e., all precision@5 values in Fig. 10a are above 0.32) supports its superior image retrieval performance according to fine-grained image characteristics even when example images are available. Figure 1a shows the results for Q.2, whose recall@5 (1.00±0.001.00\pm 0.00) was the highest, while Fig. 1b shows the results for Q.3, whose recall@5 (0.90±0.320.90\pm 0.32) was the lowest. These examples demonstrate the robustness of our system, as most of the retrieved images were judged to be clinically similar, even though each evaluator sketched the disease with unique details on the same or slightly different template images.

For lung cancers, five images with different sizes and locations were presented to the evaluators (see Fig. 10b and G.1). Notably, recall@5 varied among the question items. A low recall@5 suggests that the representation power of the user query was insufficient to retrieve the same-volume images. To examine the causes of this, we compared the user queries in Q.4 exhibiting the highest recall@5 (0.90±0.320.90\pm 0.32) (see Fig. 2a) with those in Q.1 showing the lowest recall@5 (0.10±0.320.10\pm 0.32) (see Fig. 2b). Because the user queries resembled each other among the evaluators for each question item, we concluded that skill-based limitations were not the primary cause. Instead, a sketch-based limitation, which can be troublesome when a semantic sketch has insufficient representation power to specify the image characteristics of the disease phenotype, might have affected the variability of recall@5. This was because the single class of tumor-associated labels (i.e., PT) seemed insufficient for differentiating the internal characteristics of lung cancers, including ground-glass opacity and solid components. Meanwhile, the means ±\pm standard deviations of precision@5, reciprocal rank, nDCG, and recall@5 of the query-by-example approach were 0.28±0.110.28\pm 0.11, 1.00±0.001.00\pm 0.00, 0.66±0.030.66\pm 0.03, and 1.00±0.001.00\pm 0.00, respectively. The consistently higher precision@5 (i.e., all precision@5 values in Fig. 10b are above 0.28) suggests that our SBMIR system’s overall performance for retrieving clinically similar images was still superior to the conventional CBMIR system despite the possible sketch-based limitation for lung cancers.

5.3 SBMIR enabled image retrieval without example images

Test-2 demonstrated that our SBMIR system could retrieve clinically similar images even without example images. This would be impossible using a conventional CBMIR system based on the query-by-example approach, posing the usability problem. Five descriptions of clinical findings, which simulate radiology reports, were presented to the evaluators. Each evaluator constructed a user query according to clinical findings recalled by interpreting the description and evaluated the clinical similarity of retrieved images to the intention of their query using the user-oriented metrics.

For gliomas, the five descriptions included various types of tumors (see Fig. 11a and G.2). The mean precision@5 ranged from 0.48 in Q.5 to 0.86 in Q.3. The mean reciprocal rank and mean nDCG were intermediate-to-high, exceeding 0.6 for all question items. Thus, most top-5 retrieved images were given high similarity scores, including more than two clinically similar images ranked high on average (see Section 4.1 for the definitions of the similarity scores). As can be seen in the example results of Q.3 (see Fig. 3a), even though each evaluator was presented with the description independently, the user queries resembled each other, retrieving clinically similar images as intended. Notably, our SBMIR system was effective in retrieving images with distinctive yet rare characteristics, for example, the so-called butterfly glioma in Q.5 (see Fig. 3b). These promising results were obtained in a situation that required sufficiently detailed descriptions for the evaluators to recall common clinical findings, which makes the evaluation stricter and more complex than in Test-1.

For lung cancers, in a more challenging task, we presented descriptions that specify the detailed anatomical location of the disease (see Fig. 11b and G.2), including the lung segment, the apex of the lung, the pleural contact, and the hilar region. This is because the radiological phenotypes of lung cancers are so diverse that the information about the size and lung lobes alone would be insufficient for the evaluators to specify their user queries. Except for Q.4, the mean precision@5 (ranging from 0.46 in Q.1 to 0.84 in Q.3) and mean reciprocal rank (ranging from 0.77 in Q.5 to 1.00 in Q.3) showed intermediate-to-high values, suggesting that at least two clinically similar images were successfully ranked higher on average. See Fig. 4a for the results of successful image retrieval with relatively similar user queries in Q.3. In contrast, for particular cases, such as the central type of lung cancer in Q.4 (see Fig. 4b), even though the mean precision@5 of 0.24 exceeded 0.2 (i.e., as described in Section 4.2.1, the lower bound of precision@5 will be 0.2 when image retrieval works properly), the mean reciprocal rank of 0.47 was still low. The detailed anatomical location, such as the hilar region, may not have been fully learned by the model, impeding localization of the area by normal ACs from the template images. We call this a template-image-based limitation, which can hinder the image retrieval performance of our SBMIR system.

5.4 SBMIR enabled image retrieval for isolated samples

Test-3 confirmed that our SBMIR system could obtain the same-volume images even when the isolated samples were presented, overcoming the searchability problem. See B for the details of how we identified the isolated samples, which were defined as images that are “not” included within the kk-NN of any other images in the database. Five randomly selected isolated samples, identified based on the fine-tuned ResNet-101 features, were presented to the evaluators. Recall@5 was automatically evaluated for each isolated sample.

For gliomas, the five isolated samples were presented to the evaluators (see Fig. 12a and G.3). The mean recall@5 was more than or equal to 0.7, implying that most evaluators succeeded in retrieving the same-volume images. Example results in Q.1 with the highest recall@5 (1.00±0.001.00\pm 0.00) and those in Q.2 with the lowest recall@5 (0.70±0.480.70\pm 0.48) are presented in Fig. 5a and Fig. 5b, respectively. Despite the slightly different user queries, the same-volume images were effectively retrieved within the top-5 retrieved images, as highlighted by the red boxes.

For lung cancers, we presented the five isolated samples with the evaluators (see Fig. 12b and G.3). Recall@5 exceeded 0.5 for all question items, except for Q.2, indicating that more than half of the evaluators successfully retrieved the same-volume images. The example results in Q.4 showing the highest recall@5 (0.89±0.330.89\pm 0.33) are shown in Fig. 6a. In contrast, the isolated sample in Q.2 was difficult to retrieve, as shown by the recall@5 of 0.44±0.530.44\pm 0.53 (see Fig. 6b). The low recall@5 for Q.2 may be related to a template-image-based limitation based on the observation that the body and chest wall of the presented image seems much larger than the corresponding template image, possibly hindering the representation power of the template image to encompass such individual anatomical differences.

6 Conclusion

Despite tremendous advancements in CBMIR (Haq et al., 2021; Zhong et al., 2021; Fang et al., 2021; Rossi et al., 2021), the practical aspect of information-seeking objectives of healthcare professionals has not been paid much attention, overlooking practical usability and searchability limitations. To overcome this, we developed the SBMIR system, which does not require preparing example images or sketching the entire anatomical appearance. The most innovative aspect is the simple two-step user operation (see Fig. 2a), which can specify fine-grained characteristics of image content. The user test showed that our SBMIR system could overcome previous limitations through better image retrieval performance according to fine-grained image characteristics (see Fig. 10), image retrieval without example images (see Fig. 11), and image retrieval for isolated samples (see Fig. 12). Our SBMIR system enables users to retrieve images of interest on demand, expanding the utility of medical image databases.

Three possible sources of limitations of our SBMIR system were observed. Skill-based limitations seemed to be minimized by the practice stage. Sketch-based and template-image-based limitations are associated with the algorithm, indicating room for improvement. As malignant tumors are characterized not only by shape and size but also by internal characteristics (Aerts et al., 2014), learning internal characteristics may mitigate sketch-based limitations, which were particularly evident for lung cancers. Besides, modeling normal anatomy using detailed segmentation labels could reduce template-image-based limitations, while information about normal anatomy was learned in a self-supervised manner in the present study. Because our SBMIR system can be applied to other diseases as long as lesions are segmentable, the development of medical image retrieval technologies will hopefully be accelerated by considering these issues.

The present study reminds us that database searching is an inherently interactive process. Indeed, the bidirectional communication between the user and the system to refine user queries for better results has been a fundamental concern for practical information access (Lamine, 2008; Miao et al., 2021). However, in conventional query-by-example approaches, there has been no room for such interaction in the whole process. In our SBMIR system, on the other hand, users can seek a better way to express their user intention by observing how the search results change according to which template image is selected and how the disease is sketched (see Fig. 9 and the Supplementary Video 1). Such human-machine interaction has the potential to establish reliable and trustworthy data-driven applications in medicine (Cutillo et al., 2020; Liang et al., 2022).

Acknowledgement

The authors thank the members of the Division of Medical AI Research and Development of the National Cancer Center Research Institute for their kind support. The RIKEN AIP Deep Learning Environment (RAIDEN) supercomputer system was used in this study to perform computations.

Funding

This work was supported by JST CREST (Grant Number JPMJCR1689), JST AIP-PRISM (Grant Number JPMJCR18Y4), JSPS Grant-in-Aid for Scientific Research on Innovative Areas (Grant Number JP18H04908), and JSPS KAKENHI (Grant Number JP22K07681).

Competing interests

Kazuma Kobayashi and Ryuji Hamamoto have received research funding from Fujifilm Corporation.

Contributions

K.K. conceived the study, devised the algorithms, developed the software, coordinated the user tests, performed the technical evaluation, and analyzed the results. K.K., L.G., and R. Hataya prepared the manuscript. K.K. and M.M. prepared the datasets and the question items for the user tests. K.K, T.M., M.M., and M.T. designed the framework of the user tests. K.K., T.M., M.M., H.W., M.T., Y.T., Y.Y., S.N., N.K., and A.B. participated in the user tests. Y.K. provided technical advice. T.H. and R. Hamamoto supervised the research.

Data availability

The glioma dataset is available on the website of the MICCAI BraTS Challenge (Menze et al., 2015; Bakas et al., 2017c, a, b). Note that the collected chest CT scans from the hospital that were utilized in the lung cancer dataset remain under their custody.

Code availability

All source codes for the training and evaluation of the present study will be publicly available on GitHub (https://github.com/Kaz-K/sketch-based-medical-image-retrieval).

Appendix A Preparation of ResNet-101 to extract image-level features

We prepared two types of feature extractors that represent the characteristics of entire images: ImageNet-trained ResNet-101 and fine-tuned ResNet-101. ResNet-101 is a 101-layered deep neural network that produces a feature vector with 2,048 dimensions from a layer just before the final fully connected layer (He et al., 2016). Because ImageNet contains 1.28 million natural training images consisting of 1,000 classes (Russakovsky et al., 2015), ImageNet-trained ResNet-101 can be expected to take general image features into account; however, there are substantial differences between ImageNet classification and medical image diagnosis (Raghu et al., 2019). Hence, we fine-tuned the ImageNet-trained ResNet-101 using the training datasets (see Section 3.1). For fine-tuning based on the training dataset, the final 1,000-node classification layer of the ResNet-101 was removed and replaced by a one-node layer to output a probability of whether the input image contains abnormal findings or not. As our datasets contain tumor-associated labels corresponding to a relatively small portion of areas in a large image of a body region (see Fig. 6), we expected that fine-tuning would force the model to focus on the local pathological change in the image rather than a global subject of the image. All the model parameters were optimized according to the training settings described previously (Kobayashi et al., 2021). For each training dataset, the models were trained for 100 epochs. The resultant image-wise classification performances for the abnormality of the fine-tuned ResNet-101 were as follows. The means ±\pm standard deviations of accuracy, precision, recall, and specificity were 0.94±0.070.94\pm 0.07, 0.94±0.200.94\pm 0.20, 0.84±0.240.84\pm 0.24, and 0.97±0.080.97\pm 0.08 for the glioma testing dataset and 0.97±0.040.97\pm 0.04, 0.81±0.300.81\pm 0.30, 0.86±0.290.86\pm 0.29, and 0.98±0.040.98\pm 0.04 for the lung cancer testing dataset.

Appendix B Identification of isolated samples in the conventional query-by-example approach

Refer to caption
Figure 1: Isolated samples in a database as a fundamental limitation of conventional query-by-example approaches. a In a latent space, an ordinary sample can be surrounded by other ordinary samples, while an isolated sample with unique image characteristics is considered to be located far away from the others. Thus, when such isolated samples exist in a database, one may not be able to find them using the query-by-example approach because it is challenging to prepare an example image with similar unique characteristics. This can be quantitatively evaluated by counting the number of samples that do not appear in the kk-nearest neighbors (kk-NN) of any other samples in the latent space. b The number of isolated samples in the glioma testing dataset among kk-NN samples. The fine-tuning of ResNet-101 reduced the number of isolated samples. For example, the number of isolated samples in the 5-NN group was reduced from 21 to 10 (indicated by the arrow). c The number of isolated samples in the lung cancer testing dataset among kk-NN samples. Similarly, the number of isolated samples in the 5-NN group was reduced from 48 to 13 by fine-tuning (indicated by the arrow). Nevertheless, a substantial number of isolated samples existed in both databases, even when the fine-tuned feature extractor was used.

We defined an isolated sample as an image that is “not” included among the kk-nearest neighbors (kk-NN) of any other images (see Fig. 1a). To count the isolated samples in the testing datasets, we focused on representative images. A representative image is defined as the 2D axial slice containing the largest tumor-associated region in each 3D volume. This is because 3D volume data, such as CT and MRI scans, usually contain similar abnormal findings in consecutive slices, which can raise a concern about duplicative counting of similar images. We decided that the evaluation using the representative images is valid because the slice that best characterizes the clinical significance of 3D volume data is often the one in which a lesion appears largest. After identifying the representative images from each 3D volume, a feature extractor (i.e., ImageNet-trained ResNet-101 or fine-tuned ResNet-101, as described in A) extracted an image-level feature from each representative image. Then, the L2 distances with all the other representative images based on the feature vectors were calculated to obtain kk-NN samples for each representative image. The number of representative images that were “not” included in the kk-NN of all the other representative images was counted. The numbers of isolated samples in the glioma and lung cancer testing datasets are shown in Fig. 1b and Fig. 1c, respectively.

Appendix C Detailed network architecture

For the glioma training dataset, the input image size was 3×256×256(=196,608)3\times 256\times 256(=196,608) and the segmentation label size was 4×256×2564\times 256\times 256, which indicates three classes of tumor-associated regions (i.e., ET, ED, and NET) plus a background class. Table 1 shows the detailed implementation of the normal AC encoder, which mainly consists of a repeated structure with residual blocks (He et al., 2016) and a strided convolution (ResBlock + StridedConv). For the abnormal AC encoder, the network architecture demonstrated in Table 2 was employed. Almost the same network architecture as the Table 2 was used for the label encoder, except for its input image size of 4×256×2564\times 256\times 256, which is consistent with the segmentation label size. The image decoder employed the neural network architecture shown in Table 3. For the label decoder, almost the same architecture in Table 3 was used, except for its final output size, which was adjusted to be 4×256×2564\times 256\times 256.

For the lung cancer training dataset, the network architectures were the same, but the spatial resolutions of the output shape in Table 1, Table 2, and Table 3 were doubled because the spatial resolution of the input image was doubled, to be 3×512×512(=786,432)3\times 512\times 512(=786,432). Additionally, the segmentation label size was 2×512×5122\times 512\times 512, reflecting one class of tumor-associated region (i.e., PT) plus a background class. Because the input image was concatenated with adjacent upper and lower slices to reach a channel size of 3, the segmentation label corresponding to the center slice was set to be the learning objective. Consequently, the sizes of the ACs were 512×2×2=2,048512\times 2\times 2=2,048 for the glioma dataset and 512×4×4=8,192512\times 4\times 4=8,192 for the lung cancer dataset, maintaining the same compression ratio (2,048/196,608=8,192/786,432=1/96\nicefrac{{2,048}}{{196,608}}=\nicefrac{{8,192}}{{786,432}}=\nicefrac{{1}}{{96}}) across the datasets irrespective of the difference in the spatial resolution of images.

Table 1: Basic architecture of the encoder networks of the variational autoencoder component
Module Activation Output shape
Input image
Conv
StridedConv
[3×332]\begin{bmatrix}3\times 3&32\end{bmatrix}
[3×364]\begin{bmatrix}3\times 3&64\end{bmatrix}
3×256×2563\times 256\times 256
32×256×25632\times 256\times 256
64×128×12864\times 128\times 128
ResBlock
StridedConv
[3×3643×364]\begin{bmatrix}3\times 3&64\\ 3\times 3&64\end{bmatrix}
[3×3128]\begin{bmatrix}3\times 3&128\end{bmatrix}
64×128×12864\times 128\times 128
128×64×64128\times 64\times 64
ResBlock
StridedConv
[3×31283×3128]\begin{bmatrix}3\times 3&128\\ 3\times 3&128\end{bmatrix}
[3×3256]\begin{bmatrix}3\times 3&256\end{bmatrix}
128×64×64128\times 64\times 64
256×32×32256\times 32\times 32
ResBlock
StridedConv
[3×32563×3256]\begin{bmatrix}3\times 3&256\\ 3\times 3&256\end{bmatrix}
[3×3512]\begin{bmatrix}3\times 3&512\end{bmatrix}
256×32×32256\times 32\times 32
512×16×16512\times 16\times 16
ResBlock
StridedConv
[3×35123×3512]\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}
[3×3512]\begin{bmatrix}3\times 3&512\end{bmatrix}
512×16×16512\times 16\times 16
512×8×8512\times 8\times 8
ResBlock
StridedConv
[3×35123×3512]\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}
[3×3512]\begin{bmatrix}3\times 3&512\end{bmatrix}
512×8×8512\times 8\times 8
512×4×4512\times 4\times 4
ResBlock
StridedConv
Split
[3×35123×3512]\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}
[3×3512]\begin{bmatrix}3\times 3&512\end{bmatrix}
-
512×4×4512\times 4\times 4
512×2×2512\times 2\times 2
Conv
[1×1512]\begin{bmatrix}1\times 1&512\end{bmatrix}, [1×1512]\begin{bmatrix}1\times 1&512\end{bmatrix}
512×2×2512\times 2\times 2, 512×2×2512\times 2\times 2
Table 2: Basic architecture of the encoder networks
Module Activation Output shape
Input image
Conv
StridedConv
[3×332]\begin{bmatrix}3\times 3&32\end{bmatrix}
[3×364]\begin{bmatrix}3\times 3&64\end{bmatrix}
3×256×2563\times 256\times 256
32×256×25632\times 256\times 256
64×128×12864\times 128\times 128
ResBlock
StridedConv
[3×3643×364]\begin{bmatrix}3\times 3&64\\ 3\times 3&64\end{bmatrix}
[3×3128]\begin{bmatrix}3\times 3&128\end{bmatrix}
64×128×12864\times 128\times 128
128×64×64128\times 64\times 64
ResBlock
StridedConv
[3×31283×3128]\begin{bmatrix}3\times 3&128\\ 3\times 3&128\end{bmatrix}
[3×3256]\begin{bmatrix}3\times 3&256\end{bmatrix}
128×64×64128\times 64\times 64
256×32×32256\times 32\times 32
ResBlock
StridedConv
[3×32563×3256]\begin{bmatrix}3\times 3&256\\ 3\times 3&256\end{bmatrix}
[3×3512]\begin{bmatrix}3\times 3&512\end{bmatrix}
256×32×32256\times 32\times 32
512×16×16512\times 16\times 16
ResBlock
StridedConv
[3×35123×3512]\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}
[3×3512]\begin{bmatrix}3\times 3&512\end{bmatrix}
512×16×16512\times 16\times 16
512×8×8512\times 8\times 8
ResBlock
StridedConv
[3×35123×3512]\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}
[3×3512]\begin{bmatrix}3\times 3&512\end{bmatrix}
512×8×8512\times 8\times 8
512×4×4512\times 4\times 4
ResBlock
StridedConv
[3×35123×3512]\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}
[3×3512]\begin{bmatrix}3\times 3&512\end{bmatrix}
512×4×4512\times 4\times 4
512×2×2512\times 2\times 2
Table 3: Basic architecture of the decoder networks
Module Activation Output shape
Latent representation
-
512×2×2512\times 2\times 2
Conv
ResBlock
[3×3512]\begin{bmatrix}3\times 3&512\end{bmatrix}
[3×35123×3512]\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}
512×2×2512\times 2\times 2
512×2×2512\times 2\times 2
Conv
Upsample
ResBlock
[1×1512]\begin{bmatrix}1\times 1&512\end{bmatrix}
-
[3×35123×3512]\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}
512×2×2512\times 2\times 2
512×4×4512\times 4\times 4
512×4×4512\times 4\times 4
Conv
Upsample
ResBlock
[1×1512]\begin{bmatrix}1\times 1&512\end{bmatrix}
-
[3×35123×3512]\begin{bmatrix}3\times 3&512\\ 3\times 3&512\end{bmatrix}
512×4×4512\times 4\times 4
512×8×8512\times 8\times 8
512×8×8512\times 8\times 8
Conv
Upsample
ResBlock
[1×1256]\begin{bmatrix}1\times 1&256\end{bmatrix}
-
[3×32563×3256]\begin{bmatrix}3\times 3&256\\ 3\times 3&256\end{bmatrix}
256×8×8256\times 8\times 8
256×16×16256\times 16\times 16
256×16×16256\times 16\times 16
Conv
Upsample
ResBlock
[1×1128]\begin{bmatrix}1\times 1&128\end{bmatrix}
-
[3×31283×3128]\begin{bmatrix}3\times 3&128\\ 3\times 3&128\end{bmatrix}
128×16×16128\times 16\times 16
128×32×32128\times 32\times 32
128×32×32128\times 32\times 32
Conv
Upsample
ResBlock
[1×164]\begin{bmatrix}1\times 1&64\end{bmatrix}
-
[3×3643×364]\begin{bmatrix}3\times 3&64\\ 3\times 3&64\end{bmatrix}
64×32×3264\times 32\times 32
64×64×6464\times 64\times 64
64×64×6464\times 64\times 64
Conv
Upsample
ResBlock
[1×132]\begin{bmatrix}1\times 1&32\end{bmatrix}
-
[3×3323×332]\begin{bmatrix}3\times 3&32\\ 3\times 3&32\end{bmatrix}
32×64×6432\times 64\times 64
32×128×12832\times 128\times 128
32×128×12832\times 128\times 128
Conv
Upsample
ResBlock
[1×13]\begin{bmatrix}1\times 1&3\end{bmatrix}
-
[3×333×33]\begin{bmatrix}3\times 3&3\\ 3\times 3&3\end{bmatrix}
3×128×1283\times 128\times 128
3×256×2563\times 256\times 256
3×256×2563\times 256\times 256

Appendix D Training results of the feature extraction module of our SBMIR system

We trained the feature extraction module of our SBMIR system using the glioma training dataset and the lung cancer training dataset independently (see Section 3.2.3 and Section 3.2.4). We qualitatively and quantitatively evaluated its training results to verify the algorithm from technical viewpoints.

D.1 Qualitative evaluation of the image reconstruction

Refer to caption
Figure 1: Training results of the image reconstruction and segmentation. The input images, the entire reconstructions from entire anatomy codes (ACs), and the pseudo-normal reconstructions from normal ACs are presented in the first, second, and third rows, respectively. The ground-truth and predicted segmentation labels are shown in the fourth and fifth rows, respectively. Arrows indicate the diseased areas in the entire reconstructions, which are diminished in the pseudo-normal reconstructions. Gd, gadolinium.

How the feature decomposition was achieved can be assessed by visualizing images generated by the image decoder and the label decoder. Recall that the reconstructed images from the normal ACs should be pseudo-normal images, while those from the entire ACs should be entire images with some abnormalities if they exist (see Fig. 1b). Additionally, the abnormal ACs produce segmentation labels for the tumor-associated regions when they are inputted into the label decoder.

Figure 1 presents the images and segmentation labels decoded from the ACs using the glioma testing dataset and the lung cancer testing dataset. The first row shows the input images. The second row demonstrates the entire reconstructions of the input images, which were decoded by the image decoder taking the entire ACs as input. The third row indicates the pseudo-normal reconstructions, which were decoded by the image decoder taking the normal ACs as input. Note that the abnormal imaging features that appear both in the input images and in the entire reconstructions (see the arrows in Fig. 1) are diminished in the pseudo-normal images, recovering the normal anatomy that should have existed therein. These results suggest successful feature decomposition. Moreover, the fourth row presents the ground-truth segmentation labels for the tumor-associated regions, and the fifth row shows the predicted segmentation labels that were decoded by the label decoder taking the abnormal ACs as input. Note that the segmentation was trained only for the tumor-associated regions and not for the normal anatomy-associated regions, as shown in the results.

One may argue that the quality of the reconstructed images and segmentation labels was insufficient, as observed from the blurred and rounded appearance that did not recover the detailed image characteristics. These tendencies are reasonable because the image information is compressed because of the limited size of the latent representation, which raises a trade-off between the reconstruction qualities and latent size (Razavi et al., 2019; Kobayashi et al., 2021). Indeed, we did not pursue the generation quality of the reconstructions as a primary purpose because the lower dimension of the latent representations can be advantageous for computational efficiency in similarity search. Besides, although the detailed part of the image characteristics was not perfectly reproduced, it was still sufficient for recognizing the anatomical location and presence of abnormalities in the reconstructed images.

D.2 Qualitative evaluation of the semantically organized latent space

Refer to caption
Figure 2: Relationship between the margin parameter and cluster formation in the latent space. Four latent distributions, the entire anatomy codes (ACs) from healthy images (green dots), the normal ACs from healthy images (blue dots), the normal ACs from diseased images (orange dots), and the entire ACs from diseased images (red dots) are visualized using t-distributed stochastic neighbor embedding (t-SNE) plotting. a In the glioma testing dataset, two clusters emerged with a margin parameter of more than 5. b In the lung cancer testing dataset, two clusters emerged when the margin parameter was more than 40. Note that the entire ACs from diseased images (red dots) were intermingled with the other data points, particularly with a margin parameter of 0, possibly reflecting the relatively small area of each lung cancer.

We applied t-distributed stochastic neighbor embedding (t-SNE) analysis (van der Maaten and Hinton, 2011) to evaluate how the latent space was organized according to the semantics by changing the margin parameter (see Section 2.2.3). We randomly selected 50 individual volumes from each dataset and extracted entire ACs and normal ACs in an image-wise manner. The once randomly selected samples were used repeatedly in the following t-SNE analyses for the purpose of comparison, particularly in F.

The results using the glioma testing dataset and the lung cancer testing dataset are shown in Fig. 2a and Fig. 2b, respectively. When the margin parameter was set to 0, there was no visible cluster in the latent space, where the different ACs intermingled with each other (see the leftmost images in Fig. 2a and Fig. 2b). This is particularly evident in Fig. 2b, possibly reflecting that the primary site of lung cancer usually occupies such a small region that a latent feature representing an abnormality does not convey meaningful information when no margin parameter is assigned. On the other hand, when the margin parameter was increased, two clusters appeared and moved away from each other, particularly when the margin parameter was more than 10 for the glioma testing dataset and more than 40 for the lung cancer testing dataset. One cluster consisting of the entire ACs from healthy images (green dots), the normal ACs from healthy images (blue dots), and the normal ACs from diseased images (orange dots) can be interpreted as corresponding to the healthy subspace. In contrast, the other cluster consisting of the entire ACs from diseased images (red dots) can be considered the diseased distribution. Therefore, sufficiently large margin parameters are necessary for configuring the semantically organized latent space (see Fig. 5b).

D.3 Quantitative evaluation of the image reconstruction

Given entire ACs, the image decoder was trained to generate entire reconstructions 𝒙^\hat{\bm{x}} of input images 𝒙\bm{x}. Then, the reconstruction error between the entire reconstructions 𝒙^\hat{\bm{x}} and input images 𝒙\bm{x} was evaluated as 𝒙^𝒙22\|\hat{\bm{x}}-\bm{x}\|_{2}^{2}. The means ±\pm standard deviations of the reconstruction error were 0.22±0.230.22\pm 0.23 and 0.16±0.050.16\pm 0.05 in the glioma testing dataset and lung cancer testing dataset, respectively.

D.4 Quantitative evaluation of the segmentation

The label decoder predicts segmentation labels of tumor-associated regions 𝒍^\hat{\bm{l}}, which should be close to the ground-truth label 𝒍\bm{l}. The segmentation performance was evaluated based on the Dice score with respect to the tumor-associated labels. To calculate the Dice score for each volume, the segmentation outputs of each 2D axial image were concatenated into a 3D volume. The means ±\pm standard deviations of the Dice score were 0.37±0.210.37\pm 0.21, 0.60±0.140.60\pm 0.14, and 0.44±0.210.44\pm 0.21 for NET, ED, and ET, respectively, in the glioma testing dataset. The mean ±\pm standard deviation of the Dice score was 0.64±0.180.64\pm 0.18 for PT in the lung cancer testing dataset. The intermediate levels of these Dice scores were reasonable because the model did not have skip-connections, which is essential for precise medical image segmentation (Drozdzal et al., 2016), in order to concentrate the information on the diseased regions in the abnormal ACs.

D.5 Quantitative evaluation of the mapping function from the label space to the latent space

We evaluated the performance of the label encoder taking the ground-truth labels as input to estimate the corresponding abnormal ACs 𝒂^\hat{\bm{a}}, which should be similar to the output of the abnormal AC encoder 𝒂\bm{a} taking the corresponding image as input. The means ±\pm standard deviations of L2 distance between the output of the label encoder and the corresponding abnormal ACs 𝒂^𝒂22\|\hat{\bm{a}}-\bm{a}\|_{2}^{2} were 3.1×102±4.5×1023.1\times 10^{-2}\pm 4.5\times 10^{-2} and 0.7×102±3.6×1020.7\times 10^{-2}\pm 3.6\times 10^{-2} in the glioma testing dataset and the lung cancer testing dataset, respectively. As the L2 distance is small, the inverse mapping from the semantic labels to the corresponding abnormal ACs can be precise, enabling users to specify the characteristics of abnormal findings of interest by drawing semantic sketches.

D.6 Quantitative evaluation of the separability in latent space

To quantitatively evaluate how well the latent space 𝒵\mathcal{Z} can be separated into the healthy subspace 𝒵h\mathcal{Z}^{\mathrm{h}} and the diseased subspace 𝒵d\mathcal{Z}^{\mathrm{d}} (see Fig. 5b and Fig. 2), we trained support vector machines (SVMs) to learn the support vector between the entire ACs from healthy images 𝒟(𝒆|𝒙h)\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{h}}) and the entire ACs from diseased images 𝒟(𝒆|𝒙d)\mathcal{D}(\bm{e}\rvert\bm{x}^{\mathrm{d}}) based on the training datasets (i.e., the glioma training dataset and the lung cancer training dataset). Because the SVM is expected to learn the hyperplane that has the maximum margin between the two distributions, its classification performance reflects how well the two distributions are separated. In the glioma testing dataset, the classification performance of the SVM was as follows: accuracy, 0.94; precision, 0.96; recall, 0.89; F-score, 0.92. In the lung cancer testing dataset, the classification performance of the SVM was as follows: accuracy, 0.97; precision, 0.90; recall, 0.82; F-score, 0.86. We consider that the high classification performance of the SVMs implies that the configuration of the semantically organized latent space was successfully achieved.

Appendix E Technological evaluation focusing on the image retrieval performance

Here, we demonstrate that extending the feature decomposition of medical images by imposing the semantically organized latent space is critical to achieving the practical SBMIR system. The practical SBMIR system should realize image retrieval according to the similarity of images while simultaneously reflecting the semantics. Note that in image retrieval reflecting semantics, a query vector conveying the information about a diseased region should retrieve only diseased images and that not conveying the information about a diseased region should retrieve only healthy images. From this perspective, we devised two evaluation methods to objectively demonstrate how the retrieved image features can preserve information regarding both the semantics and similarity of images. The image retrieval performance was observed to be affected by the margin parameter for configuring the semantically organized latent space (see Fig. 5b).

E.1 Semantic consistency in image retrieval

Refer to caption
Figure 1: Evaluation of the semantic consistency in image retrieval. a When image retrieval can be realized according to the semantics of an image, healthy or diseased, a query vector as an entire anatomy code (AC) from a healthy or diseased image should retrieve healthy or diseased images, respectively. b To achieve semantic consistency, the latent space should be semantically organized, with data points with similar semantics located close to each other. c The semantic consistency in the glioma testing dataset. d The semantic consistency in the lung cancer testing dataset. Note that a higher margin parameter is essential for maintaining semantic consistency, as is particularly evident in the lung cancer testing dataset.
Refer to caption
Figure 2: Evaluation of latent proximity in image retrieval. Efficient medical image retrieval requires similar images to be near each other in latent space. a To quantitatively evaluate latent proximity, we noted that the contiguous slices in an image volume are similar to each other. Then, whether a query vector as an entire anatomy code (AC) from a center image can retrieve the slices that were cranially or caudally consecutive to the center image was evaluated. b Low latent proximity indicates that images that are originally similar to each other are mixed in the latent space with dissimilar images, whereas high latent proximity suggests the similarity in the image space is also preserved in the latent space. c The latent proximity was evaluated using the glioma testing dataset. d The latent proximity was evaluated using the lung cancer testing dataset. Too large a margin parameter was found to adversely affect latent proximity, especially for the lung cancer testing dataset.

To evaluate how well the image retrieval is performed according to semantics, we developed an evaluation method to quantify semantic consistency. As shown in Fig. 1a, image retrieval reflecting semantics should retrieve healthy or diseased images with an entire AC from a healthy or diseased image, respectively, when the entire AC is used as a query vector. To achieve this, each sample in the latent space should be grouped with samples with the same semantics rather than samples with different semantics (see Fig. 1b), as is expected to be configured by the margin parameter. Therefore, we evaluated semantic consistency based on the following procedure. First, each image in a testing dataset was considered as a query image, and the corresponding entire AC was calculated as a query vector. Using each query vector, the five closest slices from different volumes other than the one the query vector belongs to were retrieved. Then, the ratio of retrieved images whose semantics are consistent with the query vector was evaluated. The resultant ratio was reported as the semantic consistency value, which was calculated according to the semantics of the query images (i.e., healthy images or diseased images).

In the glioma testing dataset (see Fig. 1c), the semantic consistency of diseased images increased from 0.86 to 0.88 as the margin parameter increased from 0 to 10. More remarkable effects were observed in the lung cancer testing dataset (see Fig. 1d). When the margin parameter was set to 0, the semantic consistency of diseased images was as small as 0.25. This indicates that even if the user intention is to retrieve images with abnormalities, the retrieval results can be mixed with healthy images. In contrast, the semantic consistency was improved to 0.72 when the margin parameter was increased to 40, showing that the margin parameter had a meaningful effect on the semantic consistency of diseased images. As the average tumor volume in the lung cancer testing dataset was significantly smaller than that in the glioma testing dataset (see Section 3.2), we inferred that the latent features representing such small diseased areas could be overwhelmed by those representing the large image of a body region. Consequently, the margin parameter is essential for image retrieval with semantic consistency, effectively conveying the imaging features relevant to abnormal findings in medical image retrieval.

E.2 Latent proximity in image retrieval

To evaluate how the similarity in the image space is preserved in latent space, we devised an index called latent proximity based on the notion that similar images should also be nearby in the latent space, which was evaluated as follows (see Fig. 2a). First, each image in a testing dataset was considered as a query image, and the corresponding entire AC was calculated as a query vector. Based on each query vector, the closest slice from all volumes, including the original volume the query vector belongs to, was retrieved. Then, whether the retrieved image was included in the five cranially or caudally consecutive slices in the original volume was assessed according to the semantics of the query image. Finally, the ratio between the number of retrieved images not included in the consecutive slices and those included in the consecutive slices was calculated as the latent proximity value. As shown in Fig. 2b, low latent proximity indicates that even similar images are not in close proximity in latent space, which can cause unintended search results with dissimilar images. In contrast, high latent proximity can guarantee that similarity in image space is also reproduced in latent space, leading to retrieval results that are faithful to the user intention. The evaluation was repeated by varying the margin parameter in the analysis of both the glioma and lung cancer testing datasets.

As shown in Fig. 2c, the latent proximity of the glioma testing dataset does not differ dramatically according to the magnitude of the margin parameter. On the other hand, the latent proximity of diseased images in the lung cancer testing dataset was clearly changed by the margin parameter (see Fig. 2d). Therefore, a margin parameter that is too large may worsen the correspondence between the image space and the latent space, hindering image retrieval performance according to the similarity of images.

Appendix F Ablation studies of the feature extraction module of our SBMIR system

Among several loss functions used in the training of the feature extraction module (i.e., the reconstruction loss, segmentation loss, consistency loss, abnormality loss, regularization loss, and margin loss) (see Section 2.3), the necessity and the optimal values of the margin parameter are demonstrated in the previous sections. Here, we confirm that the abnormality loss and regularization loss are also essential for semantically organized latent space. For simplicity, only the results based on the glioma testing dataset are shown.

F.1 Ablation study of abnormality loss

Refer to caption
Figure 1: Effects of abnormality loss on the organization of latent space. Four latent distributions–namely, the entire anatomy codes (ACs) from healthy images (green dots), the normal ACs from healthy images (blue dots), the normal ACs from diseased images (orange dots), and the entire ACs from diseased images (red dots)–are visualized using t-distributed stochastic neighbor embedding (t-SNE) plotting. Without abnormality loss, the entire ACs from healthy images (green dots) moved away from the Gaussian distribution representing normal anatomy, rendering the latent space disorganized according to the image semantics.

Abnormality loss forces the abnormal AC encoder that takes healthy images to output zero vectors, as the abnormal ACs from healthy images should not contain any relevant information (see Fig. 5a). Because segmentation output from the subsequent label encoder is also trained to be zero vectors reflecting the absence of the tumor-associated regions, one can argue that the abnormality loss is duplicative. However, when training the model without abnormality loss, the entire ACs from healthy images (green dots) moved away from the Gaussian distribution as shown in Fig. 1. This may be because the abnormal ACs, even from healthy images, had significant norms when trained without abnormality loss. In this case, a discrepancy emerged between the entire ACs from healthy images and the normal ACs from healthy images, which weakened the rationale that the former distribution represents the normal anatomy, undermining the assumption of the semantically organized latent space (see Fig. 5b).

F.2 Ablation study of regularization loss

Refer to caption
Figure 2: Effects of regularization loss. a Four latent distributions–namely, the entire anatomy codes (ACs) from healthy images (green dots), the normal ACs from healthy images (blue dots), the normal ACs from diseased images (orange dots), and the entire ACs from diseased images (red dots)–are visualized using t-distributed stochastic neighbor embedding (t-SNE) plotting. Without regularization loss to force the distribution of normal ACs from healthy images to follow the Gaussian distribution, the latent space lost its organization according to the image semantics. b Due to the lack of the prior distribution, the pair of the normal AC encoder and the image decoder acted like an identity function. Thereby, the reconstructed image, even from the normal ACs, contained abnormal imaging features (see the arrowhead).

Regularization loss plays a role in regularizing the distribution of the normal ACs from healthy images to follow the Gaussian distribution, which is essential for training the VAE component (see Fig. 3a). One assumption of the VAE component is that normal anatomic variation can be modeled according to the Gaussian distribution, which is analogous to the fact that many medical indicators, such as body heights, are known to follow the Gaussian distribution. Here, we also show that the assumption is essential for the semantically organized latent space (see Fig. 5b). When the regularization loss was not imposed on the distribution of normal ACs, the latent space did not contain clustering as demonstrated in Fig. 2a. Moreover, owing to the lack of the prior distribution, the pair of the normal AC encoder and the image decoder (i.e., the neural networks trained as the VAE component) acted like an identity function, resulting in image reconstructions with abnormal findings even from the normal ACs (see the arrowhead in Fig. 2b). In summary, the assumption of the Gaussian distribution for the normal ACs is critical in our implementation not only for the semantically organized latent space but also for the feature decomposition of medical images.

Appendix G Description of question items for the user tests

G.1 Question items in Test-1

For gliomas, the clinical characteristics of the five presented images are as follows (see Fig. 10a):

  • 1.

    Q.1: A 95-mm non-enhancing tumor with peritumoral edema in the left frontal lobe.

  • 2.

    Q.2: A 90-mm ring-enhancing tumor with multiple cores with peritumoral edema in the left frontal lobe.

  • 3.

    Q.3: A 70-mm non-enhancing tumor in the left temporal lobe.

  • 4.

    Q.4: A 50-mm ring-enhancing tumor with peritumoral edema in the right parietal lobe.

  • 5.

    Q.5: An 85-mm ring-enhancing tumor with peritumoral edema in the right temporal lobe.

For lung cancers, the clinical characteristics of the five presented images are as follows (see Fig. 10b):

  • 1.

    Q.1: A 25-mm part-solid nodule in the left upper lobe.

  • 2.

    Q.2: A 30-mm tumor in the left lower lobe.

  • 3.

    Q.3: A 10-mm solid nodule in the right upper lobe.

  • 4.

    Q.4: A 45-mm tumor in the right middle lobe.

  • 5.

    Q.5: A 35-mm tumor in the right lower lobe.

G.2 Question items in Test-2

For gliomas, the five descriptions presented to the evaluators are as follows (see Fig. 11a):

  • 1.

    Q.1: A 60-mm ring-enhancing tumor is located primarily in the left temporal lobe. It is associated with massive peritumoral edema (about 100 mm in maximum length) extending through the left temporal lobe.

  • 2.

    Q.2: A 50-mm non-enhancing tumor is located in the right frontal lobe. It is associated with mild peritumoral edema (about 70 mm in maximum length).

  • 3.

    Q.3: A 25-mm ring-enhancing tumor is located in the left temporal pole (the tip of the left temporal lobe). It is associated with mild peritumoral edema (about 40 mm in maximum length).

  • 4.

    Q.4: A 30-mm ring-enhancing tumor is localized in the right occipital lobe. It is associated with peritumoral edema extending anteriorly (about 60 mm in maximum length).

  • 5.

    Q.5: A 60-mm ring-enhancing tumor is located in the midline of the bilateral frontal lobes. It is associated with extensive peritumoral edema in the bilateral frontal lobes (about 90 mm in maximum length).

For lung cancers, the five descriptions presented to the evaluators are as follows (see Fig. 11b):

  • 1.

    Q.1: A 20-mm nodule in the left upper lobe that contacts the pleura on the mediastinal side.

  • 2.

    Q.2: A 40-mm tumor in the posterior-basal segment of the left lower lobe in contact with the pleura.

  • 3.

    Q.3: A 50-mm tumor in the apex of the right lung.

  • 4.

    Q.4: A 70-mm tumor in the right pulmonary hilar region, possibly invading the mediastinum.

  • 5.

    Q.5: A 20-mm peripheral nodule in the right lateral-basal segment in contact with the chest wall pleura.

G.3 Question items in Test-3

For gliomas, the clinical characteristics of the five isolated samples presented to the evaluators are as follows (see Fig. 12a):

  • 1.

    Q.1: A 65-mm ring-enhancing tumor with peritumoral edema located in the deep white matter in the right parietal lobe.

  • 2.

    Q.2: A 90-mm ring-enhancing tumor with extensive edema located primarily in the right frontal lobe.

  • 3.

    Q.3: A 45-mm enhancing tumor with peritumoral edema located primarily in the left insular lobe.

  • 4.

    Q.4: A 70-mm non-enhancing tumor with peritumoral edema in the left frontal lobe.

  • 5.

    Q.5: A 35-mm ring-enhancing tumor with peritumoral edema located at the anterior edge of the left temporal lobe.

For lung cancers, the clinical characteristics of the five isolated samples presented to the evaluators are as follows (see Fig. 12b):

  • 1.

    Q.1: A 25-mm part-solid nodule in the left upper lobe.

  • 2.

    Q.2: A 65-mm tumor in the left lower lobe.

  • 3.

    Q.3: A 65-mm tumor in the right lower lobe.

  • 4.

    Q.4: A 70-mm tumor in the left lower lobe.

  • 5.

    Q.5: A 55-mm tumor in the right lower lobe.

Appendix H Example results of the user tests

H.1 Example results of Test-1

Example results of Test-1 for gliomas and those for lung cancers are shown in Fig. 1 and Fig. 2, respectively.

H.2 Example results of Test-2

Example results of Test-2 for gliomas and those for lung cancers are shown in Fig. 3 and Fig. 4, respectively.

H.3 Example results of Test-3

Example results of Test-3 for gliomas and those for lung cancers are shown in Fig. 5 and Fig. 6, respectively.

References

  • Aerts et al. (2014) Aerts, H.J.W.L., Velazquez, E.R., Leijenaar, R.T.H., Parmar, C., Grossmann, P., Carvalho, S., Bussink, J., Monshouwer, R., Haibe-Kains, B., Rietveld, D., Hoebers, F., Rietbergen, M.M., Leemans, C.R., Dekker, A., Quackenbush, J., Gillies, R.J., Lambin, P., 2014. Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat. Commun. 5, 4006. URL: https://doi.org/10.1038/ncomms5006, doi:10.1038/ncomms5006.
  • Allan et al. (2012) Allan, C., Burel, J.M., Moore, J., Blackburn, C., Linkert, M., Loynton, S., MacDonald, D., Moore, W.J., Neves, C., Patterson, A., Porter, M., Tarkowska, A., Loranger, B., Avondo, J., Lagerstedt, I., Lianas, L., Leo, S., Hands, K., Hay, R.T., Patwardhan, A., Best, C., Kleywegt, G.J., Zanetti, G., Swedlow, J.R., 2012. OMERO: Flexible, model-driven data management for experimental biology. Nat. Methods 9, 245–253. URL: https://doi.org/10.1038/nmeth.1896, doi:10.1038/nmeth.1896.
  • Bakas et al. (2017a) Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J., Freymann, J., Farahani, K., Davatzikos, C., 2017a. Segmentation labels and radiomic features for the pre-operative scans of the TCGA-GBM collection. The Cancer Imaging Archive URL: https://doi.org/10.7937/K9/TCIA.2017.KLXWJJ1Q, doi:10.7937/K9/TCIA.2017.KLXWJJ1Q.
  • Bakas et al. (2017b) Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J., Freymann, J., Farahani, K., Davatzikos, C., 2017b. Segmentation labels and radiomic features for the pre-operative scans of the TCGA-LGG collection. The Cancer Imaging Archive URL: https://doi.org/10.7937/K9/TCIA.2017.GJQ7R0EF, doi:10.7937/K9/TCIA.2017.GJQ7R0EF.
  • Bakas et al. (2017c) Bakas, S., Akbari, H., Sotiras, A., Bilello, M., Rozycki, M., Kirby, J.S., Freymann, J.B., Farahani, K., Davatzikos, C., 2017c. Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci. Data 4, 1–13. URL: https://doi.org/10.1038/sdata.2017.117, doi:10.1038/sdata.2017.117.
  • Bernhardsson (2022) Bernhardsson, E., 2022. Approximate nearest neighbors oh yeah. https://github.com/spotify/annoy.
  • Bhunia et al. (2022) Bhunia, A.K., Koley, S., Khilji, A.F.U.R., Sain, A., Chowdhury, P.N., Xiang, T., Song, Y.Z., 2022. Sketching without worrying: Noise-tolerant sketch-based image retrieval, in: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 999–1008. URL: https://doi.org/10.1109/cvpr52688.2022.00107, doi:10.1109/cvpr52688.2022.00107.
  • Chen et al. (2022) Chen, C., Lu, M.Y., Williamson, D.F.K., Chen, T.Y., Schaumberg, A.J., Mahmood, F., 2022. Fast and scalable search of whole-slide images via self-supervised deep learning. Nat. Biomed. Eng. , 1–15URL: https://doi.org/10.1038/s41551-022-00929-8, doi:10.1038/s41551-022-00929-8.
  • Cutillo et al. (2020) Cutillo, C.M., Sharma, K.R., Foschini, L., Kundu, S., Mackintosh, M., Mandl, K.D., Beck, T., Collier, E., Colvis, C., Gersing, K., Gordon, V., Jensen, R., Shabestari, B., Southall, N., 2020. Machine intelligence in healthcare—perspectives on trustworthiness, explainability, usability, and transparency. npj Digit. Med. 3, 1–5. URL: https://doi.org/10.1038/s41746-020-0254-2, doi:10.1038/s41746-020-0254-2.
  • Dice (1945) Dice, L.R., 1945. Measures of the amount of ecologic association between species. Ecology 26, 297–302. URL: https://doi.org/10.2307/1932409, doi:10.2307/1932409.
  • Drozdzal et al. (2016) Drozdzal, M., Vorontsov, E., Chartrand, G., Kadoury, S., Pal, C., 2016. The importance of skip connections in biomedical image segmentation, in: Carneiro, G., Mateus, D., Peter, L., Bradley, A., Tavares, J.a.M.R.S., Belagiannis, V., Papa, J.a.P., Nascimento, J.C., Loog, M., Lu, Z., Cardoso, J.S., Cornebise, J. (Eds.), Deep Learning and Data Labeling for Medical Applications, Springer International Publishing, Cham. pp. 179–187. URL: https://link.springer.com/chapter/10.1007/978-3-319-46976-8_19.
  • Dutta and Akata (2020) Dutta, A., Akata, Z., 2020. Semantically tied paired cycle consistency for any-shot sketch-based image retrieval. Int. J. Comput. Vision 128, 2684–2703. URL: https://doi.org/10.1007/s11263-020-01350-x, doi:10.1007/s11263-020-01350-x.
  • Fang et al. (2021) Fang, J., Fu, H., Liu, J., 2021. Deep triplet hashing network for case-based medical image retrieval. Med. Image Anal. 69, 101981. URL: https://doi.org/10.1016/J.MEDIA.2021.101981, doi:10.1016/J.MEDIA.2021.101981.
  • Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversarial nets, in: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (Eds.), Advances in Neural Information Processing Systems 27 (NIPS), Curran Associates, Inc.. pp. 1–9. URL: https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf.
  • Haq et al. (2021) Haq, N.F., Moradi, M., Wang, Z.J., 2021. A deep community based approach for large scale content based x-ray image retrieval. Med. Image Anal. 68, 101847. URL: https://doi.org/10.1016/J.MEDIA.2020.101847, doi:10.1016/J.MEDIA.2020.101847.
  • He et al. (2016) He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 770–778. URL: https://doi.org/10.1109/CVPR.2016.90, doi:10.1109/CVPR.2016.90.
  • Hofmanninger et al. (2020) Hofmanninger, J., Prayer, F., Pan, J., Röhrich, S., Prosch, H., Langs, G., 2020. Automatic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. Eur. Radiol. Exp. 4, 1–13. URL: https://doi.org/10.1186/s41747-020-00173-2, doi:10.1186/s41747-020-00173-2.
  • Hosny et al. (2018) Hosny, A., Parmar, C., Quackenbush, J., Schwartz, L.H., Aerts, H.J.W.L., 2018. Artificial intelligence in radiology. Nat. Rev. Cancer 18, 500–510. URL: https://doi.org/10.1038/s41568-018-0016-5, doi:10.1038/s41568-018-0016-5.
  • Kingma and Welling (2014) Kingma, D.P., Welling, M., 2014. Auto-encoding variational bayes, in: The 2nd International Conference on Learning Representations (ICLR), pp. 1–14. URL: https://arxiv.org/abs/1312.6114.
  • Kobayashi et al. (2021) Kobayashi, K., Hataya, R., Kurose, Y., Miyake, M., Takahashi, M., Nakagawa, A., Harada, T., Hamamoto, R., 2021. Decomposing normal and abnormal features of medical images for content-based image retrieval of glioma imaging. Med. Image Anal. 74, 102227. URL: https://doi.org/10.1016/j.media.2021.102227, doi:10.1016/j.media.2021.102227.
  • Lamine (2008) Lamine, M., 2008. Review of human-computer interaction issues in image retrieval, in: Pinder, S. (Ed.), Advances in Human Computer Interaction. InTech, Rijeka. chapter 14, pp. 215–240. URL: https://doi.org/10.5772/5929, doi:10.5772/5929.
  • LeCun et al. (2015) LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444. URL: https://doi.org/10.1038/nature14539, doi:10.1038/nature14539.
  • Li and Li (2018) Li, Y., Li, W., 2018. A survey of sketch-based image retrieval. Mach. Vision Appl. 29, 1083–1100. URL: https://doi.org/10.1007/s00138-018-0953-8, doi:10.1007/s00138-018-0953-8.
  • Li et al. (2018) Li, Z., Zhang, X., Müller, H., Zhang, S., 2018. Large-scale retrieval for medical image analytics: A comprehensive review. Med. Image Anal. 43, 66–84. URL: https://doi.org/10.1016/j.media.2017.09.007, doi:10.1016/j.media.2017.09.007.
  • Liang et al. (2022) Liang, W., Tadesse, G.A., Ho, D., Fei-Fei, L., Zaharia, M., Zhang, C., Zou, J., 2022. Advances, challenges and opportunities in creating data for trustworthy AI. Nat. Mach. Intell. 4, 669–677. URL: https://doi.org/10.1038/s42256-022-00516-1, doi:10.1038/s42256-022-00516-1.
  • Lin et al. (2020) Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollar, P., 2020. Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42, 318–327. URL: https://doi.org/10.1109/tpami.2018.2858826, doi:10.1109/tpami.2018.2858826.
  • Liu et al. (2022) Liu, X., Sanchez, P., Thermos, S., O’Neil, A.Q., Tsaftaris, S.A., 2022. Learning disentangled representations in the imaging domain. Med. Image Anal. 80, 102516. URL: https://doi.org/10.1016/j.media.2022.102516, doi:10.1016/j.media.2022.102516.
  • Long et al. (2003) Long, F., Zhang, H., Feng, D.D., 2003. Fundamentals of content-based image retrieval, in: Feng, D.D., Siu, W.C., Zhang, H.J. (Eds.), Multimedia Information Retrieval and Management. Springer, Berlin, Heidelberg, pp. 1–26. URL: https://doi.org/10.1007/978-3-662-05300-3_1, doi:10.1007/978-3-662-05300-3_1.
  • van der Maaten and Hinton (2011) van der Maaten, L., Hinton, G., 2011. Visualizing non-metric similarities in multiple maps. Mach. Learn. 87, 33–55. URL: https://doi.org/10.1007/s10994-011-5273-4, doi:10.1007/s10994-011-5273-4.
  • Menze et al. (2015) Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., Burren, Y., Porz, N., Slotboom, J., Wiest, R., Lanczi, L., Gerstner, E., Weber, M.A., Arbel, T., Avants, B.B., Ayache, N., Buendia, P., Collins, D.L., Cordier, N., Corso, J.J., Criminisi, A., Das, T., Delingette, H., Demiralp, C., Durst, C.R., Dojat, M., Doyle, S., Festa, J., Forbes, F., Geremia, E., Glocker, B., Golland, P., Guo, X., Hamamci, A., Iftekharuddin, K.M., Jena, R., John, N.M., Konukoglu, E., Lashkari, D., Mariz, J.A., Meier, R., Pereira, S., Precup, D., Price, S.J., Raviv, T.R., Reza, S.M.S., Ryan, M., Sarikaya, D., Schwartz, L., Shin, H.C., Shotton, J., Silva, C.A., Sousa, N., Subbanna, N.K., Szekely, G., Taylor, T.J., Thomas, O.M., Tustison, N.J., Unal, G., Vasseur, F., Wintermark, M., Ye, D.H., Zhao, L., Zhao, B., Zikic, D., Prastawa, M., Reyes, M., Van Leemput, K., 2015. The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Trans. Med. Imaging 34, 1993–2024. URL: https://doi.org/10.1109/tmi.2014.2377694, doi:10.1109/tmi.2014.2377694.
  • Miao et al. (2021) Miao, Z., Liu, Z., Gaynor, K.M., Palmer, M.S., Yu, S.X., Getz, W.M., 2021. Iterative human and automated identification of wildlife images. Nat. Mach. Intell. 3, 885–895. URL: https://doi.org/10.1038/s42256-021-00393-0, doi:10.1038/s42256-021-00393-0.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019. PyTorch: An imperative style, high-performance deep learning library, in: Advances in Neural Information Processing Systems 32 (NeurIPS), pp. 8024–8035. URL: https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html.
  • Pedronette and Torres (2015) Pedronette, D.C.G., Torres, R.D.S., 2015. Unsupervised effectiveness estimation for image retrieval using reciprocal rank information, in: 2015 28th SIBGRAPI Conference on Graphics, Patterns and Images, IEEE. pp. 321–328. URL: https://doi.org/10.1109/SIBGRAPI.2015.28, doi:10.1109/SIBGRAPI.2015.28.
  • Pinho et al. (2019) Pinho, E., Silva, J.F., Costa, C., 2019. Volumetric feature learning for query-by-example in medical imaging archives, in: 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS), IEEE. pp. 138–143. URL: https://doi.org/10.1109/CBMS.2019.00038, doi:10.1109/CBMS.2019.00038.
  • Prior et al. (2017) Prior, F., Smith, K., Sharma, A., Kirby, J., Tarbox, L., Clark, K., Bennett, W., Nolan, T., Freymann, J., 2017. The public cancer radiology imaging collections of the cancer imaging archive. Sci. data 4. URL: https://pubmed.ncbi.nlm.nih.gov/28925987/, doi:10.1038/SDATA.2017.124.
  • Quellec et al. (2011) Quellec, G., Lamard, M., Cazuguel, G., Roux, C., Cochener, B., 2011. Case retrieval in medical databases by fusing heterogeneous information. IEEE Trans. Med. Imaging 30, 108–118. URL: https://doi.org/10.1109/tmi.2010.2063711, doi:10.1109/tmi.2010.2063711.
  • Raghu et al. (2019) Raghu, M., Zhang, C., Kleinberg, J., Bengio, S., 2019. Transfusion: Understanding transfer learning for medical imaging, in: Advances in Neural Information Processing Systems 32 (NeurIPS), Curran Associates Inc.. pp. 1–11. URL: https://proceedings.neurips.cc/paper/2019/file/eb1e78328c46506b46a4ac4a1e378b91-Paper.pdf.
  • Razavi et al. (2019) Razavi, A., van den Oord, A., Vinyals, O., 2019. Generating diverse high-fidelity images with VQ-VAE-2, in: Advances in Neural Information Processing Systems 32 (NeurIPS), pp. 1–11. URL: http://papers.nips.cc/paper/9625-generating-diverse-high-fidelity-images-with-vq-vae-2.
  • Rezende et al. (2014) Rezende, D.J., Mohamed, S., Wierstra, D., 2014. Stochastic backpropagation and approximate inference in deep generative models, in: Proceedings of the 31st International Conference on Machine Learning (ICML), pp. 1278–1286. URL: https://proceedings.mlr.press/v32/rezende14.html.
  • Rossi et al. (2021) Rossi, A., Hosseinzadeh, M., Bianchini, M., Scarselli, F., Huisman, H., 2021. Multi-modal siamese network for diagnostically similar lesion retrieval in prostate mri. IEEE Trans. Med. Imaging 40, 986–995. URL: https://doi.org/10.1109/TMI.2020.3043641, doi:10.1109/TMI.2020.3043641.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L., 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252. URL: https://doi.org/10.1007/s11263-015-0816-y, doi:10.1007/s11263-015-0816-y.
  • Sangkloy et al. (2016) Sangkloy, P., Burnell, N., Ham, C., Hays, J., 2016. The sketchy database. ACM Trans. Graphics 35, 1–12. URL: https://doi.org/10.1145/2897824.2925954, doi:10.1145/2897824.2925954.
  • Schlegl et al. (2017) Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G., 2017. Unsupervised anomaly detection with generative adversarial networks to guide marker discovery, in: The 25th International Conference on Information Processing in Medical Imaging (IPMI), pp. 146–157. URL: https://link.springer.com/chapter/10.1007/978-3-319-59050-9_12.
  • Shattuck and Leahy (2002) Shattuck, D.W., Leahy, R.M., 2002. BrainSuite: An automated cortical surface identification tool. Med. Image Anal. 6, 129–142. URL: https://doi.org/10.1016/s1361-8415(02)00054-3, doi:10.1016/s1361-8415(02)00054-3.
  • Shirahatti and Barnard (2005) Shirahatti, N.V., Barnard, K., 2005. Evaluating image retrieval, in: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), IEEE. pp. 955–961. URL: https://doi.org/10.1109/CVPR.2005.147, doi:10.1109/CVPR.2005.147.
  • Simonyan and Zisserman (2015) Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-scale image recognition, in: The 3rd International Conference on Learning Representations (ICLR), pp. 1–14. URL: https://arxiv.org/abs/1409.1556.
  • The Cancer Genome Atlas Research Network (2015) The Cancer Genome Atlas Research Network, 2015. Comprehensive, integrative genomic analysis of diffuse lower-grade gliomas. New Engl. J. Med. 372, 2481–2498. URL: https://doi.org/10.1056/nejmoa1402121, doi:10.1056/nejmoa1402121.
  • Tschandl et al. (2020) Tschandl, P., Rinner, C., Apalla, Z., Argenziano, G., Codella, N., Halpern, A., Janda, M., Lallas, A., Longo, C., Malvehy, J., Paoli, J., Puig, S., Rosendahl, C., Soyer, H.P., Zalaudek, I., Kittler, H., 2020. Human–computer collaboration for skin cancer recognition. Nat. Med. 26, 1229–1234. URL: https://doi.org/10.1038/s41591-020-0942-0, doi:10.1038/s41591-020-0942-0.
  • Turro et al. (2020) Turro, E., Astle, W.J., Megy, K., Gräf, S., Greene, D., Shamardina, O., Allen, H.L., Sanchis-Juan, A., Frontini, M., Thys, C., Stephens, J., Mapeta, R., Burren, O.S., Downes, K., Haimel, M., Tuna, S., Deevi, S.V., Aitman, T.J., Bennett, D.L., Calleja, P., Carss, K., Caulfield, M.J., Chinnery, P.F., Dixon, P.H., Gale, D.P., James, R., Koziell, A., Laffan, M.A., Levine, A.P., Maher, E.R., Markus, H.S., Morales, J., Morrell, N.W., Mumford, A.D., Ormondroyd, E., Rankin, S., Rendon, A., Richardson, S., Roberts, I., Roy, N.B., Saleem, M.A., Smith, K.G., Stark, H., Tan, R.Y., Themistocleous, A.C., Thrasher, A.J., Watkins, H., Webster, A.R., Wilkins, M.R., Williamson, C., Whitworth, J., Humphray, S., Bentley, D.R., Abbs, S., Abulhoul, L., Adlard, J., Ahmed, M., Alachkar, H., Allsup, D.J., Almeida-King, J., Ancliff, P., Antrobus, R., Armstrong, R., Arno, G., Ashford, S., Attwood, A., Aurora, P., Babbs, C., Bacchelli, C., Bakchoul, T., Banka, S., Bariana, T., Barwell, J., Batista, J., Baxendale, H.E., Beales, P.L., Bentley, D.R., Bierzynska, A., Biss, T., Bitner-Glindzicz, M.A., Black, G.C., Bleda, M., Blesneac, I., Bockenhauer, D., Bogaard, H., Bourne, C.J., Boyce, S., Bradley, J.R., Bragin, E., Breen, G., Brennan, P., Brewer, C., Brown, M., Browning, A.C., Browning, M.J., Buchan, R.J., Buckland, M.S., Bueser, T., Diz, C.B., Burn, J., Burns, S.O., Burren, O.S., Burrows, N., Campbell, C., Carr-White, G., Carss, K., Casey, R., Chambers, J., Chambers, J., Chan, M.M., Cheah, C., Cheng, F., Chinnery, P.F., Chitre, M., Christian, M.T., Church, C., Clayton-Smith, J., Cleary, M., Brod, N.C., Coghlan, G., Colby, E., Cole, T.R., Collins, J., Collins, P.W., Colombo, C., Compton, C.J., Condliffe, R., Cook, S., Cook, H.T., Cooper, N., Corris, P.A., Furnell, A., Cunningham, F., Curry, N.S., Cutler, A.J., Daniels, M.J., Dattani, M., Daugherty, L.C., Davis, J., Soyza, A.D., Deevi, S.V., Dent, T., Deshpande, C., Dewhurst, E.F., Dixon, P.H., Douzgou, S., Downes, K., Drazyk, A.M., Drewe, E., Duarte, D., Dutt, T., Edgar, J.D.M., Edwards, K., Egner, W., Ekani, M.N., Elliott, P., Erber, W.N., Erwood, M., Estiu, M.C., Evans, D.G., Evans, G., Everington, T., Eyries, M., Fassihi, H., Favier, R., Findhammer, J., Fletcher, D., Flinter, F.A., Floto, R.A., Fowler, T., Fox, J., Frary, A.J., French, C.E., Freson, K., Frontini, M., Gale, D.P., Gall, H., Ganesan, V., Gattens, M., Geoghegan, C., Gerighty, T.S., Gharavi, A.G., Ghio, S., Ghofrani, H.A., Gibbs, J.S.R., Gibson, K., Gilmour, K.C., Girerd, B., Gleadall, N.S., Goddard, S., Goldstein, D.B., Gomez, K., Gordins, P., Gosal, D., Gräf, S., Graham, J., Grassi, L., Greene, D., Greenhalgh, L., Greinacher, A., Gresele, P., Griffiths, P., Grigoriadou, S., Grocock, R.J., Grozeva, D., Gurnell, M., Hackett, S., Hadinnapola, C., Hague, W.M., Hague, R., Haimel, M., Hall, M., Hanson, H.L., Haque, E., Harkness, K., Harper, A.R., Harris, C.L., Hart, D., Hassan, A., Hayman, G., Henderson, A., Herwadkar, A., Hoffman, J., Holden, S., Horvath, R., Houlden, H., Houweling, A.C., Howard, L.S., Hu, F., Hudson, G., Hughes, J., Huissoon, A.P., Humbert, M., Humphray, S., Hunter, S., Hurles, M., Irving, M., Izatt, L., Johnson, S.A., Jolles, S., Jolley, J., Josifova, D., Jurkute, N., Karten, T., Karten, J., Kasanicki, M.A., Kazkaz, H., Kazmi, R., Kelleher, P., Kelly, A.M., Kelsall, W., Kempster, C., Kiely, D.G., Kingston, N., Klima, R., Koelling, N., Kostadima, M., Kovacs, G., Koziell, A., Kreuzhuber, R., Kuijpers, T.W., Kumar, A., Kumararatne, D., Kurian, M.A., Laffan, M.A., Lalloo, F., Lambert, M., Lawrie, A., Layton, D.M., Lench, N., Lentaigne, C., Lester, T., Levine, A.P., Linger, R., Longhurst, H., Lorenzo, L.E., Louka, E., Lyons, P.A., Machado, R.D., Ross, R.V.M., Madan, B., Maher, E.R., Maimaris, J., Malka, S., Mangles, S., Mapeta, R., Marchbank, K.J., Marks, S., Marschall, H.U., Marshall, A., Martin, J., Mathias, M., Matthews, E., Maxwell, H., McAlinden, P., McCarthy, M.I., McKinney, H., McMahon, A., Meacham, S., Mead, A.J., Castello, I.M., Megy, K., Mehta, S.G., Michaelides, M., Millar, C., Mohammed, S.N., Moledina, S., Montani, D., Moore, A.T., Morales, J., Morrell, N.W., Mozere, M., Muir, K.W., Mumford, A.D., Nemeth, A.H., Newman, W.G., Newnham, M., Noorani, S., Nurden, P., O’Sullivan, J., Obaji, S., Odhams, C., Okoli, S., Olschewski, A., Olschewski, H., Ong, K.R., Oram, S.H., Ormondroyd, E., Ouwehand, W.H., Palles, C., Papadia, S., Park, S.M., Parry, D., Patel, S., Paterson, J., Peacock, A., Pearce, S.H., Peden, J., Peerlinck, K., Penkett, C.J., Pepke-Zaba, J., Petersen, R., Pilkington, C., Poole, K.E., Prathalingam, R., Psaila, B., Pyle, A., Quinton, R., Rahman, S., Rankin, S., Rao, A., Raymond, F.L., Rayner-Matthews, P.J., Rees, C., Rendon, A., Renton, T., Rhodes, C.J., Rice, A.S., Richardson, S., Richter, A., Robert, L., Roberts, I., Rogers, A., Rose, S.J., Ross-Russell, R., Roughley, C., Roy, N.B., Ruddy, D.M., Sadeghi-Alavijeh, O., Saleem, M.A., Samani, N., Samarghitean, C., Sanchis-Juan, A., Sargur, R.B., Sarkany, R.N., Satchell, S., Savic, S., Sayer, J.A., Sayer, G., Scelsi, L., Schaefer, A.M., Schulman, S., Scott, R., Scully, M., Searle, C., Seeger, W., Sen, A., Sewell, W.A., Seyres, D., Shah, N., Shamardina, O., Shapiro, S.E., Shaw, A.C., Short, P.J., Sibson, K., Side, L., Simeoni, I., Simpson, M.A., Sims, M.C., Sivapalaratnam, S., Smedley, D., Smith, K.R., Smith, K.G., Snape, K., Soranzo, N., Soubrier, F., Southgate, L., Spasic-Boskovic, O., Staines, S., Staples, E., Stark, H., Stephens, J., Steward, C., Stirrups, K.E., Stuckey, A., Suntharalingam, J., Swietlik, E.M., Syrris, P., Tait, R.C., Talks, K., Tan, R.Y., Tate, K., Taylor, J.M., Taylor, J.C., Thaventhiran, J.E., Themistocleous, A.C., Thomas, E., Thomas, D., Thomas, M.J., Thomas, P., Thomson, K., Thrasher, A.J., Threadgold, G., Thys, C., Tilly, T., Tischkowitz, M., Titterton, C., Todd, J.A., Toh, C.H., Tolhuis, B., Tomlinson, I.P., Toshner, M., Traylor, M., Treacy, C., Treadaway, P., Trembath, R., Tuna, S., Turek, W., Turro, E., Twiss, P., Vale, T., Geet, C.V., van Zuydam, N., Vandekuilen, M., Vandersteen, A.M., Vazquez-Lopez, M., von Ziegenweidt, J., Noordegraaf, A.V., Wagner, A., Waisfisz, Q., Walker, S.M., Walker, N., Walter, K., Ware, J.S., Watkins, H., Watt, C., Webster, A.R., Wedderburn, L., Wei, W., Welch, S.B., Wessels, J., Westbury, S.K., Westwood, J.P., Wharton, J., Whitehorn, D., Whitworth, J., Wilkie, A.O., Wilkins, M.R., Williamson, C., Wilson, B.T., Wong, E.K., Wood, N., Wood, Y., Woods, C.G., Woodward, E.R., Wort, S.J., Worth, A., Wright, M., Yates, K., Yong, P.F., Young, T., Yu, P., Yu-Wai-Man, P., Zlamalova, E., Kingston, N., Walker, N., Penkett, C.J., Freson, K., Stirrups, K.E., Raymond, F.L., 2020. Whole-genome sequencing of patients with rare diseases in a national health system. Nature 583, 96–102. URL: https://www.nature.com/articles/s41586-020-2434-2, doi:10.1038/s41586-020-2434-2.
  • Vinker et al. (2022) Vinker, Y., Pajouheshgar, E., Bo, J.Y., Bachmann, R.C., Bermano, A.H., Cohen-Or, D., Zamir, A., Shamir, A., 2022. CLIPasso. ACM Trans. Graphics 41, 1–11. URL: https://doi.org/10.1145/3528223.3530068, doi:10.1145/3528223.3530068.
  • Wang et al. (2013) Wang, Y., Wang, L., Li, Y., He, D., Liu, T.Y., 2013. A theoretical analysis of NDCG type ranking measures, in: Shalev-Shwartz, S., Steinwart, I. (Eds.), Proceedings of the 26th Annual Conference on Learning Theory, PMLR, Princeton, NJ, USA. pp. 25–54. URL: https://proceedings.mlr.press/v30/Wang13.html.
  • Zhang et al. (2020) Zhang, Z., Zhang, Y., Feng, R., Zhang, T., Fan, W., 2020. Zero-shot sketch-based image retrieval via graph convolution network, in: Proceedings of the AAAI Conference on Artificial Intelligence, AAAI. pp. 12943–12950. URL: https://doi.org/10.1609/aaai.v34i07.6993, doi:10.1609/aaai.v34i07.6993.
  • Zheng et al. (2018) Zheng, L., Yang, Y., Tian, Q., 2018. SIFT meets CNN: A decade survey of instance retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 40, 1224–1244. URL: https://doi.org/10.1109/tpami.2017.2709749, doi:10.1109/tpami.2017.2709749.
  • Zhong et al. (2021) Zhong, A., Li, X., Wu, D., Ren, H., Kim, K., Kim, Y., Buch, V., Neumark, N., Bizzo, B., Tak, W.Y., Park, S.Y., Lee, Y.R., Kang, M.K., Park, J.G., Kim, B.S., Chung, W.J., Guo, N., Dayan, I., Kalra, M.K., Li, Q., 2021. Deep metric learning-based image retrieval system for chest radiograph and its clinical applications in covid-19. Med. Image Anal. 70, 101993. URL: https://doi.org/10.1016/J.MEDIA.2021.101993, doi:10.1016/J.MEDIA.2021.101993.
Refer to caption
Figure 1: Example results of Test-1 for gliomas. a The presented image in Q.2 of Test-1 showed the highest recall@5 for gliomas. Three example user queries and corresponding top-5 retrieved images are shown. The retrieved images are arranged from left to right, starting with the most similar image. The letters S, L, and P in the upper left corner of each retrieved image indicate the judged consistencies of the size (S), location (L), and pattern of contrast enhancement (P), respectively. The number in the lower right corner indicates the given similarity score, with a maximum score of 3/3. The retrieved images highlighted with red boxes are the images belonging to the same volume as the presented images (i.e., the same-volume images). The corresponding retrieval result by the query-by-example approach is shown at the bottom. b The presented image in Q.3 of Test-1 showed the lowest recall@5 for gliomas. A failed case is shown in the middle row, where none of the retrieved images are highlighted in a red box. This is presumably the effect of overdiagnosis of the point-like contrast-enhanced region in the tumor drawn only in the middle user query (see the arrow), consistent with a skill-based limitation. T1, T1-weighted sequence; T1CE, T1-weighted contrast-enhanced sequence; FLAIR, fluid-attenuated inversion recovery sequence.
Refer to caption
Figure 2: Example results of Test-1 for lung cancers. a The presented image in Q.4 of Test-1 showed the highest recall@5 for lung cancers. Three example user queries and corresponding top-5 retrieved images are shown. The retrieved images are arranged from left to right, starting with the most similar image. The letters S and L in the upper left corner of each retrieved image indicate the judged consistencies of the size (S) and location (L), respectively. The number in the lower right corner indicates the given similarity score, with a maximum score of 2/2. The retrieved images highlighted with red boxes are the images belonging to the same volume as the presented images (i.e., the same-volume images). The corresponding retrieval result by the query-by-example approach is shown at the bottom. b The presented image in Q.1 of Test-1 showed the lowest recall@5 for lung cancers. All three cases shown here failed to retrieve the presented images, which may be caused by a sketch-based limitation, meaning that the single class of tumor-associated labels was insufficient to express the variable internal characteristics of lung cancers. CT, computed tomography.
Refer to caption
Figure 3: Example results of Test-2 for gliomas. a The presented image in Q.3 of Test-2 showed the highest precision@5 for gliomas. Three example user queries and corresponding top-5 retrieved images are shown. The retrieved images are arranged from left to right, starting with the most similar image. The letters S, L, and P in the upper left corner of each retrieved image indicate the judged consistencies of the size (S), location (L), and pattern of contrast enhancement (P), respectively. The number in the lower right corner indicates the given similarity score, with a maximum score of 3/3. Note that even though the evaluators were presented with the same sentence independently, the user queries resembled each other in terms of their successful retrieval of clinically similar images, as shown by the similarity scores of 3/3. b The presented image in Q.5 of Test-2 showed the lowest precision@5 for gliomas. It can be seen that the semantic sketches differed in terms of how each evaluator expressed the tumor extension beyond the corpus callosum, which could represent a skill-based limitation if this difference had influenced the search results. T1CE, T1-weighted contrast-enhanced sequence.
Refer to caption
Figure 4: Example results of Test-2 for lung cancers. a The presented image in Q.3 of Test-2 showed the highest precision@5 for lung cancers. Three example user queries and corresponding top-5 retrieved images are shown. The retrieved images are arranged from left to right, starting with the most similar image. The letters S and L in the upper left corner of each retrieved image indicate the judged consistencies of the size (S) and location (L), respectively. The number in the lower right corner indicates the given similarity score, with a maximum score of 2/2. Note that even though the evaluators were presented with the same sentence independently, the user queries resembled each other and achieved successful retrieval of clinically similar images, as shown by the similarity scores of 2/2. b The presented image in Q.4 of Test-2 showed the lowest precision@5 for lung cancers. The low precision@5 may have been caused by a template-image-based limitation, suggesting that such detailed anatomical location, such as the pulmonary hilar region, was not completely learned by the model, which was trained in a self-supervised manner to learn the normal anatomy.
Refer to caption
Figure 5: Example results of Test-3 for gliomas. a The presented image, which was an isolated sample, in Q.1 of Test-3 showed the highest recall@5 for gliomas. Three example user queries and corresponding top-5 retrieved images are shown. The retrieved images are arranged from left to right, starting with the most similar image. Retrieved images highlighted with red boxes are the images belonging to the same volume as the presented isolated sample (i.e., the same-volume images). Note that three different user queries successfully retrieved the same-volume images. b The presented image in Q.2 of Test-3 showed the lowest recall@5 for gliomas. The failed case is shown in the middle row, where none of the retrieved images are highlighted in a red box. This failure may be owing to the fact that the peritumoral edema was not sketched in the user query (see the arrow), in contrast with the other user queries, possibly indicating a skill-based limitation. T1, T1-weighted sequence; T1CE, T1-weighted contrast-enhanced sequence; FLAIR, fluid-attenuated inversion recovery sequence.
Refer to caption
Figure 6: Example results of Test-3 for lung cancers. a The presented image, which was an isolated sample, in Q.4 of Test-3 showed the highest recall@5 for lung cancers. Three example user queries and corresponding top-5 retrieved images are shown. The retrieved images are arranged from left to right, starting with the most similar image. Retrieved images highlighted with red boxes are the images belonging to the same volume as the presented isolated sample (i.e., the same-volume images). Note that three different user queries successfully retrieved the same-volume images. b The presented image in Q.2 of Test-3 showed the lowest recall@5 for lung cancers. Even though the three user queries were relatively similar, only the query in the middle row successfully retrieved the same-volume image. This may be attributed to a template-image-based limitation owing to the large difference in the body size between the presented image and the template image. CT, computed tomography.