Explainable Artificial Intelligence Architecture for Melanoma Diagnosis Using Indicator Localization and Self-Supervised Learning

Ruitong Sun
University of Southern California
[email protected] Mohammad Rostami
University of Southern California
[email protected]

Abstract

Melanoma is a prevalent lethal type of cancer that is treatable if diagnosed at early stages of development. Skin lesions are a typical indicator for diagnosing melanoma but they often led to delayed diagnosis due to high similarities of cancerous and benign lesions at early stages of melanoma. Deep learning (DL) can be used as a solution to classify skin lesion pictures with a high accuracy, but clinical adoption of deep learning faces a significant challenge. The reason is that the decision processes of deep learning models are often uninterpretable which makes them black boxes that are challenging to trust. We develop an explainable deep learning architecture for melanoma diagnosis which generates clinically interpretable visual explanations for its decisions. Our experiments demonstrate that our proposed architectures matches clinical explanations significantly better than existing architectures.

1 Introduction

Melanoma is a prevalent type of skin cancer that can be highly deadly in advanced stages. For this reason, early detection of melanoma is the most important factor for successful treatment. New skin moles or changes in existing models are the most distinct symptoms of melanoma. However, due to similarity of benign and cancerous moles, melanoma diagnosis is a sensitive task that can be preformed by trained dermatologist. If skin moles are not screened and graded on time, melanoma maybe detected by patients very late. Unfortunately, this is the case with low-income populations with limited access to healthcare. Advances in deep learning along with accessibility of smartphones have led to emergence of automatic diagnosis of melanoma using skin lesion ordinary photographs [5, 34, 11, 1, 10, 15, 9]. When evaluated only in terms of diagnosis of melanoma, deep models have accuracy rates close to those of dermatologists. Despite this success, adoption of these models in clinical setting has been limited.

Refer to caption — Figure 1: Left: a skin mole photograph; right: explainability heatmap generated using Grad-Cam [26] for a trained DNN.

A primary challenge for adopting deep learning in clinical tasks is the challenge of interpretability. Deep neural networks (DNNs) sometimes are called “black boxes” because their internal decision-making process is opaque. Existing explainability methods [30, 26, 38, 18] try to clarify decisions of these black boxes to help users or developers understand the most important areas of the image in making the classification, in the form of a heatmap. However, the area alone is not particularly helpful, e.g., Grad-Cam [26] simply highlights the entire mole in the melanoma image in Figure 1. In other words, the highlighted regions are often too large to show the shape of an interpretable region, or extremely deviated from the regions of interest to dermatologists. A reason behind this deficiency is that many explainability methods primarily consider the last DNN layer for heatmap generation, whereas some interpretable features maybe encoded at earlier layers. Hence, improving DL explanation methods may be helpful.

More importantly, there is no guarantee that a trained DNN uses human interpretable indicators for decision-making [3], irrespective of improving DL explainability algorithms. We argue existing explainability methods may not be sufficient for explainable DL in high-stakes domain such as medicine due to the end-to-end training pipeline of DL. In other words, a model that is trained only using a high-level abstract label, e.g., cancerous vs benign, may learn to extract indicator features that are totally different compared to the features human experts use. In contrast, dermatologists are trained to perform their diagnosis through identifying intermediate indicator biomarkers [2]. The solution that we propose is to benefit from intermediate-level annotations that denote human-interpretable features in the training pipeline to enforce a DNN learn making decisions similar to clinical experts. However, data annotation, particularly in medical applications, is an expensive and time-consuming task and generating a finely annotated dataset is infeasible. To circumvent this challenge, we use self-supervised learning [4] to train a human-interpretable model using only a small annotated dataset. Our empirical experiments demonstrate that our approach can generate explanations more similar to expert dermatologists.

2 Related Work

Explainability in Deep Learning

Existing explainability methods in deep learning primarily determine which spatial regionx of the input image or combination of regions led to a specific model decision or contribute significantly to the network prediction (see Figure 1). There are two main approaches to identify regions of interest when using deep learning: Model-based methods and model agnostic methods. Model-based methods work based on the details of the specific structures of a deep learning model. They examine the activations or weights of the deep network to find regions of importance. Grad-CAM and Layerwise Relevance propagation [25] are examples of such methods. Attention-based methods [8] similarly identify important image regions. Model agnostic methods ethods separate the explanations from the model which offers wide applicability. These methods (e.g., LIME [20]) manipulate inputs (e.g., pixels, regions or superpixels) and measure how changes in input affect output. If an input perturbation has no effect, it is not relevant to decision-making. In contrast, if a change has a major impact (e.g., changing the classification from glaucoma to normal), then the region is important to the classification. SHapley Additive exPlanations (SHAP) [12] can assign each feature or region an importance value for a particular prediction. Note, however, the regions found by these algorithms do not necessarily correspond to intermediate concepts or diagnostic features that are known to experts or novices. Hence, while these algorithms are helpful to explain classifications of DNNs, they do not help training models that mimic humans when making predictions.

Identifying regions of interest is also related to semantic segmentation [17, 32] which divides an image into segments that are semantically meaningful (e.g., separating moles from background skin in diagnosing melanoma). However, these methods mostly segment based on spatial similarities and do not offer any explanation how these segments that are generated can be used for classification of the input image. UNets [21] also identify regions within images but do not indicate the importance of regions to overall classification, a key step in explaining model decision.

Deep Learning for Melanoma Diagnosis

Dermatology is one of the most common use cases of DL in medicine, with many existing works in the literature [5, 34, 11, 1, 10, 15, 9]. Despite significant progress in DL, these methods simply train a DL on a labeled binary dataset using supervised learning. Despite being naive in terms the artificial intelligence (AI) algorithms they use, these works lead to decent performances, comparable with expert clinicians. There is still room for improving explainability characteristics of these methods to convince clinicians adopting AI for melanoma diagnosis in practice. However, only a few works have explored explainability of AI models for melanoma diagnosis. Murabayashi et al. [14] use clinical indicators benefit from virtual adversarial training [13] and multitask learning to train a model that predicts the clinical indicators in addition to the binary label to improve explainability. Nigar et al. [16] simply use LIME to study interpretability of their algorithm. Stieler et al. [33] use the ABCD-rule, a diagnostic approach of dermatologists, while training a model to improve interpretability. Shorfuzzaman [29] used meta-learning to train an ensemble of DNNs, each predicting an indicator, to use indicators to explain decisions. These existing works, however, do not spatially locate the indicators. We develop and architecture that generates spatial masks on the input image to locate clinical indicators spatially.

3 Problem Formulation

Due to the privacy concerns [31] and high annotation costs [23] associated with medical images, available and well-annotated medical data is extremely scarce. However, there are many unannotated datasets available. Therefore, we aim to leverage these large amounts of unannotated data to improve our models. We refer to such a dataset as $D^{UL}=(\bm{x}^{\prime}i)^{N}_{i=1}$ , where $x^{\prime}_{i}$ denotes the images. We refer to this dataset as Dataset B.

Most works based on using AI for melanoma diagnosis, consider that we have access to a dataset that includes skin lesion images along with corresponding binary labels for cancerous vs benign cases. The standard supervised learning is then use to train a suitable binary classifier, primarily based on convolutional neural networks (CNNs). In our work, we also consider such a dataset is accessible, referred to Dataset A in our formulation. Despite being a simple procedure for AI, it has been used extensively in the literature [19, 39, 5, 34, 11, 1, 10, 15, 9] due to high accuracy rates. However, as explained, this simple baseline does not lead to a human-centered explainable model. We cannot benefit from unsupervised domain adaptation [22] because we want to transfer knowledge from the unannotated domain to the annotated domain.

Fortunately, there are a few labeled datasets available where images are annotated with clinically plausible indicators. These indicator commonly are used by dermatologists and residents are trained to diagnose melanoma based on identifying them. We try to implement a similar approach, where the model is trained in end-to-end scheme to first predict the indicators as intermediate-level labels and then use them for diagnosis label prediction. Let $D^{L}=(\bm{x}_{i},y_{i},(\bm{z}_{ij})_{j=1}^{d})^{M}_{i=1}$ denotes this dataset, where $x_{i}$ and $y_{i}$ denote the images and their binary diagnostic labels. Additionally, $z_{ij}$ denotes a feature mask array with the same size as the input image, where for each $j$ , the mask denotes the spatial location of a clinically interpretable indicator, e.g., pigmented network, on the input image in the form of a a binary segmentation map. We refer to this dataset as Dataset A. Clearly, preparing Dataset A is a significantly more challenging task that Dataset B. It suffices to go though existing medical records to prepare Dataset B according to diagnosis. In contrast, existing medical records rarely include instances for Dataset A and hence, a dermatologist should determine the absence and presence of each indicator and locate then on the image in addition to a simple binary label. Even if can annotate some images to generate Dataset A, the size of Dataset A will be significantly smaller than Dataset B ( $M<<N$ ) due to scarcity of dermatologists who would accept serving as data annotators. Our goal is to benefit from both Dataset A and Dataset B to train an architecture that can be used for melanoma diagnosis with interpretable explanations.

A naive idea to train an explainable model is to use a suitable architecture and train one segmentation model, e.g., U-Net [21] to predict indicator masks. Previously, this idea has been used for training explainable AI models for medicine [28]. In our problem, we can use one U-Net for each of the $d$ indicators and train them using Dataset A. Hence, we will have $d$ image segmentation models that determine spatial location of each indicator for input images. However, there are two shortcomings. First, we will still need a secondary classification model to determine the diagnosis label from the indicators [14]. More importantly, the size of the Dataset A may not be sufficient for this purpose, particularly because only a subset of instances will contain a particular type of indicator and semantic segmentation is a complex task compared to classification. Since we likely will encounter the challenge of attribute sparsity, we likely will face overfitting during model execution.

The idea we will explore is to benefit from information encoded in Dataset B. Although Dataset B is not attributed coarsely, it is similar to Dataset A and transferring knowledge between these two datasets might be feasible. We formulate a weakly supervised learning problem for this purpose. Specifically, we rely on self-supervised learning using Dataset B to train an encoder that can better represent input images, enabling the model to generalize and locate biomarkers. Additionally, we propose to train multiple encoders to separately learn each feature, so that we can apply unique operations to each encoder and improve performance. Therefore, we can benefit from transfer learning to address the challenge of data sparsity. Finally, we connect the output vectors of multiple encoders to predict labels $\bm{y}_{i}$ , making the model’s predictions similar to those of an expert.

4 Proposed Algorithm

We have developed an explainable architecture for melanoma diagnosis. The network utilizes self-supervised learning to understand the inherent data structure. Then, we use U-Net as the basic backbone, and by letting the encoder perform the downsampling task, the Resnet backbone focuses more on the important areas identified by self-supervised learning, rather than providing too general features. In this section, we will describe the network architecture components and explain why we use each part. Additionally, we will discuss the training procedure used.

4.1 Overview of the network architecture

Figure 2, Visualize the architecture of our proposed model, which aims to generate a heatmap for the final convolution layer that closely resembles the regions expected by human experts when the network is trained for diagnostic classification. Our architecture consists of three key subnetworks:

(1) A pretrained Resnet50 network for attributes classification; we denote the subnetwork (1) as $f_{Res}$ where $f_{Res}(\bm{x}_{i})=E_{1}(\bm{x}_{i})+fc_{1}(\bm{x}_{i})$ , $\bm{x}_{i}\in D^{L}$ and $E_{1}(\cdot)$ represents the encoder, which consists of four consecutive blocks, each block consists of a downsampling and two ResNet residual blocks, $fc_{1}(\cdot)$ represents a fully connected layer which produces probabilistic outputs.
(2) An architecture based on U-Net for biomarker indicator locations. We replaced the U-Net encoder with the encoder $E_{1}(\cdot)$ of the $f_{Res}(\cdot)$ ; we denote the subnetwork (2) as $f_{Seg}(\cdot)$ where $f_{Seg}(\bm{h}_{i})=E_{1}(\bm{h}_{i})+D_{1}(\bm{h}_{i})$ , where $\bm{h}_{i}$ denotes the heatmaps that are generated using Grad-Cam, $D_{1}(\cdot)$ denotes the decoder which outputs biomarker indicator location.
(3) We add a ResNet50-based projection head for self-supervised learning; we denote the subnetwork (3) as $f_{CLR}(\cdot)$ , where $f_{CLR}(\bm{x}_{i}^{\prime})=E_{1}(\bm{x}_{i}^{\prime})+P_{1}(\bm{x}_{i}^{\prime})$ and $\bm{x}_{i}^{\prime}\in D^{UL}$ , $P_{1}(\cdot)$ is a projection head consisting of two fully connected layers that outputs a feature vector.

In summary, our architecture learns to classify the input images. Then, we use Grad-Cam to generate attention heatmaps. These heatmaps do not necessarily show interpretable indicators, but are useful for classification which means that they should have correlations with features expert clinicians use. Our idea is to feed these heatmaps to the segmentation subnetwork and generate the indicator biomarker localization maps using the Grad-Cam heatmaps. As a result, the architecture learns to generate explanations along with diagnosis labels. We can see that our full architecture can only be trained using Dataset A.

4.2 Bio-UNet Baseline Training

Our Bio-Unet baseline consists of networks $f_{Res}(\cdot)$ , $f_{Seg}(\cdot)$ . At first, We train the network $f_{Res}(\cdot)$ to do the skin lesion classification task using simple supervised learning. After this stage, the classification network can predict diagnosis labels with high accuracy. We then apply Grad-CAM on the network to generate attention heatmaps. Grad-CAM is generally used by computing the gradient of the classification score with respect to the last convolutional feature map to identify the classification score of the selected target in the most influential part of the input image. However, the final convolutional feature map primarily provides a high-level region of the input image which corresponds to the area of interest. Our experiments demonstrate that this mostly will just generate the full lesion mole similar to Figure 1 which is usually much larger than the region delineated by experts for a specific indicator and usually cannot reflect the location and area of an indicator. However, the earlier convolutional feature maps contain low-level information, e.g., the boundary of a region of interest. Hence, combining attention maps at all convolutional layers looks like an option that may generate a good estimation for the location of an indicator, but we empirically observed that simply averaging all heatmaps will result in poor output. Therefore, we would like to benefit from binary masks provided by experts and reconstruct these heatmaps using the segmentation network $f_{Seg}(\cdot)$ . We learn to use appropriate weights for each heatmap so that all bottleneck blocks of $E_{1}(\cdot)$ contribute to reconstruction of the final heatmap. Because the encoder $E_{1}(\cdot)$ has been used for training, in order not to affect its parameters, we created an encoder $E_{3}(\cdot)$ with the same architecture as $E_{1}(\cdot)$ . Specifically, the Resnet encoder $E_{3}(\cdot)$ consists of 12 bottleneck blocks, each loaded with optimal checkpoint parameters.Then we input a training image and a bottleneck block in Grad-CAM, get a heatmap, and then replace the bottleneck block in turn, getting a total of 12 localization masks. We denote $f_{Grad\_CAM}(b_{i},j)$ where $b_{i}$ is the bottleneck block and j is the index of the selected label.

For the segmentation task, dice coefficient is commonly used with a value ranging from 0 to 1 as the loss function. The larger the value, the more similar two binary masks would be. However, the Dice loss is only designed for binary data. In order to avoid adding the artificial factor of threshold during the model training, We adopt the soft dice loss as the segmentation loss in Eq. 1. The soft dice loss directly uses the predicted probability instead of using the threshold to change the output to 0 or 1. It is defined as:

L_{SoftDice}=1-\frac{2\sum_{Pixels}y_{true}y_{predict}}{\sum_{Pixels}y_{true}^{2}+\sum_{Pixels}y_{pred}^{2}}

(1)

By performing downsampling on the input heatmaps, the encoder will focus more on important features. During training, we add only the heatmap of the currently predicted feature by the encoder as input to $f_{Seg}(\cdot)$ , making the classification model $f_{Res}(\cdot)$ more accurate for the current attribute’s region of interest. Therefore, explanations can improve the diagnostic accuracy.

4.3 Self-Supervised Learning for Bio-UNet

The proposed Bio-Unet baseline architecture can learn boundaries efficiently, but this is only effective when the number of annotated images is large for biomarker indicators, e.g., pigment network. or the area corresponding to the indicators on a lesion is contiguous and large. To enable classification models to generate accurate heatmaps when Dataset A is small or when the area for an indicator is very small or is scatterred on the image, we benefit from self-supervised learning on Dataset B to improve the baseline of our proposed network architecture that is obtained by training on Dataset B. Specifically, we use SimCLR [4] which uses contrastive learning for improved visual representation. As shown in Figure 2, there are two independent data augmenters $T_{1}(\cdot)$ and $T_{2}(\cdot)$ which are randomly selected from rotation, scaling, cropping, brightness, contrast, saturation, and flipping transforms to generate an augmented version samples of Dataset B so that we can compute the contrastive learning loss. Each training image $\bm{x}^{\prime}_{i}\in D^{UL}$ is passed through two data augmenters to produce two augmented images. The two augmented images will then pass through our shared weight encoder $E_{1}(\cdot)$ and projection head $P_{1}(\cdot)$ , resulting in two 128-length features. In a minibatch of $N$ input images, $2N$ augmented images will be produced, for each pair of augmented images we treat as positive pairs, the other $2(N-1)$ are negative examples.We adopt contrastive loss as semi-supervised loss in Eq. (2)

\small L_{CLR}=-\log\frac{\exp(sim(f_{CLR}(T_{1}(x_{i})),f_{CLR}(T_{2}(x_{j})))/\tau)}{\sum_{k=1}^{2N}\exp(sim(f_{CLR}(T_{1}(x_{i})),f_{CLR}(T_{2}(x_{k})))/\tau)},

(2)

where $sim(u,v)=\frac{u^{T}v}{\|u\|\|v\|}$ and k $\not=$ i and $\tau$ denotes a temperature parameter. Upon training the encoders on Dataset B using self-supervised learning, we can benefit from transferring obtained knowledge across Dataset A and Dataset B. Due to the space limit, the full training procedure for our architecture is described in Algorithm 1.

Algorithm 1 Proposed Architecture Training Approach

Input1: $(\bm{x}_{i},y_{i})^{N}_{i=1}\in D^{L}$
      Input2: $(\bm{x}^{\prime}_{i},y^{\prime}_{i})^{M}_{i=1}\in D^{UL}$
      Input3: $(b_{i})^{12}_{i=1}$
      Output1: parameter $\theta$ for encoder $E_{1}$
      Output2: the final biomarker indicator location

f_{CLR}(x_{i}^{\prime})=E_{1}(x_{i}^{\prime})+P_{1}(x_{i}^{\prime})

;

f_{Res}(x_{i})==E_{1}(x_{i})+fc_{1}(x_{i})

;

f_{Seg}(H)=E_{1}(H)+D_{1}(H)

f_{Grad\_CAM}(b_{i},j)

where

b_{i}\in E_{3}

T_{1},T_{2}

: two separate data augmentation operators

7:while j

<

5 do

\triangleright

For 5 attributes

8: while Not stop do

9: Sample batch

B_{1}

x^{\prime}_{i}\in D^{UL}

10: Generating

f_{CLR}(T_{1}{(B_{1})})

and

f_{CLR}(T_{2}{(B_{1})})

11: Calculating loss

L_{CLR}

as equation(2)

12: Computing gradient of

L_{CLR}

and update

E_{1}

parameters

\theta

and

P_{1}

parameters

\theta_{1}

13: Sample batch

B_{2}

\{(x_{i},y_{i})\in D^{L}\}

14: Generating

f_{Res}(B_{2})

15: Calculating loss

L_{BCE}

16: Computing gradient of

L_{BCE}

and update

E_{1}

parameters

\theta

and

fc_{1}

parameters

\theta_{2}

17: Load the optimal checkpoint parameter

\theta

E_{3}

, use

f_{Grad\_CAM}(b_{i},j)

where

b_{i}\in E_{3}

to get the heatmap

h_{i}

18:

H=(h_{i})^{12}_{i=1}

19: Generating

f_{Seg}(H)

20: Calculating loss

L_{SoftDice}

as equation(1)

21: Computing gradient of

L_{SoftDice}

and update

E_{1}

parameters

\theta

and

D_{1}

parameters

\theta_{3}

22: EndWhile

23: return

E_{1}

parameters

\theta

24:EndWhile

25:Load 5 returned encoder parameter on five enocoders

26:Connect the outputs of 5 encoders and use the connected outputs as the input for the logistic regression model

5 Experimental Validation

Our implementation code is publicly available.

5.1 Experimental Setup

Datasets

We used the ISIC 2018 dataset [6, 35] to simulate our semi-supervised learning framework. It is a large collection of dermatoscopic images of common pigmented skin lesions with several prediction tasks. We used its Task 2 data as Dataset A and ISIC 2019 dataset [36, 37, 7] as Dataset B.

Dataset A: Task 2 of the ISIC 2018 dataset poses a challenge for melanoma clinical indicator detection. The task is to detect the following five dermoscopic attributes that are melanoma indicators: pigment network, negative network, streaks, mila-like cysts and globules. For descriptions of these clinical indicator biomarkers, please refer to the ISIC 2018 documentation [6, 35]. There are 2594 images in this task with binary labels for melanoma diagnosis. Table 1 shows the statistics for these indicators. As it can be guessed from our previous discussion, the dataset is sparse.

	Nonempty	empty	Skin images total
globules	603	1991	2594
milia_like_cyst	682	1912	2594
negative network	190	2404	2594
pigment network	1523	881	2594
streaks	100	2494	2594

Table 1: Comparison of the number of non-empty masks and the number of empty masks for each attribute

Dataset B: Task 3 in the ISIC 2019 dataset consists of 25331 images with only binary diagnosis labels. As it can be seen this dataset is much larger than Dataset A. However, the images are not annotated with the indicator biomarkers.

Baseline for Comparison:

Bio-Unet was compared against using variants of CAM when applied on the classification subnetwork for generating heatmap explanations for each of the indicators. We included Layer-CAM, Grad-CAM, Grad-CAM++ in our experiments. Our goal is to demonstrate that despite having a good accuracy, the ability to classify images with high accuracy does not lead to human-interpretable explanations, irrespective of the particular algorithm that we use for generating heatmaps.

Evaluation metrics

Our primary goal is localize the melanoma indicators on the input image. For this goal, we use the Continuous Dice metric [27]. We computed the Continuous Dice metric between the generated mask for each indicator and the provided ground truth map. For melanoma diagnosis, we used classification accuracy as the evaluation metric.

Implementation Details

For details about the algorithm implementation, optimizations hyperparameters, and the used hardware, please refer to the supplementary material.

5.2 Performance Results

After training the ResNet50 subnetwork and the Bio-UNet architecture for classification, we obtained $76\%$ and $82\%$ classification accuracy rates, respectively. We observe that both architecture have high melanoma diagnosis rates and using the additional subnetworks for segmentation and localization has led to a significant diagnosis performance boost. Table 2 presents results for localizing the five indicators when measured using the Continuous DICE metric. As it can be seen, utilizing Bio-UNet architecture along with our proposed training scheme enhances localization results by 0.95%, 2.33%, 5.11%, -8.24%, 9.01% over the standard architecture for the five indicators listed in Table 1, respectively. Each feature is critical, and Bio-UNet enables the encoder to better capture three features that are difficult for ResNet to detect. Every feature is vital for melanoma diagnosis, so when each feature can be localized uniformly, the accuracy of cancer diagnosis increases significantly. Table 2 presents results for localizing the five indicators when measured using the Continuous DICE metric.

We find that when the model’s decisions are close to what is expected by humans, the accuracy would increase. We conclude from our results that in order to make AI explainable, we may need to incorporate intermediate-level human-interpretable annotations in the deep learning end-to-end training pipelines and design DNN architectures that learn to perform a downstream task using human-interpretable intermediate indicators.

globules milia_like_cyst negative pigment streaks $f_{Res}(\cdot)$ + Grad-Cam 14.21 0.0 15.78 41.03 5.16 $f_{Res}(\cdot)$ + Grad-Cam++ 7.95 0.67 13.67 26.85 3.5 $f_{Res}(\cdot)$ + Layer-Cam 14.53 0.0 15.89 40.67 5.83 Bio-Unet 15.16 2.33 20.89 32.79 14.17

Table 2: Evaluating the localization accuracy for the five clinical indicators. Continuous Dice coefficient in percentage is reported.

To provide an intuition behind the quality of results presented in Table 2, we have visualized samples of heatmaps that are generated using different methods in Figure 3 for visual inspection. In this figure, we have selected an example image such that it is annotated with a corresponding indicator biomarker and presented the localization map that the AI pipelines generate. We can observe that the three CAM-based techniques generate features maps that are far larger than the ground truth mask and pretty much include the majority of the input mole. This means that the generated maps are not interpretable because they point to the whole mole which even a beginner knows that it should be the primary area of attention. In contrast, a close visualize comparison between columns two and six, demonstrates that our method generate binary feature maps that are quite similar to the ground truth, focusing on a subarea of the mole that in reality pertains to the clinical indicator. We also conclude that although the DICE metric is the predominant metric for segmentation, it is a sensitive metric when the semantic classes in the images are imbalanced.

5.3 Ablative Experiments

Experiments on the importance of the subnetworks:

$f_{Seg}$ $f_{CLR}$ globules milia_like_cyst negative pigment streaks Mel $f_{Res}(\cdot)$ 14.21 0.0 15.78 41.03 5.16 0.76 $f_{Res}(\cdot)$ + $f_{Seg}(\cdot)$ ✓ 15.21 0.5 17.78 39.58 3.83 0.76 $f_{Res}(\cdot)$ + $f_{CLR}(\cdot)$ ✓ 15.11 1.0 17.44 39.53 12.67 0.80 Bio-Unet ✓ ✓ 15.16 2.33 20.89 32.79 14.17 0.82

Table 3: Localization performance results for the ablative study on the importance of subnetworks of Bio-Unet.

We conducted an ablation study to investigate the contribution of each component of Bio-UNet. Table 3 presents the results of our ablative study. We observed that when the subnetwork $f_{CLR}(\cdot)$ was removed, the localization results for the ”streaks” and ”negative network” indicators were reduced. This observation was expected because ”streaks” and ”negative network” exist in only a very small number of samples and appear in a scattered and discontinuous manner in the input images. We concluded that self-supervised learning is extremely helpful for localizing infrequent indicators that appear in a scattered and discontinuous manner in the input images.

Figure 4 presents samples of generated masks using our ablative experiment. The second row presents a case of the ”milia-like cyst” indicator. It can be seen that this indicator corresponds to a single dot-like region in the ground truth. When we removed $f_{CLR}(\cdot)$ , the encoder was unable to capture the accurate location of this region. However, when we added $f_{Seg}(\cdot)$ after $f_{CLR}(\cdot)$ , the neural network was able to effectively remove irrelevant points and increase the importance of the region close to the ground truth. Therefore, we concluded that self-supervised learning can help provide more possible localization areas in some cases. From the last row of Figure 4, it can be seen that $f_{Seg}(\cdot)$ focuses on smaller regions, while $f_{CLR}(\cdot)$ provides a general region. When the two are combined, $f_{Seg}(\cdot)$ effectively refines the large but not confident region provided by $f_{CLR}(\cdot)$ . We concluded that all the new aspects of our architecture are critical for improved performance.

Impact of repeating $f_{Seg}(\cdot)$ within one epoch:

globules milia_like_cyst negative pigment streaks Melanoma $f_{Res}(\cdot)$ 14.21 0.0 15.78 41.03 5.16 0.76 Bio-Unet 15.16 2.33 20.89 32.79 14.17 0.82 Repeat 2 14.89 2.67 21.22 37.92 15.50 0.80 Repeat 3 14.84 0.16 16.33 39.01 12.83 0.82 Repeat 4 13.68 1.00 19.00 43.19 7.93 0.81 Repeat 5 13.26 1.83 16.89 38.04 11.5 0.80

Table 4: Evaluating the localization accuracy for the five clinical indicators. Continuous Dice Coefficient in percentage is reported.

Finally, we conducted an experiment to investigate whether repeating $f_{Seg}(\cdot)$ within each epoch is helpful. Figure 5 shows the results of repeating $f_{Seg}(\cdot)$ for all attributes. We can observe the second and third columns (last row) of the ”streaks” indicator, which correspond to no repetition and repetition once, respectively. We observed that when we repeated $f_{Seg}(\cdot)$ once in the loop, the resulting localization mask successfully focused on the two discontinuous parts and abandoned the wrong attention area on the left, compared to the Bio-UNet without repetition. Additionally, from Table 4, we can see that four out of five features increased, while the diagnostic rate for melanoma decreased. However, it can be seen from Table 4 that when repeating for the fourth time, the localization score of ”pigment network” reached its highest value, but the localization score of ”streaks” decreased. Therefore, I believe that each feature should use a different number of repetitions of $f_{Seg}(\cdot)$ . But for consistency in the paper, we ultimately used only one repetition of $f_{Seg}(\cdot)$ for all attributes.

Five encoders instead of multi-task learning

globules milia_like_cyst negative pigment streaks Melanoma $f_{Res}(\cdot)$ 14.21 0.0 15.78 41.03 5.16 0.76 Bio-Unet 15.16 2.33 20.89 32.79 14.17 0.82 Bio-Unet-(Two-tasks) 14.16 1.16 20.22 37.88 14.33 0.84 Bio-Unet-(Five-tasks) 15.89 0.5 19.44 15.64 10.67 0.82 Bio-Unet-(Six-tasks) 14.68 1.5 19.33 23.21 11.83 0.83

Table 5: Evaluate the localization accuracy of five clinical indicators in different mlutitask learning. Continuous Dice Coefficient in percentage is reported.

Table 5 represents different multi-task learning approaches. Bio-Unet-(Two-tasks) denotes that the encoder predicts both one intermediate feature and Melnoma label, while Bio-Unet-(Five-tasks) denotes that the encoder predicts all five intermediate features simultaneously. Bio-Unet-(Six-tasks) denotes that the encoder predicts all five intermediate features as well as the Melnoma label. We can see that adding the Melnoma label significantly improves the AUC, reaching a maximum of 84%, and the localization scores for ”pigment” and ”streaks” also increase. This shows that multi-task learning with Melnoma plays a significant role in improving performance. However, as the number of tasks increases to five or six, the continual dice coefficient for most attributes decreases, although the AUC remains sufficiently high.

I believe that performing $f_{Seg}(\cdot)$ for each individual attribute may conflict with the multi-task learning of Five or Six tasks. Therefore, I chose to use five encoders for each intermediate feature, which resulted in better cDC.

6 Conclusions

We developed an architecture for explainable diagnosis of melanoma using skin lesion images. Our architecture is designed to localize melanoma clinical indicators spatially and use them to predict the diagnosis label. As a result, it performs the task similar to a clinician, leading to interpretable decisions. We benefited from contrastive learning and self-supervised learning to address the challenge of annotated data scarcity for our task that requires coarse annotations with respect to clinical indicators. Experimental results demonstrate that our model is able to generate localization masks for identifying clinical biomarkers and generates more plausible explanations compared to existing classification architectures. Future works include extensions to learning settings with distributed data [24].

References

[1] Adekanmi A Adegun and Serestina Viriri. Deep learning-based system for automatic melanoma detection. IEEE Access, 8:7160–7172, 2019.
[2] G Argenziano, G Fabbrocini, P Carli, V De Giorgi, E Sammarco, and M Delfino. Epiluminescence microscopy for the diagnosis of abcd rule of dermatoscopy and a new 7-point checklist based on pattern analysis. Archives of Dermatology, (134):1536–1570, 1998.
[3] Kleanthis Avramidis, Mohammad Rostami, Melinda Chang, and Shrikanth Narayanan. Automating detection of papilledema in pediatric fundus images with explainable machine learning. In 2022 International Conference on Image Processing, 2022. ICIP’22. IEEE, 2022.
[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
[5] Noel CF Codella, Q-B Nguyen, Sharath Pankanti, David A Gutman, Brian Helba, Allan C Halpern, and John R Smith. Deep learning ensembles for melanoma recognition in dermoscopy images. IBM Journal of Research and Development, 61(4/5):5–1, 2017.
[6] Noel C. F. Codella, Veronica Rotemberg, Philipp Tschandl, M. Emre Celebi, Stephen W. Dusza, David A. Gutman, Brian Helba, Aadi Kalloo, Konstantinos Liopyris, Michael A. Marchetti, Harald Kittler, and Allan Halpern. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC). CoRR, abs/1902.03368, 2019.
[7] Marc Combalia, Noel C F Codella, Veronica Rotemberg, Brian Helba, Veronica Vilaplana, Ofer Reiter, Cristina Carrera, Alicia Barreiro, Allan C Halpern, Susana Puig, and Josep Malvehy. BCN20000: Dermoscopic lesions in the wild. Aug. 2019.
[8] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
[9] Mario Fernando Jojoa Acosta, Liesle Yail Caballero Tovar, Maria Begonya Garcia-Zapirain, and Winston Spencer Percybrooks. Melanoma diagnosis using deep learning techniques on dermatoscopic images. BMC Medical Imaging, 21(1):1–11, 2021.
[10] Sara Hosseinzadeh Kassani and Peyman Hosseinzadeh Kassani. A comparative study of deep learning architectures on melanoma detection. Tissue and Cell, 58:76–83.
[11] Yuexiang Li and Linlin Shen. Skin lesion analysis towards melanoma detection using deep learning network. Sensors, 18(2):556, 2018.
[12] Scott M Lundberg and Su-In Lee. A unified approach to interpreting model predictions. In Proceedings of the 31st international conference on neural information processing systems, pages 4768–4777, 2017.
[13] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
[14] Seiya Murabayashi and Hitoshi Iyatomi. Towards explainable melanoma diagnosis: prediction of clinical indicators using semi-supervised and multi-task learning. In 2019 IEEE International Conference on Big Data (Big Data), pages 4853–4857. IEEE, 2019.
[15] Ahmad Naeem, Muhammad Shoaib Farooq, Adel Khelifi, and Adnan Abid. Malignant melanoma classification using deep learning: datasets, performance measurements, challenges and opportunities. IEEE Access, 8:110575–110597, 2020.
[16] Natasha Nigar, Muhammad Umar, Muhammad Kashif Shahzad, Shahid Islam, and Douhadji Abalo. A deep learning approach based on explainable artificial intelligence for skin lesion classification. IEEE Access, 2022.
[17] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
[18] Phillip E Pope, Soheil Kolouri, Mohammad Rostami, Charles E Martin, and Heiko Hoffmann. Explainability methods for graph convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10772–10781, 2019.
[19] J Premaladha and KS Ravichandran. Novel approaches for diagnosing melanoma skin lesions through supervised and deep learning algorithms. Journal of medical systems, 40(4):1–12, 2016.
[20] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144, 2016.
[21] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
[22] M. Rostami and A. Galstyan. Overcoming concept shift in domain-aware settings through consolidated internal distributions. In Thirty-Seventh AAAI Conference on Artificial Intelligence, 2023.
[23] M. Rostami, D. Huber, and T. Lu. A crowdsourcing triage algorithm for geopolitical event forecasting. In ACM RecSys conference, pages 377–381, 2018.
[24] M. Rostami, S. Kolouri, K. Kim, and E. Eaton. Multi-agent distributed lifelong learning for collective knowledge acquisition. In International Conference on Autonomous Agents and Multiagent Systems, pages 712–720, 2018.
[25] Wojciech Samek, Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, and Klaus-Robert Müller. Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems, 28(11):2660–2673, 2016.
[26] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.
[27] Reuben R Shamir, Yuval Duchin, Jinyoung Kim, Guillermo Sapiro, and Noam Harel. Continuous dice coefficient: a method for evaluating probabilistic segmentations, 2019.
[28] Neeraj Sharma, Luca Saba, Narendra N Khanna, Mannudeep K Kalra, Mostafa M Fouda, and Jasjit S Suri. Segmentation-based classification deep learning model embedded with explainable ai for covid-19 detection in chest x-ray scans. Diagnostics, 12(9):2132, 2022.
[29] Mohammad Shorfuzzaman. An explainable stacked ensemble of deep learning models for improved melanoma skin cancer detection. Multimedia Systems, 28(4):1309–1323, 2022.
[30] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagating activation differences. In International conference on machine learning, pages 3145–3153. PMLR, 2017.
[31] Serban Stan and Mohammad Rostami. Secure domain adaptation with multiple sources. Transactions on Machine Learning Research.
[32] Serban Stan and Mohammad Rostami. Domain adaptation for the segmentation of confidential medical images. In Proceedings of the British Machine Vision Conference, 2022.
[33] Fabian Stieler, Fabian Rabe, and Bernhard Bauer. Towards domain-specific explainable ai: model interpretation of a skin image classifier using a human approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1802–1809, 2021.
[34] Nazneen N Sultana and Niladri B Puhan. Recent deep learning methods for melanoma detection: a review. In International Conference on Mathematics and Computing, pages 118–132. Springer, 2018.
[35] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The HAM10000 dataset: A large collection of multi-source dermatoscopic images of common pigmented skin lesions. CoRR, abs/1803.10417, 2018.
[36] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data, 5(1):180161, Aug. 2018.
[37] Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci. Data, 5(1):180161, Aug. 2018.
[38] Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff. Top-down neural attention by excitation backprop. International Journal of Computer Vision, 126(10):1084–1102, 2018.
[39] Xiaoqing Zhang. Melanoma segmentation based on deep learning. Computer assisted surgery, 22(sup1):267–277, 2017.

Appendix A Implementation Details

We provide details of our experimental implementation.

Hardware and Optimizer:

The entire framework is implemented using PyTorch and trained on four NVIDIA RTX 2080Ti GPUs with 11GB of memory. We use ResNet50 with pretrained weights on ImageNet as the encoder for our architecture Bio-Unet. The Adam optimizer with one set of hyperparameters (lr = $1e^{-4}$ , weight decay = $1e^{-4}$ ) is used for all tasks during the training stage.

A.1 Optimization Implementation Details

Preprocessing:

We start by training the subnetwork $f_{CLR}(\cdot)$ . Dataset B serves as the input for $f_{CLR}(\cdot)$ . Each image is first resizes to $512\times 512$ , then the image is normalized to have a zero mean and unit variance. Finally, data augmentation is carried out to improve the model generalization. Operations for data augmentation include random rotation, cropping, brightness, contrast, saturation, and flipping. Each mini-batch contains one positive example and 22 negative examples, with the batch size set to 24. The temperature parameter $\tau$ is set to 0.5.

We then perform the classification task using all of dataset A’s data, each resized to $512\times 512$ , and normalize images with zero mean and unit variance without performing any data augmentation. Due to the large size of the images and memory cap, the batch size is set to 16.

Segmentation training:

Once the classification task is complete, we use Grad-CAM to create an attention heatmap with a bottleneck block and the desired target attribute index as input. We then use this method to replace the bottlenecks one at a time until every bottleneck has been tried. After gathering all heatmaps, we input those heatmaps into the $f_{Seg}(\cdot)$ subnetwork for training it as the reconstruction function. We used a batch size of 8.

After the network training is complete, we create heatmaps using Grad-CAM as $f_{Grad\_CAM}(b_{i},j)$ , where $b_{i}\in E_{1}(\cdot)$ , using the bottleneck block $b_{i}$ in the encoder $E_{1}(\cdot)$ loaded with the optimum checkpoint parameters and the index of the selected attribute as input. We would select the heatmap of the last block as the final localization mask for the biomarker indicators.

Explainable Artificial Intelligence Architecture for Melanoma Diagnosis Using Indicator Localization and Self-Supervised Learning

Abstract

1 Introduction

2 Related Work

Explainability in Deep Learning

Deep Learning for Melanoma Diagnosis

3 Problem Formulation

4 Proposed Algorithm

4.1 Overview of the network architecture

4.2 Bio-UNet Baseline Training

4.3 Self-Supervised Learning for Bio-UNet

5 Experimental Validation

5.1 Experimental Setup

Datasets

Baseline for Comparison:

Evaluation metrics

Implementation Details

5.2 Performance Results

5.3 Ablative Experiments

Experiments on the importance of the subnetworks:

Impact of repeating fS​e​g​(⋅)f_{Seg}(\cdot) within one epoch:

Five encoders instead of multi-task learning

6 Conclusions

References

Appendix A Implementation Details

Hardware and Optimizer:

A.1 Optimization Implementation Details

Preprocessing:

Segmentation training:

Impact of repeating $f_{Seg}(\cdot)$ within one epoch: