\jmlrvolume

222 \jmlryear2023 \jmlrworkshopACML 2023

TLMCM Network for Medical Image Hierarchical Multi-Label Classification

\NameMeng Wu^∗ \Email[email protected]
\NameSiyan Luo^∗ \Email[email protected]
\NameQiyu Wu^∗ \Email[email protected]
\addrCollege of Computing Georgia Tech \NameWenbin Ouyang \Email[email protected]
\addrRedmond WA US

Abstract

Medical Image Hierarchical Multi-Label Classification (MI-HMC) is of paramount importance in modern healthcare, presenting two significant challenges: data imbalance and hierarchy constraint. Existing solutions involve complex model architecture design or domain-specific preprocessing, demanding considerable expertise or effort in implementation. To address these limitations, this paper proposes Transfer Learning with Maximum Constraint Module (TLMCM) network for the MI-HMC task. The TLMCM network offers a novel approach to overcome the aforementioned challenges, outperforming existing methods based on the $AU\overline{(PRC)}$ (Area Under the Average Precision and Recall Curve) metric. In addition, this research proposes two novel accuracy metrics, $EMR$ (Exact Match Ratio) and $HammingAccuracy$ , which have not been extensively explored in the context of the MI-HMC task. Experimental results demonstrate that the TLMCM network achieves high multi-label prediction accuracy( $80\%$ - $90\%$ ) for MI-HMC tasks, making it a valuable contribution to healthcare domain applications.

keywords:

hierarchical multi-label classification; medical image; transfer learning

^†^†editors: Berrin Yanıkoğlu and Wray Buntine^*^*footnotetext: These authors contributed equally to this work.

1 Introduction

Hierarchical Multi-label Classification (HMC) is a classification task that involves hierarchically organized classes. In the domain of healthcare, the Medical Image Hierarchical Multi-label Classification (MI-HMC) is important for efficient image interpretation, retrieval, and diagnosis Cai et al. (2020); Kim et al. (2022). The MI-HMC problem naturally arises in the medical industry and academia, given that X-ray images Chen et al. (2018), and microscope images Dimitrovski et al. (2011) can incorporate tree-structured sub-categories. However, MI-HMC faces two key challenges: data imbalance and hierarchy constraint Giunchiglia and Lukasiewicz (2020). Existing solutions involve complex model architectures Wehrmann et al. (2018a); Noor et al. (2022) or domain-specific preprocessing Dimitrovski et al. (2012); Quan et al. (2013); Pelka et al. (2018).

In prior research, the emphasis has predominantly leaned towards generic solutions, often overlooking the specific intricacies of MI-HMC tasks. In our study, we introduce a novel approach, the Transfer Learning with Maximum Constraint Module (TLMCM) network, which squarely tackles the challenges inherent to the MI-HMC domain.

The TLMCM network combines a pretrained deep learning CNN model with a Maximum Constraint Module (MCM) as proposed by Giunchiglia and Lukasiewicz (2020). It effectively addresses the issue of data imbalance by harnessing the power of transfer learning techniques, which have previously demonstrated their efficacy on small image datasets. The MCM method we employ is meticulously designed to ensure the satisfaction of the ”hierarchy constraint” in multi-label prediction results, and it boasts a straightforward implementation. One of the key advantages of the TLMCM network is that it obviates the need for extensive image preprocessing or domain-specific knowledge for feature extraction prior to model training.

For generic HMC tasks, Area Under Precision-Recall Curve ( $AU\overline{(PRC)}$ ) is the typical evaluation metric Giunchiglia and Lukasiewicz (2020); Wehrmann et al. (2018a). In specific MI-HMC tasks, where each prediction follows a distinct path in a hierarchical label structure, we introduce two new accuracy metrics: $EMR$ and $HammingAccuracy$ Sorower (2010), for a comprehensive evaluation. We thoroughly assessed the TLMCM network using these three metrics on two MI-HMC tasks with X-ray image datasets (ImageCLEF09A and ImageCLEF09D) Thomas and B. (2009). Our experiments demonstrate the superior performance of the TLMCM network compared to the current state-of-the-art methods Giunchiglia and Lukasiewicz (2020). Moreover, it achieves exceptionally high accuracy in multi-label predictions across both tasks, highlighting the practical significance of this research.

The key contributions of this work are: 1) the proposal of the compact and highly effective TLMCM network, which adeptly addresses common MI-HMC challenges and outperforms state-of-the-art methods in our experimental tasks; 2) the introduction of two novel evaluation metrics, $EMR$ and $HammingAccuracy$ , facilitating intuitive accuracy assessment in the MI-HMC domain, an area where such metrics have been largely unexplored.

2 Related Work

Current Hierarchical Multi-Label Classification (HMC) methods use local, global, or hybrid approaches. Local methods build separate classifiers for each node, while global methods use one classifier for the full hierarchy. Hybrids combine both Wehrmann et al. (2018b). Recent work shows directly incorporating hierarchical information in model improves performance Giunchiglia and Lukasiewicz (2020). For medical images, Hierarchical Medical Image Classification uses stacked deep learning models Kowsari et al. (2020), and deep Hierarchical Multi-Label Classification targets chest x-ray diagnosis Chen et al. (2018). Other work focuses on specific diseases Gour and Khanna (2020) or body parts Hou et al. (2021).

3 Methodology

We integrate the Maximum Constraint Module(MCM) as proposed in Giunchiglia and Lukasiewicz (2020), with the transfer learning process of a pretrained Convolutionnal Neural Network model ResNet50 He et al. (2015) as the backbone. We make careful and essential adaptations to the architecture of ResNet50, for the purpose of addressing the unique challenges posed by the MI-HMC task, and further augmenting the model’s capabilities in handling hierarchical multi-label classification.

3.1 MCM Method

A generic HMC task must respect the hierarchy constraint, i.e. for each label that is predicted to be true, all the ancestor labels as pre-defined in the hierarchy structure must also be predicted to be true Giunchiglia and Lukasiewicz (2020). In this regard, we adopted the two key concepts proposed in Giunchiglia and Lukasiewicz (2020): Max Constraint Module (MCM) and Maximum Constraint Loss (MCLoss).

Formally, for a generic HMC task with a set $S$ of $n$ labels in total, given a label $A$ , let $D_{A}$ be the set of all labels which are the descendants of the label $A$ in the hierarchical structure, and a machine learning model $h$ predicts the label $A$ to be true with the probability of $h_{A}$ , we then impose the MCM module on top of the output of the model $h$ , such that the output of MCM for label $A$ is:

\operatorname{MCM}_{A}=\max_{B\in\mathcal{D}_{A}}\left(h_{B}\right)

(1)

In addition, we can also guide the training process by incorporating the MCM constraint into the loss function of the underlying model $h$ . Formally, let $y_{A}$ and $y_{B}$ be the ground truth value of label $A$ and $B$ , the loss for label A is

\begin{split}\operatorname{MCLoss}_{A}=&-y_{A}\ln\left(\max_{B\in\mathcal{D}_{A}}\left(y_{B}h_{B}\right)\right)\\ &-\left(1-y_{A}\right)\ln\left(1-\mathbf{MCM}_{A}\right)\end{split}

(2)

Then the final loss function is defined as

\mathrm{MCLoss}=\sum_{A\in\mathcal{S}}\mathrm{MCLoss}_{A}

(3)

According to Giunchiglia and Lukasiewicz (2020), the novel $\mathrm{MCLoss}$ function could bring benefits to the gradient backpropagation to achieve a lower loss than the standard binary cross-entropy loss function.

3.2 Transfer Learning

For the backbone model, we retain the ResNet50 convolution blocks unchanged and focus on adapting the last fully-connected linear layer. In particular, we replace the original last layer with two fully-connected linear layers with a ReLU activation in between and a Sigmoid after. The hidden dimension is set to 256, and the output dimension is equal to the total number of labels. The Sigmoid layer converts the output scores into probabilities, as required by the MCM module.

The modified ResNet50 output is then fed into the MCM to predict the probability of truth for each label. We use the adapted loss function, $\mathrm{MCLoss}$ , as defined in Equation (3). The output are the probabilities for each label to be true. In the specific task of MI-HMC, since the label structure is mono-hierarchical(fully explained in the dataset description section 4.1), we just make our prediction that all labels corresponding to the maximum probability of the MCM output are predicted to be true. The whole architecture design is shown in Figure 1.

There are two common approaches to transfer learning: (1)freezing the convolution module as the fixed feature extractor, and only training the linear classifier(ResNet50 as Fix Feature Extractor, hereafter referred to as RNFFE); (2) fine-tuning the convolution module with the pre-trained weights of the convolution module as well as the linear classifier module(ResNet50 Fine-Tuning, hereafter referred to as RNFT). In the experiment section 4.3, we will explore whether RNFFE or RNFT is more suitable for the MI-HMC task.

Refer to caption — Figure 1: TLMCM network architecture. The predicted labels are represented as dark blue solid circles, while the other non-predicted labels are shown as hollow circles.

4 Experiments

4.1 Datasets

The main dataset that we used for this research is the 2009 ImageCLEF edition of the IRMA X-ray dataset Thomas and B. (2009). Each image in the dataset was classified based on the IRMA code Lehmann et al. (2003). A classification code may consist of three or four digits, representing a mono-hierarchical classification structure for the corresponding medical image. A mono-hierarchical label means that the classification hierarchy is a tree structure, thus each child node can only have one parent node.

In this study, our focus was directed toward evaluating the effectiveness of our model on two specific classification codes: anatomical (A) of 110 labels, and directional (D) of 36 labels. The dataset consists of 14410 images in total, and we split it into the train/validation/test set with the ratio of 70:15:15. This dataset choice of the (A) and (D) codes also allowed for a direct comparison with the baseline results^*^**The ImageCLEF07 datasets from their research are an older version no longer available. However, the ImageCLEF09 datasets share the same chracteristics..

4.2 Evaluation Metrics

Area Under the average Precision and Recall Curve( $AU\overline{(PRC)}$ ) Giunchiglia and Lukasiewicz (2020); Sorower (2010) is most commonly used in multi-label classification and tasks, and is also the metric that we used to compare with the state-of-the-art baseline results. Additionally, Exact Match Ratio( $EMR$ ) Sorower (2010) is a strict accuracy metric in that the prediction is considered correct only when the set of labels of prediction exactly matches the corresponding set of labels of ground truth. Formally, the $EMR$ is computed as follows:

EMR=\frac{1}{n}\sum_{i=1}^{n}I\left(Y_{i}=Z_{i}\right)

(4)

where, $I$ is the indicator function, $n$ is total number of all labels, $Y_{i}$ is the set of ground true labels for sample $i$ , and $Z_{i}$ is the set of predicted labels for sample $i$ .

From the application perspective of the MI-HMC task, if a prediction is partially correct, it still deserves some credit. In this respect, we utilize the multi-label classification $HammingAccuracy$ Sorower (2010), which is defined as the following with the same meaning as all notations defined in Eq. 4:

HammingAccuracy=\frac{1}{n}\sum_{i=1}^{n}\frac{\left|Y_{i}\cap Z_{i}\right|}{\left|Y_{i}\cup Z_{i}\right|}

(5)

4.3 Experiment setup

We performed experiments with both RNFFE and RNFT on the MI-HMC tasks of ImageCLEF09A and ImageCLEF09D. We maintained the same learning rate of $5e-6$ , weight decay of $1e-6$ , batch size of 32 and the total number of epoches of 120 for each task. The ReLU activation was used for the linear classifier and no dropout was applied. Then we trained the model with the Adam optimizer.

The training was completed on the Google Cloud Platform(GCP) virtual machine. We used the regular configuration of the “n1-standard-4” machine type, and 1 NVIDIA T4 GPU. We finished 4 training sessions(2 tasks, 2 approaches) in 10 hours, which is reasonably computationally efficient. The code is at https://github.com/flowing-time/IMAGE-HMC.

4.4 Results and discussions

The $AU\overline{(PRC)}$ , $EMR$ , and $hammingAccuracy$ for the tasks of ImageCLEF09A and ImageCLEF09D are summarized in Table 1. On both tasks, the transfer learning approach RNFT shows a significantly higher $AU\overline{(PRC)}$ score than the baseline results of C-HMCNN(h) in Giunchiglia and Lukasiewicz (2020) and some other recent models with good performance(e.g. Wehrmann et al. (2018a), Pelka et al. (2018)).

Task	Metric	RNFFE	RNFT	Giunchiglia (2020)	Wehrmann (2018)	Pelka (2018)
	$AU\overline{(PRC)}$	0.858	0.984	0.956	0.950	N/A
ImageCLEF09A	$EMR$	0.449	0.789	N/A	N/A	0.603
	$Hamming$	0.642	0.890	N/A	N/A	N/A
	$AU\overline{(PRC)}$	0.928	0.984	0.927	0.920	N/A
ImageCLEF09D	$EMR$	0.451	0.880	N/A	N/A	0.791
	$Hamming$	0.636	0.917	N/A	N/A	N/A

Table 1: Test datasets result in summarization. The best results are in bold.

To the best of our knowledge, the $EMR$ and $hammingAccuracy$ metrics have not been commonly reported on the previous HMC tasks. Nevertheless, our TLMCM network shows 80% - 90% prediction accuracy on the MI-HMC task and demonstrated its great value in real-world healthcare applications.

For a better understanding of fine-turning approaches in our methodology, we plotted the learning curves of both approaches for the ImageCLEF09A task in Figure 2. The logarithmic scale is used for the loss(1st colummn), while linear scale is used for the metrics(2nd, 3rd, 4th column). It is clearly shown that significantly lower training and validation loss can be achieved by the RNFT approach than RNFFE. In the 2nd column, we can see RNFT quickly attains the plateau of the highest $AU\overline{(PRC)}$ score, for both the training and validation set. The 3rd and 4th columns show the prediction accuracy( $EMR$ and $HammingAccuracy$ ) over the training iterations and the RNFT approach is again the winner.

Overall, the experiments show that for the MI-HMC task, our TLMCM network is highly effective, and the transfer learning approach of fine-tuning with pretrained weights(RNFT) should be the prioritized choice, as it does not only perform best in all three metrics but also achieves the best result with the fewest training iterations.

5 Conclusion

We proposed a novel TLMCM network for hierarchical multi-label classification on radiological medical images. To the best of our knowledge, TLMCM network is the first method to incorporate the Maximum Constraint Module (MCM) approach and transfer learning in medical image classification, eliminating the requirement for domain-specific knowledge in medical image pre-training. Our architecture outperformed current state-of-the-art models on the benchmark tasks of ImageCLEF09A, ImageCLEF09D. Furthermore, we introduced the metrics of $EMR$ and $HammingAccuracy$ for evaluating the performance of TLMCM network in the context of the MI-HMC problem. As part of our future work, our proposed architecture can be extended to larger medical image datasets that have deeper levels of hierarchy. Additionally, we plan to evaluate the TLMCM network on HMC image datasets in other domains such as object recognition, fashion and clothing, extending its application beyond radiological medical images to explore its potential in diverse fields.

References

Cai et al. (2020) Lei Cai, Jingyang Gao, and Di Zhao. A review of the application of deep learning in medical image classification and segmentation. Annals of Translational Medicine, 8, 2020.
Chen et al. (2018) Haomin Chen, Shun Miao, Daguang Xu, Gregory Hager, and Adam P. Harrison. Deep hierarchical multi-label classification of chest x-ray images. In International Conference on Medical Imaging with Deep Learning, 2018.
Dimitrovski et al. (2011) Ivica Dimitrovski, Dragi Kocev, and Suzana Loskovska. Hierarchical annotation of medical images. Journal of Foo, 44(10):2436–2449, 2011.
Dimitrovski et al. (2012) Ivica Dimitrovski, Dragi Kocev, Suzana Loskovska, and Saeroski. Hierarchical classification of diatom images using ensembles of predictive clustering trees. Ecol. Informatics, 7:19–29, 2012.
Giunchiglia and Lukasiewicz (2020) Eleonora Giunchiglia and Thomas Lukasiewicz. Coherent hierarchical multi-label classification networks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020.
Gour and Khanna (2020) Neha Gour and Pritee Khanna. Multi-class multi-label ophthalmological disease detection using transfer learning based convolutional neural network. Biomedical Signal Processing and Control, 66:102329, 2020.
He et al. (2015) Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
Hou et al. (2021) Benjamin Hou, G. Kaissis, Ronald M. Summers, and Bernhard Kainz. Ratchet: Medical transformer for chest x-ray diagnosis and reporting. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 2021.
Kim et al. (2022) Hee E. Kim, Alejandro Cosa-Linan, Nandhini Santhanam, Mahboubeh Jannesari, Máté Elod Maros, and Thomas Ganslandt. Transfer learning for medical image classification: a literature review. BMC Medical Imaging, 22, 2022.
Kowsari et al. (2020) Kamran Kowsari, Rasoul Sali, Lubaina Ehsan, William Adorno, Asad Ali, Sean R. Moore, Beatrice C. Amadi, Paul Kelly, Sana Syed, and Donald Brown. Hmic: Hierarchical medical image classification, a deep learning approach. Information (Basel), 11, 2020.
Lehmann et al. (2003) Thomas Martin Lehmann, Henning Schubert, Daniel Keysers, Michael Kohnen, and Berthold B. Wein. The irma code for unique classification of medical images. In SPIE Medical Imaging, 2003.
Noor et al. (2022) Khondaker Tasrif Noor, Antonio Robles-Kelly, and Brano Kusy. A capsule network for hierarchical multi-label image classification. In International Workshop on Structural and Syntactic Pattern Recognition, 2022.
Pelka et al. (2018) Obioma Pelka, Felix Nensa, and Christoph M. Friedrich. Annotation of enhanced radiographs for medical image retrieval with deep convolutional neural networks. PLOS One, 13(11), 2018.
Quan et al. (2013) Zou Quan, Chen Weicheng, Huang Yong, Liu Xiangrong, and Jiang Yi. Identifying multi-functional enzyme by hierarchical multi-label classifier. Journal of Computational and Theoretical Nanoscience, 10(4):1038-1043, 2013.
Sorower (2010) Mohammad S Sorower. A literature survey on algorithms for multi-label learning, 2010.
Thomas and B. (2009) Deserno Thomas and Ott B. 15.363 irma bilder in 193 kategorien für imageclefmed 2009 = 15,363 irma images of 193 categories for imageclefmed 2009. Available at https://publications.rwth-aachen.de/record/667225, 2009.
Wehrmann et al. (2018a) Jonatas Wehrmann, Ricardo Cerri, and Rodrigo Barros. Hierarchical multi-label classification networks. In International conference on machine learning, pages 5075–5084. PMLR, 2018a.
Wehrmann et al. (2018b) Jonatas Wehrmann, Ricardo Cerri, and Rodrigo Barros. Hierarchical multi-label classification networks. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 5075–5084. PMLR, 10–15 Jul 2018b. URL https://proceedings.mlr.press/v80/wehrmann18a.html.