¹¹institutetext: Ping An Technology, Shanghai, China ²²institutetext: Northwestern Polytechnical University, Xian China ³³institutetext: PAII Inc., Bethesda, Maryland, USA ⁴⁴institutetext: Vanderbilt University, Nashville, Tennessee, USA ⁵⁵institutetext: Eastern Hepatobiliary Surgery Hospital, Shanghai, China

Liver Tumor Localization and Characterization from Multi-Phase MR Volumes Using Key-Slice Parsing: A Physician-Inspired Approach

Bolin Lai B. Lai, Y. Wu and X. Bai — Equal contribution.
This work was done when X. Bai was intern at Ping An Technology.11 Yuhsuan Wu 1 ⋆1 ⋆ Xiaoyu Bai 112 ⋆2 ⋆ Xiao-Yun Zhou 33 Peng Wang 55 Jinzheng Cai 33 Yuankai Huo 44 Lingyun Huang 11 Yong Xia 22 Jing Xiao 11 Le Lu 33 Heping Hu 55 Adam P. Harrison 33

Abstract

Using radiological scans to identify liver tumors is crucial for proper patient treatment. This is highly challenging, as top radiologists only achieve F1 scores of roughly $80\%$ (hepatocellular carcinoma (HCC) vs. others) with only moderate inter-rater agreement, even when using multi-phase magnetic resonance (MR) imagery. Thus, there is great impetus for computer-aided diagnosis (CAD) solutions. A critical challenge is to robustly parse a 3D MR volume to localize diagnosable regions of interest (ROI), especially for edge cases. In this paper, we break down this problem using a key-slice parser (KSP), which emulates physician workflows by first identifying key slices and then localizing their corresponding key ROIs. To achieve robustness, the KSP also uses curve-parsing and detection confidence re-weighting. We evaluate our approach on the largest multi-phase MR liver lesion test dataset to date ( $430$ biopsy-confirmed patients). Experiments demonstrate that our KSP can localize diagnosable ROIs with high reliability: $87\%$ patients have an average 3D overlap of $>=40\%$ with the ground truth compared to only $79\%$ using the best tested detector. When coupled with a classifier, we achieve an HCC vs. others F1 score of $0.801$ , providing a fully-automated CAD performance comparable to top human physicians.

Keywords:

Liver Tumor localization Tumor characterization.

1 Introduction

Liver cancer is the fifth/eighth most common malignancy in men/women worldwide [4]. During treatment planning, non-invasive diagnostic imaging is preferred, as invasive procedures, i.e., biopsies or surgeries, can lead to hemmorages, infections, and even death [13]. Multi-phase magnetic resonance (MR) imagery is considered the most informative radiological option [20], with T2-weighted imaging (T2WI) able to reveal tumor edges and aggressiveness [2]. Manual lesion differentiation is workload-heavy and ideally is executed by highly experienced radiologists that are not always available in every medical center. Studies on human reader performance, which focus on differentiating hepatocellular carcinoma (HCC) from other types, report low specificities [2] and moderate inter-rater agreement [9]. Thus, there is a need for computer-aided diagnosis (CAD) solutions, which is the topic of our work. Unlike other approaches, we propose a physician-inspired workflow to achieve greater reliability and robustness.

A major motivation for CAD is addressing challenging cases that would otherwise be biopsied or even incorrectly operated on. For instance, a 2006 retrospective study discovered that pre-operative imaging misinterpreted $20\%$ of its liver transplant patients as having HCC [10]. While several CAD approaches have been reported, many do not focus on histopathologically-confirmed studies [29, 1, 7, 25], which are the cases most requiring CAD intervention. Prior CAD studies, except for Zhen et al. [32], also only focus on computed tomography (CT), despite the greater promise of MR. Most importantly, apart from two studies [6, 18], CAD works typically assume a manually drawn region of interest (ROI) is available. In doing so, they elide the major challenge of parsing a medical volume to determine diagnosable ROIs. Without this capability, manual intervention remains necessary and the system also remains susceptible to inter-user variations. The most obvious localization strategy, e.g., that of [18], would follow computer vision practices and directly applies a detector. However, detectors aim to find all lesions and their entire 3D extent in a study, whereas the needs for liver lesion characterization are distinct: reliably localize one or more key diagnosable ROI(s). This different goal warrants its own study, which we investigate.

Refer to caption — Figure 1: (a) HCC, ICC and metastasis lesion examples on T2WIs, including large, medium, small and low contrast tumors. The “small” metastasis shows an example of a lesion cluster. (b) Different MR sequences of the same patient.

In this work, we develop a robust and fully-automated CAD system to differentiate malignant liver tumors into the HCC, ICC and metastasis subtypes. Fig. 1(a) depicts these three types. To localize key diagnosable ROIs, we use a physician-inspired approach that departs from standard detection frameworks seen in computer vision and used elsewhere [18]. Instead, we propose a key-slice parser (KSP), which breaks down the parsing problem similarly to clinical practice, i.e., first robustly identifying and ranking key slices in the volume and, from each of these, regressing a single diagnosable ROI. This follows, at least in spirit, protocols like the ubiquitous response evaluation criteria in solid tumors (RECIST) [8]. In concrete terms KSP comprises multi-sequence classification, detection, and curve parsing. Once localized, each ROI is classified using a standard classifier.

We test our approach on $430$ multi-phase MR studies ( $2150$ scans), which is the largest test cohort studied for liver lesion CAD to date. Moreover, all of our patient studies are histopathologically confirmed, well-representing the challenging cases requiring CAD. Using our KSP framework, we achieve very high reliability, with $87\%$ of our predicted ROIs overlapping with the ground truth by $>=40\%$ , outperforming the best detector alternative (only $79\%$ with an overlap $>=40\%$ ).

2 Methods

2.1 Overview

Fig. 2(a) illustrates our approach, which comprises a key-slice parser (KSP) and a liver lesion classifier. As illustrated and defined by Fig. 1(b), we assume we are given a dataset of MR volumes with five sequences/phases. Formally, assuming $\rm N$ studies, we define our dataset as $\mathcal{D}=\{\mathcal{X}_{i},\,\mathcal{B}_{i},\,y_{i}\}_{i=1}^{\rm N}$ , where $\mathcal{X}_{i}=\{X_{i,j}\}_{j=1}^{5}$ is the MR sequences and $y_{i}$ is a study-level lesion-type label. Lesions are either (1) annotated by 2D bounding boxes (bboxes) using RECIST-style marks [8] or (2) when they are too numerous to be individually annotated, a bbox over each cluster is provided. See Fig. 1(a) for an illustration of the two types. Given the extreme care and multiple readers needed for lesion masks [3], bbox labels are much more practical to generate. We use $m$ to represents individual slices, e.g., $X_{i,j,m}$ , which also selects any corresponding bboxes, $\mathcal{B}_{i,m}$ . From the bboxes, we can also define slices as being “key”, “marginal”, and “non-key”. Marginal slices are slices within a buffer of one slice from the beginning or end of any lesion, see our supplementary for examples. We will drop the $i$ when appropriate.

2.2 Key-Slice Parser

Because any popular classifier can be used for lesion classification, our methodological focus is on the KSP. Illustrated in Fig. 2(b), KSP decomposes localization the simpler problem of key-slice ranking followed by key ROI regression.

Slice ranking identifies whether each MR slice is a key slice or not. Any state-of-the-art classifier can be used, trained on “key” and “non-key” slices, with “marginal” ones ignored. But care must be taken to handle multi-sequence MR data. In short, the MR sequences or phases where lesions are visible vary, somewhat unpredictably, from lesion to lesion. EF models, i.e., inputting a five-channel slice, can be susceptible to overfitting to the specific sequence behavior seen in the training set. Examining each MR sequence more independently mitigates this risk. Thus, we perform late fusion (LF). More specifically, because T2WI is the most informative sequence for liver tumors [2], we use T2WI as an anchor (T2-anchor) and pair each of the remaining sequences with it (and also one T2WI-only sequence), training a separate model for each. Unlike standard LF, the T2-anchor LF approach indeed boosts the performance over EF, see our ablation study in the supplementary. Under the T2-anchor LF approach, we obtain confidence scores for each of the five T2-anchor sequences, $j$ , and for each slice, $m$ : $s^{\mathrm{cls}}_{j,m}$ . We average the confidence score across all sequences to compute a slice-wise classification confidence:

\displaystyle s^{\mathrm{cls}}_{m}=\frac{1}{5}\sum_{j=1}^{5}s^{\mathrm{cls}}_{j,m}\ \mathrm{.}

(1)

Our decomposition strategy means that selected key slices should contain at least one lesion. We take advantage of this prior knowledge to regress a single key ROI from each prospective key slice. To do this, we train any state-of-the-art detector on each T2-anchor sequence, producing a set of bbox confidence values and locations for each slice and sequence: $\mathcal{S}_{j,m}^{\mathrm{det}}$ and $\hat{\mathcal{B}}_{j,m}$ , respectively, and we group the outputs across all sequences together: $\mathcal{S}_{m}^{\mathrm{det}},\,\hat{\mathcal{B}}_{m}=\{\mathcal{S}_{j,m}^{\mathrm{det}},\,\hat{\mathcal{B}}_{j,m}\}_{j=1}^{5}$ . To produce single ROI, we use a voting scheme where for every possible pixel location we sum up the detection confidences of any bbox, $s^{\mathrm{det}}_{m,k}\in\hat{\mathcal{B}}_{m}$ , that overlaps with it. We then choose the pixel location with the highest detection confidence sum. For all the bboxes that overlap the chosen pixel location, we take their mean location and size as the final slice-wise ROI. Importantly, we filter out low-confidence ROIs using a threshold $t$ , which is determined by examining the overlap of resulting slice-wise ROIs with ground truth bboxes in validation. We choose the $t$ that provides the best empirical cumulative distribution function of overlaps, and thus the best balance between false negatives and false positives.

The final step is to rank slices and their corresponding ROIs, as illustrated in Fig.2(c). We start by producing a confidence curve across all slices using $s^{\mathrm{cls}}_{m}$ . From this curve we identify peaks, which ideally should each correspond to the presence of a true lesion. Each peak defines a key-slice zone, which is the adjoining region where confidence values are within $1/2$ of the “peak”. Only key slices in key-slice zones will be ranked and selected, and we only admit slices that contain at least one bbox with a confidence score $>t$ .

Since good performance relies on selecting the correct slices, we use detection to build in redundancy and to better rank slices in the key-slice zone. Specifically, from all bbox confidences in a slice, $\mathcal{S}_{m}^{\mathrm{det}}$ , we compute a slice-wise confidence by choosing the maximum bbox confidence. We bias these confidences toward larger bboxes, based on the assumption that they are more diagnosable:

\displaystyle s^{\mathrm{det}}_{m}=\max\left(\left\{(a_{m,k}+s^{\mathrm{det}}_{m,k})/2\right\}_{k=1}^{\rm{\hat{K}}_{m}}\right)\textrm{,}

(2)

where $k$ indexes the predicted bboxes and confidences in $\mathcal{S}_{m}^{\mathrm{det}}$ and $\hat{\mathcal{B}}_{m}$ and $a_{m,k}\in(0,1]$ is the normalized bbox area across all slices. Next, we combine classification and detection scores:

\displaystyle s^{\mathrm{cls+det}}_{m}=(s^{\mathrm{cls}}_{m}+s^{\mathrm{det}}_{m})/2\mathrm{.}

(3)

We rank slices using (3) and select the top $T\%$ of slices. We choose $T$ by examining the distribution of within-study precision and recalls across all validation studies. We choose the $T$ giving an across-study average recall $>=0.5$ and a first quartile (Q1) precision of $>=0.6$ (see Results). This strategy is applied for all evaluated detectors. The corresponding slice-wise ROIs comprise the key ROIs.

3 Results

3.0.1 Setup

We collected $430$ multi-phase multi-sequence MR studies ( $2150$ volumes) from Anonymized Hospital. The selection criteria was any patient who had surgical reaction or biopsy in the period between 2006 and 2019 where T1WI, T2WI, T1WI-V, T1WI-A, and DWI sequences are available. Lesion distribution was $207$ , $113$ , and $110$ patients with HCC, ICC, and metastasis, respectively. The data was then split patient-wise using five-fold cross validation, with $70\%$ , $10\%$ , and $20\%$ used for training, validation, and testing, respectively. Data splitting was executed on HCC, ICC and metastasis independently to avoid imbalanced distributions. RECIST marks were labeled on each slice, under the supervision of a hepatic physician with $>10$ years experience. From there, a bbox was generated. For clusters of lesions too numerous to individually mark, a bbox over each cluster was drawn, as shown in Fig. 1(a). As key-slice classifier we use DenseNet121 [17]. As detectors we evaluated three KSP options: CenterNet has been used to achieve state-of-the-art results in DeepLesion [5]; ATSS achieves the best performance on COCO [31]; and 3DCE is a powerful detector specifically designed for lesion localization [27]. All detectors were trained using the T2-anchor LF approach. Standard preprocessing and hyper-parameter settings were used for all modules, which are outlined in the supplementary.

3.0.2 Key-Slice Selection

We measure the impact of detection-based reweighting and curve parsing (CP). To do this, we rank key slices based on a) directly using detection output, i.e., $s^{\rm{det}}_{m}$ , b) directly using classification output, i.e., $s^{\rm{cls}}_{m}$ , c) using classification and detection confidences, i.e., Eq.(3), and d) including CP in key-slice selection. As metrics, we select the top $T\%$ of ranked slices across all studies. For each choice of $T$ , we calculate the within-study precision and recall, giving us a distribution of precision and recalls across studies. Thus, we graph the corresponding average recall (across patients) along with the median, first quartile (Q1) and third quartile (Q3) precision, providing typical, lower-, and upper-bound performances. In Fig. 3, detection-based re-weighting significantly outperforms detection and classification, boosting the Q1 and median precision, respectively. CP provides additional boosts in precision and recall, with notable boosts in lower-bound performance (robustness). To choose $T$ we select the value corresponding to an average recall $>=0.5$ and a Q1 precision $>=0.6$ , which balances between finding all slices while keeping good precision. For CenterNet, ATSS, and 3DCE this corresponds to keeping $48\%$ , $54\%$ , and $50\%$ of the top slices, respectively.

3.0.3 Localization

Unlike standard detection setups, for CAD we are not interested in free response operating characteristics with arbitrary overlap cutoffs. Instead, we are only interested in whether we can select high-quality ROIs. Thus, we measure the overlap of selected ROIs against any ground truth bbox using the intersection over union (IoU). When ground truth bboxes are drawn over lesion clusters, we use the intersection over bounding box (IoBB) [24] as an IoU proxy. For each patient, we examine the average overlap across all selected ROIs and also the worst case, i.e., lower bound (LB) overlap. We then directly observe the empirical cumulative distribution function (CDF) of these overlaps across all patients. We evaluate whether the KSP can enhance the performance of the three tested detectors, which would otherwise directly output key ROIs according to their bbox confidence scores.

From Fig. 4, the improvements provided by KSP is apparent on all detectors. When examining the mean overlap for each patient, the percentage of patients with low overlap ( $<=25\%$ IoU-IoBB) is decreased by $0.5\%\sim 3.3\%$ . Much more significantly, the LB performance indicates that KSP results in roughly $13\%$ fewer patients with zero overlap and $12\%$ fewer patients with low overlap. Thus, the KSP better ensures that no poor ROIs get selected and passed on to classification. It should be noted that the LB metrics directly measure robustness, which is the main motivating reason for the KSP framework. Hence, the corresponding LB improvements validate the KSP approach of hierarchically decomposing the problem into key-slice classification and ROI regression.

Table 1: Lesion characterization performance. Radiomics is implemented based on the manual localization and SaDT[18] cannot be used without ROIs. Results when using DenseNet121 with ground truth bboxes are reported as the upper bound. The bold numbers indicate the best performance under each metric except the upper bound.

Methods	Accuracy	mean F1	F1(HCC)	F1(ICC)	F1(Meta.)
Radiomics[23]	$58.65\pm 4.51$	$55.07\pm 6.34$	$71.37\pm 4.05$	$39.50\pm 10.94$	$54.36\pm 9.26$
ResNet101[15]	$59.54\pm 3.53$	$50.03\pm 3.55$	$76.85\pm 4.55$	$12.74\pm 8.58$	$60.51\pm 4.55$
DenseNet121[17]	$61.68\pm 5.43$	$51.24\pm 3.73$	$75.91\pm 10.37$	$17.54\pm 14.51$	$50.15\pm 7.57$
ResNeXt101[26]	$60.51\pm 5.32$	$54.96\pm 5.60$	$78.15\pm 6.01$	$27.62\pm 11.89$	$59.12\pm 6.79$
DeepTEN[30]	$53.97\pm 3.38$	$54.74\pm 3.12$	$65.03\pm 6.60$	$41.39\pm 5.63$	$57.80\pm 6.40$
KSP+ResNet101	$64.91\pm 6.42$	$61.32\pm 5.91$	$77.00\pm 6.09$	$45.73\pm 6.72$	$61.26\pm 6.44$
KSP+DenseNet121	$\mathbf{69.62\pm 3.13}$	$\mathbf{66.49\pm 2.78}$	$80.12\pm 3.54$	$\mathbf{55.34\pm 4.88}$	$\mathbf{64.02\pm 6.81}$
KSP+ResNeXt101	$67.26\pm 4.18$	$62.82\pm 3.98$	$\mathbf{80.33\pm 3.54}$	$45.86\pm 4.15$	$62.27\pm 7.98$
KSP+DeepTEN	$67.20\pm 2.79$	$64.14\pm 3.95$	$77.07\pm 3.26$	$51.61\pm 11.88$	$63.74\pm 8.48$
KSP+SaDT	$67.26\pm 3.91$	63.25 $\pm 3.63$	$79.69\pm 3.62$	$49.26\pm 8.51$	$60.80\pm 8.11$
Upper bound	$70.68\pm 2.97$	$68.02\pm 3.83$	$78.68\pm 3.61$	$57.59\pm 7.41$	$67.79\pm 7.78$

3.0.4 Characterization

Finally, for the overall lesion characterization performance, we measure patient-wise accuracy, one-vs-all and mean F1 score(s) of the three tumor types, with emphasis on HCC-vs-others given its prominence in clinical work [2]. According to Fig. 4, CenterNet with KSP surpasses 3DCE and ATSS in average and LB overlap, respectively. Therefore, we choose it as our KSP detector and train and test various classifiers on its ROIs. Patient-wise diagnoses are produced by averaging classifications from detected ROIs weighted by confidence. As demonstrated in Table 1, compared with using classifiers alone, KSP significantly improves accuracy (+5% $\sim$ +8%), mean F1 (+8% $\sim$ +15%) and HCC F1 scores (+0.15% $\sim$ +12%). DenseNet121 [26] performs best, garnering an HCC vs. others F1 score of 0.801, which is comparable to reported physician performance (0.791) [2]. In addition, we also produced an upper bound by testing DenseNet121 on oracle tumor locations, and there is only a marginal gap between it and our best results— $1\%$ in accuracy and $1.5\%$ in mean F1 score. This further validates the effectiveness of KSP, suggesting that performance bottlenecks may now be due to classifier limitations, which we leave for future work.

4 Discussion and Conclusion

As the medical image field progresses toward more clinically viable CAD, research efforts will likely increasingly focus on ensuring true robustness. We contribute toward this goal for liver lesion characterization. Specifically, we articulate a physician-inspired decompositional approach toward ROI localization that breaks down the complex problem into key-slice identification and then ROI regression. Using our proposed framework, our KSP realization can achieve very high robustness, with $87\%$ of its ROIs having an overlap of $>=40\%$ . Overall, our fully automated CAD solution can achieve an HCC-vs-others F1 score of $80.1\%$ . Importantly, this performance is reported on histopathologically-confirmed cases, which selects for the most challenging cases requiring CAD intervention. Even so, this matches reported clinical performances of $79.1\%$ [2], despite such studies including both radiologically and histopathologically-confirmed cases. Given the challenging nature of liver lesion characterization, our proposed CAD system represents a step forward toward more clinically practical solutions.

References

[1] Adcock, A., Rubin, D., Carlsson, G.: Classification of hepatic lesions using the matching metric. Comput Vis Image Underst 121, 36–42 (2014)
[2] Aubé, C., Oberti, F., Lonjon, J., Pageaux, G., Seror, O., N’Kontchou, G., Rode, A., Radenne, S., Cassinotto, C., Vergniol, J., et al.: EASL and AASLD recommendations for the diagnosis of HCC to the test of daily practice. Liver Int 37(10), 1515–1525 (2017)
[3] Bilic, P., Christ, P.F., Vorontsov, E., Chlebus, G., Chen, H., Dou, Q., Fu, C.W., Han, X., Heng, P.A., Hesser, J., Kadoury, S., Konopczynski, T., Le, M., Li, C., Li, X., Lipkovà, J., Lowengrub, J., Meine, H., Moltz, J.H., Pal, C., Piraud, M., Qi, X., Qi, J., Rempfler, M., Roth, K., Schenk, A., Sekuboyina, A., Vorontsov, E., Zhou, P., Hülsemeyer, C., Beetz, M., Ettlinger, F., Gruen, F., Kaissis, G., Lohöfer, F., Braren, R., Holch, J., Hofmann, F., Sommer, W., Heinemann, V., Jacobs, C., Mamani, G.E.H., van Ginneken, B., Chartrand, G., Tang, A., Drozdzal, M., Ben-Cohen, A., Klang, E., Amitai, M.M., Konen, E., Greenspan, H., Moreau, J., Hostettler, A., Soler, L., Vivanti, R., Szeskin, A., Lev-Cohain, N., Sosna, J., Joskowicz, L., Menze, B.H.: The liver tumor segmentation benchmark (lits) (2019)
[4] Bosch, F.X., Ribes, J., Díaz, M., Cléries, R.: Primary liver cancer: worldwide incidence and trends. Gastroenterology 127(5), S5–S16 (2004)
[5] Cai, J., Harrison, A.P., Zheng, Y., Yan, K., Huo, Y., Xiao, J., Yang, L., Lu, L.: Lesion harvester: Iteratively mining unlabeled lesions and hard-negative examples at scale. TMI [Accepted] (2020)
[6] Chen, X., Lin, L., Hu, H., Zhang, Q., Iwamoto, Y., Han, X., Chen, Y.W., Tong, R., Wu, J.: A cascade attention network for liver lesion classification in weakly-labeled multi-phase ct images. In: DART/MIL3ID workshop, pp. 129–138. Springer (2019)
[7] Diamant, I., Hoogi, A., Beaulieu, C.F., Safdari, M., Klang, E., Amitai, M., Greenspan, H., Rubin, D.L.: Improved patch-based automated liver lesion classification by separate analysis of the interior and boundary regions. JBHI 20(6), 1585–1594 (2015)
[8] Eisenhauer, E., Therasse, P., Bogaerts, J., et al.: New response evaluation criteria in solid tumours: revised RECIST guideline (v1.1). EJC 45(2), 228–247 (2009)
[9] Fowler, K.J., Tang, A., Santillan, C., Bhargavan-Chatfield, M., Heiken, J., Jha, R.C., Weinreb, J., Hussain, H., Mitchell, D.G., Bashir, M.R., Costa, E.A.C., Cunha, G.M., Coombs, L., Wolfson, T., Gamst, A.C., Brancatelli, G., Yeh, B., Sirlin, C.B.: Interreader reliability of LI-RADS version 2014 algorithm and imaging features for diagnosis of hepatocellular carcinoma: A large international multireader study. Radiology 286(1), 173–185 (2018)
[10] Freeman, R.B., Mithoefer, A., Ruthazer, R., Nguyen, K., Schore, A., Harper, A., Edwards, E.: Optimizing staging for hepatocellular carcinoma before liver transplantation: a retrospective analysis of the unos/optn database. Liver Transpl 12(10), 1504–1511 (2006)
[11] Galloway, M.M.: Texture analysis using gray level run lengths. Computer graphics and image processing 4(2), 172–179 (1975)
[12] Goyal, P., Dollár, P., Girshick, R.B., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1 hour. ArXiv abs/1706.02677 (2017)
[13] Grant, A., Neuberger, J.: Guidelines on the use of liver biopsy in clinical practice. Gut 45(suppl 4), IV1–IV11 (1999)
[14] Haralick, R.M., Shanmugam, K., Dinstein, I.H.: Textural features for image classification. IEEE Transactions on systems, man, and cybernetics pp. 610–621 (1973)
[15] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)
[16] Heinrich, M.P., Jenkinson, M., Brady, M., Schnabel, J.A.: MRF-based deformable registration and ventilation estimation of lung CT. TMI 32(7), 1239–1248 (2013)
[17] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR. pp. 4700–4708 (2017)
[18] Huo, Y., Cai, J., Cheng, C.T., Raju, A., Yan, K., Landman, B.A., Xiao, J., Lu, L., Liao, C.H., Harrison, A.: Harvesting, detecting, and characterizing liver lesions from large-scale multi-phase CT data via deep dynamic texture learning. arXiv:2006.15691 (2020)
[19] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[20] Oliva, M.R., Saini, S.: Liver cancer imaging: role of CT, MRI, US and PET. Cancer imaging 4(Spec No A), S42 (2004)
[21] Sun, C., Wee, W.G.: Neighboring gray level dependence matrix for texture classification. Computer Vision, Graphics, and Image Processing 23(3), 341–352 (1983)
[22] Thibault, G., Fertil, B., Navarro, C., Pereira, S., Cau, P., Levy, N., Sequeira, J., Mari, J.L.: Shape and texture indexes application to cell nuclei classification. International Journal of Pattern Recognition and Artificial Intelligence 27(01), 1357002 (2013)
[23] Van Griethuysen, J.J., Fedorov, A., Parmar, C., Hosny, A., Aucoin, N., Narayan, V., Beets-Tan, R.G., Fillion-Robin, J.C., Pieper, S., Aerts, H.J.: Computational radiomics system to decode the radiographic phenotype. Cancer research 77(21), e104–e107 (2017)
[24] Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.: Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: CVPR. pp. 2097–2106 (2017)
[25] Wu, J., Liu, A., Cui, J., Chen, A., Song, Q., Xie, L.: Radiomics-based classification of hepatocellular carcinoma and hepatic haemangioma on precontrast magnetic resonance images. BMC medical imaging 19(1), 23 (2019)
[26] Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: CVPR. pp. 1492–1500 (2017)
[27] Yan, K., Bagheri, M., Summers, R.M.: 3D context enhanced region-based convolutional neural network for end-to-end lesion detection. In: MICCAI. pp. 511–519. Springer (2018)
[28] Yan, K., Lu, L., Summers, R.M.: Unsupervised body part regression via spatially self-ordering convolutional neural networks. In: ISBI. pp. 1022–1025. IEEE (2018)
[29] Yang, W., Lu, Z., Yu, M., Huang, M., Feng, Q., Chen, W.: Content-based retrieval of focal liver lesions using bag-of-visual-words representations of single-and multiphase contrast-enhanced ct images. JDI 25(6), 708–719 (2012)
[30] Zhang, H., Xue, J., Dana, K.: Deep ten: Texture encoding network. In: CVPR. pp. 708–717 (2017)
[31] Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: CVPR. pp. 9759–9768 (2020)
[32] Zhen, S.h., Cheng, M., Tao, Y.b., Wang, Y.f., Juengpanich, S., Jiang, Z.y., Jiang, Y.k., Yan, Y.y., Lu, W., Lue, J.m., et al.: Deep learning for accurate diagnosis of liver tumor based on magnetic resonance imaging and clinical data. Front Oncol 10, 680 (2020)
[33] Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv:1904.07850 (2019)

Supplementary Material

1 Definition of Key Slices

We label all slices as “key”, “marginal”, and “non-key” using the scheme illustrated in Fig. 5. For cases with a single tumor in the liver, “marginal” slices are those within a buffer of one slice from the beginning or end of a lesion. Then slices with tumors between two “marginal” regions are defined as “key” slices and the remaining slices are defined as “non-key” slices. For cases with more than one tumor, we merge the designations of each individual lesion together to create a slice-wise label. Under this protocol, a “key” slice is any slice that captures one or more individual lesion “key” slices. In the remaining slices, a “marginal” slice is any slice that captures one or more individual lesion “marginal” slices. Finally, for those that are not defined as “key” or “marginal”, they are treated as “non-key” slices.

2 Implementation Details

Preprocessing We resampled all MR volumes and aligned them using the DEEDS algorithm [16]. All volumes were preprocessed by clipping within the $0.1\%$ and $99.9\%$ percentile values. For all experiments we augmented the data by random rotations and gamma intensity transforms.

Slice classification in KSP used a DenseNet121 [17] backbone. Following [28], we add an additional $1\times 1$ convolutional layer before global pooling and use log-sum-exp (LSE) pooling, finding it outperforms the standard average pooling. Three adjacent slices are inputted to provide some 3D context. The batch size is set as $20$ , an Adam optimizer [19] with initial learning rate as $1\times 10^{-4}$ is used. The learning rate is decayed by $0.01$ after every 1000 iterations.

For the detection in KSP, we demonstrate the benefits of our framework on three advanced detectors - 3DCE [27], CenterNet [33] and ATSS [31]. To avoid overly tuning hyper-parameters, the network structure and loss use the same settings as the original papers. The batch size and learning rate are set as 30 and $1\times 10^{-4}$ , respectively, which follows linear learning rate rule [12]. Each model is trained for 50 epochs. Random scaling and cropping was added as data augmentation for all tested detectors.

As for lesion characterization, we test radiomics [23], three standard classifiers, ResNet101 [15], DenseNet121 [17] and ResNeXt101 [26], as well as two texture based classifiers, DeepTEN [30] and SaDT [18]. In radiomics, the support vector machine (SVM) classifier is implemented with extracted features from manually localized tumors, including shape, first order statistics, neighboring gray level dependence method (NGLDM) [21], gray level size zone matrix (GLSZM) [22], gray level run length matrix (GLRLM) [11] and gray-level co-occurrence matrix (GLCM) [14]. As for deep-learning-based classifiers, the batch size and learning rate are set as 8 and $1\times 10^{-4}$ , respectively. Networks are trained based on ground truth bounding boxes (bboxes) for 70 epochs. Random rotation, scaling and cropping are adopted as augmentation. Besides, ground truth bboxes are randomly shifted and resized to simulate the imperfect localization.

3 Evaluation of Different Fusion Methods

In Fig. 6(a), we measure three fusion approaches, including standard late fusion (LF), early fusion (EF), and also T2-anchor LF on key-slice classification. As metrics shows, when keeping the same average recall, T2-anchor LF outperforms both LF and EF in the first quartile (Q1), medium (M) and third quartile (Q3) precision, showing its superiority of localizing key slices. In Fig. 6(b), the cumulative probability curves of CenterNet [33] with three fusion methods are demonstrated. To evaluate the detection performance independently, the curves show results directly from the detector without key-slice classification. The curve of T2-anchor LF is lower than the other two especially when IoU-IoBB is smaller than 0.5. This indicates T2-anchor LF surpasses LF and EF by decreasing the ratio of bboxes of low overlaps with ground truth.