Semi-supervised object detection based on single-stage detector for thighbone fracture localization

Jinman Wei [email protected] Jinkun Yao [email protected] Guoshan Zhang [email protected] Bin Guan [email protected] Yueming Zhang [email protected] Shaoquan Wang [email protected] School of Electrical and Information Engineering, Tianjin University, Tianjin, 300072, China. Department of Radiology,Linyi People’s Hosptial,276000 Linyi,China.

Abstract

The thighbone is the largest bone supporting the lower body. If the thighbone fracture is not treated in time, it will lead to lifelong inability to walk. Correct diagnosis of thighbone disease is very important in orthopedic medicine. Deep learning is promoting the development of fracture detection technology. However, the existing computer aided diagnosis (CAD) methods baesd on deep learning rely on a large number of manually labeled data, and labeling these data costs a lot of time and energy. Therefore, we develop a object detection method with limited labeled image quantity and apply it to the thighbone fracture localization. In this work, we build a semi-supervised object detection(SSOD) framework based on single-stage detector, which including three modules: adaptive difficult sample oriented (ADSO) module, Fusion Box and deformable expand encoder (Dex encoder). ADSO module takes the classification score as the label reliability evaluation criterion by weighting, Fusion Box is designed to merge similar pseudo boxes into a reliable box for box regression and Dex encoder is proposed to enhance the adaptability of image augmentation. The experiment is conducted on the thighbone fracture dataset, which includes 3484 training thigh fracture images and 358 testing thigh fracture images. The experimental results show that the proposed method achieves the state-of the-art AP in thighbone fracture detection at different labeled data rates, i.e. 1%, 5% and 10%. Besides, we use full data to achieve knowledge distillation, our method achieves 86.2% AP50 and 52.6% AP75.

keywords:

Semi-supervised Learning; Object Detection; Single-stage; Tighbone Fracture Detection

^†^†journal: Applied Soft Computing

1 Introduction

The thighbone is located below the pelvis. Thighbone and acetabulum constitute the hip joint and play a role in supporting the whole body. Various activities of the human body depend on the thighbone, so it is one of the most vulnerable part. The diagnosis of ordinary fracture and comminuted fracture is a significant part of surgical diagnosis[1]. However, compared with the huge number of patients, there is a lack of excellent surgeons. Therefore, surgeons urgently need an assistant to relieve their work pressure. In order to solve this problem, many computer-aided detection and diagnosis methods[2] have been proposed. In recent years, substantial progress has been made in developing deep learning-based CAD systems to fracture diagnosis. Guan et al. proposed a convolutional neural network for thighbone fracture detection that can balance the information of each feature map in ResNeXt’s feature pyramid.[3]. Firat et al. designed an integrated object detection model for wrist X-ray image fracture detection[4]. At present, the state-of-the-art fracture detection methods are usually developed based on large-scale expert annotations such as 5134 labeled CT images for spinal fracture detection[5], 7356 wrist radiographic images[6], 9040 labeled hand, wrist, knee, ankle, foot and ankle radiographs for multiple fracture detection[7].

Compared with the above-mentioned methods, semi-supervised learning (SSL) uses both labeled data and unlabeled data when training the model, and uses unlabeled data to assist in optimizing the model, so as to save training cost. The state-of-the-art semi-supervised methods are the pseudo-label based approaches[8]. Specifically, the model is trained on labeled data, and then the trained model is used to predict the pseudo labels on unlabeled images. Teacher-student model[9] is a common method to generate pseudo labels in semi-supervised learning in which key idea is to train two independent models, namely teacher model and student model. The teacher model is trained on the labeled images to label the unlabeled images and then mix these pseudo labeled images with the labeled images to train the student model.

Most research on SSOD has focused on the two-stage detectors[10, 11, 12, 13]. But basing on the single-stage detectors (such as FCOS[14], YOLOF[15], RestinaNet[16]) has more attractive and practical, because they can be easily deployed on devices with limited resources, eliminate cumbersome preprocessing and post-processing except for NMS[17]. The main difference between the single-stage detectors and the two-stage detectors is that Region proposal network (RPN)[18] of the two-stage detector can filter most of the background samples, and in the next stage, the remaining candidate boxes are further predicted the detailed categories. The single-stage detectors make dense prediction for all areas of the image at one time, as long as few bounding boxes can be predicted as positive samples. Because in the single-stage detectors, the generation and judgment of the proposal are integrated, this lead to that the detection speed is faster but the classification score of one-stage detectors is lower than two-stage detectors. And directly sending pseudo labels with low classification score into student model will bring a lot of noise and affect the training accuracy of the model. Therefore, how to deal with a large number of low-quality pseudo labels in dense prediction is still an important problem.

Regression branch is another component of object detection task. The regression quality of pseudo box is another important factor that determines the performance of semi supervised target detection model. Xu et al.[20] find that the accuracy of the regression is related to the uncertainty calculated by the BoxJitter module, but the BoxJitter module relies on Regions with CNN features(RCNN) to process the proposal, so it is not applicable to the single-stage detector. To address this issue, we propose the Fusion Box module in the regression branch for SSOD based on single-stage detector.

In summary, we develop a semi-supervised framework based on the single-stage detector for the thighbone fracture detection. In this framework, the adaptive difficult sample oriented(ADSO) module and the Fusion Box module are developed to reduce the impact of inaccurate pseudo label prediction. In addition, The Single-in-Single-out (SISO) encoder called Dex encoder is proposed to improve the adaptability of the augmented input images. The main contributions of this paper can be summarized as follows:

1. We developed the semi-supervised object detection framework based on single-stage detector for thighbone fracture detection with limited annotations. Compared with previous work, it has fewer parameters and faster detection speed.

2. The adaptive difficult sample oriented (ADSO) module is proposed to take the classification score of teacher model as the criterion of pseudo labels reliability.

3. The Fusion Box module is proposed to reduce the impact of multiple pseudo boxes regression in the same position on model performance.

4. We design a Single-in-Single-out encoder named Deformable expand encoder (Dex encoder) for enhancing the learning ability of of deformed features.

5. The experimental results show that compared with supervised and semi-superviesed methods, our method is better than other methods in thighbone fracture detection.

2 Related work

2.1 Deep learning for medical detection

CAD has been extensively studied in the past decade[21, 22], and CAD system based on deep learning has been developed to diagnose a wide range of Pathology such as detection of covid-19[23, 24], mass and calcification features in mammography[25] and brain tumor diagnosis[26]. In the fracture detection method based on deep learning[27], FAMO[7] constructed the Feature Ambiguity Mitigate Operator model to mitigate feature ambiguity in bone fracture detection on radiographs of various body parts. Due to the requirements of medical professional knowledge, the labor cost of large-scale annotations is expensive which hinders the development of CAD solutions based on deep learning. Computer aided detection using SSL method is an emerging task in recent years, such as Yirui Wang et al. proposed the adaptive asymmetric label sharpening (AALS) algorithm using the teacher-student model paradigm, which solves the label imbalance problem unique to the medical field[28].

2.2 object detection

Object detection is one of the core tasks in computer vision. At present, the object detector based on CNN can be divided into single-stage and two-stage detectors. FasterRcnn[18] is the representative two-stage detector, which uses RPN network for proposal extraction and RCNN head for regional prediction and extraction of objects. The single-stage detector only uses the features extracted by the feature extraction network for regression and classification. For example, SSD[29] uses the feature pyramid method to complete target regression and classification on different scale features at the same time. Chen et al. developed the YOLOF that only uses C5 feature for detection as shown in Figure 1 in which the complex Multiple-in-Multiple-out encoder is replaced by the simple Single-in-Single-out encoder, YOLOF containing two key components: dilated encoder and uniform decoder.

2.3 Semi-supervised learning in object detection

SSL method plays a leading role in image classification[31, 32, 33, 34, 35]. Because the object detector has complex architecture design and multi task learning (classification and regression), it is not a simple work to transfer the SSL method to the object detection task. The current SSOD method mainly has two directions: Consistency Regularization[36] and Pesudo Label[8]. The former uses two deep convolution neural networks to learn the consistency between different data augmentation[37] (horizontal flip, different contrast, brightness, etc.) of the same unlabeled image, and make the image prediction to small disturbance the same. The latter uses the pre-training model learned on labeled data to infer the unlabeled data. In recent years, semi-supervised object detection method has attracted people’s attention[38, 39, 40]. STAC[19] first applies pseudo label method to SSOD, it apply weak data augmentation to unlabeled data, and uses the trained teacher model to generate pseudo labels of unlabeled images. Unbiased teacher[41] uses focal loss[16] to solve the imbalance between positive and negative samples. Instant teaching[42] trains two models at the same time to check and correct pseudo labels for each other, so as to effectively suppress the accumulation of false predictions. Almost all the above work is based on the two-stage detector, such as FasterRcnn, which is not convenient to develop in the medical field with limited resources. Inspired by the above works, we designed a fast semi-supervised detection model based on the single-stage detector.

Refer to caption — Figure 1: The structure of YOLOF.

3 Methology

Our method adopts the teacher-student mutual learning mode in which the student model learns from the detection loss of labeled and unlabeled images. The unlabeled images have two groups of pseudo boxes, which are used for classification branch and regression branch training, respectively. The teacher model is updated by using the student model with exponential moving average (EMA). The pseudo boxes predicted by the teacher model will be filtered by confidence at first, and then the pseudo labels with classification scores higher than the confidence threshold $\sigma$ will be retained. The remaining pseudo boxes will be sent to the classification branch and regression branch. In this SSOD framework, There are two critical designs: ADSO and Fusion Box. Figure 2 shows the description of our SSOD framework.

3.1 Semi-supervised learning framework

In each training iteration, unlabeled images and labeled images are extracted according to a certain data sampling ratio. The data are preprocessed by two different preprocessing methods to obtain strong augmented labeled images, weak augmented and strong augmented unlabeled images. The student network is trained with the pseudo boxes generated by teacher model and ground truth boxes in labeled images. The total loss function can be expressed by (1) :

L=L_{\text{sup }}+\lambda L_{\text{unsup }}

(1)

where $L_{\text{sup }}$ and $L_{\text{unsup}}$ represent the loss function of labeled images and the unlabeled images respectively, and $\lambda$ represent the weight of unsupervised loss in the total loss function.

At the beginning of training, the student model and teacher model adopt random initialization. For labeled data, the student network uses its ground truth to calculate the loss $L_{\text{sup }}$ as (2) and update the student model parameters by the gradient descent method. For unlabeled data, the teacher network first deduces the result of weakly augmented unlabeled data, then filtering out the low-quality results according to the confidence threshold $\sigma$ . The retained high-quality predictions is considered to be the pseudo labels for unlabeled data. Then, the student calculates the unsupervised loss $L_{\text{unsup }}$ of unlabeled data as (3), and updates the parameters by the gradient descent method. Finally, the teacher network is updated with the exponential moving average (EMA) for the student model.

L_{\text{sup }}=\frac{1}{N_{\text{label }}}\sum_{t=1}^{N_{\text{label }}}L_{cls}\left(v_{t},\hat{v}_{t}\right)+\sum_{t=1}^{N_{\text{label }}}L_{reg}\left(\theta_{t},\hat{\theta}_{t}\right)

(2)

L_{\text{unsup }}=\frac{1}{N_{\text{unlabel }}}\sum_{t=1}^{N_{\text{unlabel }}}L_{cls}\left(v_{t},v_{t}^{*}\right)+\sum_{t=1}^{N_{\text{unlabel }}}L_{reg}\left(\theta_{t},\theta_{t}^{*}\right)

(3)

where $N_{\text{label }}$ indicates the number of positive samples in the labeled data, $N_{\text{unlabel }}$ indicates the number of positive samples in pseudo data with retained classification scores above the threshold $\sigma$ . $v_{t}$ and $\hat{v}_{t}$ represent the student predicted category and ground truth of the t-th positive sample, $\theta_{t}$ and $\hat{\theta}_{t}$ represent the student predicted regression result and corresponding target of the t-th positive sample. $v_{t}^{*}$ and $\theta_{t}^{*}$ represent the category and box of the positive sample whose classification score from pesudo labels is higher than the threshold $\sigma$ .

3.2 Adaptive difficult sample oriented method (ADSO)

The quality of pseudo labels determines the performance of SSL model, in our framework, pseudo labels with low classification scores will be filtered out with confidence threshold $\sigma$ to ensure the quality of pseudo labels. Neither the high confidence threshold setting nor low confidence threshold setting is suitable for the semi-supervised framework based on single-stage detectors. For the SSOD method based on single-stage detector, directly using the same setting as the two-stage detector (confidence threshold = 0.7 in [41]) will greatly increase the number of negative samples, and only few part can be predicted in positive samples[40]. This imbalance will result in not getting enough pseudo labels, which limits the performance of the semi-supervised single-stage detector. On the contrary, low threshold threshold will easily bring in more low-quality pseudo labels with noise. When we adopt confidence $\sigma$ = 0.5, as shown in Figure 4, AP50 decreases to a great extent with the increase of training iterations. We believe with training iterations increasing, Teacher have been able to generate enough pseudo labels with confidence greater than $\sigma$ . The number of pseudo labels has been much more than that in the early stage of training, but limited to low threshold settings, the classification score of pseudo labels has not been greatly increased with the increase of training iterations. We need to adopt a more strict confidence screening strategy in the middle stage of training and provide higher score pseudo labels.

We propose the adaptive difficult sample oriented (ADSO) moudle, which inherits the advantages of the flexibility of the end-to-end framework and fully utilizes the information provided by the teacher model. Specifically, we traverse the pseudo labels after confidence filtering to evaluate the reliability of each label generated by teacher. This labels generated by teacher are regarded as real pseudo labels. In this process, deal with the confidence of each pseudo box without considering background boxes. The processing function is shown in the Figure 3:

when the confidence is close to the confidence threshold setting, the confidence is reduced to a lower value through the function, at this time when the confidence is already high, we take the current confidence as the evaluation standard of prospect reliability in (4), so that the classification loss function of unsupervised part is as (5). Liu et al.[41] have proved by a large number of experiments that selecting 0.7 as the confidence threshold in the two-stage network will achieve better results. Therefore, we select $r_{i}$ =0.7 to make the pseudo boxes’ classification score approach the two-stage network. When the confidence is low, a penalty item will be added to the loss calculation of the pseudo box which is similar to the hard negative sample mining method. We find that this method makes the classification score after a long iteration higher than without ADSO as shown in Figure 5.

\omega_{i}=y(x)=\begin{cases}r_{i}\sin\left(\frac{\pi}{2}\frac{x-r_{i}}{r_{i}}\right)+r_{i}&:x<r_{i}\\ x&:r_{i}<x<1\end{cases}

(4)

L_{\mathrm{unsup}}^{cls}=\frac{1}{N_{\text{unlabel }}^{fg}\omega_{i}}\sum_{i=1}^{N_{i}^{fg}}L_{cls}\left(v_{t},v_{t}^{*}\right)

(5)

where $L_{\mathrm{unsup}}^{cls}$ is unsupervised classification loss, $x$ indicates the classification score of the i-th box, $\omega_{i}$ indicates the classification score after function transformation as (4).

3.3 Fusion Box

Different from the classification quality evaluation of pseudo labels, the regression quality of pseudo boxes is a difficult index to evaluate. We visualized the pseudo labeled images of the teacher model in the training process, As shown in Figure 6. Because the prediction accuracy of the teacher model is not high enough, multiple pseudo boxes will be predicted around ground truth boxes of the unlabeled images. Most of these pseudo boxes can not provide reliable and accurate positioning information for student but after merging these pseudo boxes, the new box location is closer to the ground truth box. Therefore, we introduce Fusion Box module to reduce the inaccurate location impact of pseudo boxes. Specifically, for an unlabeled image predicted by teacher model, after confidence filtering, Fusion Box module select whether to synthesize each other through the similarity $\xi$ and the pseudo algorithm of Fusion Box is as Algorithm 1.

Fusion Box collects the euclidean distance of the center point between each other for all the pseudo boxes generated in each iteration, and then calculates the similarity between each other. The two boxes whose similarity is less than the threshold $\mu$ are combined into a new pseudo box. The calculation process of the similarity $\xi$ of these pseudo boxes can be expressed by (6).

Algorithm 1 Fusion Box

1:Prediction

(Box_{(x,y)},Score)

, confidence threshold

\sigma

, Fusion Box threshold

\mu

2:Fusion prediction

3:Box

{}_{(x,y)}\leftarrow

Box

{}_{(x,y)}[

Score

\geq\sigma]

, Score

\leftarrow

Score

[

Score

\geq\sigma]

4:if

\operatorname{Size}\left(\operatorname{Box}_{(x,y)}\right)>1

then

5: for

n=0

\operatorname{Size}\left(\operatorname{Box}_{(\mathrm{x},\mathrm{y})}\right)

\text{Centre}[n]=\operatorname{mean}\left[\operatorname{Box}_{(x,y)}[n]\right]

7: end for

8: for

m=0

\operatorname{Size}\left(\operatorname{Box}_{(\mathrm{x},\mathrm{y})}\right)

9: for

n=0

\operatorname{Size}(

Center

)

10:

\text{$\xi$}=\operatorname{Center}[m]-H[n]

11: if

\text{ $\xi$ }<\mu

then

12:

Box_{(x,y)}[m]=Box_{(x,y)}[n]\cup Box_{(x,y)}[m]

13:

Del\ Bbox_{(x,y)}[n]

14:

Del\ Center[n]

15: end if

16: end for

17: end for

18:end if

\text{ $\xi$ }=\frac{|\operatorname{mean}(\mathrm{Bbox}[\mathrm{n}])-\operatorname{mean}(\mathrm{Bbox}[\mathrm{m}])|^{2}}{\operatorname{scale}(w)+\operatorname{scale}(h)}

(6)

where $\text{ mean(Bbox }[\mathrm{n}])$ indicates that the center coordinates of each pseudo box. $\operatorname{scale}(w)$ and $\operatorname{scale}(h)$ represent the width and height of the whole image scaled after data augmentation. After Fusion Box module, the unsupervised regression loss function is calculated as follows:

L_{\mathrm{unsup}}^{reg}=\frac{1}{N_{pos_{\text{fusion }}}}\sum_{t=1}^{N_{pos_{\text{fusion }}}}L_{reg}\left(\theta_{t},\theta_{t}^{*}\right)

(7)

where $L_{\mathrm{unsup}}^{reg}$ is unsupervised regression loss, $N_{pos_{\text{fusion }}}$ indicates the number of pseudo boxes after confidence filtering and Fusion Box, $\theta_{t}^{*}$ represents the regression target after Fusion Box.

3.4 Dex encoder

As shown in Figure 8, we design the Dex encoder to replace the dilated encoder in YOLOF. Dilated encoder stacks standard convolution and dilated convolution, then merge the original feature to the feature map containing expanded receptive field. For reducing the parameter quantity, we cancel the dilated block with dilation rate = 2 in Dex encoder and choose to add 1×1 convolution after Deformable convolution v2 (DCNv2) module[44], the convolution layer reduces channels to 128 and maintains the reduced number of channels throughout the network.

Data augmentation need to be used for the input images in SSOD. The unlabeled images after different data augmentation will be collected to mark pseudo labels in the unsupervised section. Inputting these deformed images to network will lead to coordinates and angle changes in the ground truth box corresponding to Figure 7 while the deformed box will seriously affect the network training. Compared with the traditional fixed window convolution, deformable convolution [43] can effectively deal with gtbox deformation including box movement, size scaling and rotation, because its local receptive field is learnable and oriented to the whole image. Deformable convolution is to increase the spatial sampling position of additional offset and does not need additional supervision. Therefore, we choose to apply DCNv2 to YOLOF.

As shown in Figure 9, (a) is fixed window convolution, (b) is deformation convolution, and saturation color points is the actual sampling position of convolution kernel, which is offset from the standard position. (c) is dilated convolution which can be regarded as the special form of deformation convolution, has the ability to expand the receptive field. The convolution kernel used to generate the output feature and the deformation convolution kernel used to generate the offset are synchronously learned in DCNv2 while the offset is obtained by back propagation using interpolation algorithm.

4 Experiments

4.1 Dataset and Evaluation protocol

We validate our method on the thighbone fracture dataset[45] which was collected from the radiology department of Linyi people’s hospital. All X-ray images are produced by the latest digital radiography (DR) technology. The dataset consists of 3842 thigh fracture images in 24 bit JPG format. The train set contains 3484 labeled images. In addition, The test set of 358 images is provided for serious performance verification. The following settings are adopted for performance verification:

Referring to the verification method of STAC[19], we use 1%, 5% and 10% of trainset as labeled training data, and unselected images in trainset as unlabeled data. In addition, our knowledge distillation result is compared with the supervised thighbone fracture detection methods proposed by Guan et al.[45] and Wang et al.[46] on this dataset to prove our advantage, following the convention to report the performance of val with mAP and AP50 as evaluation indicators.

4.2 Implementation Details

The experiment was conducted on 4 NVIDIA geforce RTX 1080ti. We use Resnet-50 as feature extraction network to compare with previous methods. The backbone is initialized on ImageNet with pre-trained weight. Our implementation and hyperparameters are based on mmdetection[47], anchors with 5 scales and 1 aspect ratios are used. In the comparison between using partial labeled data sets and being compared with supervised algorithms, the training parameters under the two settings are slightly different because there is a large discrepancy in the amount of labeled training data.

Partial annotation data. the model is trained on 4 GPUs for 60K iterations, and each GPU has 8 images. During SGD training, the initial learning rate is 0.02, and the learning rate of backbone part is set to 0.005, which is divided by 10 in the 1k-th iteration and 1.5k-th iteration. The warmup phase is delayed to 1500iter. The weight attenuation and momentum are set to 0.0001 and 0.9, respectively. The confidence threshold is set to 0.5, the data sampling ratio SR is set to 0.25, and gradually decreases to 0 in the last 1K iterations.

Compared with supervised algorithms. the data sampling ratio SR is set to 1, other settings are the same as the experience of partial annotation data.

In order to estimate the reliability of Fusion Box, the Fusion Box threshold $\mu$ is selected as 0.05 according to the size of the input image and select the fused pseudo box for box regression. In addition, we use weak data augmentation for teacher model, strong data augmentation for student model as shown in Table 1. (Unlabeled (T) and Unlabeled (S) respectively represent the data augmentation used for the unlabeled images input to the teacher and student models while Labeled represents the data augmentation used for labeled images. Weak Aug and Strong Aug respectively represent the weak data augmentation and strong data augmentation.)

Table 1: Data augmentation implementation details for our semi-supervised approach.

	Labeled	Unlabeled (S)	Unlabeled (T)
Weak Aug
Scale jitter	short edge $\in$ (0.5, 1.5)	short edge $\in$ (0.5, 1.5)	short edge $\in$ (0.5, 1.5)
Horizontal Flip	p=0.5	p=0.5	p=0.5
Strong Aug
Contrast jitter	p=0.2, ratio $\in$ (0, 1)	p=0.2, ratio $\in$ (0, 1)	p=0.25, ratio $\in$ (0, 1)
Solarize jitter	p=0.1,, ratio $\in$ (0, 1)	p=0.1,, ratio $\in$ (0, 1)	-
Color Jitter	p=0.1,ratio=(0.4,0.4,0.4,0.1)	P=0.1,ratio=(0.4,0.4,0.4,0.1)	-
Brightness jitter	p=0.1, ratio $\in$ (0, 1)	p=0.1, ratio $\in$ (0, 1)	-
Sharpness jitter	p=0.1, ratio $\in$ (0, 1)	p=0.1, ratio $\in$ (0, 1)	p=0.25, ratio $\in$ (0, 1)
Posterize	p=0.1	p=0.1	-
Equazlize	p=0.1	p=0.1	p=0.25
Rotate	-	p=0.3, angle $\in$ (0, 30◦)	p=0.3, angle $\in$ (0, 30◦)
SHIFT	-	p=0.3, angle $\in$ (0, 30◦)	p=0.3, angle $\in$ (0, 30◦)
Cutout	-	ratio $\in$ (0.05, 0.2)

4.3 System Comparison

In this section, we compare our method with other state-of-the-art methods proposed in recent years. We first evaluate on Partial annotation data setting and compare our method with the results in STAC, Soft teacher and Unbiased teacher. Also the baseline of YOLOF and FasterRcnn is compared with our semi-supervised method in Table 2. We found that our method performed better than other methods in fracture detection. Specifically, the mAP of our method is 15.7%, 22.5% and 11.0% higher than the baseline on the 1%, 5% and 10% data sets, at the same time 2.4%, 0.2% and 2.3% higher than the previous methods. We evaluate the loss of model training and show the results in Figure 10. In order to compare the prediction results more intuitively, we visualized the baseline prediction using YOLOF and the prediction results of our semi-supervised framework in Figure 11.

Then we compared our performance of SSL framework on full dataset and 50% labeled dataset in Table 3 and Table 4. We only use 50% data for semi-supervised training, and the AP exceeds Cascade r-cnn, a two-stage network which parameters are much more than YOLOF. By using Dex encoder, we only need 3 divided blocks to increase the mAP of YOLOF by 1.6% and AP75 by 2.6%, which proves the effectiveness of Dex encoder. In addition, using our semi-supervised framework to distill the knowledge of the single-stage network can further improve 2.6% AP50 and 9.6% AP75. Compared with the algorithm for fracture detection and the current popular object detection network, our semi-supervised framework has higher accuracy. In addition, by introducing our semi-supervised framework, our model achieves the same performance as the fully supervised training of two-stage detectors with simpler network structure and fewer parameters. We also compare the model inference speed on test set and moel complexity with the latest two methods in Table 7.

Table 2: The results compared with baseline and other semi-supervised methods in 1%, 5% and 10% labeled data.

Method

Backbone

10%

ResNet-50

mAP

AP50

mAP

AP50

mAP

AP50

Supervised baseline (FasterRcnn)

6.5

22.3

12.1

38.8

28.4

68.3

Supervised baseline (YOLOF)

6.1

22.6

13.8

39.1

29.6

69.6

STAC

11.6

31.2

19.7

55.4

33.8

72.9

Softteacher

19.2

45.6

36.1

71.6

38.3

75.1

Unbiased teacher

19.8

53.4

31.7

72.7

37.7

78.1

Ours

22.2

53.9

36.3

75.0

40.6

80.1

Table 3: Performance of semi-supervised framework on full labeled datasets.

Current object detection algorithm
Algorithm	Backbone network	AP50	AP75
YOLOF	Resnet-50	82.2	39.6
FCOS	Resnet-101	85.4	42.7
Empirical attention	Resnet-50	86.1	43.2
GHM	Resnet-50	86.5	43.8
GCNet	Resnet-50	85.4	44.6
FPN	Resnet-101	86.3	48.8
Cascade R-CNN	Resnet-50	85.0	49.7
Object detection algorithms in the thighbone fracture dataset
YOLOF(Dex encoder+3 dilated blocks)	Resnet-50	83.8(+1.6)	42.2(+2.6)
framework(M et al,2019)	network(M et al,2019[46])	87.3	47.6
framework(Bin et al,2022)	network(Bin et al,2022[3])	88.9	52.6
Our semi-supervised method	Resnet-50	86.2(+2.6)	52.6(+9.6)

Table 4: Performance of semi-supervised framework on full labeled dataset and 50% labeled datatset.

Algorithm	AP50	AP75
our semi-supervised framework(50% data)	85.0	51.1
our semi-supervised framework(full data)	86.2	52.6

Table 5: Ablation results with different modules.

ADSO	Fusion Box	Dex encoder	mAP	AP50
			37.3	77.7
✓			39.0	78.1
	✓		38.4	78.0
		✓	38.4	78.6
✓		✓	39.9	79.2
✓	✓	✓	40.6	80.1

Table 6: Ablation results of dilated block quantity. The number in () indicates each dilated block’s dilated rate.

method	mAP	AP50
YOLOF(2,4,6,8)	29.4	69.3
Dex encoder+1 dilated block(4)	29.9	69.9
Dex encoder+2 dilated blocks(4,6)	30.0	70.6
Dex encoder+3 dilated blocks(4,6,8)	30.0	71.9
Dex encoder+4 dilated blocks(2,4,6,8)	31.4	72.5
Dex encoder+5 dilated blocks(2,4,6,7,8)	30.8	71.3

Table 7: Computational complexity and inference speed of different models in 358 val images.

Method	FLOPs	FPS
Unbiased teacher	204.13G	22.4
Soft teacher	202.31G	16.0
Ours	101.57G	27.6

Table 8: Ablation results of different confidence thresholds.

Threshold $\sigma$	mAP	AP50
0.4	39.8	79.3
0.5	40.6	80.1
0.55	40.1	79.7
0.6	39.6	79.1

Table 9: Ablation results of Fusion Box thresholds.

Threshold $\xi$	mAP	AP50
0.04	39.7	79.6
0.05	40.6	80.1
0.06	39.8	79.8

4.4 Ablation Studies

In this section, we validate key designs. All ablation experiments are performed on YOLOF using 10% labeled dataset.

The influence of critical design in semi-supervised model. Table 5 shows the impact of different modules on the performance of the model. When all three modules are adopted, the model performance can reach the best. Compared with the original SSL model, mAP is increased by 3.3% and AP50 by 2.4%. The validity of modules is proved.

The impact of confidence filter threshold. We select different confidence thresholds to compare the performance of our semi-supervised model in Table 8. When we select a threshold of 0.5, the performance of the model reaches the best. When we select a threshold of 0.5 ± 0.1, a high threshold will lead to a faster decline in the mAP, indicating that our semi-supervised model is sensitive to a high threshold. However, whether we select a high threshold or a low threshold, the accuracy of the model does not decline much, indicating that ADSO module has a certain adjustment function to the confidence threshold.

The impact of dilated block quantity.We compared the effect of different number of expansion blocks in Table 6. Considering the balance between accuracy and parameter quantity, we finally choose to use 3 dilated blocks on the Dex encoder.

The impact of Fusion Box threshold. In Table 9, we study the box regression Fusion Box threshold. The best performance is achieved when the threshold is set to 0.05.

5 Conclusions

In this paper, a semi-supervised object detection method based on the single-stage network is proposed to train neural networks with limited labeled data and a large number of unlabeled data. In the end-to-end training, we propose three modules: ADSO, Fusion Box and Dex encoder. We improve the object detection network to promote the effective use of teacher model. For the detection of thighbone fracture in clinical application, the model has high accuracy. A large number of experiments on the thighbone fracture dataset show that semi-supervised method has broad application prospects in the field of medical images. We hope that our work can help surgeons improve the efficiency of diagnosing diseases.

6 Reference

References

[1] R. M. Jones, A. Sharma, R. Hotchkiss, J. W. Sperling, R. V. Lindsey, Assessment of a deep-learning system for fracture detection in musculoskeletal radiographs, npj Digital Medicine 3 (1) (2020) 1–6. doi:10.1038/s41746-020-00352-w.
[2] G. L. Georgalis, T. M. Scheyer, Crushed but not lost: a colubriform snake (serpentes) from the miocene swiss molasse, identified through the use of micro-ct scanning technology, Swiss Journal of Geosciences 115 (1) (2022) 1–9.
[3] B. Guan, J. Yao, S. Wang, G. Zhang, Y. Zhang, X. Wang, M. Wang, Automatic detection and localization of thighbone fractures in x-ray based on improved deep learning method, Computer Vision and Image Understanding 216 (2022) 103345. doi:10.1016/j.cviu.2021.103345.
[4] F. Hardalaç, F. Uysal, O. Peker, M. Çiçeklidağ, T. Tolunay, N. Tokgöz, U. Kutbay, B. Demirciler, F. Mert, Fracture detection in wrist x-ray images using deep learning-based object detection models, Sensors 22 (3) (2022) 1285. doi:10.3390/s22031285.
[5] G. Sha, J. Wu, B. Yu, Detection of spinal fracture lesions based on improved yolov2, in: 2020 IEEE International Conference on Artificial Intelligence and Computer Applications (ICAICA), IEEE, 2020, pp. 235–238. doi:10.1109/ICAICA50127.2020.9182582.
[6] Y. L. Thian, Y. Li, P. Jagmohan, D. Sia, V. E. Y. Chan, R. T. Tan, Convolutional neural networks for automated fracture detection and localization on wrist radiographs, Radiology: Artificial Intelligence 1 (1) (2019) e180001. doi:10.1148/ryai.2019180001.
[7] H.-Z. Wu, L.-F. Yan, X.-Q. Liu, Y.-Z. Yu, Z.-J. Geng, W.-J. Wu, C.-Q. Han, Y.-Q. Guo, B.-L. Gao, The feature ambiguity mitigate operator model helps improve bone fracture detection on x-ray radiograph, Scientific Reports 11 (1) (2021) 1–10. doi:10.1038/s41598-021-81236-1.
[8] D.-H. Lee, et al., Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks, in: Workshop on challenges in representation learning, ICML, Vol. 3, 2013, p. 896.
[9] G. Hinton, O. Vinyals, J. Dean, Distilling the knowledge in a neural network, Computer Science 14 (7) (2015) 38–39.
[10] Z. Cai, N. Vasconcelos, Cascade r-cnn: Delving into high quality object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6154–6162. doi:10.1109/CVPR.2018.00644.
[11] J. Wang, K. Chen, S. Yang, C. C. Loy, D. Lin, Region proposal by guided anchoring, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2965–2974. doi:10.1109/CVPR.2019.00308.
[12] X. Zhu, D. Cheng, Z. Zhang, S. Lin, J. Dai, An empirical study of spatial attention mechanisms in deep networks, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6688–6697. doi:10.1109/ICCV.2019.00679.
[13] Y. Cao, J. Xu, S. Lin, F. Wei, H. Hu, Gcnet: Non-local networks meet squeeze-excitation networks and beyond, in: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 1971–1980. doi:10.1109/ICCVW.2019.00246.
[14] Z. Tian, C. Shen, H. Chen, T. He, Fcos: Fully convolutional one-stage object detection, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9627–9636. doi:10.1109/ICCV.2019.00972.
[15] Q. Chen, Y. Wang, T. Yang, X. Zhang, J. Cheng, J. Sun, You only look one-level feature, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13039–13048. doi:10.1109/CVPR46437.2021.01284.
[16] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988. doi:10.1109/TPAMI.2018.2858826.
[17] J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
[18] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: NIPS, Vol. 28, 2016.
[19] K. Sohn, Z. Zhang, C.-L. Li, H. Zhang, C.-Y. Lee, T. Pfister, A simple semi-supervised learning framework for object detection, arXiv preprint arXiv:2005.04757 (2020). doi:10.48550/arXiv.2005.04757.
[20] M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, Z. Liu, End-to-end semi-supervised object detection with soft teacher, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3060–3069.
[21] M. H. Hesamian, W. Jia, X. He, P. Kennedy, Deep learning techniques for medical image segmentation: achievements and challenges, Journal of digital imaging 32 (4) (2019) 582–596. doi:10.1007/s10278-019-00227-x.
[22] H. Fujita, Ai-based computer-aided diagnosis (ai-cad): the latest review to read first, Radiological physics and technology 40 (4) (2020) 140. doi:10.11323/jjmp.40.4_140.
[23] R. Karthik, R. Menaka, M. Hariharan, Learning distinctive filters for covid-19 detection from chest x-ray using shuffled residual cnn, Applied Soft Computing 99 (2021) 106744. doi:10.1016/j.asoc.2020.106744.
[24] P. Gupta, M. K. Siddiqui, X. Huang, R. Morales-Menendez, H. Pawar, H. Terashima-Marin, M. S. Wajid, Covid-widenet—a capsule network for covid-19 detection, Applied Soft Computing 122 (2022) 108780. doi:10.1016/j.asoc.2022.108780.
[25] M. A. Al-Masni, M. A. Al-Antari, J.-M. Park, G. Gi, T.-Y. Kim, P. Rivera, E. Valarezo, M.-T. Choi, S.-M. Han, T.-S. Kim, Simultaneous detection and classification of breast masses in digital mammograms via a deep learning yolo-based cad system, Computer Methods and Programs in Biomedicine: An International Journal Devoted to the Development, Implementation and Exchange of Computing Methodology and Software Systems in Biomedical Research and Medical Practice 157 (2018) 85–94. doi:10.1016/j.cmpb.2018.01.017.
[26] L. Ma, F. Zhang, End-to-end predictive intelligence diagnosis in brain tumor using lightweight neural network, Applied Soft Computing 111 (2021) 107666. doi:10.1016/j.asoc.2021.107666.
[27] X. Zhang, Y. Wang, C.-T. Cheng, L. Lu, J. Xiao, C.-H. Liao, S. Miao, A new window loss function for bone fracture detection and localization in x-ray images with point-based annotation, arXiv preprint arXiv:2012.04066 (2020).
[28] Y. Wang, K. Zheng, C.-T. Cheng, X.-Y. Zhou, Z. Zheng, J. Xiao, L. Lu, C.-H. Liao, S. Miao, Knowledge distillation with adaptive asymmetric label sharpening for semi-supervised fracture detection in chest x-rays, in: International Conference on Information Processing in Medical Imaging, Springer, 2021, pp. 599–610. doi:10.1007/978-3-030-78191-0_46.
[29] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, A. C. Berg, Ssd: Single shot multibox detector, European conference on computer vision (2016) 21–37.
[30] S.-W. Kim, H.-K. Kook, J.-Y. Sun, M.-C. Kang, S.-J. Ko, Parallel feature pyramid network for object detection, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 234–250.
[31] S. Laine, T. Aila, Temporal ensembling for semi-supervised learning (2016). doi:10.48550/arXiv.1610.02242Focustolearnmore.
[32] A. Tarvainen, H. Valpola, Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results, Advances in neural information processing systems 30 (2017).
[33] D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, C. Raffel, Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring, arXiv preprint arXiv:1911.09785 (2019). doi:10.48550/arXiv.1911.09785.
[34] K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, C.-L. Li, Fixmatch: Simplifying semi-supervised learning with consistency and confidence, Advances in Neural Information Processing Systems 33 (2020) 596–608.
[35] F. Pourpanah, D. Wang, R. Wang, C. P. Lim, A semisupervised learning model based on fuzzy min–max neural networks for data classification, Applied Soft Computing 112 (2021) 107856. doi:10.1016/j.asoc.2021.107856.
[36] J. Jeong, S. Lee, J. Kim, N. Kwak, Consistency-based semi-supervised learning for object detection, Advances in neural information processing systems 32 (2019).
[37] B. Zoph, E. D. Cubuk, G. Ghiasi, T.-Y. Lin, J. Shlens, Q. V. Le, Learning data augmentation strategies for object detection, in: European conference on computer vision, Springer, 2020, pp. 566–583. doi:10.1007/978-3-030-58583-9_34.
[38] Q. Yang, X. Wei, B. Wang, X.-S. Hua, L. Zhang, Interactive self-training with mean teachers for semi-supervised object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5941–5950. doi:10.1109/CVPR46437.2021.00588.
[39] Z. Wang, Y. Li, Y. Guo, L. Fang, S. Wang, Data-uncertainty guided multi-phase learning for semi-supervised object detection, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4568–4577. doi:10.1109/CVPR46437.2021.00454.
[40] Y. Zhang, X. Yao, C. Liu, F. Chen, X. Song, T. Xing, R. Hu, H. Chai, P. Xu, G. Zhang, S4od: Semi-supervised learning for single-stage object detection, arXiv preprint arXiv:2204.04492 (2022). doi:10.48550/arXiv.2204.04492.
[41] Y. C. Liu, C. Y. Ma, Z. He, C. W. Kuo, P. Vajda, Unbiased teacher for semi-supervised object detection, arXiv preprint arXiv:2102.09480 (2021). doi:10.48550/arXiv.2102.09480.
[42] Q. Zhou, C. Yu, Z. Wang, Q. Qian, H. Li, Instant-teaching: An end-to-end semi-supervised object detection framework, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4081–4090. doi:10.1109/CVPR46437.2021.00407.
[43] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, Deformable convolutional networks, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773. doi:10.1109/ICCV.2017.89.
[44] X. Zhu, H. Hu, S. Lin, J. Dai, Deformable convnets v2: More deformable, better results, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9300–9308. doi:10.1109/CVPR.2019.00953.
[45] B. Guan, J. Yao, G. Zhang, X. Wang, Thigh fracture detection using deep learning method based on new dilated convolutional feature pyramid network, Pattern Recognition Letters 125 (2019) 521–526. doi:10.1016/j.patrec.2019.06.015.
[46] M. Wang, J. Yao, G. Zhang, B. Guan, X. Wang, Y. Zhang, Parallelnet: multiple backbone network for detection tasks on thigh bone fracture, Multimedia Systems 27 (6) (2021) 1091–1100. doi:10.1007/s00530-021-00783-9.
[47] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, et al., Mmdetection: Open mmlab detection toolbox and benchmark, arXiv preprint arXiv:1906.07155 (2019). doi:10.48550/arXiv.1906.07155.