This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Qingdao Huajing Biotechnology CO., LTD 22institutetext: Ocean University of China, Qingdao, China 33institutetext: Qingdao University of Science and Technology, Qingdao, China 44institutetext: ETH Zürich, Zürich, Switzerland 55institutetext: Inception Institute of Artificial Intelligence, Abu Dhabi, UAE

Trichomonas Vaginalis Segmentation in Microscope Images

Lin Li 1122 Jingyi Liu 33 Shuo Wang 44 Xunkun Wang 11 Tian-Zhu Xiang 55
Abstract

Trichomoniasis is a common infectious disease with high incidence caused by the parasite Trichomonas vaginalis, increasing the risk of getting HIV in humans if left untreated. Automated detection of Trichomonas vaginalis from microscopic images can provide vital information for diagnosis of trichomoniasis. However, accurate Trichomonas vaginalis segmentation (TVS) is a challenging task due to the high appearance similarity between the Trichomonas and other cells (e.g., leukocyte), the large appearance variation caused by their motility, and, most importantly, the lack of large-scale annotated data for deep model training. To address these challenges, we elaborately collected the first large-scale Microscopic Image dataset of Trichomonas Vaginalis, named TVMI3K, which consists of 3,158 images covering Trichomonas of various appearances in diverse backgrounds, with high-quality annotations including object-level mask labels, object boundaries, and challenging attributes. Besides, we propose a simple yet effective baseline, termed TVNet, to automatically segment Trichomonas from microscopic images, including high-resolution fusion and foreground-background attention modules. Extensive experiments demonstrate that our model achieves superior segmentation performance and outperforms various cutting-edge object detection models both quantitatively and qualitatively, making it a promising framework to promote future research in TVS tasks.

Keywords:
Segmentation Microscope Images Trichomoniasis.
footnotetext: ✉ Co-corresponding authors ([email protected][email protected]).
First Author: Lin Li ([email protected]).

1 Introduction

Trichomoniasis (or “trich”), caused by infection with a motile, flagellated protozoan parasite called Trichomonas vaginalis (TV), is likely the most common, non-viral sexually transmitted infection (STI) worldwide. According to statistics, there are more than 160 million new cases of trichomoniasis in the world each year, with a similar probability of males and females [10, 27]. A number of studies have shown that Trichomonas vaginalis infection is associated with an increased risk of infection with several other STIs, including human papillomavirus (HPV) and human immunodeficiency virus (HIV) [32]. The high prevalence of Trichomonas vaginalis infection globally and the frequency of co-infection with other STIs make trichomoniasis a compelling public health concern.

Automatic Trichomonas vaginalis segmentation (TVS) is crucial to the diagnosis of Trichomoniasis. Recently, deep learning methods have been widely used for medical image segmentation [12] and made significant progress, such as brain region and tumor segmentation [11, 36], liver and tumor segmentation [38, 26], polyp segmentation [1, 7, 14], lung infection segmentation [8, 19] and cell segmentation [23, 18, 35]. Most of these methods are based on the encoder-decoder framework, such as U-Net[23] and its variants [24] (e.g., U-Net++ [38], Unet 3+ [13]), and PraNet [7], or are inspired by some commonly-used natural image segmentation models, e.g., fully convolutional networks (FCN) [36] and DeepLab [26]. These works have shown great potentials in the segmentation of various organs and lesions from different medical imaging modalities. To our knowledge, however, deep learning techniques have not yet been well-studied and applied for TVS in microscope images, due to three key factors: 1) The large variation in morphology (e.g., size, appearance and shape) of the Trichomonas is challenging for detection. Besides, Trichomonas are often captured out of focus (blurred appearance) or under occlusion due to their motility, which aggregates the difficulty of accurate segmentation. 2) The high appearance similarity between Trichomonas and other cells (e.g., leukocyte) makes them easily confused with complex surroundings. Most importantly, 3) the lack of large-scale annotated data restricts the performance of deep models that rely on sufficient training data, thereby hindering further research in this field. It is worth noting that the above factors also reflect the clear differences in object segmentation between the microscope images of Trichomonas in our work and conventional cells (e.g., HeLa cells [23] and blood cells [16]). Furthermore, we noted that recently Wang et al. [28] proposed a two-stage model for video-based Trichomonas vaginalis detection, which utilizes video motion cues (e.g., optical flow) to greatly reduce the detection difficulty. The difference is that this work focuses on image-based Trichomonas vaginalis detection, without motion information, which increases the difficulty of detection. As we know, no one has set foot on this field so far. Hence, accurate TV segmentation remains a challenging and under-explored task.

To address above issues, we first elaborately construct a novel large-scale microscope images dataset exclusively designed for Trichomonas Vaginalis segmentation, named TVMI3K. Moreover, we develop a simple but effective deep neural network, termed TVNet, for TVS. In a nutshell, our main contributions are threefold: (1) We carefully collect TVMI3K, a large-scale dataset for TVS, which consists of 3,158 microscopic images covering Trichomonas of various appearances in diverse backgrounds, with high-quality annotations of object-level labels, object boundaries and challenging attributes. To our knowledge, this is the first large-scale dataset for TVS that can serve as a catalyst for promoting further research in this field in the deep learning era. (2) We proposed a novel deep neural network, termed TVNet, which enhances high-level feature representations with edge cues in a high-resolution fusion (HRF) module and then excavates object-critical semantics based on foreground-background attention (FBA) module under the guidance of coarse location map for accurate prediction. (3) Extensive experiments show that our method achieves superior performance and outperforms various cutting-edge segmentation models both quantitatively and qualitatively, making it a promising solution to the TVS task. The dataset, results and models will be publicly available at: https://github.com/CellRecog/cellRecog.

Refer to caption
Figure 1: Various examples of our proposed TVMI3K. We provide different annotations, including object-level masks, object edges and challenging attributes. We use red boxes to mark Trichomonas and green boxes to mark leukocyte on images for better visualization. Leukocyte shows high similarity with Trichomonas.

2 Proposed Dataset

To facilitate the research of TVS in deep learning era, we develop the TVMI3K dataset, which is carefully collected to cover TV of various appearances in diverse challenging surroundings, e.g., large morphological variation, occlusion and background distractions. Examples can be seen in Fig. 1.

2.1 Data Collection

To construct the dataset, we first collect 80 videos of Trichomonas samples with resolution 3088×\times2064 from more than 20 cases over seven months. The phase-contrast microscopy is adopted for video collection, which captures samples clearly, even unstained cells, making it more suitable for sample imaging and microscopy evaluation than ordinary light microscopy. We then extract images from these videos to build our TVMI3K dataset, which finally contains 3,158 microscopic images (2,524 Trichomonas and 634 background images). We find that videos are more beneficial for data annotation, because objects can be identified and annotated accurately according to the motion of Trichomonas. Thus, even unfocused objects can also be labeled accurately. Besides, to avoid data selection bias, we collect 634 background images to enhance generalization ability of models. It should be noted that images are collected by the microscope devices independently, and we do not collect any patient information, thus the dataset is free from copyright and loyalties.

High-quality annotations are critical for deep model training [17]. During the labeling process, 5 professional annotators are divided into two groups for annotation and cross-validation is conducted between each group to guarantee the quality of annotation. We label each image with accurately object-level masks, object edges and challenging attributes, e.g., occlusions and complex shapes. Attribute descriptions are shown in Tab. 1.

Table 1: Attribute Descriptions of Our TVMI3K Dataset
Attr. Descriptions
Image- level MO Multiple Objects. Number of objects in each image \geq 2
SO Small Objects. The ratio of object area to image area \leq 0.1
OV Out of view. Incomplete objects clipped by image boundary
Object- level CS Complex Shape. In diverse shapes with tiny parts (e.g., flagella)
OC Occlusions. Object is partially obscured by surroundings
OF Out-of-focus. Ghosting due to poor focus
SQ Squeeze. The object appearance changes when squeezed

2.2 Dataset Features

\bullet Image-level Attributes. As listed in Tab. 1, our data is collected with several image-level attributes, i.e., multiple objects (MO), small object (SO) and out-of-view (OV), which are mostly caused by Trichomonas size, shooting distance and range. According to statistics, each image contains 33 objects averagely, up to 1717 objects. The size distribution ranges from 0.029%0.029\% to 1.179%1.179\%, with an average of 0.188%0.188\%, indicating that it belongs to a dataset for tiny object detection.

\bullet Object-level Attributes. Trichomonas objects show various attributes, including complex shapes (CS), occlusions (OC), out-of-focus (OF) and squeeze (SQ), which are mainly caused by the growth and motion of trichomonas, highly background distraction (e.g. other cells) and the shaking of acquisition devices. These attributes lead to large morphological differences in Trichomonas, which increase the difficulty of detection. In addition, there is a high similarity in appearance between Trichomonas and leukocyte, which can easily confuse the detectors to make false detections. Examples and the details of object-level attributes are shown in Fig. 1 and Tab. 1 respectively.

\bullet Dataset Splits. To provide a large amount of training data for deep models, we select 60% of videos as training set and 40% as test set. In order to be consistent with practical applications, we select data according to the chronological order of the sample videos from each case instead of random selection, that is, the data collected first for each case is used as the training set, and the data collected later is used as the test set. Thus, the dataset is finally split into 2,305 images for training and 853 images for testing respectively. Note that 290 background images in the test set are not included in our test experiments.

3 Method

Refer to caption
Figure 2: Overview of our proposed TVNet, which consists of high-resolution fusion (HRF) module and foreground-background attention (FBA) module. See §\S 3 for details.

3.1 Overview

Fig. 2 illustrates the overall architecture of the proposed TVNet. Specifically, for an input Trichomonas image II, the backbone network Res2Net [9] is adopted to extract five levels of features {fi}i=15\{f_{i}\}_{i=1}^{5}. Then we further explore the high-level features (i.e., f3f5f_{3}\sim f_{5} in our model) to effectively learn object feature representation and produce object prediction. As shown in Fig. 2, we first design a high-resolution fusion (HRF) module to enhance feature representation by integrating high-resolution edge features and get three refined features ({fi}i=35\{f_{i}^{{}^{\prime}}\}_{i=3}^{5}). Next, we introduce the neighbor connection decoder (NCD) [5], a new cutting-edge decoder component which improves the partial decoder component [33] with neighbor connection, to aggregate these three refined features and generate the initial prediction map P6P_{6}. It can capture the relatively coarse location of objects so as to guide the following object prediction. Specially, {Pi}i=35\{P_{i}\}_{i=3}^{5} represents the prediction map of the ii layer, which corresponds to the ii-layer in {fi}i=35\{f_{i}^{{}^{\prime}}\}_{i=3}^{5}. Finally, we propose a foreground-background attention (FBA) module to excavate object critical cues by exploring semantic differences between foreground and background and then accurately predict object masks in a progressive manner.

3.2 High-resolution Fusion Module

With the deepening of the neural network, local features, such as textures and edges, are gradually diluted, which may reduce the model’s capability to learn the object structure and boundaries. Considering too small receptive field and too much redundant information of f1f_{1} feature, in this paper, we design a high-resolution fusion (HRF) module which adopts the the low-level f2f_{2} feature to supplement local details for high-level semantic features and boost feature extraction for segmentation. To force the model to focus more on the object edge information, we first feed feature f2f_{2} into a 3×33\times 3 convolutional layer to explicitly model object boundaries with the ground-truth edge supervision. Then we integrate the low-level feature f2f_{2} with the high-level features (f3f5f_{3}\sim f_{5}) to enhance representation with channel and spatial attention operations [31, 2], denoted as:

{f~i=fi+𝒢(Cat(fi;𝒢(δ2(f2)))),i{3,,5},f~i=Mc(f~i)f~i,fi=Ms(f~i)f~i,\left\{\begin{aligned} \tilde{f}_{i}&=f_{i}+\mathcal{G}\left(Cat(f_{i};~{}\mathcal{G}(\delta_{\downarrow}^{2}(f_{2})))\right),i\in\{3,...,5\},\\ \tilde{f}_{i}^{{}^{\prime}}&=M_{c}(\tilde{f}_{i})\otimes\tilde{f}_{i},\\ f_{i}^{{}^{\prime}}&=M_{s}(\tilde{f}_{i}^{{}^{\prime}})\otimes\tilde{f}_{i}^{{}^{\prime}},\end{aligned}\right. (1)

where δ2()\delta_{\downarrow}^{2}(\cdot) denotes ×2\times 2 down-sampling operation. 𝒢()\mathcal{G}(\cdot) is a 3×33\times 3 convolution layer, and CatCat is concatenation operation. Mc()M_{c}(\cdot) and Ms()M_{s}(\cdot) represent channel attention and spatial attention respectively. \otimes is the element-wise multiplication operation. After that, a 1×\times1 convolution is used to adjust the number of channels to ensure the consistent output of each HRF.

3.3 Foreground-Background Attention Module

To produce accurate prediction from the fused feature (f3f5f_{3}^{{}^{\prime}}\sim f_{5}^{{}^{\prime}}) progressively in the decoder, we devise a foreground-background attention (FBA) module to filtrate and enhance object related features using region sensitive map derived from the coarse prediction of the previous layer. For the fused feature {fi}i=35\{f_{i}^{{}^{\prime}}\}_{i=3}^{5}, we first decompose the previous initial prediction Pi+1P_{i+1} into three regions, i.e., strong foreground region (i+11\mathcal{F}_{i+1}^{1}), weak foreground region (i+12\mathcal{F}_{i+1}^{2}) and background region (i+13\mathcal{F}_{i+1}^{3}). The decomposition process can refer to [25]. Then each region is normalized into [0,1] as a region sensitive map to extract the corresponding features from fif_{i}^{{}^{\prime}}. Noted that 1\mathcal{F}^{1} provides the location information, and 2\mathcal{F}^{2} contains the object edge/boundary information. The 3\mathcal{F}^{3} denotes the remaining region.

After that, the region sensitive features can be extracted from the fused feature by an element-wise multiplication operation with up-sampled region sensitive maps followed by a 3×\times3 convolution layer. Next, these features are aggregated by an element-wise summation operation with a residual connection. It can be denoted as:

fiout=k=13𝒢(δ2(i+1k)fi)+fif_{i}^{out}=\sum\nolimits_{k=1}^{3}\mathcal{G}(\delta_{\uparrow}^{2}(\mathcal{F}_{i+1}^{k})\otimes f_{i}^{{}^{\prime}})+f_{i}^{{}^{\prime}} (2)

where δ2()\delta_{\uparrow}^{2}(\cdot) denotes ×2\times 2 up-sampling operation. Inspired by [5], FBA can also be cascaded multiple times to gradually refine the prediction. For more details please refer to the Supp. In this way, FBA can differentially handle regions with different properties and explore region-sensitive features, thereby strengthening foreground features and reducing background distractions.

4 Experiments and Results

4.1 Experimental Settings

Baselines and Metrics. We compare our TVNet with 9 state-of-the-art medical/natural image segmentation methods, including U-Net++ [39], SCRN [34], U2Net [22], F3Net [30], PraNet [7], SINet [6], MSNet [37], SANet [29] and SINet-v2 [5]. We collect the source codes of these models and re-train them on our proposed dataset. We adopt 7 metrics for quantitative evaluation using the toolboxes provided by [7] and [29], including structural similarity measure (SαS_{\alpha}, α\alpha = 0.50.5[3], enhanced alignment measure (EϕmaxE_{\phi}^{max}[4], FβF_{\beta} measure (FβwF_{\beta}^{w} and FβmeanF_{\beta}^{mean}[20], mean absolute error (MAE, \mathcal{M}[21], Sorensen-Dice coefficient (mean Dice, mDice) and intersection-over-union (mean IoU, mIoU) [7].

Training Protocols. We adopt the standard binary cross entropy (BCE) loss for edge supervision, and the weighted IoU loss [30] and the weighted BCE loss [30] for object mask supervision. During the training stage, the batch size is set to 20. The network parameters are optimized by Adam optimizer [15] with an initial learning rate of 0.05, a momentum of 0.9 and a weight decay of 5e-4. Each image is resized to 352×\times352 for network input. The whole training time is about 2 hours for 50 epochs on a NVIDIA GeForce RTX 2080Ti GPU.

Table 2: Quantitative comparison on our TVMI3K dataset. “\uparrow” indicates the higher the score the better. “\downarrow” denotes the lower the score the better.
Methods Pub. SαS_{\alpha}\uparrow EϕmaxE_{\phi}^{max}\uparrow   FβwF_{\beta}^{w}\uparrow FβmeanF_{\beta}^{mean}\uparrow \mathcal{M}\downarrow mDice\uparrow mIoU\uparrow
UNet++ [39] TMI19 0.524 0.731 0.069 0.053 0.006 0.004 0.003
SCRN [34] ICCV19 0.567 0.789 0.145 0.254 0.011 0.201 0.135
U2Net [22] PR20 0.607 0.845 0.209 0.332 0.013 0.301 0.209
F3Net [30] AAAI20 0.637 0.809 0.320 0.377 0.005 0.369 0.265
PraNet [7] MICCAI20 0.623 0.792 0.300 0.369 0.006 0.328 0.230
SINet [6] CVPR20 0.492 0.752 0.010 0.113 0.104 0.095 0.061
MSNet [37] MICCAI21 0.626 0.786 0.321 0.378 0.005 0.366 0.268
SANet [29] MICCAI21 0.612 0.800 0.289 0.361 0.006 0.338 0.225
SINet-v2 [5] PAMI21 0.621 0.842 0.309 0.375 0.005 0.348 0.245
Ours - 0.635 0.851 0.343 0.401 0.004 0.376 0.276

4.2 Comparison with State-of-the-art

Quantitative comparison. Tab. 2 shows the quantitative comparison between our proposed model and other competitors on our TVMI3K dataset. It can be seen that our TVNet significantly outperforms all other competing methods on all metrics except SαS_{\alpha} which is also on par with the best one. Particularly, our method achieves a performance gain of 2.2% and 2.3% in terms of FβwF_{\beta}^{w} and FβmeanF_{\beta}^{mean}, respectively. This suggests that our model is a strong baseline for TVS.

Qualitative comparison. Fig. 3 shows some representative visual results of different methods. From those results, we can observe that TVNet can accurately locate and segment Trichomonas objects under various challenging scenarios, including cluttered distraction objects, occlusion, varied shape and similarity with other cells. In contrast, other methods often provide results with a considerable number of missed or false detection, or even failed detection.

Refer to caption
Figure 3: Visual comparison of different methods. Obviously, our method provides more accurate predictions than other competitors in various challenging scenarios.
Table 3: Ablation study for TVNet on the proposed TVMI3K datasets.
No. Backbone HRF FBA SαS_{\alpha}\uparrow   FβwF_{\beta}^{w}\uparrow FβmeanF_{\beta}^{mean}\uparrow \mathcal{M}\downarrow mDice mIoU
a 0.593 0.234 0.302 0.005 0.251 0.163
b 0.619 0.291 0.357 0.005 0.328 0.228
c 0.623 0.258 0.330 0.004 0.271 0.185
d 0.635 0.343 0.401 0.004 0.376 0.276

4.3 Ablation Study

Effectiveness of HRF. From Tab. 3, we observe that the HRF module outperforms the baseline model with significant improvement, e.g., 2.6%, 5.7%, 5.7%, 7.7% and 6.5% performance improvement in SαS_{\alpha}, FβwF_{\beta}^{w} FβmeanF_{\beta}^{mean}, mDicemDice and mIoUmIoU metrics, respectively. This shows the fusion of local features is beneficial for object boundary localization and segmentation. Note that the adopted explicit edge supervision facilitates the model to focus more on object boundaries and enhance the details of predictions.

Effectiveness of FBA. We further investigate the contribution of the FBA module. As can be seen in Tab. 3, FBA improves the segmentation performance by 3%, 2.4% and 2.8% in SαS_{\alpha}, FβwF_{\beta}^{w} and FβmeanF_{\beta}^{mean}, respectively. FBA enables our model to excavate object-critical features and reduce background distractions, thus distinguishing TV objects accurately.

Effectiveness of HRF & FBA. From Tab. 3, the integration of HRF and FBA is generally better than other settings (a\simc). Compared with the baseline, the performance gains are 1.2%, 5.2% and 4.4% in SαS_{\alpha}, FβwF_{\beta}^{w} and FβmeanF_{\beta}^{mean} respectively. Besides, our TVNet outperforms other recently proposed models, making it an effective framework that can help boost future research in TVS.

Model Complexity. We observe that the number of parameters and FLOPs of the proposed model are \sim155M and \sim98GMac, respectively, indicating that there is room for further improvement, which is the focus of our future work.

5 Conclusion

This paper provides the first investigation for the segmentation of Trichomonas vaginalis in microscope images based on deep neural networks. To this end, we collect a novel large-scale, challenging microscope image dataset of TV called TVMI3K. Then, we propose a simple but effective baseline, TVNet, for accurately segmenting Trichomonas from microscope images. Extensive experiments demonstrate that our TVNet outperforms other approaches. We hope our study will offer the community an opportunity to explore more in this field.

References

  • [1] Brandao, P., Mazomenos, E., Ciuti, G., Caliò, R., Bianchi, F., Menciassi, A., et al.: Fully convolutional neural networks for polyp segmentation in colonoscopy. In: Medical Imaging: Computer-Aided Diagnosis. vol. 10134, pp. 101–107 (2017)
  • [2] Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., Chua, T.S.: Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In: IEEE CVPR. pp. 5659–5667 (2017)
  • [3] Fan, D.P., Cheng, M.M., Liu, Y., Li, T., Borji, A.: Structure-measure: A new way to evaluate foreground maps. In: IEEE ICCV. pp. 4548–4557 (2017)
  • [4] Fan, D.P., Gong, C., Cao, Y., Ren, B., Cheng, M.M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. In: IJCAI. pp. 698–704 (2018)
  • [5] Fan, D.P., Ji, G.P., Cheng, M.M., Shao, L.: Concealed object detection. IEEE TPAMI pp. 1–1 (2021)
  • [6] Fan, D.P., Ji, G.P., Sun, G., Cheng, M.M., Shen, J., Shao, L.: Camouflaged object detection. In: IEEE CVPR. pp. 2777–2787 (2020)
  • [7] Fan, D.P., Ji, G.P., Zhou, T., Chen, G., Fu, H., Shen, J., Shao, L.: Pranet: Parallel reverse attention network for polyp segmentation. MICCAI pp. 263–273 (2020)
  • [8] Fan, D.P., Zhou, T., Ji, G.P., Zhou, Y., Chen, G., Fu, H., Shen, J., Shao, L.: Inf-net: Automatic covid-19 lung infection segmentation from ct images. IEEE TMI 39(8), 2626–2637 (2020)
  • [9] Gao, S.H., Cheng, M.M., Zhao, K., Zhang, X.Y., Yang, M.H., Torr, P.: Res2net: A new multi-scale backbone architecture. IEEE TPAMI 43(2), 652–662 (2019)
  • [10] Harp, D.F., Chowdhury, I.: Trichomoniasis: Evaluation to execution. Eur. J. Obstet. Gynecol. Reprod. Biol. 157(1),  3–9 (2011)
  • [11] Havaei, M., Davy, A., Warde-Farley, D., Biard, A., et al.: Brain tumor segmentation with deep neural networks. Medical Image Analysis 35, 18–31 (2017)
  • [12] Hesamian, M.H., Jia, W., He, X., Kennedy, P.: Deep learning techniques for medical image segmentation: achievements and challenges. Journal of digital imaging 32(4), 582–596 (2019)
  • [13] Huang, H., Lin, L., Tong, R., Hu, H., Zhang, Q., Iwamoto, Y., Han, X., Chen, Y.W., Wu, J.: Unet 3+: A full-scale connected unet for medical image segmentation. In: ICASSP. pp. 1055–1059 (2020)
  • [14] Ji, G.P., Chou, Y.C., Fan, D.P., Chen, G., Fu, H., Jha, D., Shao, L.: Progressively normalized self-attention network for video polyp segmentation. In: MICCAI. pp. 142–152. Springer (2021)
  • [15] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: ICLR (2015)
  • [16] Li, D., Tang, P., Zhang, R., Sun, C., Li, Y., Qian, J., Liang, Y., Yang, J., Zhang, L.: Robust blood cell image segmentation method based on neural ordinary differential equations. Computational and Mathematical Methods in Medicine 2021 (2021)
  • [17] Li, J., Zhu, G., Hua, C., Feng, M., Li, P., Lu, X., Song, J., Shen, P., et al.: A systematic collection of medical image datasets for deep learning. arXiv preprint arXiv:2106.12864 (2021)
  • [18] Li, L., Liu, J., Yu, F., Wang, X., Xiang, T.Z.: Mvdi25k: A large-scale dataset of microscopic vaginal discharge images. BenchCouncil Transactions on Benchmarks, Standards and Evaluations 1(1), 100008 (2021)
  • [19] Liu, J., Dong, B., Wang, S., Cui, H., Fan, D.P., Ma, J., Chen, G.: Covid-19 lung infection segmentation with a novel two-stage cross-domain transfer learning framework. Medical Image Analysis 74, 102205 (2021)
  • [20] Margolin, R., Zelnik-Manor, L., Tal, A.: How to evaluate foreground maps? In: IEEE CVPR. pp. 248–255 (2014)
  • [21] Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A.: Saliency filters: Contrast based filtering for salient region detection. In: IEEE CVPR. pp. 733–740. IEEE (2012)
  • [22] Qin, X., Zhang, Z., Huang, C., Dehghan, M., et al.: U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition 106, 107404 (2020)
  • [23] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI. pp. 234–241 (2015)
  • [24] Siddique, N., Paheding, S., Elkin, C.P., Devabhaktuni, V.: U-net and its variants for medical image segmentation: A review of theory and applications. IEEE Access pp. 82031–82057 (2021)
  • [25] Sun, P., Zhang, W., Wang, H., Li, S., Li, X.: Deep rgb-d saliency detection with depth-sensitive attention and automatic multi-modal fusion. In: IEEE CVPR. pp. 1407–1417 (2021)
  • [26] Tang, W., Zou, D., Yang, S., Shi, J., Dan, J., Song, G.: A two-stage approach for automatic liver segmentation with faster r-cnn and deeplab. Neural Computing and Applications 32(11), 6769–6778 (2020)
  • [27] Vos, T., Allen, C., Arora, M., Barber, R.M., Bhutta, Z.A., Brown, A., et al.: Global, regional, and national incidence, prevalence, and years lived with disability for 310 diseases and injuries, 1990–2015: a systematic analysis for the global burden of disease study 2015. The Lancet 388(10053), 1545–1602 (2016)
  • [28] Wang, X., Du, X., Liu, L., Ni, G., Zhang, J., Liu, J., Liu, Y.: Trichomonas vaginalis detection using two convolutional neural networks with encoder-decoder architecture. Applied Sciences 11(6),  2738 (2021)
  • [29] Wei, J., Hu, Y., Zhang, R., Li, Z., Zhou, S.K., Cui, S.: Shallow attention network for polyp segmentation. In: MICCAI. pp. 699–708 (2021)
  • [30] Wei, J., Wang, S., Huang, Q.: F3net: Fusion, feedback and focus for salient object detection. In: AAAI. pp. 12321–12328 (2020)
  • [31] Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: ECCV. pp. 3–19 (2018)
  • [32] Workowski, K.A.: Sexually transmitted infections and hiv: diagnosis and treatment. Topics in Antiviral Medicine 20(1),  11 (2012)
  • [33] Wu, Z., Su, L., Huang, Q.: Cascaded partial decoder for fast and accurate salient object detection. In: IEEE CVPR. pp. 3907–3916 (2019)
  • [34] Wu, Z., Su, L., Huang, Q.: Stacked cross refinement network for edge-aware salient object detection. In: IEEE ICCV. pp. 7263–7272 (2019)
  • [35] Zhang, Y., Higashita, R., Fu, H., Xu, Y., Zhang, Y., Liu, H., Zhang, J., Liu, J.: A multi-branch hybrid transformer network for corneal endothelial cell segmentation. In: MICCAI. pp. 99–108 (2021)
  • [36] Zhao, X., Wu, Y., Song, G., Li, Z., et al.: A deep learning model integrating fcnns and crfs for brain tumor segmentation. Medical Image Analysis 43, 98–111 (2018)
  • [37] Zhao, X., Zhang, L., Lu, H.: Automatic polyp segmentation via multi-scale subtraction network. In: MICCAI. pp. 120–130 (2021)
  • [38] Zhou, Z., Rahman Siddiquee, M.M., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: DLMIA. pp. 3–11 (2018)
  • [39] Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE TMI pp. 1856–1867 (2019)

6 Appendix

6.1 More Details of TVMI3K Dataset

Refer to caption
Figure 4: Examples of Object-level Attributes.
Refer to captionRefer to caption
(a) Co-attribute Distribution
Refer to caption
(b) Object Number &\& Size
Figure 5: (a) Co-attribute distribution table (left) and multiple dependencies of these attributes (right). (b) Distribution of Trichomonas population in each image (left) and distribution of the object area ratio (right).

6.2 The Details of FBAs

Refer to caption
Figure 6: Cascaded form of multiple FBA modules. Left: Single FBA. Right: Cascaded FBAs. FBA can be cascaded multiple times to gradually refine predictions. (\sim): Sigmoid.

6.3 More Visual Comparisons

Refer to caption
Figure 7: Visual comparison of different methods. We select image patches (red boxes) from input images to show the results more clearly (data attributes are also marked).