¹¹institutetext: Kyushu University, Fukuoka, Japan
¹¹email: [email protected]
¹¹email: [email protected]

Semi-supervised Cell Detection in Time-lapse Images Using Temporal Consistency

Kazuya Nishimura 11 Hyeonwoo Cho 11 Ryoma Bise 11

Abstract

Cell detection is the task of detecting the approximate positions of cell centroids from microscopy images. Recently, convolutional neural network-based approaches have achieved promising performance. However, these methods require a certain amount of annotation for each imaging condition. This annotation is a time-consuming and labor-intensive task. To overcome this problem, we propose a semi-supervised cell-detection method that effectively uses a time-lapse sequence with one labeled image and the other images unlabeled. First, we train a cell-detection network with a one-labeled image and estimate the unlabeled images with the trained network. We then select high-confidence positions from the estimations by tracking the detected cells from the the labeled frame to those far from it. Next, we generate pseudo-labels from the tracking results and train the network by using pseudo-labels. We evaluated our method for seven conditions of public datasets, and we achieved the best results relative to other semi-supervised methods. Our code is available at https://github.com/naivete5656/SCDTC

Keywords:

semi-supervised learning cell detection microscopy image.

1 Introduction

Non-invasive imaging techniques such as phase-contrast and differential interference contrast microscopy can capture images of cells without staining them. These techniques have been widely used for the long-term monitoring of living cells, in which hundreds of cells are captured as time-lapse images at short time intervals over days. Cell detection that detects the approximate positions of cell centroids from such microscopy images is a fundamental task for biomedical research [2, 8, 4, 32, 15, 17, 20, 8, 19, 5]. Time-lapse images include hundreds of cells, so manual analysis is time consuming. Therefore, there is significant demand for automated cell-detection tools.

Traditionally, image processing-based methods such as thresholding [16], level sets [24] have been proposed for cell detection [2, 3, 31, 27, 26]. Recently, convolutional neural network-based detection methods have performed well with a large amount of labeled data [4, 32, 15, 17, 20, 5]. For cell detection, heatmap-based methods have achieved promising results [15, 17, 20]. However, these methods require a certain amount of annotation for each imaging condition (e.g., type of microscopy, type of cell, and growth conditions). Preparing the necessary amount of annotation for each condition is a time-consuming and labor-intensive task.

To address this annotation problem, semi-supervised object-detection methods have been proposed mainly for bounding-box-based detection [7, 22, 21, 13, 28, 29, 1, 13, 11], and a few methods have been proposed based-on consistency learning [14, 6]. For example, Moskvyak et al. have proposed a consistency loss to train the network so that it consistently estimates the heatmap for data augmented by shifting or rotating [14]. These consistency-based methods assume that the labeled data are randomly sampled, i.e. labeled and unlabeled data have a similar distribution. However, this assumption often does not hold for time-lapse images of cell. In the time-lapse images, the cell density changes due to cell division, or the characteristics of cells may change due to the cultured condition (Fig. 1). As a result, the appearances of cells are different between the early frames and the later frames. Using consistency loss with an image as the label and the entire sequence as unlabeled data shows that the loss has an adversarial effect on training and does not work for the test data.

Refer to caption — Figure 1: Appearance of sequence images.

In this paper, we propose a semi-supervised cell-detection method that utilizes one time-lapse sequence, in which one image is labeled data and the other images are unlabeled data. Our method can improve the performance of the detection network by finding reliable detection positions from the unlabeled data using the labeled image as a clue. First, we train a detection network $f_{d}$ with a labeled image. As shown in Fig. 1, if we capture time-lapse images with short intervals, cells in frames close to the labeled frame have a similar appearance to the labeled-frame cells because the appearances of cells change gradually. Thus, the trained network can perform an accurate estimation for these frames close to the labeled image. In contrast, the estimation for the later frames often does not perform well since the cell appearance changes from the labeled frame due to the increase of the density, and stimulation from culture conditions. On the basis of this observation, we track cells using estimated results from the labeled frame to the further frames, and the consistently tracked cells are considered to be the reliable estimations. We use tracking results as pseudo-labels for retraining the detection CNN. By iteratively performing this process, our method can improve the detection CNN. Our main contributions are summarized as follows:

•

We propose a semi-supervised cell-detection method that can effectively train a detection network using a time-lapse sequence that contains one labeled image with the remaining images unlabeled. We improve the network by gradually adding pseudo-labels from the nearby frames.
•

We propose a pseudo labeling method for heat map-based cell detection by using tracking. We generate pseudo-heatmaps from high-confidence detection results that selected by tracking.
•

We demonstrated the effectiveness of our method for seven different conditions and demonstrated that our method can improve the detection network with one labeled image for various conditions.

1.0.1 Related work:

To address this annotation problem, semi-supervised object-detection methods have been proposed for mainly general images [7, 22, 21, 13, 28, 29, 1, 6, 13]. These methods can be divided into two groups.

The first group is consistency-based semi-supervised object detection [7, 22, 14, 6]. As mentioned in the introduction, the consistency-based methods assume that the labeled image is randomly sampling. Therefore, it does not work in our setting.

The second group is pseudo-labeling-based semi-supervised method [28, 29, 30, 1, 13, 12]. The main approach of pseudo-labeling first trains the model on a small amount of data, and samples with high confidence are selected from the estimation results. Next, the model is trained using the selected samples. The performance of the model is improved by iterative labeling and learning. However, these methods have proposed a semi-supervised object detection method for a bounding box-based detection model. The cost of annotating a bounding box of a cell is expensive since a cell has a deformed shape with blurry boundaries. Therefore, the heatmap prediction is more suitable for cell detection tasks. As semi-supervised learning for heatmap prediction, Gberta et al. [1] have proposed video pose propagation for sparsely annotated sequence. By warping a heatmap from a labeled image to an unlabeled image, the method generates a pseudo label. However, this method requires a certain pair of labeled and unlabeled images for training of the warping network, i.e., it requires a certain labeled images in a sequence.

Unlike these methods, our method improves the detection performance with a single labeled image by gradually increasing the number of reliable pseudo-labels with tracking.

2 Semi-supervised cell detection

An overview of the proposed method is shown in Fig. 2. Given one sequence $\mathcal{X}=\{x_{t}\}_{t=1}^{T}$ , in which there is one labeled image $\mathcal{X}_{l}=\{(x_{l},y_{l})\}$ and the other images are unlabeled, our method improves the detection network $f_{d}$ . $T$ is the number of frames in the sequence. (1) The cell-detection network $f_{d}$ is trained with a labeled image $\mathcal{X}_{l}$ . (2) The prediction results $\mathcal{\hat{Y}}$ are obtained by the trained detection network $f_{d}$ from the whole sequence $\mathcal{X}$ . (3) The detection results are tracked from the labeled frame $l$ to the frames far from it. (4) We generate pseudo-heatmaps $y^{p}_{t}$ and masks $M_{t}$ for each frame based-on the tracking result, and we get pseudo labeled images $\mathcal{X}_{p}=\{(x_{t},y_{t}^{p},m_{t})\}_{t=1}^{b}$ . The mask $M_{t}$ is used on the next training step for mitigating the effect of untracked cells and mitosis cells. (5) The network $f_{d}$ is trained with generated pseudo labels $\mathcal{X}_{p}$ . We improve performance by iteratively performing pseudo-labeling and learning.

Cell detection: For cell detection, we use the heatmap-based detection method [15]. Given an input image $x_{t}$ , the network output a heatmap $\hat{y}_{t}$ that is generated by blurring the approximate cell centroids. The network is trained by the mean squared error loss between the output $\hat{y}_{l}$ and the ground-truth of the heatmap $y_{l}$ . After training, the cell position can be determined by finding the peak of the estimated heatmap. The detected positions for each frame $\mathcal{P}=\{\vec{p_{t}}\}_{t=1}^{T}$ are determined by detecting the peaks of the network output $\hat{\mathcal{Y}}=\{\hat{y}_{t}\}^{T}_{t=1}$ , where $\hat{y}_{t}=f_{d}(x_{t})$ .

Pseudo-labeling with tracking: Next, we select high-confidence positions from detected positions $\mathcal{P}$ based on tracking, and generate pseudo-heatmaps using the selected high-confidence positions. Our key assumption is that reliable estimations can be found by tracking. If detected positions in unlabeled frames can be continuously tracked from the labeled frame, we consider these as reliable estimations. The other assumption is that the additional training using reliable pseudo-labels can improve $f_{d}$ , and the reliable estimations increase in the next iteration by the re-trained network, which can be used as the additional pseudo-labels for further re-training. The network gradually improves by iterating this process.

The detection points of successive frames are associated by using a one-by-one matching [9], which optimizes the assignment among detected points in successive frames. The association is performed bi-directions, from the labeled frame to far frames. To avoid selecting the unconfident results, if the distance between the associated positions is small enough, we associate these. The right images in Fig. 2 show the examples of the estimated heat-maps $\hat{\mathcal{Y}}$ and the tracking results. We can observe that the heat-maps of frames close to the frame $l$ , were more accurately estimated compared to the results of later frames. Thus, the number of successfully tracked cells from $l$ gradually decreases with far from $l$ . If the pseudo-heatmaps contain many unreliable regions, which may be cell regions or background, these affect re-training. Therefore, we generate pseudo-heatmaps at the frames that the ratio of the tracked positions is larger than a threshold $\alpha$ . The range of the tracked frames is defined as $[a,b]$ .

Next, we generate the pseudo-heatmaps from the tracking results in $[a,b]$ . We generate the set of the pseudo-heat-maps $\{y_{t}^{p}\}_{t=a}^{b}$ using the positions of the tracked cells $\{\vec{p}_{t}^{tr}\}_{t=a}^{b}$ in the same manner to the supervised heatmap generation [15]. These frames still contain few unreliable regions, i.e., the regions on un-tracked cells. To mitigate the affect from the un-tracked cell regions, we re-train the detection network using the masked loss that ignores the masked regions in training. There are two types of unreliable regions.

The first is a region that is detected but is not tracked. Most of this type’s region is the region of a daughter cell that newly appears by cell mitosis after frame $l$ . Since cells monotonically increase with time by mitosis, we only consider this type of unreliable region at $t>l$ . These regions can be easily defined using an unassociated detected positions $\{\vec{p}^{o}_{t}\}$ . We define this region as $R(\vec{p}_{t}^{o})$ that is a set of pixels within radius $\beta$ from $\{\vec{p}^{o}_{t}\}$ .

The second is a region that is not detected but there is a cell (miss detection). If detected points in a certain region are continuously tracked from frame $l$ until the previous frame $t_{ut}$ i.e., it is possible that miss-detection occurred at $t_{ut}+1$ . To define the second regions, we use the position and timing when a track was terminated based on the tracking results, in which a terminate point and the frame are denoted as $\{\vec{p}^{ut}_{t^{ut}}\}$ . Since cells move slowly, the positions of the un-tracked cell in the later/earlier frames can be roughly predicted by random work from the terminated position and time. A region at $t$ is defined as $R(\vec{p}^{ut}_{t^{ut}},t)$ that is a set of pixels within radius $\beta+||t^{ut}-t||$ . The mask is defined using these unreliable regions as follows:

\displaystyle M_{t}(\vec{p})=\begin{cases}0&if~{}(\vec{p}\in R(\vec{p}_{t}^{ut},t)\\ 0&if~{}(\vec{p}\in R(\vec{p}_{t}^{o})and(t>l)\\ 1&otherwise.\end{cases}

(1)

We train $f_{d}$ with $\mathcal{X}_{p}=\{(x_{t},y_{t},M_{t})\}_{t=a}^{b}$ . When we train $f_{d}$ with $\mathcal{X}_{p}$ , we use following loss function:

\displaystyle L=\frac{1}{N}\sum_{N}\left(\frac{1}{\sum M_{t}}M_{t}*(y^{p}_{t}-\hat{y}_{t})^{2}\right),

(2)

where $\vec{i}$ is a coordinate, and $N$ is the number of pseudo-labels. We repeat Pseudo-labeling and re-training the network are iteratively performed until we reach $\gamma$ iterations.

3 Experiments

Implementation details: The detection network is trained for 10000 iterations using the Adam optimizer with a learning rate of 0.001 in each step. We set the hyperparameter $\alpha$ , which is the error rate for terminating tracking, to $0.8$ , $\beta$ , which is the radius of the region, to $18$ or $27$ according to cell size, and $\gamma$ , which is the number of iterations of our pseudo-labelings, to $3$ .

Metrics: To evaluate detection performance, we used F1-score $=\frac{2\cdot Precision\cdot Recall}{Recall+Precision}$ , in which TP, FP, and FN are true positive, false positive, and false negative, respectively (Precision $=\frac{TP}{TP+FP}$ , Recall $=\frac{TP}{TP+FN}$ ). We associate detected positions with ground-truth positions. We define true positive as the number of associated positions. We define the number of unassociated detected positions and ground-truth positions as false positive, and false-negative, respectively.

[Uncaptioned image] — Table 1: Quantitative evaluation results for Cell Tracking Challenge datasets on F1-score.

Evaluation on Cell Tracking Challenge datasets: To evaluate our method, we used Cell Tracking Challenge [18, 25], which are well-known cell image datasets. In this dataset, the cell were captured in various conditions as time-lapse images. We use three type of conditions: DIC-C2DH-HeLa, PhC-C2DH-U373, and PhC-C2DL-PSC. Two sequences that include 83 to 113 frames were fully annotated at a resolution ranging from 512x512 to 720x576 pixels. Since the magnification of the images (the size of cells) are different, we resized the images. The examples of the images are shown in Fig. 3 Please to refer to [18, 25] for the detailed information of this dataset. We used one image at the 20th frame as labeled frames and the rest frames as unlabeled sequence on all datasets, and we performed two fold cross validation. We used the twentieth frame as the labeled frame and the other frames as unlabeled sequences on all datasets, and we performed twofold cross-validation.

We compared our method with 3 conventional methods using one labeled image or unlabeled: Baseline, in which Nishimura’s method [15] was trained by one labeled image; Vicar [27], which is an image processing-based method by combining preconditioning of Yin [33] and distance transform [23]; Moskvyak [14], which is a consistency based semi-supervised detection method. In addition, to clarify how the number of the labeled data affect to the detection performance, we also compared several methods using additional labeled data: Supervised, in which Nishimura’s method [15] was trained by fully labeled image ; Moskvyak half, in which the first half of the sequence was additionally labeled and used for training.

Tab. 4 shows a quantitative evaluation of the F1 score on Cell Tracking Challenge. Our method achieves the best performance among the methods that trained with one labeled image. Since the time-lapse images gradually change the appearance of the image, Moskvyak’s performance worsens performance. Even if Moskvyak used 50 % of labeled data, the performance decreases from the baseline on DIC-C2DH-HeLa and PhC-C2DL-PSC. It indicates that the previous method that assumes the labeled images are randomly sampled can not work on our target setting. In contrast, our method can improve the performance with one labeled image. It can observe that our method is suitable for time-lapse sequences and effectively improves performance in a semi-supervised manner.

Fig. 4 shows the average F1 score of the three datasets for each frame in the training data. The horizontal axis indicates the F1 score, and the vertical axis indicates the frame. The performance of the F1 score is gradually decreasing from the twentieth frame. We observe that the performance of our method is improved compared to baseline by adding pseudo-labels from the close frames. Fig. 5 shows the example of result for iteration on DIC-C2DH-HeLa. As shown in Fig. 5, the heatmaps of our method become more clear than the baseline even in frames away from the labeled frame.

Evaluation on C2C12: To further evaluate our method on a more challenging case, we used a public dataset [10] (C2C12), which consists of 48 sequences with 1,013 images respectively. C2C12 cell is captured by phase-contrast microscopy under four different media conditions (Control, FGF2, BMP2, FGF2+BMP2) at a resolution of 1040 $\times$ 1392 pixels. An example of the images is shown in Fig. 3. The appearance of the image changed depend on cultured media conditions. Since only one sequence is fully annotated (BMP2), we additionally annotated three conditions (FGF2, Control, and BMP2 and BMP2+FGF2) to evaluate our method on various media conditions. We annotated 100 frames for the test data between the 600th and 700th frames: the total number of cells of is 27723, 85518, 7764, and 15082 for Control, FGF2, BMP2, and BMP2+FGF2, respectively. We annotated the 400th frame for all conditions to a different sequence of test data as training data: the total number of cells are 116, 28, 99, and 99 for respective conditions. Because cell type changes rapidly on FGF2 in the latter frames, we annotated frames 300 to 400 for test data and the 100th frame as training data in FGF2. Our model is trained on a sequence that includes 1 labeled image while the other 1,012 images are unlabeled. We compared our method with semi-supervised and unsupervised methods that are used in previous experiments.

Table 2 shows the quantitative evaluation results with the F1 score on four cultured conditions. Our method achieves the best performance on whole-culture conditions. Even if the cell shape slightly changes depending on the culture condition, the detection performance can be improved by this method. It indicate that the proposed method is effective for various time-lapse images.

Table 2: Quantitative evaluation results for C2C12 on F1-score

Method	Control	BMP2	FGF2	BMP2+FGF2	Ave.
Vicar [27]	0.731	0.607	0.676	0.820	0.709
Baseline [15]	0.740	0.844	0.651	0.824	0.765
Moskvyak [14]	0.598	0.170	0.478	0.520	0.442
Ours	0.756	0.886	0.830	0.901	0.843

4 Conclusion

We proposed a semi-supervised cell detection method for time-lapse images in which there is one labeled image and the other images are unlabeled. Our method can improve detection network by adding pseudo labels that is selected by tracking from detection results. We demonstrated our method’s effectiveness on seven different conditions, and we demonstrated that our method can improve detection performance on various conditions.

Acknowledgment: This work was supported by JSPS KAKENHI Grant Number JP20H04211.

References

[1] Bertasius, G., Feichtenhofer, C., Tran, D., Shi, J., Torresani, L.: Learning temporal pose estimation from sparsely labeled videos. In: NeurIPS (2019)
[2] Bise, R., Sato, Y.: Cell detection from redundant candidate regions under nonoverlapping constraints. IEEE Transactions on Medical Imaging 34(7), 1417–1427 (2015)
[3] Cosatto, E., Miller, M., Graf, H.P., Meyer, J.S.: Grading nuclear pleomorphism on histological micrographs. In: ICPR. pp. 1–4 (2008)
[4] Cruz-Roa, A.A., Ovalle, J.E.A., Madabhushi, A., Osorio, F.A.G.: A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection. In: MICCAI. pp. 403–410 (2013)
[5] Fujita, S., Han, X.H.: Cell detection and segmentation in microscopy images with improved mask r-cnn. In: ACCV (2020)
[6] Honari, S., Molchanov, P., Tyree, S., Vincent, P., Pal, C., Kautz, J.: Improving landmark localization with semi-supervised learning. In: CVPR. pp. 1546–1555 (2018)
[7] Jeong, J., Lee, S., Kim, J., Kwak, N.: Consistency-based semi-supervised learning for object detection. In: NeurIPS. vol. 32 (2019)
[8] Kainz, P., Urschler, M., Schulter, S., Wohlhart, P., Lepetit, V.: You should use regression to detect cells. In: MICCAI. pp. 276–283 (2015)
[9] Kanade, T., Yin, Z., Bise, R., Huh, S., Eom, S., Sandbothe, M.F., Chen, M.: Cell image analysis: Algorithms, system and applications. In: WACV. pp. 374–381 (2011)
[10] Ker, D.F.E., Eom, S., Sanami, S., Bise, R., Pascale, C., Yin, Z., Huh, S.i., Osuna-Highley, E., Junkers, S.N., Helfrich, C.J., Liang, P.Y., et al.: Phase contrast time-lapse microscopy datasets with automated and manual cell tracking annotations. Scientific data 5(1), 1–12 (2018)
[11] Kikkawa, R., Sekiguchi, H., Tsuge, I., Saito, S., Bise, R.: Semi-supervised learning with structured knowledge for body hair detection in photoacoustic image. In: ISBI. pp. 1411–1415 (2019)
[12] Li, J., Yang, S., Huang, X., Da, Q., Yang, X., Hu, Z., Duan, Q., Wang, C., Li, H.: Signet ring cell detection with a semi-supervised learning framework. In: IPMI. pp. 842–854 (2019)
[13] Misra, I., Shrivastava, A., Hebert, M.: Watch and learn: Semi-supervised learning for object detectors from video. In: CVPR. pp. 3593–3602 (2015)
[14] Moskvyak, O., Maire, F., Dayoub, F., Baktashmotlagh, M.: Semi-supervised keypoint localization. In: ICLR (2021)
[15] Nishimura, K., Bise, R., et al.: Weakly supervised cell instance segmentation by propagating from detection response. In: MICCAI. pp. 649–657 (2019)
[16] Otsu, N.: A threshold selection method from gray-level histograms. IEEE Transactions on Systems, Man, and Cybernetics 9(1), 62–66 (1979)
[17] Raza, S.E.A., AbdulJabbar, K., Jamal-Hanjani, M., Veeriah, S., Le Quesne, J., Swanton, C., Yuan, Y.: Deconvolving convolutional neural network for cell detection. In: ISBI. pp. 891–894 (2019)
[18] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR. pp. 779–788 (2016)
[19] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: NeurIPS. pp. 91–99 (2015)
[20] Sirinukunwattana, K., Raza, S.E.A., Tsang, Y., Snead, D.R.J., Cree, I.A., Rajpoot, N.M.: Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Transactions on Medical Imaging 35(5), 1196–1206 (2016)
[21] Sohn, K., Zhang, Z., Li, C.L., Zhang, H., Lee, C.Y., Pfister, T.: A simple semi-supervised learning framework for object detection. In: arXiv:2005.04757 (2020)
[22] Tang, P., Ramaiah, C., Wang, Y., Xu, R., Xiong, C.: Proposal learning for semi-supervised object detection. In: WACV. pp. 2291–2301 (2021)
[23] Thirusittampalam, K., Hossain, M.J., Ghita, O., Whelan, P.F.: A novel framework for cellular tracking and mitosis detection in dense phase contrast microscopy images. IEEE journal of biomedical and health informatics 17(3), 642–653 (2013)
[24] Tse, S., Bradbury, L., Wan, J.W., Djambazian, H., Sladek, R., Hudson, T.: A combined watershed and level set method for segmentation of brightfield cell images. In: Medical Imaging 2009: Image Processing. vol. 7259, p. 72593G (2009)
[25] Ulman, V., Maška, M., Magnusson, K.E., Ronneberger, O., Haubold, C., Harder, N., Matula, P., Matula, P., Svoboda, D., Radojevic, M., et al.: An objective comparison of cell-tracking algorithms. Nature methods 14(12), 1141–1152 (2017)
[26] Veta, M., Van Diest, P.J., Kornegoor, R., Huisman, A., Viergever, M.A., Pluim, J.P.: Automatic nuclei segmentation in h&e stained breast cancer histopathology images. PloS one 8(7), e70221 (2013)
[27] Vicar, T., Balvan, J., Jaros, J., Jug, F., Kolar, R., Masarik, M., Gumulec, J.: Cell segmentation methods for label-free contrast microscopy: review and comprehensive comparison. BMC bioinformatics 20(1), 360 (2019)
[28] Wang, K., Lin, L., Yan, X., Chen, Z., Zhang, D., Zhang, L.: Cost-effective object detection: Active sample mining with switchable selection criteria. IEEE transactions on neural networks and learning systems 30(3), 834–850 (2018)
[29] Wang, K., Yan, X., Zhang, D., Zhang, L., Lin, L.: Towards human-machine cooperation: Self-supervised sample mining for object detection. In: CVPR. pp. 1605–1613 (2018)
[30] Wang, T., Yang, T., Cao, J., Zhang, X.: Co-mining: Self-supervised learning for sparsely annotated object detection. AAAI (2020)
[31] Xu, H., Lu, C., Berendt, R., Jha, N., Mandal, M.: Automatic nuclei detection based on generalized laplacian of gaussian filters. IEEE journal of biomedical and health informatics 21(3), 826–837 (2016)
[32] Xu, J., Xiang, L., Liu, Q., Gilmore, H., Wu, J., Tang, J., Madabhushi, A.: Stacked sparse autoencoder (ssae) for nuclei detection on breast cancer histopathology images. IEEE transactions on medical imaging 35(1), 119–130 (2015)
[33] Yin, Z., Kanade, T., Chen, M.: Understanding the phase contrast optics to restore artifact-free microscopy images for segmentation. Medical image analysis 16(5), 1047–1062 (2012)