¹¹institutetext: The University of Queensland, Australia
¹¹email: {jiasyuen.lim, y.luo, zhi.chen, tianqi.wei, scott.chapman, helen.huang@uq.edu.au}

Track Any Peppers: Weakly Supervised Sweet Pepper Tracking Using VLMs

Jia Syuen Lim 0009-0008-0003-4805 Yadan Luo 0000-0001-6272-2971 Zhi Chen 0000-0002-9385-144X Tianqi Wei 0009-0005-0134-6438
Scott Chapman 0000-0003-4732-8452 Zi Huang 0000-0002-9738-4949

Abstract

In the Detection and Multi-Object Tracking of Sweet Peppers Challenge, we present Track Any Peppers (TAP) - a weakly supervised ensemble technique for sweet peppers tracking. TAP leverages the zero-shot detection capabilities of vision-language foundation models like Grounding DINO to automatically generate pseudo-labels for sweet peppers in video sequences with minimal human intervention. These pseudo-labels, refined when necessary, are used to train a YOLOv8 segmentation network. To enhance detection accuracy under challenging conditions, we incorporate pre-processing techniques such as relighting adjustments and apply depth-based filtering during post-inference. For object tracking, we integrate the Matching by Segment Anything (MASA) adapter with the BoT-SORT algorithm. Our approach achieves a HOTA score of 80.4%, MOTA of 66.1%, Recall of 74.0%, and Precision of 90.7%, demonstrating effective tracking of sweet peppers without extensive manual effort. This work highlights the potential of foundation models for efficient and accurate object detection and tracking in agricultural settings.

Keywords:

Multiple Object Tracking Object Detection

1 Introduction

The integration of computer vision technologies into horticulture has become increasingly vital for modern agricultural practices. Accurate detection and tracking of small objects, such as sweet peppers in densely populated fields, is essential for monitoring crop health, identifying diseases [12, 13, 14], assessing harvest readiness, phenotyping [4] and making informed decisions that enhance sustainability and production efficiency [5, 11, 10]. Automating these tasks not only reduces the reliance on manual labor but also improves precision and scalability in crop management.

Traditional object tracking algorithms require extensive training on large, annotated datasets [7]. This process involves manually assigning instance tracking IDs across video frames in addition to labeling bounding boxes—a labor-intensive and time-consuming endeavor that can be prohibitively expensive. The need for such detailed annotations makes it impractical to frequently update models to adapt to the dynamic conditions commonly found in agricultural environments.

Recent advancements in large foundation models have showcased remarkable zero-shot and generalization capabilities, particularly in vision-language models (VLMs) like Grounding DINO [8] and Segment Anything Model (SAM) [6]. These models can perform object detection without task-specific training data, presenting an opportunity to mitigate the extensive manual effort typically required for dataset annotation.

In this work, we propose a novel methodology that leverages the zero-shot detection capabilities of foundation models to generate pseudo-labels for object instances across video sequences. By utilizing off-the-shelf VLMs, we automatically obtain bounding boxes for target objects with minimal human intervention. Human experts are involved only to refine these pseudo-labels when necessary, thereby reducing annotation costs compared to traditional methods. Building upon these pseudo-labels, we train a YOLOv8 segmentation network using a combination of the refined labels and publicly available datasets, as detailed in Section 2. To enhance detection accuracy, especially in challenging conditions like high illumination, we incorporate pre-processing techniques such as relighting adjustments. During post-inference, depth-based filtering is applied to further refine the results. For object tracking, we employ a hybrid approach that integrates the Matching by Segment Anything (MASA) [7] adapter with the BoT-SORT [1] algorithm.

Our experimental results demonstrate the effectiveness of the proposed methodology, achieving a HOTA score of 80.4%, MOTA of 66.1%, Recall of 74.0%, and Precision of 90.7%. These metrics indicate that our approach can successfully track sweet peppers without the need for extensive human intervention, highlighting the potential of leveraging foundation models for efficient and accurate object detection and tracking in agricultural settings. Further details of our overall framework and an analysis of key factors are explored in Section 2 and 3.

2 Methodology

In this section, we present the methodology used to address the challenge of detecting and tracking small objects in a cluttered agricultural environment using weakly labeled data. Our approach consists of four main stages. ① Weak Labels Acquisition involves leveraging a foundation model to generate bounding boxes and segmentation masks, which are refined by human experts. Next, in ② Mask & Box Detection with YOLOv8, we train a YOLOv8 segmentation network for object detection using publicly available datasets combined with our refined pseudo-labels. For ③ Pre- and Post-processing step, we apply an adaptive relighting strategy prior to inference to handle high illumination conditions, and perform depth filtering after inference to separate foreground from background objects. ④ Hybrid Object Tracking with Ensemble Methods, we employ an ensemble tracking method combining MASA adapter prompts and the BoT-SORT algorithm to associate unique tracking IDs for each detection across frames. The overall methodology is illustrated in Figure 1.

Refer to caption — Figure 1: The overall framework of TAP. First, pseudo-labels are generated using a foundation model and refined by experts to ensure quality. SAM is then used to create segmentation masks based on these refined labels. The labels are subsequently employed to train a segmentation model for sweet pepper detection. Before inference, adaptive relighting compensates for illumination variance, while post-inference depth filtering separates object layers. Finally, a hybrid tracking system combines algorithms to assign unique tracking IDs and ensure consistency across frames.

2.0.1 Problem Formulation.

Given a sequence of frames $\mathcal{F}=\{f_{1},f_{2},\ldots,f_{T}\}$ from a video, the objective is to track small objects (e.g., sweet peppers) across consecutive frames. In each frame $f_{t}$ , potential object detections $d_{i}^{t}$ are identified, where each detection $d_{i}^{t}$ is characterized by a bounding box $b(d_{i}^{t})=(x_{\min},y_{\min},x_{\max},y_{\max})$ , a detection probability $p(d_{i}^{t})$ , and a segmentation mask $m(d_{i}^{t})$ . The goal is to associate these detections across frames to form consistent object trajectories $\mathcal{T}_{k}=\{d_{k}^{1},d_{k}^{2},\ldots,d_{k}^{N_{k}}\}$ for each object $k$ , where $N_{k}$ denotes the number of frames in which object $k$ appears. This tracking task must address challenges such as occlusion, clutter, and varying lighting conditions to maintain accurate and continuous trajectories throughout the video sequence.

2.0.2 Weak Labels Acquisition with Foundation Models.

We utilize Grounding DINO, a vision-language model, to perform zero-shot object detection using textual queries like "bell_pepper", enabling us to detect objects without prior training on our dataset. Unlike conventional detection models, Grounding DINO uses this semantic information to identify target objects. For each image sequence $\mathcal{F}=\{f_{1},f_{2},\ldots,f_{T}\}$ , where $f_{t}$ represents the frame at timestamp $t$ , we define a sub-sampling interval $I$ to extract $N=\left\lfloor\frac{T}{I}\right\rfloor$ images, minimizing redundancy from visually similar frames. These sub-sampled images are input to Grounding DINO, which generates a set of pseudo-bounding boxes $\mathcal{P}=\{p_{1},p_{2},\ldots,p_{N}\}$ , with each $p_{i}$ representing the coordinates of a detected object. To ensure quality, we consider only pseudo-labels with high confidence scores. The confidence score $c_{i}$ for each detection $p_{i}$ is computed as $c_{i}=\sigma(l_{i})$ , where $\sigma(\cdot)$ is the sigmoid function and $l_{i}$ is the logit output associated with $p_{i}$ . We select detections satisfying $c_{i}\geq\tau$ , where $\tau$ is a predefined threshold, resulting in a refined set $\mathcal{P}^{\prime}=\{p_{i}\mid c_{i}\geq\tau\}$ . Human experts then further refine these selected pseudo-labels to enhance their accuracy, particularly in the cluttered environment. Each image is manually inspected, and only those with misaligned boxes or significant detection errors are refined, typically requiring less than 2 minutes per image. The refined bounding boxes serve as prompts for the Segment Anything Model (SAM), which generates precise segmentation masks $\mathcal{M}=\{m_{1},m_{2},\ldots,m_{N}\}$ corresponding to the detected objects. These refined labels and masks are subsequently used in later stages for model training and detection.

2.0.3 Mask & Box Detection with YOLOv8.

We train a YOLOv8 [9] model using the refined bounding boxes $\mathcal{P}^{\prime}$ and segmentation masks $\mathcal{M}$ as inputs. While vision-language models like Grounding DINO exhibit strong zero-shot capabilities, we observed that YOLOv8 achieves superior performance due to its ability to generalize effectively with limited training data. This performance gain is attributed to YOLOv8’s robustness against overfitting and reduced susceptibility to over-parameterization, both common challenges when fine-tuning large foundational models on small datasets. The YOLOv8 model is trained using a combination of three loss functions: classification loss $\mathcal{L}_{\text{cls}}$ , localization loss $\mathcal{L}_{\text{loc}}$ , and segmentation loss $\mathcal{L}_{\text{seg}}$ . The overall training objective is to minimize the total loss $\mathcal{L}_{\text{overall}}$ , defined as:

\displaystyle\mathcal{L}_{\text{overall}}=\lambda_{\text{cls}}\mathcal{L}_{\text{cls}}+\lambda_{\text{loc}}\mathcal{L}_{\text{loc}}+\lambda_{\text{seg}}\mathcal{L}_{\text{seg}},

(1)

where $\lambda_{\text{cls}}$ , $\lambda_{\text{loc}}$ , and $\lambda_{\text{seg}}$ are hyperparameters that weight the contribution of each loss component.

2.0.4 Adaptive Relighting Scheme.

To enhance detection quality under varying lighting conditions, we implement an adaptive contrast and luminance adjustment scheme prior to inference. Our approach dynamically adjusts the luminance and contrast of each frame $f_{t}$ based on its overall luminance $\mathcal{L}(f_{t})$ . The adjusted luminance $\mathcal{L}^{\prime}(f_{t})$ and contrast $\mathcal{C}^{\prime}(f_{t})$ are computed as:

	$\displaystyle\mathcal{L}^{\prime}(f_{t})$	$\displaystyle=\alpha(\mathcal{L}(f_{t}))\cdot\mathcal{L}(f_{t}),$		(2)
	$\displaystyle\mathcal{C}^{\prime}(f_{t})$	$\displaystyle=\beta(\mathcal{L}(f_{t}))\cdot\mathcal{C}(f_{t}),$		(3)

where $\mathcal{C}(f_{t})$ represents the original image contrast. The scaling factors $\alpha(\mathcal{L}(f_{t}))$ and $\beta(\mathcal{L}(f_{t}))$ are functions of the frame’s luminance, designed to reduce luminance when it’s high and boost contrast when it’s low. By dynamically adjusting these factors, we enhance detection performance across a range of illumination levels by suppressing overexposed or underexposed areas and emphasizing object boundaries.

2.0.5 Detection and Depth-Based Filtering.

After training, we use the YOLOv8 model to detect sweet peppers in each video frame $f_{t}$ . Once detections are obtained, we apply the segmentation masks $\mathcal{M}$ , generated by YOLOv8, along with depth maps to further filter the detected objects. Let $d_{i}$ denote the depth of instance $i$ , and let $\tau_{d}$ be a threshold distinguishing foreground from background objects. Each object is classified as:

\displaystyle\text{Object}_{i}

\displaystyle=\begin{cases}\mathsf{Foreground},&\text{if }d_{i}<\tau_{d}\\ \mathsf{Background},&\text{if }d_{i}\geq\tau_{d}\end{cases}

(4)

For instances classified as foreground ( $d_{i}<\tau_{d}$ ), we retain the predictions $\hat{y}_{i}=(\hat{b}_{i},\hat{p}_{i},\hat{m}_{i})$ , where $\hat{b}_{i}$ , $\hat{p}_{i}$ , and $\hat{m}_{i}$ are the predicted bounding box, detection probability, and segmentation mask, respectively. Instances classified as background ( $d_{i}\geq\tau_{d}$ ) are discarded ( $\hat{y}_{i}=\varnothing$ ).

2.0.6 Hybrid Object Tracking with Ensemble Methods.

After detecting the bounding boxes in each frame of the video sequence, we perform object tracking using a hybrid ensemble approach. The first component employs the BoT-SORT [1] algorithm, a multi-object tracking method that associates detections across frames based on bounding box overlap and appearance features. Simultaneously, we utilize the detected bounding boxes as prompts for the Matching Anything by Segmenting Anything (MASA) [7] adapter, built on SAM. MASA enables instance-level correspondence tracking without requiring explicit video annotations by learning instance representations from dense, unlabeled data. This improves the model’s ability to track objects across frames by leveraging prompt-based supervision.

3 Experiments

Dataset and Evaluation Metrics. In this study, we utilize a combination of pseudo-labeled data and publicly available datasets to construct a comprehensive training set. On top of the pseudo-labels generated and refined described in Section 2.0.2, we also integrate additional datasets, including BUP20 [11] and a publicly available dataset from Kaggle [3] to ensure diversity in object appearances and environments. For evaluation, we adopt the widely-used HOTA (Higher Order Tracking Accuracy) metric to assess tracking performance. HOTA evaluates both the detection and association performance of tracking algorithms. In addition to HOTA, we report results for MOTA (Multiple Object Tracking Accuracy) [2], Recall, and Precision to provide a well-rounded assessment of our method’s effectiveness.

Implementation Details. For the depth filtering step, the threshold $\tau_{d}$ is set to 1200, ensuring that objects with a depth value $d_{i}<\tau_{d}$ are retained as foreground, while the rest are discarded as background. For sub-sampling the video sequences, we use an interval $I=5$ , meaning we select one frame out of every five to minimize overlap and redundancy in the training set, resulting in a total of 380 images. For training the YOLOv8 model, we use the default parameters provided by the Ultralytics project, with the following modifications: a batch size of 8 and all input images resized to 736x416 pixels.

Table 1: Tracking Performance. We report the average Identification F1-score (IDF1), Precision (IDP), and Recall (IDR), as well as HOTA, MOTA, Recall (Rcll), and Precision (Prec) across all sequences in the evaluation set. The

\uparrow

and

\downarrow

indicate that lower or higher values mean better performance.

Method	IDF1[%]	IDP[%]	IDR[%]	HOTA[%]	MOTA[%]	Rcll[%]	Prec[%]
Method	$\uparrow$ avg	$\uparrow$ avg	$\uparrow$ avg	$\uparrow$ avg	$\uparrow$ avg	$\uparrow$ avg	$\uparrow$ avg
Ours	80.5	89.5	73.1	80.4	66.1	74.0	90.7

Main Results. Our model achieves a HOTA score of 80.4%, MOTA of 66.1%, Recall of 74.0%, and Precision of 90.7% as shown in Table 1. The high Precision demonstrates effective reduction of false positives, primarily due to our depth filtering step, which eliminates background objects using an empirically selected depth threshold. We argue that pre-defining these settings would have allowed for further optimization, potentially improving performance. Contrary to this, the slightly lower Recall suggests that some true positives were missed, likely due to flickering detections in heavily occluded scenes. This may be attributed to MASA’s [7] pre-training on static and unlabeled images, which limits its ability to perform well under scenarios with heavy occlusions.

Ablation Study. We evaluate the effect of incorporating depth-based filtering into the detection pipeline by comparing results with and without depth filtering. The results, as shown in Table 2, demonstrate that depth filtering significantly improves the quality of the final predictions by eliminating background objects and enhancing tracking performance in cluttered environments by separating overlapping objects.

Table 2: Ablation Study. We report the impact of applying depth filtering on tracking performance.

Method	IDF1[%]	IDP[%]	IDR[%]	HOTA[%]	MOTA[%]	Rcll[%]	Prec[%]
Method	$\uparrow$ avg	$\uparrow$ avg	$\uparrow$ avg	$\uparrow$ avg	$\uparrow$ avg	$\uparrow$ avg	$\uparrow$ avg
w/o Filtering	58.6	62.2	55.4	63.4	22.4	55.8	62.7
w/ Filtering	80.5	89.5	73.1	80.4	66.1	74.0	90.7

4 Conclusion

In this study, we presented a methodology for detecting and tracking sweet peppers in agricultural environments by leveraging the zero-shot detection capabilities of large foundation models, thus minimizing the need for extensive manual annotation. By generating pseudo-labels with off-the-shelf vision-language models and involving human experts only for necessary refinements, we trained a YOLOv8 segmentation network using these labels alongside publicly available datasets. Employing pre-processing techniques like relighting adjustments and followed by an ensemble tracking algorithm, we effectively assigned unique IDs across video frames. We hope that our work provides valuable insights for future research in agricultural computer vision.

References

[1] Aharon, N., Orfaig, R., Bobrovsky, B.Z.: Bot-sort: Robust associations multi-pedestrian tracking. arXiv preprint arXiv:2206.14651 (2022)
[2] Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing 2008, 1–10 (2008)
[3] Cavero, L.E.M.: Sweet pepper and peduncle segmentation dataset (2022), https://www.kaggle.com/datasets/lemontyc/sweet-pepper, accessed: September 20, 2024
[4] Chen, Z., Wei, T., Zhao, Z., Lim, J.S., Luo, Y., Zhang, H., Yu, X., Chapman, S., Huang, Z.: Cf-prnet: Coarse-to-fine prototype refining network for point cloud completion and reconstruction. arXiv preprint arXiv:2409.08443 (2024)
[5] Halstead, M., Ahmadi, A., Smitt, C., Schmittmann, O., McCool, C.: Crop agnostic monitoring driven by deep learning. Frontiers in plant science 12 (2021)
[6] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: ICCV. pp. 4015–4026 (2023)
[7] Li, S., Ke, L., Danelljan, M., Piccinelli, L., Segu, M., Van Gool, L., Yu, F.: Matching anything by segmenting anything. CVPR (2024)
[8] Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499 (2023)
[9] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR. pp. 779–788 (2016)
[10] Smitt, C., Halstead, M., Zaenker, T., Bennewitz, M., McCool, C.: Pathobot: A robot for glasshouse crop phenotyping and intervention. In: ICRA. pp. 2324–2330. IEEE (2021)
[11] Smitt, C., Halstead, M., Zimmer, P., Läbe, T., Guclu, E., Stachniss, C., McCool, C.: Pag-nerf: Towards fast and efficient end-to-end panoptic 3d representations for agricultural robotics. IEEE Robotics and Automation Letters 9, 907–914 (2024)
[12] Wei, T., Chen, Z., Huang, Z., Yu, X.: Benchmarking in-the-wild multimodal disease recognition and a versatile baseline. In: ACMMM (2024)
[13] Wei, T., Chen, Z., Yu, X.: Snap and diagnose: An advanced multimodal retrieval system for identifying plant diseases in the wild. arXiv preprint arXiv:2408.14723 (2024)
[14] Wei, T., Chen, Z., Yu, X., Chapman, S., Melloy, P., Huang, Z.: Plantseg: A large-scale in-the-wild dataset for plant disease segmentation. arXiv preprint arXiv:2409.04038 (2024)