Enhancing Thermal MOT: A Novel Box Association Method Leveraging Thermal Identity and Motion Similarity
Abstract
Multiple Object Tracking (MOT) in thermal imaging presents unique challenges due to the lack of visual features and the complexity of motion patterns. This paper introduces an innovative approach to improve MOT in the thermal domain by developing a novel box association method that utilizes both thermal object identity and motion similarity. Our method merges thermal feature sparsity and dynamic object tracking, enabling more accurate and robust MOT performance. Additionally, we present a new dataset comprised of a large-scale collection of thermal and RGB images captured in diverse urban environments, serving as both a benchmark for our method and a new resource for thermal imaging. We conduct extensive experiments to demonstrate the superiority of our approach over existing methods, showing significant improvements in tracking accuracy and robustness under various conditions. Our findings suggest that incorporating thermal identity with motion data enhances MOT performance. The newly collected dataset and source code is available at https://github.com/wassimea/thermalMOT
1 Introduction
Thermal cameras have proven to be robust perception sensors that operate reliably under different weather and lighting conditions for various tasks in computer vision [29, 26, 38, 24, 43, 42, 9, 4, 3]. This characteristic of thermal cameras allows vision systems utilizing them to take advantage of the unique thermal patterns of objects for more robust and reliable performance. Convolutional neural networks (CNNs) have been used to great effect for a variety of computer vision tasks for different spectrums (RGB, thermal, depth, hyperspectral, etc.) These tasks range from image classification [20, 50, 45, 35] to object detection [27, 13, 47] and multiple object tracking [49, 48, 52, 51].
1.1 Multiple Object Trackers
Multiple Object Tracking (MOT) is the task of detecting individual objects in a video and tracking them over consecutive frames with a unique identifier. The performance of a network is measured in terms of how each object is tracked over multiple frames with a consistent ID.
Solutions to the multiple object tracking (MOT) task can be divided into two main categories: one-stage [15, 53, 44, 36] and two-stage [2, 49, 48, 52, 51, 6]. The former type uses an end-to-end pipeline that tracks directly from the network’s inputs while the latter separates the task into (1) the detection of objects in the scene and (2) the tracking of these detections in subsequent frames. Two-stage approaches have proven to be more versatile and accurate than one-stage trackers [52].
Most two-stage MOT solutions utilize motion association as the main criteria when conducting box association: a Kalman filter [22] is used to predict the locations of objects in the next frame; Intersection-over-Union (IoU) is then calculated between the detected bounding boxes and the Kalman-filter predicted boxes to match the boxes across frames. One main advantage of motion association is that the algorithm can be utilized in systems using any type of sensor (visible and non-visible): as long as it is possible to predict bounding boxes, it is possible to conduct tracking, regardless of the sensor modality.
However, strictly relying on motion association does have an important drawback: not utilizing the unique characteristics of any one sensor. For example, pixel proximity and distance information is valuable for conducting box association with LiDAR or 3D camera data. In our work, we show that utilizing the thermal identity of objects in the two-stage trackers’ tracklet association step can lead to more accurate box association. Instead of strictly relying on motion association, we devise a box association algorithm that utilizes both the motion information and the thermal identity of objects.
The MOT challenges [28, 37, 46, 11, 12, 10, 39] provide very popular benchmarks for MOT with RGB images. Their datasets offer a wide variety of scenes, including many busy pedestrian sequences. However there is a lack of large public datasets for the MOT task with thermal images, severely limiting research on this task.
1.2 Contributions
The main contributions of this paper are:
-
•
The development, annotation, and forthcoming public release of a unique dataset that, to the best of our knowledge, stands as a significant contribution comparable in size and urban environment settings to the MOT17 benchmark. Uniquely, our dataset integrates matching RGB and thermal images across five different pedestrian crossing locations, offering a comprehensive resource for both detection and multi-object tracking tasks.
-
•
We introduce a novel box association method for use with two-stage MOT models running in the thermal spectrum that utilizes the unique characteristics of thermal data and combines them with motion data for robust MOT in the thermal spectrum. Although this novel box association method focuses on unique characteristics of thermal imagery, we believe our work would encourage the research and development of algorithms that leverage the unique attributes of any sensor when conducting MOT. Thereby, our approach is generalizable to other sensor modalities as well.
-
•
We provide an initial benchmark of the performance of two state-of-the-art two-stage MOT models on both modalities of our dataset (RGB and thermal).
2 Literature Review
Our work builds upon the existing literature of object detection, two-stage multiple object tracking, the use of thermal sensors for computer vision tasks, and existing MOT datasets with thermal images.
2.1 Object Detection
The task of object detection consists of the detection in a scene of each object that belongs to a set list of categories. The approaches can be divided into two main categories depending on whether the detection pipeline generates object proposals to be refined by a second stage (two-stage object detection) [18, 17, 41, 5, 19, 13] or not (one-stage) [40, 34, 31, 27]. Models in the latter category include SSD [34] and YOLO [40], and generally offer a simpler and faster architecture, while still achieving results competitive with two-stage models.
The Task-aligned One-stage Object Detection (TOOD) network, designed by Feng et al. [16], iterates upon preceding one-stage object detection networks. Its architecture includes the Task-aligned Head (T-Head) which improves feature sharing for the classification and localization sub-tasks, as opposed to using separate network heads for each one. The Task-aligned Predictor (TAP) part of the T-head then improves the alignment of the classification and localization predictions to better combine them. The training of TOOD is also modified through Task Alignment Learning (TAL) which improves default anchor proposals.
2.2 Two-Stage Multiple Object Trackers
Two-stage multi-object trackers can be divided into two sub-tasks, one for each stage in the MOT solution: (1) per-frame object detection and (2) tracking over a sequence. This tracking-by-detection approach enables the easy and direct use of existing state-of-the-art object detection networks as a high-accuracy first stage and places the design focus on the tracking stage [2, 49, 48, 52, 51, 6]. This also allows the training of the first stage on object detection datasets that do not necessarily have tracking annotations, decoupled from the MOT task. The main downside of this is in the limitations of the second stage to recover from mistakes in the detection stage, whether false positives or false negatives.
The Simple Online and Realtime Tracking (SORT) approach was introduced by Bewley et al. [2] and combines any of a variety of CNN object detectors with a straightforward motion estimation tracking approach. After detection in a frame, a Kalman filter [22] is used to approximate the velocity and the future location of each tracklet. For a new frame, the IoU between the approximated location of the tracklets and the detector’s predictions is used to assign either an existing tracklet identification or a new ID altogether. Wojke et al.’s DeepSORT [49] improved upon the SORT approach by introducing a CNN-based motion and appearance association metric that strongly improves performance.
Association methods can rely strongly on a detected object’s confidence score to decide on which boxes to use [48, 52, 36, 44]. Going against this approach, Zhang et al. [51] introduced the Bytetrack tracker using their BYTE association method which uses detections whether they have high or low confidence. Specifically, low-confidence detections are not discarded but are instead given lower matching priority with existing tracklets.
OCSort, designed by Cao et al. [6], improves directly on shortcomings of the SORT approach in tracking occluded objects, especially when their motion is non-linear. To combat the issue of error accumulation when tracking an object that has been temporarily lost, the authors propose the Observation-centric Re-Update (ORU) strategy. ORU leverages observations from a virtual trajectory to recalibrate the parameters of the Kalman filter. Furthermore, the authors introduce an Observation-Centric Momentum (OCM) term within the association cost function, prioritizing the use of observations over estimations to improve the accuracy of motion direction estimation. An Observation-Centric Recovery (OCR) technique is also introduced, which aids in recovering temporarily lost object tracks, such as those occluded or momentarily stationary. Together, these methodological innovations constitute the OC-SORT algorithm, which is developed as an enhancement to ByteTrack [51].
2.3 Thermal Sensors
Thermal sensors are being used to great effect for a variety of computer vision tasks either on their own [29, 26, 38, 23, 14, 24, 43, 42] or through the use of sensor fusion with RGB cameras [9, 4, 3, 1].
Lee et al. [29] and Lahmyed et al. [25] both use an RGB camera and a thermal camera for the detection of pedestrians. In the former, motion is detected independently in each sensor, and is then used to predict pedestrian location in both sensors.
Lahouli et al. [26] devise a low computational cost method of tracking pedestrians using compressed thermal images taken by Unmanned Aerial Vehicles (UAVs). The suggested approach follows a tracking-by-detection framework where Regions of Interest (ROIs) are proposed and refined using saliency maps and contrast enhancement techniques. These are then tracked over consecutive images using the MPEG compression algorithm’s motion vectors.
Nowruzi et al. [38] propose a method to detect the number of passengers in a vehicle. A low-cost CNN is used to detect individuals while meeting the requirements for use in an embedded system. Thermal images are used for their privacy-preserving quality, as it is much harder to identify individuals from the features present in a thermal image than an RGB image.
Ahmar et al. [14, 1] collected a dataset of matching RGB and thermal frames with object detection and MOT annotations. The authors used this dataset to compare detection and tracking performance from RGB and thermal data. In addition, they studied the use of multi-modal sensor fusion from the two modalities and proposed a new fusion method which noticeably increases object detection performance over existing fusion approaches.
2.4 Datasets
Most existing tracking datasets that include thermal images are annotated for single-object tracking [32, 33, 30]. For the task of MOT, the City-Scene dataset [14, 1] contains 15 sequences for a total of 1 997 annotated frames for both a FLIR thermal camera and an RGB camera. However, none of the existing thermal MOT datasets offer a sufficient volume of data for multi-object tracking for pedestrians. With limited sequences and annotated frames, these datasets are inadequate for training and evaluating MOT algorithms in complex real-world scenarios. Recognizing this limitation, we undertook the collection and annotation of a new, large-scale dataset containing both RGB and corresponding thermal data. This initiative addresses the need for a more extensive and diverse dataset collected in real-world scenarios, allowing for the development and assessment of MOT algorithms in challenging, real-world conditions in both the thermal and color spectrums.
3 Data Collection
We utilize a FLIR ADK thermal sensor and a JAI GO-5100C RGB sensor for the collection of the new dataset. The FLIR ADK thermal sensor specifications include a 75° horizontal field of view, operates in the 8-14 microns (LWIR) spectral band, has a thermal sensitivity of less than 50 mK, consumes an average of 4W of power, and offers an image resolution of 640x512. The JAI GO-5100C RGB sensor, it features a global shutter, consumes 4.35W of power, and can achieve frame rates of up to 74 frames per second. A plastic enclosure was built to fix both sensors next to each other, and was mounted on a tripod for data collection.
The dataset of 30 sequences (9000 frames per modality total) was collected at 5 different intersections around an urban campus in public spaces. RGB and thermal samples from the dataset are shown in Figure 1. The dataset was then annotated for multiple object detection and tracking. We refer to this dataset as the RGB-Thermal MOT dataset.
To the best of our knowledge, this is the world’s first large-scale dataset of RGB and corresponding thermal images annotated for MOT. We believe that this dataset will prove to be a valuable resource for research and development of both thermal and RGB MOT research. Additional statistics related to the dataset are given in Table 1.
In the collection of our RGB-Thermal pedestrian dataset, we diligently followed local regulations and ethical guidelines to ensure the respectful and responsible use of data. We sought advice from local authorities to align with privacy and data protection standards, emphasizing the responsible use of urban imagery. The dataset, aimed for pedestrian detection and multi-object tracking research, was collected under conditions that respect public space and individual privacy. This process reflects our commitment to ethical research practices, contributing to the field’s advancement while upholding high ethical standards.






Thermal | RGB | |
Total number of annotated sequences | 30 (1 minute each) | 30 (1 minute each) |
Capture frame rate | 5 | 5 |
Total frames per sequence | 300 | 300 |
Total number of annotations | 58,590 | 50,400 |
Average annotations per image | 6.51 | 5.6 |
Sequence Train/Test split | Train: 24 | Train: 24 |
Test: 6 | Test: 6 | |
Total unique tracks Train/Test | Train: 313 | Train: 284 |
Test: 126 | Test: 116 |
4 Experiments
Our study focuses specifically on the enhancement of the box association step for Multiple Object Tracking (MOT) in thermal imagery. We leverage the capabilities of the two most advanced two-stage trackers, (ByteTrack [51] and OCSORT [6]), as the basis for our experiments. This focused approach allows us to isolate and evaluate the impact of our novel thermal box association method distinctly.
Our work is aimed at advancing the precision of box association in thermal MOT scenarios, a critical area that benefits from the integration of thermal characteristics into tracking algorithms. By maintaining a fixed detection framework, our experiments directly attribute observed enhancements in tracking accuracy to our proposed method. This approach ensures that the contribution of incorporating thermal information into the box association step is rigorously assessed, highlighting its significance in the advancement of MOT technologies.
In our work, we fine-tune the TOOD [16] detector with a Resnet50 [21] backbone on images from the newly collected RGB-Thermal MOT dataset. We use the OpenMMlab mmdetection framework [7] for training the object detector. The outcome of this training is two models:
-
•
TOOD model initialized with COCO weights and fine-tuned on the RGB images of the RGB-Thermal MOT dataset.
-
•
TOOD model initialized with COCO weights and fine-tuned on the thermal images of the RGB-Thermal MOT dataset.
4.1 Benchmarking SOTA MOT models on the RGB-Thermal MOT dataset
We use the OpenMMlab mmtracking framework [8] to benchmark two SOTA MOT models (ByteTrack [51] and OCSORT [6]), on both the RGB and thermal sequences of the RGB-Thermal MOT dataset. The corresponding RGB/thermal TOOD detectors described are used as backbones for the trackers.
We optimize the hyperparameters of both ByteTrack and OCSORT to maximize their MOT metrics (specifically MOTA and IDF1) when using their standard implementations. This step is crucial to ensure a fair comparison of our innovative box association approach against the highest achievable performance of each of those two trackers.
Furthermore, we conduct a performance evaluation of ByteTrack and OCSORT on the RGB sequences within the RGB-Thermal MOT dataset. This assessment serves as a valuable reference for future users of this recently acquired dataset.
4.2 Development of New MOT Box Association Method Utilizing Thermal Information
For enhancing Multiple Object Tracking (MOT) capabilities in thermal imagery, we introduce a novel method that leverages motion and thermal identity characteristics of detected objects to enhance the quality of MOT box association in thermal imagery. This approach aims to capitalize on the unique, yet sparse, thermal signatures captured in thermal imaging, a dimension that standard MOT models typically overlook.
The core of our proposed algorithm is the integration of thermal and motion data to establish a robust tracking framework, as outlined below:
Thermal and Motion Bounding Boxes:
Let and denote the sets of tracking boxes predicted using a Kalman filter, and detection bounding boxes, respectively, within a given thermal image, denoted as .
Initialization:
The conversion of bounding box coordinates into NumPy arrays is a fundamental step for efficient manipulation and computation within our algorithm:
(1) |
This step lays the groundwork for the algorithm’s operations.
Similarity Matrix Construction:
We define , an matrix, to quantify the thermal similarity between each pair of and . Initially,
(2) |
This initialization is a preparatory step for accumulating similarity scores, rather than indicating an absence of similarity.
Pairwise Histogram Comparison:
For each in and in , their respective Regions of Interest (ROIs) are extracted from , denoted as and . The histograms and are then calculated, normalized to and , using appropriate bin sizes and ranges to capture the thermal characteristics. The normalization process ensures histograms are on a uniform scale, essential for accurate comparison. The similarity is computed using the Bhattacharyya coefficient, a robust measure for histogram comparison, to populate with meaningful values.
Integration with Motion Similarity:
The motion-based similarity matrix, , obtained through standard MOT methodologies, is integrated with to form the comprehensive similarity matrix, :
(3) |
Here, represents a carefully selected weighting factor that balances the contributions of motion and thermal similarities, optimized through experimental validation to ensure effective tracking performance. For ByteTrack, the optimal value of is proven to be , and that for OCSort is proven to be (Figure 2). The value of is selected as to ensure the best trade-off between MOTA and IDF1.
The pseudo-code of the function is given in Algorithm 1.
5 Results
5.1 Benchmarking standard MOT models running on RGB
The results of the standard implementations of ByteTrack and OCSORT on the RGB sequences of our newly collected dataset are given in Table 2.
Val RGB Sequence | IDF1 | IDP | IDR | Rcll | Prcn | MOTA | MOTP |
---|---|---|---|---|---|---|---|
2 | 56.8% | 67.1% | 49.3% | 68.6% | 93.4% | 62.1% | 0.177 |
17 | 64.6% | 71.8% | 58.8% | 78.1% | 95.3% | 68.9% | 0.150 |
22 | 87.2% | 93.8% | 81.4% | 82.2% | 94.8% | 77.0% | 0.115 |
47 | 58.7% | 72.6% | 49.3% | 64.6% | 95.1% | 59.6% | 0.163 |
54 | 67.8% | 69.3% | 66.4% | 75.4% | 78.7% | 53.1% | 0.133 |
66 | 71.1% | 72.5% | 69.8% | 92.6% | 96.2% | 88.0% | 0.129 |
OVERALL | 63.6% | 72.5% | 56.7% | 72.4% | 92.7% | 64.7% | 0.152 |
Val RGB Sequence | IDF1 | IDP | IDR | Rcll | Prcn | MOTA | MOTP |
---|---|---|---|---|---|---|---|
2 | 46.6% | 49.7% | 43.9% | 70.3% | 79.6% | 48.9% | 0.179 |
17 | 55.2% | 55.1% | 55.3% | 84.5% | 84.1% | 58.3% | 0.161 |
22 | 82.5% | 82.8% | 82.2% | 86.8% | 87.4% | 70.4% | 0.125 |
47 | 53.6% | 60.1% | 48.3% | 69.6% | 86.7% | 53.9% | 0.176 |
54 | 58.3% | 55.7% | 61.0% | 78.0% | 71.2% | 42.9% | 0.140 |
66 | 67.0% | 59.6% | 76.5% | 93.5% | 72.9% | 57.4% | 0.131 |
OVERALL | 56.8% | 58.6% | 55.1% | 76.3% | 81.0% | 53.7% | 0.160 |
5.2 Weighted average alpha-value selection
In order to accurately find the best alpha-value (the weight of the motion distance matrix and the corresponding weight of the thermal distance matrix used to calculate the comprehensive distance matrix) we calculate the MOTA and IDF1 generated through our approach on the validation sequences of the RGB-Thermal MOT dataset. The results are given in figure 2.
The selected alpha-value for each model should ideally take into account the best trade-off between MOTA and IDF1. Analyzing figure 2, we can deduce the following:
-
•
The overall performance of ByteTrack is proven to be better than OCSORT, with the maximum MOTA value achieved using ByteTrack reaching 66.4%, while the maximum MOTA achieved using OCSORT is 56.4%.
-
•
A similar trend is observable through the analysis of the IDF1 values of both models, as the maximum IDF1 value achieved through ByteTrack is 64.2% while the maximum IDF1 value achieved through OCSORT is 58.6%.
-
•
For ByteTrack, the alpha value with the best trade-off between MOTA and IDF1 is 0.3, meaning the weight of motion association contribution to the comprehensive distance matrix is 30%, while that of the thermal distance matrix is 70%. This shows that ByteTrack benefits significantly from the thermal similarity matrix.
-
•
For OCSORT, the alpha value with the best trade-off between MOTA and IDF1 is 0.8, meaning the weight of motion association contribution to the comprehensive distance matrix is 80%, while that of the thermal distance matrix is 20%. This shows that OCSORT benefits significantly from the thermal similarity matrix.
-
•
For an alpha value of 0, meaning that only the thermal distance matrix is used for box association, ByteTrack achieves impressive results with 63.4% MOTA and 55.1% IDF1, surpassing the best performance of OCSORT. This shows the clear benefit of using thermal similarity for conducting box association, as it performs reasonably well even without utilizing motion association at all.
-
•
The difference in optimal alpha values between ByteTrack and OCSORT can be attributed to the distinct ways each algorithm integrates motion similarity and additional cues within their tracking frameworks. Despite both trackers emphasizing motion, the sensitivity of each to the incorporation of thermal similarity—via the alpha parameter—differs due to variations in their internal mechanisms and how they balance motion with other information sources. The observed discrepancy reflects the nuanced impact of thermal similarity on the trackers’ performance, underscoring the necessity to tailor the alpha value to the specific architecture and processing strategy of each tracker to achieve optimal results.
5.3 MOT Metrics Comparison
For evaluating the feasibility of our suggested box association method, we consider two state-of-the-art MOT models: Bytetrack [51] and OCSort [6]. Both Bytetrack and OCSort utilize motion association for the box-association step. The detailed results can be found in tables 3 and 4, while summarized metrics are given in figure 4.
Analyzing the tables, the following conclusions can be made:
-
•
ByteTrack benefits from utilizing thermal information when conducting box association. The overall MOTA and IDF1 for the original implementation of ByteTrack are 65.5% and 62.6% respectively. These increase to 66.4% and 63.8% respectively when using our proposed approach.
-
•
A similar trend can be seen with OCSORT, with the overall MOTA and IDF1 values increasing from 54.4% and 57.8% respectively in the original implementation of OCSORT to 56.4% and 58.6% respectively when using our proposed box association method.
The results validate the effectiveness of our proposed model on enhancing tracking performance, where utilizing thermal similarity proves to be beneficial for box association when combined with motion similarity.
The proposed approach combines thermal and motion similarity scores through a weighted average. This fusion method effectively integrates two sources of information, allowing the system to make decisions based on both thermal and motion aspects.
By considering both thermal and motion aspects, the tracking system becomes more robust and adaptable. When thermal data indicates a strong match between objects with similar thermal signatures, the system can prioritize thermal information. Conversely, when motion cues are reliable, they can take precedence. This adaptability makes the tracking system more resistant to false positives and negatives. In addition, thermal data can assist in handling occlusions, a common challenge in MOT. When one object obscures another, thermal signatures may still be distinguishable, allowing the system to maintain the identity of both objects.
Val Thermal Sequence | IDF1 | IDP | IDR | Rcll | Prcn | MOTA | MOTP |
---|---|---|---|---|---|---|---|
2 | 51.2% | 69.3% | 40.6% | 56.7% | 96.7% | 52.5% | 0.184 |
17 | 71.5% | 79.1% | 65.2% | 75.8% | 92.0% | 65.7% | 0.173 |
22 | 75.4% | 84.9% | 67.9% | 74.4% | 93.0% | 67.3% | 0.163 |
47 | 59.1% | 62.2% | 56.4% | 81.5% | 89.9% | 70.1% | 0.170 |
54 | 63.8% | 65.7% | 62.0% | 77.8% | 82.4% | 59.2% | 0.176 |
66 | 67.9% | 75.5% | 61.7% | 78.5% | 95.9% | 74.4% | 0.154 |
OVERALL | 62.6% | 69.4% | 57.0% | 74.9% | 91.1% | 65.5% | 0.170 |
Val Thermal Sequence | IDF1 | IDP | IDR | Rcll | Prcn | MOTA | MOTP |
---|---|---|---|---|---|---|---|
2 | 49.3% | 62.4% | 40.8% | 62.0% | 94.9% | 56.3% | 0.194 |
17 | 71.3% | 77.0% | 66.4% | 77.5% | 89.8% | 64.9% | 0.177 |
22 | 75.9% | 83.4% | 69.7% | 77.1% | 92.3% | 69.3% | 0.166 |
47 | 60.7% | 62.1% | 59.4% | 83.7% | 87.6% | 69.6% | 0.173 |
54 | 71.4% | 71.5% | 71.3% | 81.2% | 81.4% | 60.9% | 0.183 |
66 | 68.0% | 74.3% | 62.6% | 80.2% | 95.1% | 75.3% | 0.157 |
OVERALL | 63.8% | 68.6% | 59.6% | 77.7% | 89.4% | 66.4% | 0.175 |


Val Thermal Sequence | IDF1 | IDP | IDR | Rcll | Prcn | MOTA | MOTP |
---|---|---|---|---|---|---|---|
2 | 51.7% | 60.4% | 45.2% | 66.3% | 88.8% | 51.3% | 0.197 |
17 | 60.0% | 59.3% | 60.7% | 82.2% | 80.3% | 55.3% | 0.185 |
22 | 64.8% | 64.7% | 65.0% | 81.2% | 80.9% | 57.5% | 0.174 |
47 | 53.8% | 50.0% | 58.2% | 87.1% | 74.8% | 53.2% | 0.178 |
54 | 52.6% | 47.4% | 59.1% | 84.3% | 67.7% | 40.1% | 0.187 |
66 | 74.3% | 77.4% | 71.5% | 82.8% | 89.6% | 70.8% | 0.162 |
OVERALL | 57.8% | 56.8% | 58.7% | 81.3% | 78.7% | 54.4% | 0.180 |
Val Thermal Sequence | IDF1 | IDP | IDR | Rcll | Prcn | MOTA | MOTP |
---|---|---|---|---|---|---|---|
2 | 42.4% | 49.6% | 37.1% | 66.4% | 88.9% | 54.1% | 0.198 |
17 | 63.2% | 62.5% | 64.0% | 82.4% | 80.4% | 58.1% | 0.186 |
22 | 71.1% | 71.0% | 71.2% | 81.2% | 80.9% | 60.3% | 0.175 |
47 | 56.0% | 52.1% | 60.6% | 87.1% | 74.8% | 54.5% | 0.178 |
54 | 55.2% | 49.8% | 61.9% | 84.3% | 67.7% | 42.4% | 0.187 |
66 | 76.0% | 79.1% | 73.1% | 82.8% | 89.6% | 71. 7% | 0.162 |
OVERALL | 58.6% | 57.7% | 59.6% | 81.4% | 78.7% | 56.4% | 0.180 |
5.4 Limitations
Our approach introduces an innovative use of thermal sensors for MOT, enhancing detection and tracking capabilities. However, it’s pertinent to note the requirement of specialized thermal imaging equipment, which may not be universally accessible. Furthermore, the current application and validation of our methodology are confined to urban settings. The efficacy of our method in non-urban environments remains to be explored and would benefit from further diversification of the dataset to ensure broad applicability and robustness across varying scenarios.
6 Conclusion
In this paper, we have focused on enhancing the performance of MOT models operating in the thermal spectrum. Our key contribution lies in the introduction of a novel box association mechanism that harnesses both motion similarity and thermal object identity. This innovative approach enhances tracking accuracy and robustness by considering not just how objects move but also their distinct thermal signatures. The thermal and motion aspects are aggregated through a weighted average, resulting in a comprehensive similarity matrix that combines the strengths of both modalities. A key contribution of this work is that this novel box association method can be integrated with any two-stage MOT approach operating in the thermal spectrum, and encourages the exploration of utilization of unique spectrum characteristics when conducting box association. Given that two-stage MOT approaches are more robust and versatile than single-stage MOT models [52], we believe this work could inspire more innovative research in this field.
In addition, we introduced the world’s largest (to the best of our knowledge) dataset comprising of both RGB and corresponding thermal images, annotated for pedestrian MOT. We anticipate that this RGB-Thermal MOT dataset will be an invaluable resource for researchers in the fields of MOT and thermal vision perception. We fine-tuned state-of-the-art object detection models on this dataset, both for RGB and thermal images. Subsequently, we benchmarked leading MOT models on the dataset with and without our proposed box association method.
The results are compelling. Notably, ByteTrack and OCSort, two state-of-the-art MOT models, exhibited improved performance when our proposed box association method was employed. The fusion of thermal and motion-based similarity scores proved advantageous, making the tracking system more adaptable, robust to occlusions, and resistant to false positives and negatives.
References
- [1] Ahmar, W.E., Massoud, Y., Kolhatkar, D., AlGhamdi, H., Alja’Afreh, M., Laganiere, R., Hammoud, R.: Enhanced thermal-RGB fusion for robust object detection. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE (jun 2023). https://doi.org/10.1109/CVPRW59228.2023.00042
- [2] Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE International Conference on Image Processing (ICIP). IEEE (sep 2016). https://doi.org/10.1109/icip.2016.7533003
- [3] Brenner, M., Reyes, N.H., Susnjak, T., Barczak, A.L.C.: RGB-d and thermal sensor fusion: A systematic literature review. IEEE Access 11, 82410–82442 (2023). https://doi.org/10.1109/access.2023.3301119
- [4] Broyles, D., Hayner, C.R., Leung, K.: WiSARD: A labeled visual and thermal image dataset for wilderness search and rescue. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE (oct 2022). https://doi.org/10.1109/iros47612.2022.9981298
- [5] Cai, Z., Vasconcelos, N.: Cascade r-cnn: Delving into high quality object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE (jun 2017). https://doi.org/10.1109/cvpr.2018.00644
- [6] Cao, J., Pang, J., Weng, X., Khirodkar, R., Kitani, K., Carnegie, ., University, M., Shanghai, ., Laboratory, A.I., 3Nvidia: Observation-centric SORT: Rethinking SORT for robust multi-object tracking. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (jun 2023). https://doi.org/10.1109/CVPR52729.2023.00934
- [7] Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
- [8] Contributors, M.: MMTracking: OpenMMLab video perception toolbox and benchmark. https://github.com/open-mmlab/mmtracking (2020)
- [9] Dai, X., Yuan, X., Wei, X.: TIRNet: Object detection in thermal infrared images for autonomous driving. Applied Intelligence 51(3), 1244–1261 (sep 2020). https://doi.org/10.1007/s10489-020-01882-2
- [10] Dave, A., Khurana, T., Tokmakov, P., Schmid, C., Ramanan, D.: TAO: A large-scale benchmark for tracking any object. In: Computer Vision – ECCV 2020, pp. 436–454. Springer International Publishing (2020). https://doi.org/10.1007/978-3-030-58558-7_26
- [11] Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixe, L.: Cvpr19 tracking and detection challenge: How crowded can it get? (2019). https://doi.org/10.48550/ARXIV.1906.04567
- [12] Dendorfer, P., Rezatofighi, H., Milan, A., Shi, J., Cremers, D., Reid, I., Roth, S., Schindler, K., Leal-Taixé, L.: Mot20: A benchmark for multi object tracking in crowded scenes (2020). https://doi.org/10.48550/ARXIV.2003.09003
- [13] Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: Keypoint triplets for object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE (oct 2019). https://doi.org/10.1109/iccv.2019.00667
- [14] El Ahmar, W.A., Kolhatkar, D., Nowruzi, F.E., AlGhamdi, H., Hou, J., Laganiere, R.: Multiple object detection and tracking in the thermal spectrum. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 277–285 (2022)
- [15] Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: 2017 IEEE International Conference on Computer Vision (ICCV). IEEE (oct 2017). https://doi.org/10.1109/iccv.2017.330
- [16] Feng, C., Zhong, Y., Gao, Y., Scott, M.R., Huang, W.: Tood: Task-aligned one-stage object detection. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 3490–3499. IEEE Computer Society (2021)
- [17] Girshick, R.: Fast r-cnn. In: 2015 IEEE International Conference on Computer Vision (ICCV). IEEE (dec 2015). https://doi.org/10.1109/iccv.2015.169
- [18] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE (Nov 2014). https://doi.org/10.1109/cvpr.2014.81
- [19] He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-CNN. In: 2017 IEEE International Conference on Computer Vision (ICCV). IEEE (oct 2017). https://doi.org/10.1109/iccv.2017.322
- [20] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Dec 2015). https://doi.org/10.1109/cvpr.2016.90
- [21] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
- [22] Kalman, R.E.: A new approach to linear filtering and prediction problems. Journal of Basic Engineering 82(1), 35–45 (mar 1960). https://doi.org/10.1115/1.3662552
- [23] Kristo, M., Ivasic-Kos, M., Pobar, M.: Thermal object detection in difficult weather conditions using YOLO. IEEE Access 8, 125459–125476 (2020). https://doi.org/10.1109/access.2020.3007481
- [24] Kutuk, Z., Algan, G.: Semantic segmentation for thermal images: A comparative survey. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE (jun 2022). https://doi.org/10.1109/cvprw56347.2022.00043
- [25] Lahmyed, R., Ansari, M.E., Ellahyani, A.: A new thermal infrared and visible spectrum images-based pedestrian detection system. Multimedia Tools and Applications 78(12), 15861–15885 (dec 2018). https://doi.org/10.1007/s11042-018-6974-5
- [26] Lahouli, I., Haelterman, R., Chtourou, Z., Cubber, G.D., Attia, R.: Pedestrian detection and tracking in thermal images from aerial MPEG videos. In: Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. SCITEPRESS - Science and Technology Publications (2018). https://doi.org/10.5220/0006723704870495
- [27] Law, H., Deng, J.: CornerNet: Detecting objects as paired keypoints. International Journal of Computer Vision 128(3), 642–656 (aug 2018). https://doi.org/10.1007/s11263-019-01204-1
- [28] Leal-Taixé, L., Milan, A., Reid, I., Roth, S., Schindler, K.: Motchallenge 2015: Towards a benchmark for multi-target tracking (2015). https://doi.org/10.48550/ARXIV.1504.01942
- [29] Lee, J., Choi, J.S., Jeon, E., Kim, Y., Le, T., Shin, K., Lee, H., Park, K.: Robust pedestrian detection by combining visible and thermal infrared cameras. Sensors 15(5), 10580–10615 (may 2015). https://doi.org/10.3390/s150510580
- [30] Li, C., Xue, W., Jia, Y., Qu, Z., Luo, B., Tang, J., Sun, D.: LasHeR: A large-scale high-diversity benchmark for RGBT tracking. IEEE Transactions on Image Processing 31, 392–404 (2022). https://doi.org/10.1109/tip.2021.3130533
- [31] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: 2017 IEEE International Conference on Computer Vision (ICCV). IEEE (oct 2017). https://doi.org/10.1109/iccv.2017.324
- [32] Liu, Q., He, Z., Li, X., Zheng, Y.: PTB-TIR: A thermal infrared pedestrian tracking benchmark. IEEE Transactions on Multimedia 22(3), 666–675 (mar 2020). https://doi.org/10.1109/tmm.2019.2932615
- [33] Liu, Q., Li, X., He, Z., Li, C., Li, J., Zhou, Z., Yuan, D., Li, J., Yang, K., Fan, N., Zheng, F.: LSOTB-TIR: A large-scale high-diversity thermal infrared object tracking benchmark. In: Proceedings of the 28th ACM International Conference on Multimedia. ACM (oct 2020). https://doi.org/10.1145/3394171.3413922
- [34] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: SSD: Single shot MultiBox detector. Computer Vision – ECCV 2016 pp. 21–37 (Dec 2015). https://doi.org/10.1007/978-3-319-46448-0_2
- [35] Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A ConvNet for the 2020s. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Jan 2022). https://doi.org/10.1109/cvpr52688.2022.01167
- [36] Lu, Z., Rathod, V., Votel, R., Huang, J., Google: RetinaTrack: Online single stage joint detection and tracking. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (jun 2020). https://doi.org/10.1109/cvpr42600.2020.01468
- [37] Milan, A., Leal-Taixe, L., Reid, I., Roth, S., Schindler, K.: Mot16: A benchmark for multi-object tracking (2016). https://doi.org/10.48550/ARXIV.1603.00831
- [38] Nowruzi, F.E., Ahmar, W.A.E., Laganiere, R., Ghods, A.H.: In-vehicle occupancy detection with convolutional networks on thermal images. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE (jun 2019). https://doi.org/10.1109/cvprw.2019.00124
- [39] Pedersen, M., Haurum, J.B., Bengtson, S.H., Moeslund, T.B.: 3d-ZeF: A 3d zebrafish tracking benchmark dataset. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (jun 2020). https://doi.org/10.1109/cvpr42600.2020.00250
- [40] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (jun 2015). https://doi.org/10.1109/cvpr.2016.91
- [41] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39(6), 1137–1149 (Jun 2015). https://doi.org/10.1109/tpami.2016.2577031
- [42] Rivadeneira, R.E., Sappa, A.D., Vintimilla, B.X., Wang, C., Jiang, J., Liu, X., Zhong, Z., Bin, D., Ruodi, L., Shengye, L.: Thermal image super-resolution challenge results - PBVS 2023. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE (jun 2023). https://doi.org/10.1109/cvprw59228.2023.00053
- [43] Shin, U., Park, J., Kweon, I.S.: Deep depth estimation from thermal image. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (jun 2023). https://doi.org/10.1109/cvpr52729.2023.00107
- [44] Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., Luo, P.: Transtrack: Multiple object tracking with transformer (2020). https://doi.org/10.48550/ARXIV.2012.15460
- [45] Tan, M., Le, Q.V.: Efficientnet: Rethinking model scaling for convolutional neural networks. International Conference on Machine Learning, 2019 (May 2019)
- [46] Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B.B.G., Geiger, A., Leibe, B.: MOTS: Multi-object tracking and segmentation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (jun 2019). https://doi.org/10.1109/cvpr.2019.00813
- [47] Wang, C.Y., Bochkovskiy, A., Liao, H.Y.M.: Scaled-YOLOv4: Scaling cross stage partial network. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (jun 2021). https://doi.org/10.1109/cvpr46437.2021.01283
- [48] Wang, Z., Zheng, L., Liu, Y., Li, Y., Wang, S.: Towards real-time multi-object tracking. In: Computer Vision – ECCV 2020, pp. 107–122. Springer International Publishing (Sep 2019). https://doi.org/10.1007/978-3-030-58621-8_7
- [49] Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP). IEEE (Mar 2017). https://doi.org/10.1109/icip.2017.8296962
- [50] Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (Nov 2016). https://doi.org/10.1109/CVPR.2017.634
- [51] Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., Luo, P., Liu, W., Wang, X.: Bytetrack: Multi-object tracking by associating every detection box. In: Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII. pp. 1–21. Springer (2022)
- [52] Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: FairMOT: On the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision 129(11), 3069–3087 (Apr 2020). https://doi.org/10.1007/s11263-021-01513-4
- [53] Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: Computer Vision – ECCV 2020, pp. 474–490. Springer International Publishing (2020). https://doi.org/10.1007/978-3-030-58548-8_28