LWIRPOSE: A novel LWIR Thermal Image Dataset and Benchmark

Abstract

Human pose estimation faces hurdles in real-world applications due to factors like lighting changes, occlusions, and cluttered environments. We introduce a unique RGB-Thermal Nearly Paired and Annotated 2D Pose Dataset, comprising over 2,400 high-quality LWIR (thermal) images. Each image is meticulously annotated with 2D human poses, offering a valuable resource for researchers and practitioners. This dataset, captured from seven actors performing diverse everyday activities like sitting, eating, and walking, facilitates pose estimation on occlusion and other challenging scenarios. We benchmark state-of-the-art pose estimation methods on the dataset to showcase its potential, establishing a strong baseline for future research. Our results demonstrate the dataset’s effectiveness in promoting advancements in pose estimation for various applications, including surveillance, healthcare, and sports analytics. The dataset and code are available at https://github.com/avinres/LWIRPOSE

Index Terms— Human Pose Estimation, Thermal Imagery, Thermal 2D pose estimation

1 Introduction

Thermal imaging technology captures temperature differences in the environment, which allows for detecting humans even in low-light conditions or through smoke, fog, or other visual obstructions. This makes thermal imaging particularly useful for military and defense applications where situational awareness is critical. For instance, 2D human pose estimation on thermal images can be used to track the movements of soldiers and other objects, enabling more accurate targeting and improved battlefield management. Additionally, thermal imaging can be employed in search and rescue missions, allowing rescuers to locate survivors in burning buildings or under debris. Besides military and defence applications, 2D human pose estimation on thermal images has numerous civilian uses. For example, it can be utilized in healthcare to monitor patients’ vital signs and detect abnormalities in their body posture in darkness [1]. In sports analytics, it can be applied to assess athletes’ performance and optimize their training regimens. Furthermore, smart home automation can enable intelligent systems to recognize and respond to users’ gestures and movements.

Despite the potential benefits of 2D human pose estimation on thermal images, there is a conspicuous absence of research in this area. Most existing works [2, 3, 4, 5] focus on visible light images, leaving thermal imaging. As a result, there is limited understanding of the challenges specific to 2D human pose estimation on thermal images and how to address them effectively.

The primary obstacle lies in the scarcity of large and diverse IR pose estimation datasets. Unlike the abundance of datasets like COCO [6] and MPII Human Pose [7] for RGB images, IR datasets remain limited in size and scope. This hinders the training of robust and generalizable models, as they lack sufficient data to capture the full spectrum of human poses and variations. Additionally, IR images inherently differ in appearance and illumination compared to RGB. They lack the rich textures and color information crucial for traditional pose estimation algorithms, relying on thermal signatures that are highly variable depending on body temperature, clothing, and ambient environment. This variability in appearance and illumination makes it difficult for models to extract reliable features for accurate pose estimation. Compared to the distinct edges and textures defining body contours in RGB, IR images often exhibit blurry outlines and low contrast between body parts and the background. This ambiguity in visual features makes it difficult for models to accurately identify keypoints and differentiate between limbs in complex poses. Finally, human bodies naturally self-occlude during movement, posing a challenge even in RGB images. However, self-occlusion becomes even more problematic in IR due to the lack of clear visual cues. Thermal signatures can overlap, making it difficult to distinguish between limbs in contact, posing a significant challenge for models to accurately infer the underlying pose structure.

Refer to caption — Fig. 1: Sample of thermal images from the LWIRPose Dataset. The samples belongs to one subject performing 12 different activities. It is visible from the images that the data constitutes complexities such as occlusion, self-occlusion and noises.

To address these challenges, the presented dataset introduces deliberate complexities that reflect real-world scenarios as shown in figure 1 & 2. It encompasses a diverse range of:

•

Pose variations: The dataset includes a variety of poses, from simple standing positions to complex actions like walking and greeting, ensuring the model can handle diverse scenarios.
•

Body shapes: By including subjects with different body types, the dataset encourages the model to learn pose estimation independent of body shape variations.
•

Clothing: The presence of diverse clothing styles helps the model learn pose estimation without relying solely on skin texture, which can be obscured by clothing in IR images.
•

Self-occlusion: The inclusion of poses with self-occlusion challenges the model to infer pose even when parts of the body are hidden from view.
•

Different activities: Incorporating various activities like sitting, walking, and exercise exposes the model to a broader range of motion patterns, improving its generalizability.

By incorporating these complexities, the dataset aims to train models that are robust to the challenges inherent in IR pose estimation and can perform accurately in real-world applications. While the effectiveness of the dataset depends on its size, quality, and diversity, it represents a significant step towards overcoming the unique challenges of IR pose estimation and paving the way for more accurate and robust models in this domain.

Major contribution of the paper is:

•

A Unique and Diverse Dataset: We introduce a novel RGB-Thermal Nearly Paired and Annotated 2D Pose Dataset, boasting over 2,400 high-quality LWIR (thermal) images alongside corresponding near-paired RGB images. This extensive dataset captures seven individuals performing various daily activities like walking, sitting, eating, and more, encompassing a wide range of scenarios and motions. Each image is meticulously annotated with accurate 2D human poses in the MPII format, providing valuable ground truth for researchers and practitioners.
•

Benchmarking RGB-Based Methods: To demonstrate the dataset’s utility, we conduct a comprehensive evaluation of state-of-the-art RGB-based pose estimation methods. These methods, originally designed for RGB data, are fine-tuned on our novel dataset, establishing a robust baseline for future research endeavours. This evaluation highlights the effectiveness of our dataset in pushing the boundaries of pose estimation performance.

2 Related Works

In this section, related works on datasets and thermal pose estimation are discussed. State-of-the-art pose estimation methods on RGB images and thermal images are also discussed.

2.1 Thermal Pose Estimation Dataset

Human pose estimation using RGB images has received extensive attention in recent years [2, 3, 4, 5]. Several datasets [6, 7] in the RGB domain are now available in the public domain, which has been widely used by these algorithms to develop 2D human pose estimation (2DHPE) algorithms. However, research involving thermal data remains scarce due to the limited availability of publicly accessible datasets. This work addresses this gap by introducing a novel dataset specifically designed for thermal pose estimation.

While a few thermal pose datasets exist, they have limitations. Chen et al. [8] proposed a thermal and visible image pose estimation dataset in indoor environments. They collected nearly 24,000 images and annotated the thermal images using the pose points obtained by running state-of-the-art (SotA) models on RGB images. The same pose points are used as ground truth for the thermal image after scaling. However, the dataset has certain limitations. Firstly, the resolution of the thermal images is very low, and the pose annotation is not properly supervised. Smith et al. [9] proposed the UCH-ThermalPose dataset, containing both indoor and outdoor thermal images with annotated key points. However, its size is relatively small, and it only targets static poses. The dataset contains nearly 600 images collected from different thermal image datasets and manually annotated by the authors. Our dataset offers significant advancements over these datasets. It features:

•

Larger size: Over 2,400 high-resolution RGB and IR images.
•

Diverse activities: Actors performing various daily actions (walking, sitting, eating, etc.).
•

Near-paired RGB counterparts: Facilitate comparison and analysis.

2.2 2D Human Pose Estimation on RGB and Thermal Images

Due to the availability of high-quality annotated 2D human pose estimation data in the RGB domain, several algorithms that provide state-of-the-art results were developed. Simple Baseline [10] utilizes a ResNet architecture to extract features and uses deconvolutional layers to estimate poses. HR-Net [11] has proposed the generation of heatmaps using multi-resolution connection subnetworks in parallel and fusing features at multi-scale. HRNet and its variants have been widely used in 2D human pose estimation tasks. OpenPose [12, 4] is another state-of-the-art model that uses part affinity fields (PAF) for heatmap-based keypoint coordinate grouping. Different vision transformer-based [13] models were developed for pose estimation with the advent of transformer-based networks. ViT Pose [3, 2] have utilized [13] to extract feature maps of a person present, then a CNN-based decoder was used to estimate the 2D pose points.

There has been a very limited number of works in thermal human pose estimation. [9] proposed a dataset and, along with it, established a baseline using existing RGB 2D human pose estimation models. [8] proposed a CNN-based feature extraction and PAF-based decoding network for thermal pose estimation.

In this work, we also provide a baseline for our dataset by running different existing RGB-based 2D pose estimation networks.

3 Methodology

Table 1: LWIR Imager specification

Parameter	Specification
Spectral Range	8µm -14µm
Resolution	640 x 480 for RGB & IR
FoV	57°C
Temperature Range	-40° to 330°C
Frame rate	$<$ 9 Hz
Focus	Fixed
Type	Un-cooled

3.1 Dataset Preparation

We leverage a Seek thermal camera with long-wave infrared sensor setup operating at 7-14mm to capture thermal signatures of human poses at a resolution of 640x480 pixels, ensuring detailed representation of each subjects. With the help of advanced features of the camera, we were able to capture both RGB and their corresponding IR images. Each image within the dataset is meticulously annotated with 17 keypoints, precisely pinpointing major body joints such as head, shoulders, elbows, wrists, hips, knees, and ankles. This comprehensive set of pose points enables accurate estimation of various postures and movements, laying the groundwork for robust model training and evaluation. Moreover, to enhance the dataset’s diversity and real-world applicability, we captured images across 12 distinct activity classes like ”Discussion,” ”Smoking,” and ”Walking.” Random pose variations were captured within each class, resulting in a comprehensive representation of human movement.

While our dataset included both RGB and corresponding infrared (IR) images, directly applying the HRNet model [11] to IR images yielded unsatisfactory pose estimation results. To overcome this challenge, we adopted a two-step approach. First, we passed the RGB images through the HRNet model [11], obtaining accurate keypoint predictions. Subsequently, we transferred these predicted keypoints to their corresponding IR images. These key points were inaccurate as there was some mismatch in the registration of the RGB-thermal image pair. Further, many RGB-IR image pairs were not naturally captured together, necessitating manual annotation of the entire dataset. We developed a custom annotation tool to streamline this process, hence achieving robust pose ground truth in the thermal domain.

Table 2: The number of images and annotated 2D poses in the LWIRPose dataset. The table represents the total number of images used for training and testing. We have used six subjects (1 Male and 5 Females) with 2106 images for training and one subject (Male) with 355 images for testing.

Pose	Training	Testing	Total
	S2+S3+S4+S5+S6+S7	S1
Direction	155	25	180
Discussion	160	25	185
Eating	160	25	185
Greeting	154	25	179
Phone Talk	153	25	178
Posing	132	25	157
Purchases	151	25	176
Sitting	280	55	335
Smoking	154	25	179
Taking photo	180	25	205
Waiting	150	25	175
Walking	277	50	327
Total	2106	355	2461

3.2 Experimental Settings

We captured the whole data in a controlled indoor environment in which the camera remained fixed at a distance of 7 meters for every object, ensuring consistent thermal signature capture across the dataset. This approach of leveraging RGB information combined with custom annotation and controlled data acquisition enabled us to create a high-quality IR pose estimation dataset, paving the way for further research in this domain. Table 2 represents the number of images for each activity.

3.3 Data Processing

The collected and annotated data was processed to train the existing RGB-based 2D HPE models. We have segregated the training and testing sets based on the subjects. The Training images included images from subjects S2, S3, S4, S5, S6 and S7 performing various activities. Images of subject S1 are preserved for testing purposes. Since the data size is small, we have omitted the validation set. Random samples from the training set were used to validate the model. The results demonstrated in the paper were obtained on the testing set. The split between the training and testing data and the number of images for each set is presented in Table 2.

3.4 Evaluation Metric

The models are trained and evaluated using Mean Per Joint Position Error (MPJPE) loss in pixels, widely used in the Pose estimation tasks. MPJPE is the mean Euclidean distance between the GT 2D Pose $p_{k,i}$ and estimated 2D Pose $\hat{p}_{k,i}$ . To pre-train the encoder and decoder of the ResNet, the loss was calculated as:

L_{res}=\tfrac{1}{n}\left[\sum_{k=1}^{n}\left\|p_{k,i}-\hat{p}_{k,i}\right\|_{2}\right]

Where n is the number of joints in a pose. Lower MPJPE represents better model pose point predictions.

Further, the Percentage of correct key points (PCKh) metric was also used to determine the accuracy of the localization of different critical points with a given threshold. Specifically, [email protected] is used as a threshold, which is 50% of the head bone link. A higher PCK value represents better model performance.

4 Results and Discussion

We have trained three different models ViTPose [3], HRNet [11], and simple baseline (ResNet-based pose estimation) [10] to evaluate the performance of these models on our dataset.

Table 3: This table presents the MPJPE errors for four different pose estimation models: ResNet50 baseline[10], ResNet152 baseline[10], HRNet[11], and ViTPose[3]. The MPJPE values are provided for 12 different activity categories for each model, along with the average MPJPE calculated across all categories.

Methods	Direction	Discussion	Eating	Greeting	Phone	Posing	Purchases	Sitting	Smoking	Photo	Waiting	Walking	Total
Resolution: 640x480
Baseline (ResNet-50 [10])	22.1	24.8	23.7	24.1	24.8	21.8	22.6	30.5	23.1	29.6	21.7	26.1	24.6
Baseline (ResNet-152 [10])	19.3	21.6	22.1	23.2	22.4	20.3	18.7	28.4	21.7	28.6	20.8	23.6	22.6
HRNet-W48 [11]	17.2	18.6	21.2	19.2	18.9	19.3	18.7	25.1	20.7	24.4	18.6	19.4	20.1
VitPose[3]	14.1	15.1	18.6	15.7	15.9	16.8	15.5	22.7	18.1	22.3	14.9	16.1	17.2

Table 4: This table displays the average [email protected] values for four different pose estimation models: ResNet50 baseline, ResNet152 baseline, HRNet, and ViTPose. Each model’s [email protected] values are calculated across all key points categories and activities in the dataset.

Methods	[email protected]
Resolution: 640x480
Baseline (ResNet-50)	72.1
Baseline (ResNet-152)	74.6
HRNet-W48	78.2
VitPose	83.1

4.1 Performance Evaluation of Pose Estimation Models

The evaluation of various pose estimation models was conducted on our proposed dataset, employing two standard metrics: Mean Per Joint Position Error (MPJPE) and Percentage of Correct Keypoints (PCKh). Table 3 summarizes the MPJPE errors for each activity category, along with the average MPJPE across all activities. Notably, the MPJPE of ViTPose [3] exhibited a considerable reduction compared to the baseline ResNet-50 model [10], indicating superior accuracy in pose estimation. Further, [email protected] metric was used for evaluation. Table 4 shows the PCKh value for different models. There is a noticeable enhancement in [email protected] metrics with the ViTPose [3] model, indicating improved keypoint localization capabilities with attention networks.

However, it was observed that poses characterized by self-occlusion, such as sitting and talking, yielded larger MPJPE values compared to poses with minimal occlusion. This underscores the inherent complexity of our dataset, particularly with respect to self-occlusion and occlusion by external objects. Consequently, there exists substantial room for the development of models specifically tailored to address such challenging poses present in our dataset.

4.2 Visual Analysis of Pose Estimation Results

Figure 3 presents the visual results obtained from different pose estimation models applied to our proposed dataset. While the models were able to extract relevant features and formulate the pose structure, their performance fell short of expectations compared to their performance on RGB images. Additionally, Figure 2 showcases the failed cases of ViTPose [3] on images characterized by high occlusion levels.

The challenges encountered in pose estimation on Long Wave Infrared (LWIR) images can be attributed to the inherent differences in texture and contrast compared to RGB images. LWIR images depict thermal intensity rather than visual appearance, rendering occluded poses particularly challenging to identify. Consequently, accurate pose estimation in LWIR images is hindered by the distinct characteristics of thermal imagery.

The evaluation results shed light on the performance and limitations of various pose estimation models when applied to LWIR images in our dataset. These insights guide future research endeavours to improve pose estimation accuracy in challenging conditions characterized by self-occlusion and occlusion in LWIR imagery.

5 Conclusion

This paper introduced a first-of-its-kind, fully annotated thermal image dataset for 2D human pose estimation. Featuring over 2400 high-resolution images, it fosters research in this under-explored domain. We evaluated state-of-the-art 2D pose models, revealing their limitations on thermal data but demonstrating significant performance boosts after fine-tuning on our dataset. This establishes a strong baseline for future research. The extensive dataset opens exciting avenues for developing domain-specific models, exploring data fusion, and venturing into tasks beyond pose estimation. By unlocking the potential of thermal data, this work sets the stage for advancements in diverse applications demanding robust pose estimation under challenging conditions.

References

[1] Shuangjun Liu and Sarah Ostadabbas, “Seeing under the cover: A physics guided learning approach for in-bed pose estimation,” 2019.
[2] Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao, “Vitpose+: Vision transformer foundation model for generic body pose estimation,” arXiv preprint arXiv:2212.04246, 2022.
[3] Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao, “ViTPose: Simple vision transformer baselines for human pose estimation,” in Advances in Neural Information Processing Systems, 2022.
[4] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
[5] Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S. Huang, and Lei Zhang, “Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation,” 2020.
[6] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár, “Microsoft coco: Common objects in context,” 2015.
[7] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
[8] I-Chien Chen, Chang-Jen Wang, Chao-Kai Wen, and Shiow-Jyu Tzou, “Multi-person pose estimation using thermal images,” IEEE Access, vol. 8, pp. 174964–174971, 2020.
[9] Javier Smith, Patricio Loncomilla, and Javier Ruiz del Solar, “Human pose estimation using thermal images,” IEEE Access, 2023.
[10] Bin Xiao, Haiping Wu, and Yichen Wei, Simple Baselines for Human Pose Estimation and Tracking, p. 472–487, Springer International Publishing, 2018.
[11] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang, “Deep high-resolution representation learning for human pose estimation,” 2019.
[12] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in CVPR, 2017.
[13] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” 2021.