Human Pose Estimation from Ambiguous Pressure Recordings with Spatio-temporal Masked Transformers
Abstract
Despite the impressive performance of vision-based pose estimators, they generally fail to perform well under adverse vision conditions and often don’t satisfy the privacy demands of customers. As a result, researchers have begun to study tactile sensing systems as an alternative. However, these systems suffer from noisy and ambiguous recordings. To tackle this problem, we propose a novel solution for pose estimation from ambiguous pressure data. Our method comprises a spatio-temporal vision transformer with an encoder-decoder architecture. Detailed experiments on two popular public datasets reveal that our model outperforms existing solutions in the area. Moreover, we observe that increasing the number of temporal crops in the early stages of the network positively impacts the performance while pre-training the network in a self-supervised setting using a masked auto-encoder approach also further improves the results.
Index Terms— Pressure-based Pose Estimation, Tactile Sensing, Self-supervised Learning, Human Pose Estimation
1 Introduction
Human pose estimation is an important task in computer vision research with applications in healthcare [1], robotics [2], autonomous driving [3], and action recognition [4]. With recent advances in deep learning, researchers have developed optical and depth models to estimate human 3D models [5] or their body joint locations with sub-pixel accuracy [6]. However, despite the impressive performance of vision-based models, privacy demands and challenging aspects like severe occlusions during daily activities have given rise to non-vision pose estimation techniques. Since most human activities rely on their contact with the surrounding environment, smart textiles have been proposed as a solution in numerous applications such as health care [1] and identification [7].
Currently, pressure-recording carpets and bed mattresses are available commercially. These systems have been designed to capture the pressure profiles and postures of subjects. Researchers have been able to use such systems to identify sleeping posture [8] and even 3D joint coordinates [1] reliably using deep learning models. However, unlike vision-based approaches, these models often inherit complex architectures to account for the noisy and ambiguous nature of the data. For example, systems using smart carpets for pose estimation need to utilize the limited information from feet pressure distribution to estimate the joint locations in 3D space [9]. Similarly, in-bed pressure systems record artifacts caused by blankets, movement, or stretching of pressure sensors [10]. Furthermore, pressure-based deep learning models can only be trained on datasets with limited diversity, or alternatively by synthetic data [5], unlike vision-based models that frequently rely on large-scale and in-the-wild datasets. As a result, detecting the full posture in 3D from pressure data is deemed a challenging problem with models having up to ten times more error compared to the vision-based solutions [9].
In this paper, we propose a self-supervised strategy for learning the human pose from ambiguous pressure data using a temporal adaptation of ViTPose [11] (a vision transformer model originally developed for pose estimation from ‘single-frame’ ‘images’). Our solution is able to achieve state of the art in two large-scale in-bed and smart carpet datasets. More specifically, we first pre-train the encoder of our network in a masked image reconstruction task with a warm-up strategy to leverage the benefits of the pre-trained ViT models. Then we train the network on pressure data using a combination of objective functions for pose estimation. We show the effectiveness of the self-supervised pre-training and network design by comparing our model’s performance to previous works and state-of-the-art vision-based models on both datasets.
Our contributions can be summarized as follow: (1) We propose a temporal variation of the ViT [12] for pose estimation, setting a new state of the art in temporal pose prediction from ambiguous pressure data. (2) We show that pre-training the ViT in a masked auto-encoder framework can positively impact the performance on both datasets used in this study.
2 Related Works
Human Pose Estimation. Due to recent advances in computer vision and the collection of large-scale datasets, great improvements have been achieved in 2D human pose estimation [13, 11, 14]. However, these models generally fail to predict 3D poses directly due to the underlying ambiguity of the 3D pose in one frame. As a result, most recent models rely on temporal information [15] or multi-camera setups [6]. More recently, self-supervised learning strategies have emerged as a viable solution to reduce the pose-ambiguity in monocular images [16]. In a recent work, a 2D to 3D pose uplifting solution was developed, utilizing a self-supervised training strategy by reconstructing masked joints or frames in a sequence of poses [17]. In another direction, bigger models was suggested to learn temporal information directly alongside spatial information, at the cost of computational resources [15]. Despite the impressive performance of these works, most approaches still rely on an accurate initial prediction of the 2D joints or textural information from RGB images, which is not available when using pressure recordings from smart textiles. To address the challenges of ambiguity in input pressure data, we exploit temporal information in our network by implementing a temporal variation of ViTPose [11]. Moreover, we address the lack of diversity and large-scale data by pre-training our network via a self-supervised training strategy.
Pressure-based pose estimation. The development of cost-efficient and ubiquitous tactile sensing systems has allowed researchers to model human interactions and gestures using pressure recordings. For instance, the patterns of hand grasp through pressure-sensing textiles have been explored in [2], while other works have developed multi-modal solutions for hand gesture detection using a combination of cameras and e-textiles [18]. Similarly, several works have explored virtual-reality and healthcare applications via human gait recognition [19] and posture detection [7] using an array of piezoresistive pressure sensors embedded in a carpet. A recent study [9] addressed the challenging task of 3D human pose estimation using only feet pressure distributions using a deep network based on 3D convolutions. Given the limited information available in feet pressure and the ambiguity of its mapping into a 3D pose, they achieved an error of 20, which is almost ten times more than vision-based approaches [11]. In another work, to disambiguate in-bed pressure maps, generation of human-like figures from a pressure recording matrix via a pre-processing block before pose estimation was proposed [8]. In the same line of work, in [5], a synthetic in-bed pressure dataset was presented to address the limited diversity of available datasets, allowing network pre-training for better performance in real-life applications.
3 Method
Problem setup. Let be a sequence of pressure distribution frames and be a sequence of D ground truth joint locations. Our goal is to train a pose estimator encoder as illustrated in Figure 1. We do so via learning the mapping of , where and are the trainable parameters of the encoder and the regression head, respectively. We pre-train the encoder via self-supervision prior to pose estimation. The details of our network architecture, training steps, and implementation details are described in the following sections.
Network Architecture. Similar to video-based vision transformers, we first tokenize the input frames into space-time cubes after applying a convolutions [20]. Next, we embed the space-time cube tokens via a patch embedding layer, , and after a linear projection layer, pass them to an Encoder, which is a ViT [12] pre-trained on ImageNet [21] followed by MS-COCO [22] datasets. Here, and are the number of temporal and spatial crops, respectively. The ViT is a sequence of transformer blocks, each consisting of multi-head self-attention (MHSA) and feed-forward network (FFN) layers. In our setting, the encoder operates on all of the input patches without using the masked tokens with an output of . As illustrated in Figure 1 (b), we adopt a simple Regression head composed of two deconvolution layers and one convolution layer on the reshaped following the common setting of the previous works [23]. Specifically, each deconvolution block up-samples the reshaped feature vectors by a factor of two, and the convolution layer predicts the joint location heatmaps for each keypoint in each frame . The last module in our network is a decoder used only during the self-supervised pre-training of our encoder to reconstruct the input patches from the encoder’s output latent representations. We adopt a shallow transformer network to process and connect it to a linear projection layer to match the shape of the input patches.
Self-supervised pre-training. Masked Auto-Encoders (MAE) for pre-training transformer networks has shown strong results on a variety of applications such as NLP [24], pose estimation [11], and image classification [23]. Inspired by this, we implement an MAE by masking the outputs of our encoder, , and then passing the masked encoded features along with the masked tokens to our decoder. For this purpose, we adopt an asymmetric design for masked image reconstruction, where the encoder operates on fully observed data (without masked tokens), and the decoder is applied on the masked encoder output. Our encoder-decoder network reconstructs the masked patches by learning to predict the raw pixel values of each patch. We use MSE loss on the masked patches, excluding the unmasked patches, to train the network efficiently [24].

Fine-tuning for Pose Estimation. In the fine-tuning step, we first freeze the decoder and warm up the regression head and patching block of the encoder for epochs. Then, we train all of the networks via minimizing the MSE loss between the predicted and ground-truth heatmaps, as well as the limb-length loss between the ground-truth and predicted keypoint coordinates, taken via SoftMax layer on each heatmap, by:
(1) |
where and represent the and percentile of each of the limb lengths in the given dataset.
Implementation details. We use the same architecture proposed in ViTPose [11] for our encoder and only modify the patching layer by replacing the normal convolution operator with a convolution block. In all cases, we start the training from pre-trained weights provided for ViTPose-B [11] in eligible layers. More specifically, we only use random initialization for the fully connected layers that have different sizes than the vanilla ViTPose due to the increased number of patches. Our decoder consists of 4 blocks of transformer networks with hidden dimensions of . We initialize the Temporal ViTPose weights from the pre-trained model of MAE [23], and perform masked region reconstruction with a masking rate of . We train the encoder-decoder pipeline with a learning rate of using AdamW optimizer [25] for epochs with a weight decay rate of . In the next stage, we warm up the network by training the regression head, newly initilized layers, and the patching layer of ViT in a supervised task with a learning rate of for epochs. Finally, we use a learning rate of for training the network on the supervised pose estimation task for a total of epochs.
4 Experiments and Results
4.1 Datasets
Intelligent Carpet. This dataset contains actions performed by subjects in two hour-long videos [9]. The D pose is obtained by triangulating AI-generated D joints from two webcams. Moreover, synchronized recordings of pressure maps via sensor-embedded carpets are provided for pose estimation, resulting in over 1,800,000 frames. Following the prior work, we train our model on subjects and evaluate on the remaining .
SLP. The simultaneously-collected multi-modal lying pose (SLP) dataset [1] is a collection of multi-modal data, namely RGB, LWIR, depth, and pressure maps, recorded from 102 subjects in home and 7 subjects in hospital settings. We only use the no-cover condition, and leave the thin and thick cover conditions out. As the dataset does not provide direct 3D annotation, we use the available 2D annotations and the corresponding depth of the joints of subsequent postures to train our model. Following the standard evaluation scheme of the in-home set, we train our models on the first 90 subjects and use the remaining 12 for testing.
4.2 Performance Metrics
MPJPE: For the Intelligent Carpet dataset, we report mean-per-joint-position-error (MPJPE) as the performance metric in line with previous research [9].
PCKh: To compare our method with previous works on the SLP dataset [1], we report the percentage of correct keypoints at of head limb length threshold ([email protected] [22]) using only 2D predictions.
4.3 Benchmarks
In order to evaluate our proposed model on the Intelligent Carpet dataset, we adapt and modify commonly used pose estimators, namely ResNet [14], UNet [26], HRNet [13], and ViTPose [11]. Specifically, we concatenate the frames in the channel dimension and change the first convolution layer accordingly. This adaptation follows the same strategy of previous work on the Intelligent Carpet dataset [9]. On the SLP dataset, we modify prior works to adjust them to temporal data and compare them with our approach. To train the benchmarks, we initialize the models using their pre-trained weights. Finally, for a fair comparison, we fine-tune them with empirically-tuned learning rates and the same number of epochs as our proposed model.
Method | Frames | |||||
1 | 4 | 8 | 12 | 20 | 32 | |
ResNet18 | 41.4 | 40.9 | 37.5 | 33.4 | 28.8 | 29.8 |
ResNet50 | 32.5 | 31.1 | 31.2 | 30.7 | 29.8 | 29.4 |
ResNet101 | 38.3 | 42.1 | 37.8 | 32.6 | 31.7 | 30.1 |
UNet | 34.9 | 33.8 | 32.4 | 32.1 | 30.4 | 28.7 |
HRNet-W32 | 25.4 | 25.6 | 25.4 | 25.2 | 24.8 | 24.6 |
3DCNN [9] | 33.5 | 28.7 | 24.6 | 23.1 | 19.8 | 22.4 |
ViTPose | 24.3 | 24.6 | 23.7 | 22.3 | 21.7 | 21.2 |
T-ViTPose (Ours, = 4, ) | N/A | 28.9 | 28.4 | 22.4 | 19.4 | 16.9 |
T-ViTPose+MAE (Ours, = 4, ) | N/A | 28.4 | 27.5 | 21.6 | 17.8 | 16.5 |
4.4 Results
In Table 1, we compare our method against the benchmarks when different numbers of temporal frames are given in the Intelligent Carpet dataset. We observe that changing the architecture of pose estimators as proposed in [9] does not benefit the performance of ResNet50, HRNet, and UNet as much as 3DCNN and our proposed model. Furthermore, we observe that although ResNet101 has more parameters than the other architectures, it performs worse, hinting at over-fitting and the limited diversity of foot pressure distributions caused by data ambiguity. Furthermore, we show that only changing the cropping strategy and the first convolution layers of the ViTPose, significantly improves the performance of the pose estimator and achieves the lowest error among others. Finally, we show that by utilizing our self-supervised pre-training strategy, we can consistently improve the performance, where a reduction in error is observed at frames of pressure data. We also observe a similar effect in Table 2, where we compare the performance of our proposed solution to our implementations of previous research. We see that our approach is able to consistently achieve the best results where or more frames are available. We also see that self-supervised pre-training of our Temporal ViTPose consistently improves the performance.
Generally, pose estimation approaches perform significantly better given the temporal context [27]. In our case, this is particularly seen on the carpet data where high levels of data ambiguity exist. As a result, we design our network to utilize temporal crops throughout all stages of the network, thus improving performance and outperforming the 3DCNN solution [9] that uses temporal tiling and 3D convolutions only at later stages. Consequently, as shown in Table 1, our model outperforms other approaches, mostly on long time windows. For instance, T-ViTPose achieves third and fourth place when only using 4 and 8 frames on the Intelligent Carpet dataset, respectively. Similarly, in Table 2, On the SLP dataset, T-ViTPose achieves the second-best performance when using 4 frames but outperforms all benchmarks on longer time frames.
Next, we conduct a parameter test to investigate the effect of temporal cropping on the performance of T-ViTPose in Table 3. We show that increasing the number of temporal patches consistently improves the performance. For instance, applying crops instead of reduces the error by and when and temporal frames are available. Furthermore, we show that using self-supervised pre-training can reduce the error by an average of across all conditions. Finally, we illustrate some examples of our method in Figure 2, and compare our results against previous works, where we observe more accurate pose estimation by our model.
SSL Pre-training | Temporal Crops | Frames | |
4 | 32 | ||
✓ | 1 | 29.8 | 23.4 |
✓ | 2 | 29.3 | 20.4 |
✓ | 4 | 28.4 | 16.5 |
✗ | 1 | 30.2 | 23.1 |
✗ | 2 | 30.1 | 21.4 |
✗ | 4 | 28.9 | 16.9 |
Method | FLOPS (G) | Parameters (M) | Inference Time (ms) |
ResNet18 | 0.72 | 15.52 | 79.23 |
ResNet50 | 1.2 | 34.15 | 182.93 |
ResNet101 | 1.88 | 53.14 | 357.11 |
UNet | 3.88 | 16.84 | 51.67 |
HRNet | 0.69 | 9.32 | 366.36 |
3DCNN | 57.31 | 68.86 | 67.02 |
ViTPose | 3.52 | 93.31 | 383.53 |
Ours T=1 | 3.57 | 93.3 | 410.45 |
Ours T=2 | 6.73 | 94.51 | 413.00 |
Ours T=4 | 13.07 | 99.88 | 401.24 |

Finally, Table 4 provides the model parameters, inference time, and FLOPS for a batch of 20-frame input window. We show that our model has the same order of parameters as other top-performing approaches, is only 30 ms slower than HRNet or ViTPose, and uses 78% less FLOPS than the previous state-of-the-art pressure-based pose estimation approach. Additionally, the training time of all models was 30 minutes per epoch with a batch size of 128, except 3DCNN, where training took 2 hours per epoch with a batch size of 64 due to the slow back-propagation caused by the large number of parameters from the tiling operation and the 3D convolution layers. These measurements were taken on an average of 1000 forward passes given a 96 96 array on an Nvidia 1080 Xp GPU.
5 Conclusion
In this paper, we presented a 3-stage solution for accurate 3D pose estimation from a temporal window of ambiguous pressure data. Specifically, we proposed a temporal variation of ViT by using convolutions as the initial block, and pre-trained the network using the self-supervised masked auto-encoder strategy. After a few epochs of warm-up for modified modules, we trained our model using conventional 3D pose estimation objectives. We show that each element in our design significantly improves the prediction error over prior works in our experiments and set a new state of the art on two large-scale pressure mapping datasets.
Acknowledgement. This project was funded in part by Natural Sciences and Engineering Research Council of Canada.
References
- [1] Shuangjun Liu, Xiaofei Huang, Nihang Fu, Cheng Li, Zhongnan Su, and Sarah Ostadabbas, “Simultaneously-collected multimodal lying pose dataset: Towards in-bed human pose monitoring under adverse vision conditions,” arXiv preprint arXiv:2008.08735, 2020.
- [2] Yu Meng Zhou, Diana Wagner, Kristin Nuckols, Roman Heimgartner, Carolina Correia, et al., “Soft robotic glove with integrated sensing for intuitive grasping assistance post spinal cord injury,” in IEEE International conference on robotics and automation, 2019, pp. 9059–9065.
- [3] Renshu Gu, Gaoang Wang, and Jenq-Neng Hwang, “Efficient multi-person hierarchical 3d pose estimation for autonomous driving,” in IEEE Conference on Multimedia Information Processing and Retrieval, 2019, pp. 163–168.
- [4] Haodong Duan, Yue Zhao, Kai Chen, Dahua Lin, and Bo Dai, “Revisiting skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2969–2978.
- [5] Henry M Clever, Zackory Erickson, Ariel Kapusta, Greg Turk, Karen Liu, and Charles C Kemp, “Bodies at rest: 3d human pose and shape estimation from a pressure image using synthetic data,” in IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6215–6224.
- [6] Karim Iskakov, Egor Burkov, Victor Lempitsky, and Yury Malkov, “Learnable triangulation of human pose,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7718–7727.
- [7] Vandad Davoodnia and Ali Etemad, “Identity and posture recognition in smart beds with deep multitask learning,” in IEEE International Conference on Systems, Man and Cybernetics, 2019, pp. 3054–3059.
- [8] Vandad Davoodnia, Saeed Ghorbani, and Ali Etemad, “Estimating pose from pressure data for smart beds with deep image-based pose estimators,” Applied Intelligence, vol. 52, no. 2, pp. 2119–2133, 2022.
- [9] Yiyue Luo, Yunzhu Li, Michael Foshey, Wan Shou, et al., “Intelligent carpet: Inferring 3d human pose from tactile signals,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11255–11265.
- [10] Shuangjun Liu and Sarah Ostadabbas, “Pressure eye: In-bed contact pressure estimation via contact-less imaging,” arXiv preprint arXiv:2201.11828, 2022.
- [11] Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao, “Vitpose: Simple vision transformer baselines for human pose estimation,” arXiv preprint arXiv:2204.12484, 2022.
- [12] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- [13] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang, “Deep high-resolution representation learning for human pose estimation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5693–5703.
- [14] Bin Xiao, Haiping Wu, and Yichen Wei, “Simple baselines for human pose estimation and tracking,” in Proceedings of the European conference on computer vision, 2018, pp. 466–481.
- [15] N Dinesh Reddy, Laurent Guigues, Leonid Pishchulin, Jayan Eledath, and Srinivasa G Narasimhan, “Tessetrack: End-to-end learnable multi-person articulated 3d pose tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15190–15200.
- [16] Jogendra Nath Kundu, Siddharth Seth, Pradyumna YM, Varun Jampani, et al., “Uncertainty-aware adaptation for self-supervised 3d human pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20448–20459.
- [17] Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma, and Wen Gao, “P-stmo: Pre-trained spatial temporal many-to-one model for 3d human pose estimation,” arXiv preprint arXiv:2203.07628, 2022.
- [18] Yunzhu Li, Jun-Yan Zhu, Russ Tedrake, and Antonio Torralba, “Connecting touch and vision via cross-modal prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 10609–10618.
- [19] Wenchao Li, Wenqian Lu, Xiaopeng Sha, Hualin Xing, Jiazhi Lou, Hui Sun, and Yuliang Zhao, “Wearable gait recognition systems based on mems pressure and inertial sensors: A review,” IEEE Sensors Journal, 2021.
- [20] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
- [21] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE conference on computer vision and pattern recognition, 2009, pp. 248–255.
- [22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, et al., “Microsoft coco: Common objects in context,” in European conference on computer vision. Springer, 2014, pp. 740–755.
- [23] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, et al., “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16000–16009.
- [24] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- [25] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar, “On the convergence of adam and beyond,” arXiv preprint arXiv:1904.09237, 2019.
- [26] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.
- [27] Wenhao Li, Hong Liu, Runwei Ding, Mengyuan Liu, Pichao Wang, and Wenming Yang, “Exploiting temporal contexts with strided transformer for 3d human pose estimation,” IEEE Transactions on Multimedia, 2022.