AutoHR: A Strong End-to-end Baseline for Remote Heart Rate Measurement with Neural Searching
Abstract
Remote photoplethysmography (rPPG), which aims at measuring heart activities without any contact, has great potential in many applications (e.g., remote healthcare). Existing end-to-end rPPG and heart rate (HR) measurement methods from facial videos are vulnerable to the less-constrained scenarios (e.g., with head movement and bad illumination). In this letter, we explore the reason why existing end-to-end networks perform poorly in challenging conditions and establish a strong end-to-end baseline (AutoHR) for remote HR measurement with neural architecture search (NAS). The proposed method includes three parts: 1) a powerful searched backbone with novel Temporal Difference Convolution (TDC), intending to capture intrinsic rPPG-aware clues between frames; 2) a hybrid loss function considering constraints from both time and frequency domains; and 3) spatio-temporal data augmentation strategies for better representation learning. Comprehensive experiments are performed on three benchmark datasets to show our superior performance on both intra- and cross-dataset testing.
Index Terms:
rPPG, remote heart rate measurement, neural architecture search, convolution.I Introduction
Heart rate (HR) is an important vital sign that needs to be measured in many circumstances, especially for healthcare or medical purposes. Traditionally, the Electrocardiography (ECG) and Photoplethysmograph (PPG) [1] are the two most common ways for measuring heart activities and corresponding average HR. However, both ECG and PPG sensors need to be attached to body parts, which may cause discomfort and are inconvenient for long-term monitoring. To counter for this issue, remote photoplethysmography (rPPG) [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] methods are developing fast in recent years, which target to measure heart activity remotely without any contact.
In earlier studies of remote HR measurement from facial videos, most traditional methods [2, 3, 4, 5, 6, 7, 8] can be seen as a two-stage pipeline, which first extracts the rPPG signals from the detected/tracked face regions, and then estimates the corresponding average HR from frequency analysis. On one hand, most methods analyze subtle color changes on facial regions of interest (ROI). Verkruysse et al. [2] first found that rPPG could be recovered from facial skin regions using ambient light and the green channel featured the strongest rPPG clues. Poh et al. [3, 4] utilized independent component analysis for noise removal, which is robust to the environment even using a low-cost webcam. Li et al. [5] proposed to track the well-defined ROI for coarse rPPG signals recovery and then refine the signals via illumination rectification and non-rigid motion elimination. Tulyakov et al. [6] proposed self-adaptive matrix completion for HR estimation, which captures the consistent clues among ROIs with noise reduction. On the other hand, there are few color subspace transformation methods which utilized all skin pixels for rPPG measurement, e.g., chrominance-based (CHROM) [7] and projection plane orthogonal to the skin tone (POS) [8].
Based on the prior knowledge (e.g., ROI definition and signal processing) from traditional methods, several learning based approaches [11, 12, 13, 14] are designed as non-end-to-end fashions. After extracting the rPPG signals via traditional CHROM [7], Hsu et al. [11] transformed the signals to time-frequency representation, which is cascaded with VGG15 [20] for HR regression. Qiu et al. [12] extracted the features via spatial decomposition and temporal filtering in particular face ROI, and then CNN was utilized for mapping the hand-crafted features to HR value. Niu et al. [13, 14] generated the spatio-temporal map representation via aggregating the information within multiple small face ROIs, which is cascaded with ResNet18 [21] for HR prediction. However, these methods need the strict preprocessing procedure and neglect the global clues outside the pre-defined ROI.
Meanwhile, a few end-to-end deep learning based rPPG methods [15, 16, 17, 18] are developed, which take the face frames as input and predict the rPPG signals or HR values directly. Špetlík et al. [15] proposed a two-stage method (HR-CNN), which first measures the rPPG signal by a 2D CNN with frequency constraints, and then regresses the HR value via another 1D CNN. Chen and McDuff proposed convolutional attention networks for physiological measurement (DeepPhys [16]), which uses the normalized difference frames as input and predicts the rPPG signal derivative. Yu et al. designed spatio-temporal networks (PhysNet [17] and rPPGNet [18]) for rPPG signal recovery, which are supervised by negative Pearson loss in the time domain. However, pure end-to-end methods are easily influenced by the complex scenarios (e.g., with head movement and various illumination conditions. See Fig. 1(a) for examples). As shown in Fig. 1(b), the end-to-end deep learning based rPPG methods (I3D [22], DeepPhys [16] and PhysNet [17]) fall behind the state-of-the-art non-end-to-end method (RhythmNet [14]) by a large margin.

In this letter, we aim to explore these two questions: 1) Why the existing end-to-end deep learning based rPPG methods perform poorly in less-constrained scenarios? 2) Can end-to-end networks generalize well even in less-constrained scenarios? Our contribution includes:
-
•
We find out three key factors (i.e., network architecture, loss function and data augmentation strategy) influencing the robustness and generalization ability of end-to-end rPPG networks heavily.
-
•
We propose the AutoHR, which consists of a powerful searched backbone with the Temporal Difference Convolution (TDC). To our best knowledge, it is the first time to discover well-suited backbone via Neural Architecture Search (NAS) for remote HR measurement.
-
•
We propose a hybrid loss function with both time and frequency constrains, forcing the AutoHR to learn rPPG-aware features. Besides, we present spatio-temporal data augmentation strategies for better representation learning.
-
•
We conduct intra- and cross-dataset tests and show that the AutoHR achieves superior or on par state-of-the-art performance, which can be treated as a strong end-to-end baseline for rPPG research community.
II Methodology
II-A Temporal Difference Convolution
Temporally normalized frame difference [19] is proved to be robust for rPPG recovery in motion and bad illumination scenarios. In DeepPhys [16], a new type of normalized frame difference is designed as the network input. However, some important information will be lost due to the normalized clipping, which limits the capability of representation learning. Here we adopt the original RGB frames as input and design a novel Temporal Difference Convolution (TDC) for describing the temporal differences in feature levels. For simplicity, we describe TDC with 333 kernel and channel number 1. The case with larger kernel size and channel number is analogous. denotes the learnable weight at local position . As illustrated in Fig. 2, besides weighting the spatial information in the local region at current time , the TDC also aggregates the temporal difference clues within local temporal regions and , which can be formulated as

(1) |
where denotes current location at time on both input and output feature maps while , and enumerate the locations in , and , respectively. For instance, local receptive field region , and can be represented as . The hyperparameter tradeoffs the contribution of temporal difference. The higher value of means the more importance of temporal difference information. Specially, TDC degrades to vanilla 3D convolution when .
Overall, the advantages of introducing temporal difference clues into vanilla 3D convolution are in two folds: 1) cascaded with deep normalization operators, TDC is able to mimic the temporally normalized frame difference [19] in feature levels, and 2) temporal central difference information provides fine-grained temporal context, which might be helpful to track the local ROIs for robust rPPG recovery.
II-B Backbone Search for Remote HR Measurement
All existing end-to-end rPPG networks [15, 16, 17, 18] are designed manually, which might be sub-optimal for feature representation. In this paper, we first introduce NAS to automatically discover the best-suited backbone for the task of remote HR measurement. Our search algorithm is based on two gradient-based NAS methods [23, 24], and more technical details can be referred to the original papers.
As illustrated in Fig. 3(a), our goal is to search for cells to form a network backbone for the rPPG recovery task. As for the cell-level structure, Fig. 3(b) shows that each cell is represented as a directed acyclic graph (DAG) of nodes , where each node represents a network layer. We denote the operation space as , and Fig. 3(c) shows nine designed candidate operations. Each edge of DAG represents the information flow from node to node , which consists of the candidate operations weighted by the architecture parameter . Specially, each edge can be formulated by a function where . Softmax function is utilized to relax architecture parameter into operation weight , that is . The intermediate node can be denoted as and the output node is depth-wise concatenation of all the intermediate nodes excluding the input nodes.
In general, the architecture of the cells within four blocks are shared for robust searching. We also consider the flexible and complex setting when all the four blocks to be searched are varied. We name these two configurations as ‘Shared’ and ‘Varied’, respectively. In addition, we also compare the operation space with and without TDC in Section III-B.

In the searching stage, and are denoted as the training and validation loss respectively, which are all based on the overall loss described in Section II-C. Network parameters and architecture parameters are learned with the following bi-level optimization problem:
(2) |
After convergence, the final discrete architecture is derived by: 1) setting , and 2) for each intermediate node, choosing two incoming edges with the two largest values of .
II-C Supervision in the Time and Frequency Domain
Besides designing the network architecture, we also need an appropriate loss function to guide the networks. However, existing negative Pearson (NegPearson) [17, 18] and signal-to-noise ratio (SNR) [15] losses only constrain in the time or frequency domains, respectively. It might be helpful for the network to learn more intrinsic rPPG features with both strong spectral distribution supervision in the frequency domain, and fine-grained signal rhythm guidance in the time domain. The loss in the time domain can be formulated as
(3) |
where and indicate the predicted rPPG and ground truth PPG signals, respectively. is the length of the signals. Inspired by SNR loss, we also treat HR estimation as a classification task via frequency transformation, which is formulated as
(4) |
where is the power spectral density of the predicted rPPG signal , while denotes the ground truth HR value. means the classical cross-entropy loss. Finally, the overall loss in the time and frequency domain is , where is a balancing parameter.
II-D Spatio-Temporal Data Augmentation
Two problems hindering the remote HR measurement task caught our attention, and we propose two data augmentation strategies accordingly. On one hand, head movements could cause ROI occlusion. We propose data augmentation strategy ‘DA1’, which is to randomly erase or cutout partial spatio-temporal tubes within a random time clip (less than 20% spatial size and 20% temporal length), which mimics the situation of ROI occlusion. On the other hand, HR distribution is severely unbalanced as a reversed-V shape. We propose data augmentation strategy ‘DA2’, which is to temporally upsample and downsample the videos to generate extra training samples with extreme small or large HR values. To be specific, the videos with HR values larger than 90 bpm would be temporally interpolated twice while those with HR smaller than 70 bpm are downsampled with sampling rate 2, to simulate half and doubled heart rate, respectively.

III Experiments
III-A Datasets and Metrics
Three public datasets are employed in our experiments. The VIPL-HR [14] and MAHNOB-HCI [25] datasets are utilized for intra-dataset testing with subject-independent 5-fold cross-validation while the MMSE-HR [6] is used for cross-dataset testing. Performance metrics for evaluating the average HR include the standard deviation (SD), the mean absolute error (MAE), the root mean square error (RMSE), and the Pearson’s correlation coefficient ().
III-B Ablation Study
All ablation studies are conducted on Fold-1 of the VIPL-HR [14] dataset. PhysNet [17] is adopted as the backbone for exploring the impacts of loss functions and TDC. The NAS searched backbone is used for exploring the effectiveness of data augmentation strategies.
Impact of Loss Functions. Fig. 4(a) shows that plays a vital role for accurate HR prediction, which reduces 4.3 bpm RMSE compared with . The lowest RMSE (9.9 bpm) could be obtained when supervised with , indicating the elaborate supervision in both time and frequency domains enables the model to learn robust rPPG features.
Impact of in TDC. According to Eq. (1), controls the contribution of the temporal difference information. As illustrated in Fig. 4(b), compared with the vanilla 3D convolution (i.e., =0), TDC achieves lower RMSE in most cases, indicating the temporal difference clues are helpful for HR measurement. Specially, we consider and in our search operation space because of their better performance (RMSE=9.07 and 9.1 bpm, respectively).
Impact of NAS Configuration. As shown in Fig. 4(c), the searched backbones always perform better than that designed manually without NAS. It is interesting that NAS with shared cells is more likely to search excellent architectures than that with varied cells, which might be caused by the inefficient search algorithm and limited amounts of data. Moreover, better-suited networks could be searched with TDC operators.
Impact of Data Augmentation. Fig. 4(d) illustrates the evaluation results of various data augmentation strategies. It is surprised that only with ‘DA1’ (without ‘DA2’), it harms the performance slightly (0.5 bpm RMSE increased). The reason might be that the model learns rPPG-unrelated features due to the random spatio-temporal cutouts especially under the conditions of unbalanced data distribution. However, with the help of ‘DA2’ to enrich the samples with extreme HR values, the strategy ‘DA1+DA2’ improves the performance by 5%.
III-C Intra-dataset Testing
Results on VIPL-HR. As shown in Table I, all three traditional methods (Tulyakov2016 [6], POS [8] and CHROM [7]) perform poorly because of the complex scenarios (e.g., large head movement and various illumination) in the VIPL-HR dataset. Similarly, the existing end-to-end learning based methods (e.g., PhysNet [17] and DeepPhys [16]) predict unreliable HR values as their Pearson’s correlation coefficient are quite low (). In contrast, our proposed AutoHR achieves comparable performance with state-of-the-art non-end-to-end learning based method RhythmNet [14], which can be regarded as a strong end-to-end baseline for the challenging VIPL-HR dataset. Note that RhythmNet needs the strict and heavy preprocessing procedure to eliminate external disturbances while our AutoHR learns the intrinsic rPPG-aware features automatically without any preprocessing.
Results on MAHNOB-HCI. We also evaluate our method on the MAHNOB-HCI dataset, which is widely used in HR measurement. The video samples are challenging because of the high compression rate and spontaneous motions, caused by facial expressions for example. As shown in Table II, the proposed AutoHR achieves the lowest MAE (3.78 bpm) among the traditional and end-to-end learning methods, which indicates the robustness of the learned rPPG features. Our performance is on par with the latest non-end-to-end learning based method RhythmNet [14]. It implies that with excellent architecture and sufficient supervision, the end-to-end learning fashion is possible for robust HR measurement.
Method | SD (bpm) | MAE (bpm) | RMSE (bpm) | |
---|---|---|---|---|
Poh2011 [4] | 13.5 | - | 13.6 | 0.36 |
CHROM [7] | - | 13.49 | 22.36 | 0.21 |
Li2014 [5] | 6.88 | - | 7.62 | 0.81 |
Tulyakov2016 [6] | 5.81 | 4.96 | 6.23 | 0.83 |
SynRhythm [13] | 10.88 | - | 11.08 | - |
RhythmNet [14] | 3.99 | - | 3.99 | 0.87 |
HR-CNN [15] | - | 7.25 | 9.24 | 0.51 |
rPPGNet [18] | 7.82 | 5.51 | 7.82 | 0.78 |
DeepPhys [16] | - | 4.57 | - | - |
AutoHR (Ours) | 4.73 | 3.78 | 5.10 | 0.86 |
III-D Cross-dataset Testing
In this experiment, the VIPL-HR database is used for training and all videos in the MMSE-HR database are directly used for testing. It is clear that the proposed AutoHR also generalizes well in unseen domain. It is worth noting that AutoHR achieves the highest (0.89) among the traditional, non-end-to-end and end-to-end learning based methods, which signifies 1) our predicted HRs are highly correlated with the ground truth HRs, and 2) our model learns domain-invariant intrinsic rPPG-aware features.
IV Conclusion
In this letter, we explore three main factors (i.e., network architecture, loss function and data augmentation) influencing the performance of the 3DCNN based end-to-end framework (e.g., PhysNet [17]) for remote HR measurement. The proposed AutoHR generalizes well even in less-constrained scenarios, which is promising to be a strong end-to-end baseline for rPPG research community.
References
- [1] N. K. L. Murthy, P. C. Madhusudana, P. Suresha, V. Periyasamy, and P. K. Ghosh, “Multiple spectral peak tracking for heart rate monitoring from photoplethysmography signal during intensive physical exercise,” IEEE Signal Processing Letters, vol. 22, no. 12, pp. 2391–2395, 2015.
- [2] W. Verkruysse, L. O. Svaasand, and J. S. Nelson, “Remote plethysmographic imaging using ambient light.” Optics express, vol. 16, no. 26, pp. 21 434–21 445, 2008.
- [3] M.-Z. Poh, D. J. McDuff, and R. W. Picard, “Non-contact, automated cardiac pulse measurements using video imaging and blind source separation.” Optics express, vol. 18, no. 10, pp. 10 762–10 774, 2010.
- [4] M.-Z. Poh, D. McDuff, and R. Picard, “Advancements in noncontact, multiparameter physiological measurements using a webcam,” IEEE transactions on biomedical engineering, vol. 58, no. 1, pp. 7–11, 2010.
- [5] X. Li, J. Chen, G. Zhao, and M. Pietikainen, “Remote heart rate measurement from face videos under realistic situations,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 4264–4271.
- [6] S. Tulyakov, X. Alameda-Pineda, E. Ricci, L. Yin, J. F. Cohn, and N. Sebe, “Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2396–2404.
- [7] G. De Haan and V. Jeanne, “Robust pulse rate from chrominance-based rppg,” IEEE Transactions on Biomedical Engineering, vol. 60, no. 10, pp. 2878–2886, 2013.
- [8] W. Wang, A. C. den Brinker, S. Stuijk, and G. de Haan, “Algorithmic principles of remote ppg,” IEEE Transactions on Biomedical Engineering, vol. 64, no. 7, pp. 1479–1491, 2017.
- [9] S. B. Park, G. Kim, H. J. Baek, J. H. Han, and J. H. Kim, “Remote pulse rate measurement from near-infrared videos,” IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1271–1275, 2018.
- [10] J. Shi, I. Alikhani, X. Li, Z. Yu, T. Seppänen, and G. Zhao, “Atrial fibrillation detection from face videos by fusing subtle variations,” IEEE Transactions on Circuits and Systems for Video Technology, 2019.
- [11] G.-S. Hsu, A. Ambikapathi, and M.-S. Chen, “Deep learning with time-frequency representation for pulse estimation from facial videos,” in 2017 IEEE International Joint Conference on Biometrics (IJCB). IEEE, 2017, pp. 383–389.
- [12] Y. Qiu, Y. Liu, J. Arteaga-Falconi, H. Dong, and A. El Saddik, “Evm-cnn: Real-time contactless heart rate estimation from facial video,” IEEE Transactions on Multimedia, vol. 21, no. 7, pp. 1778–1787, 2018.
- [13] X. Niu, H. Han, S. Shan, and X. Chen, “Synrhythm: Learning a deep heart rate estimator from general to specific,” in 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 3580–3585.
- [14] X. Niu, S. Shan, H. Han, and X. Chen, “Rhythmnet: End-to-end heart rate estimation from face via spatial-temporal representation,” IEEE Transactions on Image Processing, 2019.
- [15] R. Špetlík, V. Franc, and J. Matas, “Visual heart rate estimation with convolutional neural network,” in Proceedings of the British Machine Vision Conference, Newcastle, UK, 2018, pp. 3–6.
- [16] W. Chen and D. McDuff, “Deepphys: Video-based physiological measurement using convolutional attention networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 349–365.
- [17] Z. Yu, X. Li, and G. Zhao, “Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks,” in Proc. British Machine Vision Conference (BMVC), 2019, pp. 1–12.
- [18] Z. Yu, W. Peng, X. Li, X. Hong, and G. Zhao, “Remote heart rate measurement from highly compressed facial videos: an end-to-end deep learning solution with video enhancement,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 151–160.
- [19] W. Wang, S. Stuijk, and G. De Haan, “Exploiting spatial redundancy of image sensor for motion robust rppg,” IEEE transactions on Biomedical Engineering, vol. 62, no. 2, pp. 415–425, 2014.
- [20] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [22] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
- [23] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architecture search,” International Conference on Learning Representations (ICLR), 2019.
- [24] Y. Xu, L. Xie, X. Zhang, X. Chen, G.-J. Qi, Q. Tian, and H. Xiong, “Pc-darts: Partial channel connections for memory-efficient architecture search,” in International Conference on Learning Representations, 2019.
- [25] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic, “A multimodal database for affect recognition and implicit tagging,” IEEE transactions on affective computing, vol. 3, no. 1, pp. 42–55, 2011.
- [26] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
Appendix A: Dataset Details
The VIPL-HR [14] dataset has 3130 videos recorded from 107 subjects. The ground truth HR and the PPG signals are extracted from the finger BVP sensors. The MAHNOB-HCI [25] dataset is one of the most widely used benchmark for remote HR evaluations. It includes 527 videos from 27 subjects. We use the EXG2 signals as the ground truth ECG signal in evaluation. We follow the same routine as the previous work [5] and use 30 seconds clip (frames 306 to 2135) of each video. The MMSE-HR [6] dataset consists of 102 videos from 40 subjects, and the ground truth HR values are provided inside, which are computed from ECG signals.
Appendix B: Implementation Details
Our proposed method is implemented with Pytorch. For each video clip, we use the MTCNN face detector [26] to crop the enlarged face area at the first frame and fix the region through the following frames. The setting is utilized for loss tradeoff.
Training and Testing Setting. In the training stage, we randomly sample face clips with size 3160112112 (ChannelTimeHeightWidth) as the network inputs. The models are trained with Adam optimizer and the initial learning rate (lr) and weight decay (wd) are 1e-4 and 5e-5, respectively. We train models with maximum 15 epochs. The batch size is 4 on two P100 GPUs. In the testing stage, similar to [14], we uniformly separate 30-second videos into three short clips with 10 seconds, and then the video-level HR is calculated via averaging the HR from three short clips.
Searching Setting. Similar to [24], partial channel connection and edge normalization are adopted. In the training stage, we randomly sample face clips with size 3128112112. The initial channel number is 8, which doubles after searching. Adam optimizer with lr=1e-4 and wd=5e-5 is utilized when training the model weights. The architecture parameters are trained with Adam optimizer with lr=6e-4 and wd=1e-3. We search 12 epochs on Fold-1 of the VIPL-HR dataset with batchsize 2 while architecture parameters are not updated in the first five epochs. The whole searching process costs ten days on a P100 GPU.

Appendix C: The searched architecture
The architecture of the searched cell of the proposed AutoHR is illustrated in Fig. 5. It seems that the former nodes, i.e., the intermediate node 0 and node 1, prefer to utilize temporal convolutions like ‘conv_3x1x1’ and ‘conv_5x1x1’ while the latter node 2 favors to adopt spatial convolution like ‘conv_1x5x5’. It might inspire the rPPG community for the further task-aware network design.
Appendix D: Visualization
Here we show some visualization of the feature activations and predicted rPPG signals from hard samples. As shown in Fig. 6(a), in terms of the low-level features (after the first ‘Block’), both PhysNet [17] and AutoHR focus more on the forehead regions, which is in accordance with the priori knowledge mentioned in [2]. However, with the head movement, the high-level features (after the third ‘Block’) from PhysNet are not stable while those from AutoHR are still robust in tracking the particular regions. It is worth noting that features from both PhysNet and AutoHR are chaotic when the head bows (see the last column).
Fig. 6(b) shows the predicted rPPG signals in the scenario with head movement (slow movement at the beginning while fast movement in the later stage). It is clear that the rPPG signals predicted from AutoHR are highly correlated with ground truth PPG signals. However, due to the fast head movement (between 10 and 13 seconds along the time axes), the predicted signals are quite noisy, which indicates that the end-to-end learning based methods are vulnerable to such quick and large pose changes.
