Unsupervised heart abnormality detection based on phonocardiogram analysis with Beta Variational Auto-Encoders
Abstract
Heart Sound (also known as phonocardiogram (PCG)) analysis is a popular way that detects cardiovascular diseases (CVDs). Most PCG analysis uses supervised way, which demands both normal and abnormal samples. This paper proposes a method of unsupervised PCG analysis that uses beta variational auto-encoder () to model the normal PCG signals. The best performed model reaches an AUC (Area Under Curve) value of 0.91 in ROC (Receiver Operating Characteristic) test for PCG signals collected from the same source. Unlike majority of s that are used as generative models, the best-performed has a value smaller than 1. Further experiments then find that the introduction of a light weighted KL divergence between distribution of latent space and normal distribution improves the performance of anomaly PCG detection based on anomaly scores resulted by reconstruction loss. The fact suggests that anomaly score based on reconstruction loss may be better than anomaly scores based on latent vectors of samples.
Index Terms— Phonocardiogram Analysis, Variational-Auto-Encoder, Anomaly Detection, Outlier Detection, Unsupervised Learning
1 Introduction
With the rapid development of smart wearing and home care devices, the phonocardiogram (PCG) analysis is used to identify the risk of cardiovascular diseases (CVDs) due to more portable devices requirements and easier collection of samples. Most existing methods require a collection of both normal and abnormal samples to build a supervised machine learning system. Collecting abnormal samples could be a tricky task as the distribution of CVD types leads to a possible bias of the data, which potentially harms the system performance. Moreover the labelling attempts of PCG signals are professionally labour expensive. This paper proposes an unsupervised system with normal PCG signals only, which detects potential risks of CVDs with less labelling attempts and easier dataset design.
The proposed system attempts to analyse normal PCG signals with the most distinguishable features extracted. As there are no abnormal samples used in the training process of the proposed system, the main task of the proposed system is to find a type of representation for PGC signals such that normal signals can be represented by similar feature vectors. As a result, Variational Auto-Encoders (VAE) is proposed to encode the normal PCG signals where the latent vectors in the latent space of VAE follow Gaussian distribution.
VAEs are used for generative models in most cases, which balance the cost of reconstruction accuracy and distribution in latent space [1]. As a result, is a variation of VAE that introduces a variable to find the right balance between the distribution of latent space and reconstruction loss. Serving as a generative model, VAEs usually introduce a value of larger than 1. But in this paper, VAEs are used for outlier detector, it is still worthy to discuss the value of for the best performed system. With the PhysioNet / CinC Dataset [2], VAE architectures with different value of beta are tested.
As most outlier detection systems use either reconstruction loss or variables related to latent vectors to calculate anomaly score. This paper further explores how the KL divergence between the latent vectors and the reconstruction loss are related to each other. Based on the results, this paper investigates the preferred variable for anomaly score calculation in the PCG analysis.
The following sections of the paper is organised by the following way. Related works are reviewed first. Then the basic architecture of the proposed system is introduced. The experiment setup is then explained with the results followed. The results are then discussed followed by a conclusion.
2 Literature Review
Phonocardiograms, or PCGs are of great value, for which can be used to detect heart disease. Related attempts can be dated back to the year of 1995 [3]. In the year of 2016, PhysioNet [4] and CinC (Computing in Cardiology Challenge) organised a data challenge that detects anomaly PCG signals. Since then, the release of the challenge dataset [2] promotes the research in relevant topics.
Traditional methods, such as support vector machine [5], i-vector [6] and hidden Markov model [7] are used to detect anomaly PCG signals in a supervised way. Deep learning methods based on Variational Auto-Encoder (VAE) [8], deep convolutional neural network [9, 10] and recurrent neural network [11] are also used recently. Assembling of both traditional methods and deep learning methods, the best performed system [12] of the challenge achieves an accuracy of 0.91.
Most existing methods uses a supervised system such that collecting large scale of data is more difficult and the labelling process is professionally labour expensive. As a result, there are some attempts on unsupervised system. Unnikrishnan et al. [13] use auto-encoder to distinguish normal and abnormal PCG signals with an AUC of 0.828, which needs further improvements compared with the state-of-the-art systems.
As a result, this proposes a VAE based method. VAE system does not only pursue an accurate reconstruction of original signals but also require the distribution of latent space obeys a specific probability distribution (usually normal distribution). Existing works show that VAE may outperform the more traditional AE based system for anomaly detection [14]. However, as VAEs are used as generative models in most cases, this paper proposes to use -VAE [15] that enables controls on the balance between the reconstruction loss and the distribution in latent space.
Moreover, most existing works consider the dataset as a single domain problem, i.e. there are no samples that collected by a different way used in the evaluation process. To extend the application of the proposed system, this paper considers the PCG anomaly detection problem as a multiple domain problem. Namely the data collected for training and the data used for testing are collected by different devices. Particularly in this paper, unlike Banerjee and Ghose [8], this paper uses different subsets in the Physio / CinC dataset, which is more challenging in terms of data inconsistency.
3 Methods
In this section, the system architecture is introduced, which has three stages: pre-processing, VAE and post-processing. The pre-processing stage normalise Mel Spectrogram along frequency bins. The VAE system encodes and decodes the normalised Mel Spectrogram splits to produce raw anomaly score with reconstruction loss. which is used to calculate the final anomaly score of a piece of audio at post-processing stage.
3.1 Dataset
The Dataset used in the proposed method is the Physio Heart Sound Dataset [2], in which six subsets of data are collected from different data source. The number of samples and the positive rate are listed in Table 1. Each subset are collected at different locations with difference sample-collect devices [2] hence can be considered as different data domain.
Subset | a | b | c | d | e | f |
---|---|---|---|---|---|---|
Sample # | 409 | 490 | 31 | 55 | 2141 | 114 |
Positive % | 0.71 | 0.21 | 0.78 | 0.51 | 0.09 | 0.30 |
3.2 Pre-Processing
The Physio Heart Sound datasets contain samples with variable lengths which range from 5 seconds to 120 seconds. All samples are pre-processed to last 8 seconds for containing 6 to 13 complete cardiac cycles. Shorter samples are padded by a recurrent manner and Longer samples are simply truncated.
Then the Mel Spectrogram is calculated by formulae specified by Davis and Mermelstein [16] with a window size of 1024, hope size of 512 and 14 Mel filters applied. Given the sampling rate of 2 kHz, each frame last about 0.51 seconds.
Suppose Mel Spectrogram is represented by , where is the number of frequency bins and is the number of frames. The Mel Spectrogram is then normalised regarding to each frames. Using to represent the element at th row and th column in spectrogram i.e. the normalised th frequency bin can be calculated as
(1) |
The resulting normalised Mel Spectrogram is then divided into super-frames. Each super-frame is formed by five consecutive frames hence last 3.07 seconds, which should be able to contain a few heart beats in most cases . The start frame of super-frames has a hop size of a single frame. Namely, for a piece of audio that has frames, there are super-frames.
3.3 VAE Architecture
The input of the VAE system is the super-frames of the normalised Mel Spectrogram. In the training process of the proposed system, each batch contains 640 super-frames (i.e. five consecutive frames) with batch normalisation applied.
The encoder of the proposed VAE system has a simple structure of four fully-connected hidden layers who has 32, 32, 16 and 16 neurons respectively. The resulting latent representations have 16 dimensions, which are regarded as the output layer of the encoder and the input layer of the decoder. The decoder has the exact structure but with reverse order in terms of number of neurons in each layer and reversed input and output layer. Figure 1 shows the proposed VAE architecture.

Following the loss function of VAE, the loss function of the whole VAE is the sum of reconstruction loss and the KL divergence between the resulting latent variable distribution and the normal distribution. Particularly in the proposed system, the reconstruction loss is proposed to be used as the anomaly scoring variable hence entropy-based measurements such as Evidence Lower Bound (ELBO) cannot be used as the reconstruction loss. Instead the proposed system use Mean Squared Error (MSE) as reconstruction loss function.
Suppose represents the mean value of normalised Mel Spectrogram, there are dimensions in the latent space. The loss function used in the proposed system can be represented as
(2) | |||
3.4 Post-Processing
The anomaly score of each super-frame in a piece of audio is then averaged to be used as the final anomaly score for the audio. As the super-frames have a length of 5 raw frames, there are super-frames in a piece of audio with frames. Using to represent the anomaly score of the th super-frame, the overall anomaly score is then can calculated as
(3) |
4 Results
4.1 Single Subset System
Firstly, the system performance with a single subset of data is investigated. As there are very few normal samples, the subset ‘C’ is not used. For other subsets (subset ‘a’, ‘b’, ‘d’, ‘e’, ‘f’), 90% of normal data is used to train the VAE model. The remaining normal data and all abnormal data is used to evaluate the performance of resulting system. There are six proposed settings for the value of : 0, 0.01, 0.1, 1, 10, 100. When , the system is essentially a common VAE system but when , the system is not an auto-encoder system as an extra sampling process is introduced in the latent space.
To evaluate the proposed systems, Receiver Operating Characteristic (ROC) analysis is performed. Area Under Curve (AUC) is calculated as the area under a curve that formed by true and false positive rate as demonstrated in Figure 2, where the result of subset ‘e’ is used as an example. From the diagram, the VAE with is slightly better than the auto-encoder and both systems are better than classical VAE system (where ) according to the value of AUC.

In Table 2, the proposed system is tested with six different settings of beta values besides a reference auto-encoder. In most cases (subsets ‘a’, ‘d’, ‘e’), with a small value of such as 0, 0.01, 0.1 outperform other systems. For subset ‘b’, the auto-encoder outperforms all VAEs with a preference of smaller value in . For subset ‘f’ the system prefers a larger value of . As a result, the overall performance of candidate systems shows that the with a smaller value may perform
a | b | d | e | f | |
AE | 0.816 | 0.583 | 0.667 | 0.922 | 0.835 |
0 | 0.823 | 0.579 | 0.750 | 0.924 | 0.827 |
0.01 | 0.825 | 0.561 | 0.607 | 0.923 | 0.823 |
0.1 | 0.821 | 0.551 | 0.583 | 0.918 | 0.824 |
1 | 0.798 | 0.559 | 0.607 | 0.881 | 0.824 |
10 | 0.800 | 0.557 | 0.631 | 0.881 | 0.805 |
100 | 0.803 | 0.551 | 0.691 | 0.881 | 0.846 |
Usually a larger beta value in VAE gives a higher priority to the distribution of latent space with a cost of reconstruction loss. In the extreme case that an auto-encoder is used, the system does not consider the distribution of latent space at all. This experiment demonstrates that the introduction of loss for latent space distribution helps the anomaly detection with unmet data. Such improvements only cost on little accuracy of anomaly detection with the in-domain data.
4.2 Multiple Subsets System
Next the proposed experiments is repeated with multiple subsets. The multiple domain tests introduce a high complexity of data distribution. We perform the experiments with three sets of subset combination: ‘ae’, ‘ef’ and all subsets (‘abcdef’). Similar with the case of single domain system, in all cases, 90% of normal data is used for training and the remaining 10% of normal data and all abnormal data are used for testing.
‘ae’ | ‘ef’ | all | |
---|---|---|---|
AE | 0.820 | 0.847 | 0.783 |
0 | 0.823 | 0.890 | 0.782 |
0.01 | 0.822 | 0.899 | 0.786 |
0.10 | 0.810 | 0.891 | 0.750 |
1 | 0.766 | 0.838 | 0.678 |
10 | 0.765 | 0.836 | 0.678 |
100 | 0.765 | 0.836 | 0.644 |
In Table 3, the best performed system for subsets ‘ae’ has a value of 0 and the best performed system has a value of 0.01 in other cases. This results gain confirm the finding in the case of single subset: the introduction of considering for latent space distribution improves the system performance but the weight of distribution loss in latent space should remain fairly insignificant.
5 Discussion
Although the proposed method uses reconstruction loss as the anomaly score, parameters related to latent vectors such as model likelihood can also be used as the anomaly score. We first propose an examination on the relationship of reconstruction loss and KL divergence between the latent vectors and the data distribution in the latent space. Pearson’s correlation coefficient between the KL divergence and reconstruction loss are calculated. Suppose represents the KL divergence between the the th sample in the latent space and normal distribution, represents the reconstruction loss of the th sample in terms of MSE, the Peason’s correlation coefficient can be calculated as:
(4) |
Beta | 0 | 0.01 | 0.1 | 1 | 10 | 100 |
---|---|---|---|---|---|---|
0.44 | 0.41 | 0.32 | 0.79 | 0.79 | 0.79 |
Setting the same set of beta value, the correlation between KL divergence and reconstruction loss is shown in Table 4. The correlation between reconstruction loss and the KL divergence between samples and normal distribution in the latent space become high with a large value. As a result, in the systems having a larger beta value, the reconstruction loss serving as the anomaly score should have a similar performance with the cases that parameters related to latent vectors are used as the anomaly score. Therefore the better performance of systems with smaller beta values then the systems with larger values is likely to suggest that the reconstruction based anomaly score may outperform the latent vector based anomaly score in this case of unsupervised PCG anomaly detection.
6 Conclusion
This paper proposes an unsupervised PCG analysis system that detects anomaly PCG signals by learning normal PCG signals only. The system is based on VAE system with the beta value of the best performed system is even smaller than 1 in most cases. Further investigations suggests that for this PCG analysis case, the anomaly score calculation has a small preference on reconstruction based parameters.
References
- [1] Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner, “Understanding disentangling in VAE,” arXiv:1804.03599 [cs, stat], Apr. 2018.
- [2] Chengyu Liu, David Springer, Qiao Li, Benjamin Moody, Ricardo Abad Juan, Francisco J Chorro, Francisco Castells, José Millet Roig, Ikaro Silva, Alistair E W Johnson, Zeeshan Syed, Samuel E Schmidt, Chrysa D Papadaniil, Leontios Hadjileontiadis, Hosein Naseri, Ali Moukadem, Alain Dieterlen, Christian Brandt, Hong Tang, Maryam Samieinasab, Mohammad Reza Samieinasab, Reza Sameni, Roger G Mark, and Gari D Clifford, “An open access database for the evaluation of heart sound algorithms,” Physiological Measurement, vol. 37, no. 12, pp. 2181–2213, Dec. 2016.
- [3] D Barschdorff, U Femmer, and E Trowitzsch, “Automatic phonocardiogram signal analysis in infants based on wavelet transforms and artificial neural networks,” in Computers in Cardiology 1995. IEEE, 1995, pp. 753–756.
- [4] A. Goldberger, L. Amaral, L. Glass, J. Hausdorff, P.C. Ivanov, R. Mark, J.E. Mietus, Peng Moody, G.B., C.K., and H.E. Stanley, “Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals,” Circulation [Online], pp. e215––e220, 2000.
- [5] Morteza Zabihi, Ali Bahrami Rad, Serkan Kiranyaz, Moncef Gabbouj, and Aggelos K. Katsaggelos, “Heart sound anomaly and quality detection using ensemble of neural networks without segmentation,” in 2016 Computing in Cardiology Conference (CinC), Sept. 2016.
- [6] Mohammad Adiban, Bagher BabaAli, and Saeedreza Shehnepoor, “Statistical feature embedding for heart sound classification,” Journal of Electrical Engineering, vol. 70, no. 4, pp. 259–272, 2019.
- [7] I. Grzegorczyk, M. Solinski, M. Lepek, A. Perka, J. Rosinski, J. Rymko, K. Stepien, and J. Gieraltowski, “Pcg classification using a neural network approach,” in 2016 Computing in Cardiology Conference (CinC), 2016, pp. 1129–1132.
- [8] Rohan Banerjee and Avik Ghose, “A semi-supervised approach for identifying abnormal heart sounds using variational autoencoder,” in 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 1249–1253.
- [9] Tomoya Koike, Kun Qian, Qiuqiang Kong, Mark D. Plumbley, Björn W. Schuller, and Yoshiharu Yamamoto, “Audio for audio is better? an investigation on transfer learning models for heart sound classification,” in The 42nd International Engineering in Medicine and Biology ConferenceAt: Montréal, Canada, Montréal, Canada, 2020, pp. 74–77.
- [10] Jonathan Rubin, Rui Abreu, Anurag Ganguli, Saigopal Nelaturi, Ion Matei, and Kumar Sricharan, “Recognizing Abnormal Heart Sounds Using Deep Learning,” arXiv:1707.04642 [cs], Oct. 2017, arXiv: 1707.04642.
- [11] Kun Qian, Zhao Ren, Fengquan Dong, Wen-Hsing Lai, Björn W Schuller, and Yoshiharu Yamamoto, “Deep wavelets for heart sound classification,” in Proceedings of the 28th International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Taipei, China, 2019, pp. 1–2.
- [12] Cristhian Potes, Saman Parvaneh, Asif Rahman, and Bryan Conroy, “Ensemble of feature based and deep learning based classifiers for detection of abnormal heart sounds,” in 2016 Computing in Cardiology Conference (CinC), Sept. 2016.
- [13] Balagopal Unnikrishnan, Pranshu Ranjan Singh, Xulei Yang, and Matthew Chin Heng Chua, “Semi-supervised and Unsupervised Methods for Heart Sounds Classification in Restricted Data Environments,” arXiv:2006.02610 [cs], June 2020, arXiv: 2006.02610.
- [14] Rong Yao, Chongdang Liu, Linxuan Zhang, and Peng Peng, “Unsupervised Anomaly Detection Using Variational Auto-Encoder based Feature Extraction,” in 2019 IEEE International Conference on Prognostics and Health Management (ICPHM), San Francisco, CA, USA, June 2019, pp. 1–7, IEEE.
- [15] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner, “-VAE: Learning basic visual concepts with a constrained variational framework,” in Proceeding of 5th International Conference on Learning Representations, 2017, p. 22.
- [16] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.