Flow-based self-supervised density estimation for anomalous sound detection
Abstract
To develop a machine sound monitoring system, a method for detecting anomalous sound is proposed. Exact likelihood estimation using Normalizing Flows is a promising technique for unsupervised anomaly detection, but it can fail at out-of-distribution detection since the likelihood is affected by the smoothness of the data. To improve the detection performance, we train the model to assign higher likelihood to target machine sounds and lower likelihood to sounds from other machines of the same machine type. We demonstrate that this enables the model to incorporate a self-supervised classification-based approach. Experiments conducted using the DCASE 2020 Challenge Task2 dataset showed that the proposed method improves the AUC by 4.6% on average when using Masked Autoregressive Flow (MAF) and by 5.8% when using Glow, which is a significant improvement over the previous method.
Index Terms— Machine health monitoring, Anomaly detection, Normalizing Flows
1 Introduction
As a number of companies worldwide are facing a shortage of skilled maintenance workers, the demand for an automatic sound-monitoring system has been increasing. Unsupervised anomaly detection methods are often adopted for this system since anomalous data can rarely be obtained [1, 2].
For unsupervised anomaly detection, generative models such as Variational Auto Encoder (VAE), which use approximate likelihood and reconstruction error to calculate anomaly scores, have been utilized. Normalizing Flows (NF) [3, 4] is another promising generative model thanks to its ability to perform exact likelihood estimation. However, this model fails at out-of-distribution detection since it assigns higher likelihood to smoother structured data [5, 6, 7].
Self-supervised classification-based approach is another way for detecting anomalies when sound data from multiple machines of the same machine type is available [8]. In self-supervised learning, a model is trained for a main task with another task called an auxiliary task to improve the performance on the main task. In self-supervised classification-based approach, the auxiliary task is to train a classifier that predicts a machine ID assigned to each machine. If the classifier misclassifies the machine ID of a sound data, the sound data is regarded as anomalous. Although this approach improves the detection performance on average, it shows significantly low scores on some machines.
In this paper, we propose a self-supervised density estimation method using NF. Our method uses sound data from one machine ID to detect anomalies (target data) and sound data from other machines of the same machine type (outlier data), and the model is trained to assign higher likelihood to the target data and lower likelihood to the outlier data. This method is a self-supervised approach because it improves the detection performance on one machine ID by introducing an auxiliary task in which the model discriminates the sound data of that machine ID (target data) from sound data of other machine IDs with the same machine type (outlier data). Also, since the method increases the likelihood of the target data, this method can also be similar to the unsupervised approach.
We evaluated the detection performance of our method using six types of machine sound data. Glow [9] and MAF [10] were utilized as NF models. Experimental results showed that our method outperformed unsupervised approaches while showing more stable detection performances compared to the self-supervised classification-based approach.
2 Problem Statement
Anomalous sound detection is a task to identify whether a machine is normal or anomalous based on the anomaly score a trained model calculates from its sound. Each input sound data is determined as anomalous data if the anomaly score of the data exceeds a threshold value. We consider the unsupervised anomalous sound detection where only normal sound is available for training. We assume sound data of multiple machines with the same machine type is available. This is a realistic assumption since multiple machines of the same type are often installed in factories. This problem setting is the same as that in DCASE 2020 Challenge Task2 [11].
3 Relation to prior work
3.1 Improving the detection performance of NF
Researchers have proposed various methods to modify the likelihood assigned by NF models. Serra et al. [7] used input complexity to modify the likelihood. However, a compression algorithm has to be chosen to calculate the input complexity. Ren et al. [12] used the likelihood ratio between the likelihood of the input data and the background component, but parameters have to be tuned for modeling the background. Hendrycks et al. [13] used auxiliary datasets of outliers to improve the detection performance. However, the proposed loss function can destabilize the detection performance. Kirichenko et al. [14] proposed a loss function to distinguish the in-distribution data from out-of-distribution data by means of a supervised approach. The authors argued that this method cannot be used to detect out-of-distribution data not included in the training dataset. We show that this method can be used for detecting anomalies if sound data of the same machine type as the target data is used for the outlier data in the training dataset.
3.2 A self-supervised approach for anomaly detection
In DCASE 2020 Challenge Task2, top rankers trained classifiers to predict a machine ID of the data [8, 15]. This approach assumes that the classifier can output a false machine ID if the data is anomalous. Giri et al. [8] named this approach a self-supervised classification-based approach. However, we found that this approach can fail on some machine IDs, and the detection performance degrades significantly. To stabilize the detection performance, our proposed method incorporates the unsupervised approach by using the NLL to distinguish the target data from the outlier data.
4 conventional approach
Normalizing Flows (NF) is a series of invertible transformations between an input data distribution and a known distribution . Anomaly score can be calculated by the negative log likelihood (NLL) of the input data [16, 17, 18, 19]. However, this score is dependent on the smoothness of the input data and fails at out-of-distribution detection.
Kirichenko [14] proposed a loss function to distinguish the in-distribution data from the out-of-distribution data by using a supervised approach:
(1) |
where is the number of the in-distribution data in each batch, is the number of the out-of-distribution data in each batch that satisfies the condition in the indicator function , and is a threshold value.
5 Proposed Approach
To overcome the problems of NF models and the self-supervised classification-based approach, our method attempts to assign higher likelihood to the target data and lower likelihood to the outlier data using NF.
We train a model for each machine ID, where the data with that ID is the target data and other machine sounds of the same machine type is the outlier data . Assume and consist of components specific to their machine IDs (, ) and components shared across the same machine type (). The likelihood of and can be written as
(2) |
(3) |
If the model is trained to assign higher likelihood to and lower likelihood to , only components specific to the machine ID (, ) will affect the likelihood. Therefore, we can improve the detection performance of a NF model by introducing an auxiliary task in which we train the NF model to discriminate the target data from the outlier data by the likelihood. When testing the model, the NLL is used as the anomaly score. The anomaly score will be higher when the data structure is different from the target data specific component or close to the outlier data . The former case corresponds to anomaly detection by the unsupervised approach and the latter case to the self-supervised approach. Therefore, this idea can benefit from both the unsupervised and the self-supervised classification-based approach. As illustrated in Fig.1, the model is trained to decrease the NLL until the NLL of the outlier data reaches a threshold . Then, the model is penalized so that the NLL of the outlier data can be higher than the target data.

This idea can be realized by using the loss function in (1), as
(4) |
where is the number of the target data in each batch and is the number of the outlier data in each batch that satisfies the condition in . Threshold is decided so that the NLL of the outlier data does not go to . The difference between (1) and (4) is that the loss function in (1) uses out-of-distribution data that has completely different data structures from the in-distribution data, while the loss function in (4) uses the outlier data of the same machine type as the target data. In (4), the NLL of the target data and the NLL of the outlier data that satisfies the condition in have the same impact on the loss. In unsupervised anomaly detection tasks, lower NLL of the target data can lead to better detection performances. Therefore, we modified (4) so that the model can prioritize decreasing the NLL of the target data, as
(5) |
where enhances decreasing the NLL of the target data. The second condition in is for removing the penalty if the NLL of the outlier data is larger than the maximum NLL of the target data in each batch. In this case, the distinction between the target data and the outlier data can be completely made by the NLL, and the model should only focus on decreasing the NLL of the target data.
6 Experiments
6.1 Experimental conditions
We conducted experiments using DCASE 2020 Challenge Task2 development dataset. Each data is a 10-second single-channel 16-kHz recording from one of six machine types (ToyCar, ToyConveyor, fan, pump, slider, valve). It has a machine ID to distinguish three to four machines in each machine type.
To evaluate the anomaly detection performance of our proposed method, we used Glow and MAF as NF models. Glow and MAF are often chosen as a NF model in out-of-distribution detection and anomaly detection tasks [20]. VAE and VIDNN [1] were used as conventional unsupervised approaches. MobileNetV2 [21] was used as the classifier in the self-supervised classification-based approach.
For the input, the frames of the log-Mel spectrograms were computed with a length of 1024, hop size of 512, and 128 Mel bins. At least 313 frames were generated for each recording, and several frames were successively concatenated to make each input. Input details and the model architectures are described as follows.
VAE and VIDNN. Five frames were used for each input, with four overlapping frames. We trained a model for each machine ID. The model had ten linear layers with 128 dimensions, except for the fifth layer, which had eight dimensions. The model was trained for 100 epochs using the Adam optimizer [22] with a learning rate of .
MobileNetV2. 64 frames were used with 48 overlappings. We trained a model to classify machine IDs for each machine type. The width multiplier parameter was set to 0.5. The model was trained for 20 epochs with the Adam optimizer and a learning rate of .
Glow. 32 frames were used with 28 overlappings. We trained a model for each machine ID. The model had three blocks with 12 flow steps in each block, and each flow step had three CNN layers with 128 hidden layers. The model was trained for 100 epochs using the AdaMax optimizer [22] with a learning rate of and the batch size of 64. in (5) was set to for all machines so that the first term in (5) could be lower while the second term was not ignored. For in (5), we first used all data from a machine type to train the model for several epochs with the NLL as the loss, and then decided by the NLL in the last epoch. The detection performance was not severely affected by the number of epochs in the first step. This operation was performed for each machine type, and different values were set as shown in Table 1.
MAF. Four frames were used with no overlapping. We trained a model for each machine ID. The model had four MADE-blocks with 512 units for each block. The model was trained for 100 epochs using the Adam optimizer with a learning rate of and the batch size of 64. We set to for all machines, and to the values in Table 1.
6.2 Results and Discussions
We evaluated the detection performance on the area under the receiver operating characteristic curve (AUC) and partial AUC (pAUC) with . For NF models, we first used the conventional approach where the NLL was the loss, with additive/affine coupling layers and transformations (Glow add., Glow aff., MAF add., MAF aff.). We then used our proposed methods in (4) and (5). We trained a model for each machine ID, where the data from that machine ID is the target data and the data from other machine IDs with the same machine type is the outlier data. In Table 2, the average AUC and pAUC of each machine type for each model are listed. Our proposed methods outperformed all the conventional methods except for ToyConveyor. This result indicates that our method improves the detection performance by using the outlier data which can have similar data structures to the target data. For ToyConveyor, we found that clear distinctions between different IDs in ToyConveyor were made by the NLL. For example, in ToyConveyor ID 0, the NLL using Glow with affine coupling layers and the loss in (5) converged to for the target data and for the outlier data. We consider different machine IDs in ToyConveyor had completely different data structures, and this made our proposed methods ineffective for ToyConveyor 111In [11], we argued that the self-supervised classification-based approach shows low AUCs on ToyConveyor because machine IDs in ToyConveyor have very similar data structures, and the classifier cannot classify these IDs. However, the results in this paper show that the AUCs are low although these IDs can be distinguished easily. These results indicate that the cause for the low scores of the self-supervised classification-based approach is not that the machine IDs in ToyConveyor have very similar data structures.. The loss function in (5) shows almost the same performance as in (4). However, the NLL of the target data became lower with (5) than (4). For example, in ToyCar machine ID 0, the NLL of the target data using MAF converged to with (4) and to with (5). As described in [10], a lower NLL means the model can obtain accurate densities of the data, which can improve the detection performance of the unsupervised approach. Therefore, we can expect the loss function in (5) to show better results than (4) for other datasets. We also found that MAF with affine transformations could output infinities with the proposed loss function in (4) and (5). As previously pointed out [23], affine transformations have a numerical stability problem. The affine transformations slightly improved the detection performance (Table 2), while it simultaneously destabilized the model and required a higher computational cost.
Glow | MAF | |
ToyCar | 5.75 | –750 |
ToyConveyor | 5.70 | –750 |
fan | 5.69 | –750 |
pump | 5.70 | –750 |
slider | 5.63 | –750 |
valve | 5.53 | –800 |
Model | VAE | VIDNN | MAF add. | MAF aff. | Glow add. | Glow aff. | ||||||
Metric | AUC | pAUC | AUC | pAUC | AUC | pAUC | AUC | pAUC | AUC | pAUC | AUC | pAUC |
ToyCar | 83.3 | 71.3 | 81.2 | 73.1 | 79.0 | 63.2 | 79.5 | 63.9 | 75.0 | 63.6 | 75.9 | 65.1 |
ToyConveyor | 74.1 | 60.8 | 74.7 | 58.9 | 72.3 | 58.6 | 73.3 | 60.2 | 71.6 | 59.0 | 70.7 | 58.7 |
fan | 72.1 | 59.0 | 72.8 | 58.9 | 68.9 | 58.5 | 69.5 | 57.9 | 70.8 | 59.7 | 71.5 | 59.5 |
pump | 76.2 | 65.6 | 76.6 | 66.9 | 76.9 | 66.2 | 78.3 | 68.1 | 75.3 | 65.8 | 76.3 | 65.6 |
slider | 84.2 | 63.2 | 84.7 | 61.7 | 90.1 | 71.3 | 90.3 | 72.4 | 94.2 | 83.0 | 93.9 | 81.5 |
valve | 70.2 | 51.4 | 83.0 | 60.5 | 74.6 | 54.0 | 77.4 | 55.3 | 84.9 | 63.8 | 86.1 | 67.8 |
total | 76.8 | 61.9 | 79.0 | 63.5 | 77.2 | 62.1 | 78.2 | 63.1 | 78.9 | 66.1 | 79.4 | 66.7 |
Loss | Ours (eq.(4)) | Ours (eq.(5)) | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Model | MAF add. | Glow add. | Glow aff. | MAF add. | Glow add. | Glow aff. | ||||||
Metric | AUC | pAUC | AUC | pAUC | AUC | pAUC | AUC | pAUC | AUC | pAUC | AUC | pAUC |
ToyCar | 91.1 | 80.9 | 91.4 | 83.1 | 92.2 | 84.1 | 91.0 | 81.5 | 91.1 | 81.5 | 92.3 | 85.0 |
ToyConveyor | 72.3 | 59.8 | 72.3 | 59.4 | 71.5 | 59.0 | 72.5 | 59.1 | 71.7 | 59.2 | 71.2 | 58.8 |
fan | 74.2 | 65.9 | 74.6 | 65.9 | 74.9 | 65.3 | 74.7 | 66.4 | 74.7 | 65.9 | 74.8 | 65.1 |
pump | 82.4 | 72.0 | 82.2 | 71.3 | 83.4 | 73.8 | 82.1 | 71.7 | 82.5 | 72.3 | 83.1 | 72.8 |
slider | 91.0 | 74.2 | 94.8 | 83.5 | 94.6 | 82.8 | 90.3 | 75.1 | 95.4 | 83.9 | 94.8 | 82.9 |
valve | 83.1 | 62.8 | 90.1 | 70.9 | 91.4 | 75.0 | 81.3 | 61.1 | 89.7 | 70.4 | 90.5 | 74.1 |
total | 82.8 | 69.7 | 84.8 | 72.9 | 85.2 | 73.9 | 82.4 | 69.6 | 84.7 | 72.7 | 85.0 | 73.7 |
MobileNetV2 | Glow aff. | Glow aff. (5) | |
ToyCar | 55.7 | 64.2 | 80.1 |
ToyConveyor | 48.7 | 60.9 | 61.0 |
fan | 50.4 | 49.6 | 49.6 |
pump | 52.9 | 63.2 | 65.7 |
slider | 82.8 | 84.5 | 87.8 |
valve | 67.9 | 67.2 | 77.7 |
Method | Outlier data | AUC | |||
|
pump (except ID 0) | 74.2 | |||
Glow aff. (5) using a different machine type for the outlier data | ToyCar | 70.0 | |||
ToyConveyor | 70.5 | ||||
fan | 71.1 | ||||
slider | 69.4 | ||||
valve | 69.7 | ||||
Minimizing NLL | - | 69.6 | |||
In Table 3, we present the results of a machine ID with minimum AUC for each machine type. The minimum AUC of the self-supervised classification-based approach was significantly lower than the unsupervised approach for three out of the six machine types, while that of our method was equal to or significantly higher than the unsupervised approach. These results indicate that our proposed method not only outperforms the unsupervised approach but also shows more stable detection performances than the self-supervised classification-based approach, and therefore can be suitable for practical applications.
We also show experimentally that our method does not improve the detection performance if the outlier data has different data structures from the target data. We used pump machine ID 0 as the target data and made six models in which the outlier data was from each machine type. The model was trained with the loss in (5). As shown in Table 4, the detection performance did not improve if the outlier data was from a machine type other than pump. These results indicate that our method does not improve the detection performance if the outlier data has different structures from the target data.
7 Conclusion
We proposed flow-based methods for anomalous sound detection that outperform unsupervised approaches while maintaining greater stability than the self-supervised classification-based approach. Experimental results demonstrated that our methods improve the detection performance by using the outlier data from the same machine type as the target data. Our future work will include the development of more efficient methods to decide the hyperparameters.
References
- [1] K. Suefusa, T. Nishida, H. Purohit, R. Tanabe, T. Endo, and Y. Kawaguchi, “Anomalous sound detection based on interpolation deep neural network,” in ICASSP, 2020, pp. 271–275.
- [2] Y. Koizumi, S. Saito, H. Uematsu, and N. Harada, “Optimizing acoustic feature extractor for anomalous sound detection based on neyman-pearson lemma,” in EUSIPCO, 2017, pp. 698–702.
- [3] E. Tabak and C. Turner, “A family of nonparametric density estimation algorithms,” Communications on Pure and Applied Mathematics, vol. 66, pp. 145–164, 2013.
- [4] L. Dinh, D. Krueger, and Y. Bengio, “NICE: Non-linear independent components estimation,” arXiv:1410.8516, 2015.
- [5] E. Nalisnick, A. Matsukawa, Y. W. Teh, D. Gorur, and B. Lakshminarayanan, “Do deep generative models know what they don’t know?,” in ICLR, 2019.
- [6] H. Choi, E. Jang, and A. A. Alemi, “WAIC, but why? generative ensembles for robust anomaly detection,” arXiv:1810.01392, 2019.
- [7] J. Serrà, D. Álvarez, V. Gómez, O. Slizovskaia, J. F. Núñez, and J. Luque, “Input complexity and out-of-distribution detection with likelihood-based generative models,” in ICLR, 2020.
- [8] R. Giri, S. V. Tenneti, K. Helwani, F. Cheng, U. Isik, and A. Krishnaswamy, “Unsupervised anomalous sound detection using self-supervised classification and group masked autoencoder for density estimation,” Tech. Rep., DCASE2020 Challenge, 2020.
- [9] D. P. Kingma and P. Dhariwal, “Glow: Generative flow with invertible 1x1 convolutions,” in NeurIPS, 2018, pp. 10215–10224.
- [10] G. Papamakarios, T. Pavlakou, and I. Murray, “Masked autoregressive flow for density estimation,” in NeurIPS, 2017, pp. 2338–2347.
- [11] Y. Koizumi, Y. Kawaguchi, K. Imoto, T. Nakamura, Y. Nikaido, R. Tanabe, H. Purohit, K. Suefusa, T. Endo, M. Yasuda, and N. Harada, “Description and discussion on DCASE2020 Challenge task2: Unsupervised anomalous sound detection for machine condition monitoring,” in DCASE Workshop, 2020.
- [12] J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. Depristo, J. Dillon, and B. Lakshminarayanan, “Likelihood ratios for out-of-distribution detection,” in NeurIPS, 2019, pp. 14707–14718.
- [13] D. Hendrycks, M. Mazeika, and T. Dietterich, “Deep anomaly detection with outlier exposure,” in ICLR, 2019.
- [14] P. Kirichenko, P. Izmailov, and A. Wilson, “Why normalizing flows fail to detect out-of-distribution data,” in ICML workshop on Invertible Neural Networks and Normalizing Flows, 2020.
- [15] P. Primus, “Reframing unsupervised machine condition monitoring as a supervised classification task with outlier-exposed classifiers,” Tech. Rep., DCASE2020 Challenge, 2020.
- [16] M. Schmidt and M. Simic, “Normalizing flows for novelty detection in industrial time series data,” in ICML workshop on Invertible Neural Networks and Normalizing Flows, 2019.
- [17] M. Yamaguchi, Y. Koizumi, and N. Harada, “Adaflow: Domain-adaptive density estimator with application to anomaly detection and unpaired cross-domain translation,” in ICASSP, 2019, pp. 3647–3651.
- [18] A. Ryzhikov, M. Borisyak, A. Ustyuzhanin, and D. Derkach, “Normalizing flows for deep anomaly detection,” arXiv:1912.09323, 2019.
- [19] M. L. D. Dias, C. L. C. Mattos, T. L. C. D. Silva, J. Macêdo, and W. C. P. Silva, “Anomaly detection in trajectory data with normalizing flows,” in IJCNN, 2020.
- [20] V. Haunschmid and P. Praher, “Anomalous sound detection with masked autoregressive flows and machine type dependent postprocessing,” Tech. Rep., DCASE2020 Challenge, 2020.
- [21] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” in CVPR, 2018, pp. 4510–4520.
- [22] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
- [23] J. Behrmann, P. Vicol, K.-C. Wang, R. B. Grosse, and J.-H. Jacobsen, “Understanding and mitigating exploding inverses in invertible neural networks,” arXiv:2006.09347, 2020.