Audio Spoofing Verification using Deep Convolutional Neural Networks by Transfer Learning

Abstract

Automatic Speaker Verification systems are gaining popularity these days; spoofing attacks are of prime concern as they make these systems vulnerable. Some spoofing attacks like Replay attacks are easier to implement but are very hard to detect thus creating the need for suitable countermeasures. In this paper, we propose a speech classifier based on deep-convolutional neural network to detect spoofing attacks. Our proposed methodology uses acoustic time-frequency representation of power spectral densities on Mel frequency scale (Mel-spectrogram), via deep residual learning (an adaptation of ResNet-34 architecture). Using a single model system, we have achieved an equal error rate (EER) of 0.9056% on the development and 5.32% on the evaluation dataset of logical access scenario and an equal error rate (EER) of 5.87% on the development and 5.74% on the evaluation dataset of physical access scenario of ASVspoof 2019.

Index Terms: Spoofing detection, Mel-Spectrogram, Deep-Convolutional Neural Network

1 Introduction

Biometrics technologies play a critical role in regulating access to informational resources in today's world. One of the reliable approaches for attaining a suitable secure system is speaker verification.

Automatic speaker verification (ASV) has undergone rapid improvements in the recent decade but continues to show high vulnerability towards different spoofing attacks. Spoofing methods are categorized as speech synthesis (SS), voice conversion (VC), impersonation and replay attacks [1]. Among these, replay attacks are arguably the most simple yet highly indistinguishable ASV spoofing technique as they do not require the attackers to have any specialized knowledge, and can be mounted with relative ease using consumer devices.

For the first automatic speaker verification spoofing and counter-measures challenge (ASVspoof 2015 [2]) even though the best results showed an overall average detection EER of less than 1.5%, the EER of unknown attacks is five times higher than that of known attacks. In addition, while some attacks were easily and consistently detected, others provoked extremely high error rates nearing 50%.

Systems submitted under ASVspoof 2017 [3] challenge inspected distinct front-end features and the usage of various classifiers to detect replay attack under conditions. Among them, the best-performing system [4], had an equal error rate (EER) of 6.73% and used a light convolutional neural network (LCNN) to extract high-level features from the log power spectrum, together with a Gaussian Mixture Model (GMM) as classifier. Variable length Teager energy operator-energy separation algorithm-instantaneous frequency cosine coefficients (VESA-IFCCs) [5] were proposed as a single system with the aim of capturing the spectral changes due to the transmission and channel characteristic of replay devices. A single frequency filtering feature with a high spectro-temporal resolution was proposed [6] to capture the channel information embedded in the low signal to noise ratio region.

ASVspoof 2019 [7] Challenge was organized to emphasize the development of reliable countermeasures that could efficiently segregate bona fide and spoofed speech. The initiative aims specifically to encourage the design of generalized countermeasures, i.e. countermeasures that would perform well when encountered with spoofing attacks which are unpredictable in nature. The data used for ASVspoof 2015 [2] contained spoofed speech samples generated using text-to-speech (TTS) and voice conversion (VC) systems at that time. The ASVspoof 2017 [3] challenge emphasized on the design of countermeasures strived at detecting replay spoofing attacks that could be implemented easily by anyone using conventional devices. The ASVspoof 2019 [7] edition was the first to focus on countermeasures for all three major spoofing attack types, namely those originating from TTS, VC and replay attacks. ASVspoof 2019 [7] focuses to develop next-generation countermeasures for the automatic detection of spoofed or fake audio. The challenge encloses two separate sub-challenges in logical and physical access control and provides a common database of the most advanced spoofing attacks to date. The aim was to study the extremities of spoofing countermeasures with respect to automatic speaker verification and spoofed audio detection.

In this paper, we propose the use of Mel-spectrograms [8] which is obtained from the audio frames as a time-frequency representation of power spectral density, for the training of deep-convolutional neural networks for audio spoofing attack detection. The use of Mel-spectrograms provides a time-frequency representation with sufficient spectral resolution, which is required for robust replay attack detection. Our framework is based on adapting the ResNet-34 architecture [9].

2 Methodology

Our proposed framework (Figure 1) employs transfer learning [10] of a pre-trained convolutional neural network (CNN) for fast adaptation to the Mel-spectrograms extracted from speech inputs. Transfer learning is an analysis technique in machine learning that used to solve a problem by reusing knowledge gained from solving another but similar problem. Extracted spectrograms are fed into the deep convolutional neural network in which the speech signals are classified as bona fide or spoofed. In the following subsections, we describe the three major components of our system: Transfer Learning [10], Mel-spectrograms [8] and ResNet [9], followed by a functional overview of our proposed framework.

2.1 Transfer Learning

Transfer Learning is a machine learning technique that is used to solve a problem based on knowledge gained from solving another but related problem. In this approach, the features gained from training a base neural network is reused or adjusted to a target neural network for its training. Transfer learning is usually expressed through the use of pre-trained models. A pre-trained model is trained using a large benchmark dataset to solve a similar problem. The computational cost of training such models is high, So it is common practice to reuse models from published literature (e.g. VGG, Inception, MobileNet, ResNet). Pre-trained models used in transfer learning are usually based on large convolutional neural networks (CNN). High performance and the easiness in training are two of the main factors that makes CNN popular for such applications.

Refer to caption — Figure 1: Overview of the proposed framework for speech spoofing detection.

2.2 Mel-spectrograms

A Mel-spectrogram represents an acoustic time-frequency representation of a sound, i.e. the power spectral density $P(f,t)$ . It is sampled into a number of points at equal intervals of time $t\textsubscript{i}$ and frequencies $f\textsubscript{j}$ (on a Mel frequency scale).

The Mel-frequency scale is defined as:

Mel=2595log\textsubscript{10}(1+\frac{Hertz}{700})

(1)

And its inverse is:

Hertz=700(10\textsuperscript{$\frac{Mel}{2595}$}-1)

(2)

STFT (short-time Fourier transform) and Mel-spectrogram have been the most popular input representations for audio classifications [8]. Mel-spectrograms provide an efficient and perceptually relevant representation compared to STFT [11] and have been shown to perform well in various tasks [12]-[16]. However, an STFT is closer to the original signal and neural networks may be able to learn a representation that is more optimal to the given task than Mel-spectrograms. The latter requires large amount of training data; however as reported in [17] where using Mel-spectrograms outperformed STFT with a smaller dataset.

Mel-spectrogram transforms the input raw sequence to a 2-D feature map where one of the dimensions represents time and the other one represents frequency and the values represent amplitude. The above Figure 2 shows the Mel-spectrogram of an audio signal. 2-D Mel Model betters the 1-D raw wave model but the average of the two outperforms each individual model significantly [18]. The Mel-spectrogram is selected since it is psychologically relevant and computationally efficient. It produces a Mel-scaled frequency representation which is a more appropriate conjecture of human auditory understanding [11] and typically involves compressing the frequency axis of short-time Fourier transform representation. And the magnitude of Mel-spectrogram is mapped to the decibel scale.

2.3 Deep Residual Net (ResNet)

The deeper the Neural Network the more difficult it is to train. It is well known that the depth of a network is a determining factor of the network performance. Deeper neural networks are used in the field of computer vision. However, it is not easy to train a deeper network due to the notorious gradient vanishing problem. This was one of the drawbacks of VGG net [19] as they started losing generalization capability after some depth. To address this problem, an effective method called residual neural network(ResNet)[9] is introduced. ResNet provides a training framework to ease the training of networks that are substantially deeper than those used previously. It was motivated by the counter intuitive experimental findings that adding more layers lead to higher training error. Theoretically, as the number of layers increased, the modeling capabilities of the Neural Networks should be better and therefore, the deeper networks should produce no higher training error. Experiments in [9] showed that ResNet greatly improves training efficiency since the gradients can propagate many layers through the shortcut connection. In addition, ResNet allows deeper networks to be trained, resulting in models that usually perform better. One of the problems ResNets solves is known as the vanishing gradient. This is because when the network is too deep, the gradients from where the loss function is calculated easily shrink to zero after several applications of the chain rule. This results in the weights never getting updated and therefore, no learning process is involved. With ResNets, the gradients flow directly through the identity function.

Deep residual learning [9] enables the training of CNNs that have substantially deeper architectures. It introduces skip connections that enable gradient flow across a large number of layers thus relieves the problem of vanishing gradients in deep CNNs. These skip connections enables the outputs to learn a residual mapping. The residual block forms the basic building block of a ResNet (Figure 3). If the desired mapping to be learned is $\mathcal{H}(x)$ , the stacked residual layers learn the residual mapping, $\mathcal{F}(x)=\mathcal{H}(x)-x$ . Thus, the original mapping to be learned becomes $\mathcal{F}(x)+x$ . ResNet uses the Rectified linear unit (ReLU) activation function.

3 Experiments

3.1 Dataset

Table 1: Number of non-overlapping target speakers and number of utterances in training and development sets of the ASVspoof 2019 database [20].

	Speakers		Utterances
Subset	Male	Female	Logical Access		Physical Access
Subset	Male	Female	Bona fide	Spoofed	Bona fide	Spoofed
Training	8	12	2,580	22,800	5,400	48,600
Development	8	12	2,548	22,296	5,400	24,300

3.1.1 Data Conditions: Logical Access

Logical access includes spoofing attacks generated with text-to-speech (TTS) and voice conversion (VC) attacks. Genuine speech is collected from speakers with no significant channel or background noise effects. Spoofed speech is generated from genuine data using a number of possible spoofing algorithms. The full dataset is split into three subsets, the first one is for training, the second is for development and the third is for evaluation. The number of speakers in the former two subsets is illustrated in Table 1. There is no speaker overlap in these three subsets regarding target speakers used in voice conversion or TTS technique. The voice conversion systems are based on (1) neural-network-based and (2) transfer-function-based methods. The speech synthesis systems were done with (1) waveform concatenation, (2) neural-network-based parametric speech synthesis using source-filter vocoders and (3) neural-network-based parametric speech synthesis using Wavenet.

3.1.2 Data Conditions: Physical Access

The spoofing attacks that at the sensor level are considered as physical access scenario. Both bona fide and spoofed signals propagate through a physical space prior to the acquisition. Spoofing attacks in this scenario are therefore referred to as replay attacks, whereby are cording of a bona fide access attempt is first captured, presumably surreptitiously, before being replayed to the ASV microphone.

Table 1 shows the number of speakers and the number of bona fide and replay spoofing access attempts (utterances) in and development set. Both bona fide and replay spoofing access attempts in both training and development data partitions were generated with respect to the same set of randomly selected acoustic and replay configurations.

In a similar fashion to the logical access scenario, the evaluation set is disjoint in terms of speakers. Evaluation data is generated in the same manner as training and development data, albeit with different, randomly acoustic and replay configurations. Specifically, the set of room sizes, levels of reverberation, speaker-to-ASV microphone distances, attacker-to-speaker recording distances, and loudspeaker qualities, while drawn from the same categories, will be different. Full details of the dataset and performance measures of ASVspoof 2019 are presented in evaluation plan [20].

3.2 Performance measures

Improvements in Countermeasure (CM) technology need not necessarily imply an improved complete system that involves co-operation of CM and ASV. Thus, the main focus is on assessment of tandem systems where the CM act as a ‘mediator’ to determine whether a given speech input originates from a bona fide (genuine) user, before passing it as the main biometric verifier (the ASV system). The cost function used in this scenario is a recently-proposed tandem detection cost function (t-DCF) [21] as its primary performance metric. Other than t-DCF, the evaluation of the equal error rate (EER) of each submitted countermeasure is used as a secondary performance metric [20].

3.2.1 Tandem-Detection Cost Function (t-DCF)

The t-DCF [21] is mainly consisted of statistical detection theory and involves detailed specification of a specified application. A key feature of t-DCF involves the assessment of a tandem system while on the other hand, keeping the two subsystems (CM and ASV) isolated from each other, i.e. they can be developed independently of each other.

The t-DCF metric has 6 parameters: (i) false alarm and miss costs for both systems, and (ii) prior probabilities of target and spoof trials (with an implied third, non-target prior).

t-DCF(s)=C_{1}P_{miss}^{cm}(s)+C_{2}P_{fa}^{cm}(s)

(3)

where $P_{miss}^{cm}(s)$ and $P_{fa}^{cm}(s)$ are, respectively, the miss rate and the false alarm rate of the CM system at threshold s. The constants C₂ and C₁ are dictated by the t-DCF costs, priors, and the ASV system detection errors.

3.2.2 Equal Error Rate (EER)

In scoring-based spoofing detection tasks, EER is used as a secondary metric to analyze the performance of different counter measures methods. Let FRR( $\theta$ ) and FAR( $\theta$ ) denote the false rejection rate and false acceptance rate at threshold $\theta$ .

FAR(\theta)=\frac{count(spoof\ trials\ with\ score)>\theta}{total\ spoof\ trials}

(4)

FRR(\theta)=\frac{count(human\ trials\ with\ score)<\theta}{total\ human\ trials}

(5)

$FRR(\theta)$ and $FAR(\theta)$ are monotonically decreasing and increasing functions of $\theta$ . The EER corresponds to the threshold $\theta$ _EER at which two detection error rates are equal.

3.3 Model Setup

Our model was trained only on the training set and validated on the development set of the ASVspoof 2019 dataset. The model was trained in Google Colab with the granted GPU (Tesla K80). Audio preprocessing was performed via the Google Cloud Platform. Python’s fastai library was used to train CNN which provides support for vision, text, tabular, and collaborative filtering models. The fastai library is a high-level library build on PyTorch which enables easy prototyping and gives you access to a lot of state-of-the-art methods such as marketing application (horizontal application) and health care application (vertical application). Mel-spectrograms were extracted from the audio files using FFT window of size 2048 and the number of samples between successive frames was taken as 512. The Mel-spectrograms were further resized to 224x224 before feeding into the CNN. ResNet-34 training ran on Google Colab with ADAM optimizer and a learning rate in range of 0.000001, over 8 epochs and a batch size of 64.

4 Results and Discussion

Table 2: The experiment result obtained from ASVspoof 2019.

Dataset

EER

t-DCF

Development

Evaluation

Development

Evaluation

Logical Access

0.9056

5.32

0.1514

Physical Access

5.87

5.74

0.1351

Table 3: The experiment result of baseline in ASVspoof 2019.

Baseline Implementation	Logical Access		Physical Access
Baseline Implementation	EER	t-DCF	EER	t-DCF
CQCC - GMM	9.57	0.2366	11.04	0.2454
LFCC - GMM	8.09	0.2116	13.54	0.3017

Table 2 shows the result of our model in ASVspoof 2019 challenge based on the two performance matrices (t-DCF and EER) on evaluation and development dataset. Comparing the result of ASVspoof 2019 challenge, our single model outperforms the baseline models provided by ASVspoof 2019 organizers given in Table 3. But when compared to other teams, the result that we obtained in the testing phase was low. We had investigated the reason behind this problem and found that in the test dataset, spoofed data are generated according to diverse unseen spoofing algorithms. However, they are variations of the spoofing algorithms used to generate the development dataset.

Among the systems submitted under ASVspoof 2019 challenge, the best-performing system under logical access scenario, had a min-tDCF of 0.2093 and an equal error rate (EER) of 11.40% (while evaluating a single system). By comparing this with our single system, we had obtained a better result in both the performance metrics. Along with that, while analyzing the result and spoofing attack algorithms provided by ASVspoof 2019 committee, our model performed well in detecting spoofing attacks which uses voice conversion techniques by waveform filtering (A17). In both logical and physical access scenario, we are one among the top three teams while considering both primary and single system as same. We had proposed the same methodology, rather than going for different methodologies to counter both logical and physical access scenarios. Also, t-DCF and EER showed correlation of 0.99686 and 0.96886 in physical and logical access scenarios respectively.

5 Conclusion

In this paper, we explored the applicability of the transfer learning approach for the solution to the problem of spoofing detection. Our first finding is that, instead of feeding raw audio directly to the model, we use an advanced approach to audio classification using Mel-spectrogram. Mel-spectrogram transforms the input raw sequence to a 2-D feature map in which one dimension represents time and the other represents frequency and the values represent amplitude. A second finding is that residual neural networks can help us build deeper models that outperform conventional deep neural networks.

Finally, we demonstrated that even with a single model, better system performance can be obtained. By using different features and models system performance can be significantly improved. In future, we would like to explore the effectiveness of using more advanced generative adversarial networks (GAN) for the spoofing detection task. Denoising methods like Adaptive filtering and Autoencoders could also be implemented.

6 Acknowledgements

The authors would like to thank ASVspoof 2019 organizers for providing the dataset and detailed analysis of our system. We also acknowledge the technical support from Sachin Dev S, working at Tricodia Inc.

References

[1] Z. Wu, N. Evans, T. Kinnunen, J. Yamagishi, F. Alegre, and H. Li, ”Spoofing and countermeasures for speaker verification: A survey,” in speech communication, 66, 2015, pp. 130–153.
[2] Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M. Sahidullah, and A. Sizov, ”ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
[3] T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Yamagishi, and K.A. Lee, ”The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” 2017.
[4] G. Lavrentyeva, S. Novoselov, E. Malykh, A. Kozlov, O. Kudashev, and V. Shchemelinin, ”Audio Replay Attack Detection with Deep Learning Frameworks,” in Interspeech, August, 2017, pp 82–86.
[5] H.A. Patil, H.A. Kamble, T.B. Patel, and M.H. Soni, ”Novel Variable Length Teager Energy Separation Based Instantaneous Frequency Features for Replay Detection,” in INTERSPEECH, August, 2017, pp 12–16.
[6] K.R. Alluri, S. Achanta, S.R. Kadiri, S.V. Gangashetty, and A.K. Vuppala, ”SFF Anti-Spoofer: IIIT-H Submission for Automatic Speaker Verification Spoofing and Countermeasures Challenge 2017,” in Interspeech, August, 2017, pp. 107–111.
[7] ASVspoof2019:the automatic speaker verification spoofing and countermeasure.[Online]. www.asvspoof.org
[8] K. Choi, G. Fazekas, K. Cho, and M. Sandler, ”A tutorial on deep learning for music information retrieval,” arXiv preprint arXiv:1709.04396, 2017.
[9] K. He, X. Zhang, S. Ren, and J. Sun, ”Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[10] S.J. Pan, and Q. Yang, 2010. ”A survey on transfer learning,” in IEEE Transactions on knowledge and data engineering, 22(10), 2010, pp. 1345–1359.
[11] B.C. Moore, ”An introduction to the psychology of hearing,” Brill, 2012.
[12] A. Van den Oord, S. Dieleman, and B. Schrauwen, “Deep content-based music recommendation,” in Advances in Neural Information Processing Systems, 2013, pp. 2643-–2651.
[13] S. Dieleman, and B. Schrauwen, “End-to-end learning for music audio,” in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 6964–-6968.
[14] J. Schlüter, and S. Bock, “Improved musical onset detection with convolutional neural networks,” in Acoustics,Speech and Signal Processing, IEEE International Conference on. IEEE, 2014.
[15] K. Ullrich, J. Schlüter, and T. Grill, “Boundary detection in music structure analysis using convolutional neural networks,” in Proceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014), Taipei, Taiwan, 2014.
[16] K. Choi, G. Fazekas, and M. Sandler, “Automatic tagging using deep convolutional neural networks,” in The 17th International Society of Music Information Retrieval Conference, New York, USA. ISMIR, 2016.
[17] K. Choi, G. Fazekas, and M. Sandler, ”Automatic tagging using deep convolutional neural networks,” arXiv preprint arXiv:1606.00298, 2016.
[18] https://github.com/CVxTz/audio_classification
[19] K. Simonyan and A. Zisserman, ”Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[20] ASVspoof2019:the automatic speaker verification spoofing and countermeasure challenge evaluation plan.[Online]. Available: http://www.asvspoof.org/asvspoof2019/asvspoof2019evaluation-
plan.pdf
[21] T. Kinnunen, K. Lee, H. Delgado, N. Evans, M. Todisco, M. Sahidullah, J. Yamagishi, and D.A. Reynolds, ”t-DCF: a detection cost function for the tandem assessment of spoofing countermeasures and automatic speaker verification,” in Proc. Odyssey, Les Sables d’Olonne,France, June 2018.