Adaptive re-calibration of channel-wise features for Adversarial Audio Classification

Vardhan\ Dongre^{[1]}

Abhinav\ Reddy\ Thimma^{[1]}

Nikhitha\ Reddeddy^{[1]}

[1] University of Illinois Urbana-Champaign

Abstract

DeepFake Audio, unlike DeepFake images and videos, has been a less studied topic, and the solutions which exist for the synthetic speech classification either use complex networks or don’t generalize to different kinds of synthetic speech. Through this work, we perform a comparative analysis of different proposed models for synthetic speech detection including End2End and ResNet-based models against synthetic speech generated using Text to Speech and Vocoder systems like WaveNet [8], WaveRNN [4], Tactotron, and WaveGlow. We also experimented with Squeeze Excitation (SE) blocks in our ResNet models and found that the combination was able to get better performance. In addition to the analysis, we propose a combination of Linear frequency cepstral coefficients (LFCC) and Mel Frequency cepstral coefficients (MFCC) using the attentional feature fusion technique to create better input features which can help even simpler models generalize well on synthetic speech classification tasks. Our best models (ResNet based using feature fusion) were trained on Fake or Real (FoR) dataset and were able to achieve 95% test accuracy with the FoR data, and an average of 90% accuracy with samples we generated using different generative models.

Refer to caption — Figure 1: Workflow Pipeline

I Introduction

Speech synthesis and spoofing attacks have become prevalent in recent years because of the development of generative models which are able to synthesize speech of great quality which humans are not able to distinguish. The spoofing attacks include either replay attacks, where the speaker’s voice is recorded and used in a different context, or generated speech attacks, where a text to speech or voice conversion system is able to generate new voice samples. In this work, we focus on the generated speech attacks from both text to speech and voice conversion systems because this is an area that is rapidly changing because of newer neural network-based generative models. We perform a comparison of performance between different existing models and proposed models, against a variety of generated speech samples created using WaveNet [8], WaveRNN [4], Tacotron&WaveGlow [10], and FastSpeech [7].

The speaker verification community has been able to come up with innovative models to tackle the problem described above. A majority of models in the synthetic speech detection domain fall under two categories: Traditional models and End to End systems [Fig 1]. Traditional systems try to tackle the problem in two phases. The first of which is feature extraction and the next is building a classifier based on the extracted features. End to End systems skip over the feature extraction phase and build models which take in the raw audio samples as input and give out a classification result. Traditional models with specifically curated features have been shown to produce promising results in this domain, but the newer End to End systems are not that far behind and are to get similar performance without much focus on features. End to End systems have been a recent solution for the problem and had issues with noisy audio during our initial analysis. Another major issue with the existing systems is the generalizing capability to speech synthesized using newer generative models. Thus we went ahead with a traditional architecture using ResNets and explored the impact of feature fusion on features extracted from speech.

For Audio classification tasks features like Mel-Spectrogram, MFCC, LFCC, CQT and CQCC have been to produce good results. However, LFCC and MFCC are the features we focused on specifically for building the classifier. MFCC has been used in a variety of speech recognition applications and has been promising in detecting the lower frequencies coefficients from human speech. LFCC on the other hand was added so as to introduce a measure for higher frequency coefficients which might be prevalent in the synthesized speech. We focused on fusing both LFCC and MFCC and settled on using Attentional Feature Fusion (AFF) technique [1]. AFF has shown promising results in fusing and scaling features of inconsistent semantics. We built a total of 8 models using 3 combinations of features: only LFCC, only MFCC, and a combination of LFCC and MFCC. In these 8 models, we experimented with 2 different ResNet structures: ResNet34 and ResNet50, and compared the results with pre-trained versions of ResNets. We also introduced Squeeze excitation blocks into our model architecture and found that it was able to improve the performance of our models by enhancing the inter-channel dependencies for our binary classification task.

For the problem, we used the Fake or Real normalized dataset which has an even distribution of samples between genders (male and female) and classes (fake and natural). This dataset was primarily used for training our models. We evaluated the trained models on the Fake or Real datasets’ test samples and also against all of the generated audio samples. To augment the test set, we also verified the performance of our models against the ASVSpoof 2019 dataset [9].

Our primary contribution through this work is the introduction attentional feature fusion block, which has been effective in the image domain, into the speech domain to leverage the combination of different extracted features. The paper is organized as follows: Section 2 discusses the background of feature engineering, feature fusion, and existing model architecture. Section 3 focuses on our experimentation setup, baseline, and proposed models. In Section 4, we discuss the results from our testing and compare different models we experimented with. The conclusion of the paper would be in Section 5.

The code for this work can be found at https://github-dev.cs.illinois.edu/athimma2/deepfake-audio-classifier

II Background

II-A Feature Engineering

Feature engineering is an essential component of learning algorithms, the performance of the ML models is heavily dependent on how we represent the feature vector. As a result, a significant time and effort is spent in designing preprocessing pipelines and data transformations. In the audio domain, audio files usually exist in the form of digital files with wav, .mp3, .wma, .aac, .flac etc. as the common formats. The major audio features extracted from them can either be timbral texture features or rhythmic content features. In this work our focus has been only on using timbral texture features which specifically include MFCC and LFCC. A common practice while modelling deep learning frameworks in the audio domain is to convert the audio into spectrogram which is a concise snap of an audio wave that has undergone a Fourier transform. Mel-frequency cepstral coefficients (MFCC) is a cepstral representation of the audio which has been widely used in automatic speaker recognition and vocoder systems. Introduced in the 1980’s, they have been the state of the art ever since as they have proven to be robust in training several deep learning algorithms for high level audio domain tasks. Linear-frequency cepstral coefﬁcients (LFCC) is another alternative to MFCC which has been also used as a go to feature for training models to obtain learnable parameters, the difference between LFCC and MFCC is based on the filter banks that they use for transforming the audio, where MFCC uses a Mel filter bank, LFCC uses a linear filter bank. Several studies have shown both features to be comparable although MFCC is dominantly used in speakers as well speech recognition still.

II-B Feature Fusion

Many times, a single feature representation obtained from the data is insufficient to convey the necessary details of the underlying distribution of the natural process, thus a common approach in many feature engineering methods is fusing together features obtained from the single data source through different methods. This combination of features can either be a simple concatenation or can be an informed mathematical combination function. In the image domain fusing features obtained from different layers is not a novel idea, image pyramids are one such example of this attempt. However, while combining such cross-layer features a common problem we run into is of scale variance. In our framework we decided not to rely on simple concatenation or summation of features but rather include scale invariant fusion mechanism that has been recently introduced in the image domain. Attentional feature fusion (AFF) [1] [Fig 3] is a trainable mechanism that has the ability to fuse together features obtained across long and short skip connections without running into issues of scale variance. The attention module MS-CAM [Fig 2] in AFF aggregates the contextual information along the channel dimension from different receptive fields which makes the fusion process more dynamic and informed.

II-C Squeeze Excitation

A common theme across deep learning research has been obtaining powerful representations from the data that capture specific properties that are most useful for the task at hand. Squeeze Excitation block [2] [Fig 4] is an architectural unit that has been introduced specifically for this purpose. But instead of combining spatial correlations the SE block focuses on modeling the interdependencies across the channels of the features obtained from CNNs. As a result of such modelling and re-calibration, the more useful features are emphasized, and less useful ones get suppressed. From Fig we can see the structure of the block which can be understood in two separate operations viz., squeeze operation and excitation operation. The squeeze operation as the name suggests collects and squeezes together all the feature maps to generate a descriptor. The excitation operation then takes this descriptor embedding to generate weights for each channel. Once we obtain the weights, they are applied to the input feature map to generate the output for this block.

II-D Existing Approaches

From extensive literature study on the problem of identifying spoofed audio and classification of deep fakes, we came to the conclusion that most of the existing approaches in this area either rely on classical ML algorithms that depend on probabilistic modeling or exploiting the temporal information of the audios or there existed a wide range of complex deep learning models with millions of parameters consisting of gating operations which were significantly bulkier. Also due to the rapid advances in the generative algorithms that can create indiscernible synthetic audios the performance of these methods needed to be tested on these novel synthesizing algorithms.

One major baseline for our work was the End2End model [3] [Fig 5] specifically the Time-domain Synthetic Speech Detection Net (TSSDNet) based on a Inception CNN. The underlying theme of this work is to show that there is no need for any hand crafted features for detection of synthetic speech; rather a model could be trained directly on raw audio inputs to classify the natural samples from synthetic ones. The Inception style TSSDNet solely relies on identifying the forensic artifacts formed due to the synthesizing process. Since such artifacts have a greater chance of being observed in the features obtained in the upper layers the model therefore the model is shallower. This work tested its framework on the ASVspoof dataset [9] and demonstrated that the TSSDNet significantly reduced the cross dataset Equal Error Rate. Another thrust in this problem was through the introduction of Res2Net architecture [5] as shown in the which utilizes a residual connection across different channel groups split from the input feature map. The underlying of this approach lies within creating multiple scaled feature fields. This work experimented with several features derived from the ASV dataset [9] but found CQT as the most informative feature set.

III Experimental Setup

III-A DataSets

In this section, we describe two types of datasets. FoR dataset [6] that is used for training and testing phases and ASVSpoof2019 Logical Access (LA) dataset [9] to evaluate the effectiveness and generalization power of fake speech detection models. We later introduce the metrics to evaluate the performance.

III-A1 FoR-norm Dataset

The Fake or Real (FoR) dataset [6] is proposed for studies in speech synthesis and synthetic speech detection. The FoR database contains speech clips from state-of-the-art speech synthesis algorithms, i.e. speech clips with naturalness similar to real human speech. The fake speeches are synthesized by the latest methodologies, both open source and commercial, such as: DeepVoice, Amazon AWS Polly, Baidu TTS, Google TTS, and Microsoft Azure TTS. In addition, the FoR dataset contains a large number of speech clips that are enough to train complex models without overfitting.

To eliminate bias for machine learning experiments, the FoR database is processed into four different versions: FoR-original, FoR-norm, FoR-2seconds, and FoR-rerecorded. The FoR-norm dataset is processed by balancing the data to achieve an even distribution between genders (male and female) and classes (fake and natural). Due to the balancing process, the FoR-norm version of FoR dataset [6] contains a total of 69,400 speech clips. To eliminate strange and unusual speeches, we use the FoR-norm version of the FoR database. The entire FoR-norm dataset is partitioned into training and testing sets. More details about the speeches in the FoR-norm dataset are shown in [Table-1] .

The main reason why we used the FoR dataset [6] is the systems that this dataset used to generate fake speech samples are different from the ones that we used to collect the test data (described in the next section). So, if our models were able to correctly predict the true labels of these collected samples too, then we can say that our models are generalized and can differentiate samples from most of the speech synthesizers.

TABLE I: Illustration of subsets and speeches in FoR-norm dataset.

Subsets	Real Speech	Fake Speech
Training	16000	16000
Testing	4000	4000

III-A2 Synthetic Speech Dataset Collection

To evaluate the generalization power of our models, we generated some fake speech samples that are collected using different Text-to-speech and Vocoder systems. We used four such systems 1. Wavenet [8] 2. WaveRNN [4] 3. FastSpeech [7] 4. WaveGlow & Tacotron [10] and generated fake samples which are later used in the evaluation phase of the project.

III-A3 Cross-Dataset Evaluation

Apart from generating the fake speech samples from different sources, we also collected samples from another dataset. Given two labeled datasets that target fake speech detection, cross-dataset evaluations also aim to detect the fake speeches that are generated by completely unseen fake speech technologies. To explore how well the detection model trained on one dataset generalizes to another dataset, we trained the models on the FoR dataset [6] and evaluated the performance on the ASVSpoof2019 LA dataset. [9]

To summarize, the training and testing stages use only FoR norm dataset. The Evaluation stage is performed on 80 samples collected from LA dataset and from different speech synthesizers.

III-B Front-end Feature Engineering

The proposed models in our project follow the traditional architecture which is front-end feature engineering and back-end classification [Fig 5]. The front-end acoustic features are pretty important for efficiently detecting fake speech. Two simple and commonly used acoustic features 1. Linear frequency cepstral coefficients (LFCC) and 2. Mel frequency cepstral coefficients (MFCC) are evaluated in our experiments. All raw audio files are converted into multiple 1 sec samples and for each second we generate MFCC frames (20 x 44) hop-length = 512, and FFT = 2048 using librosa library and LFCC frames (136 x 13) window-hop = 0.01, FFT = 512 using spafe library.

LFCC and MFCC are largely identical in terms of coefficient extraction scheme. The only difference is in the filter bank, since the LFCC filter bank coefficients equally cover all speech frequency ranges and consider them of equal importance while MFCC considers only low-frequency ranges. So, the advantage of using LFCC is it improves the distinguishability of speaker-specific characteristics available in the higher frequency zone. How we use this advantage in training our features is described in the next section.

III-C Feature Fusion Strategy

The acoustic features generated above are fused together and are passed through our classifiers for prediction. For fusion, we have used AFF (Attentional feature Fusion) technique [1]. To pass through the AFF, the features must be of the same size but the above obtained MFCC and LFCC samples are of different sizes. MFCC - (20 x 44) and LFCC (136 x 13). So, we padded both sample types with zeros on both ends and created a unique sample size of (136 x 44). These padded MFCC and LFCC data are passed through a single 2D Convoluted layer with a kernel of size 2x2, stride = 1 to extract features from them. After obtaining the convoluted features of both MFCC and LFCC, we fuse them by passing through the AFF block 3 that is described in the background section.

III-D Back-end Classification

Baseline Models

As discussed in the background, we have different architectures for fake speech detection. There are traditional approaches that follow front-end feature engineering and back-end classification and a modern approach End-to-End neural network classification. To define a starting accuracy that we want to improve further, we have used classical machine learning algorithms RandomForest and Multi-layer Perceptrons as our baseline models. We also compared our models’ performances with the existing End-to-End model [10] to see if we achieved better performance when compared to this inception network model.

Random Forest:

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. We have formed 100 decision trees with a bootstrapping strategy to form subsets of samples.

Multi-layer Perceptron:

Multi-layer Perceptron (MLP) is a supervised learning algorithm that learns a non-linear approximator function $f:\mathbb{R}^{m}\rightarrow\mathbb{R}^{o}$ by training on a dataset, where m is the number of dimensions for input and o is the number of dimensions for output. We have used stochastic gradient descent optimizer to optimize the log-loss function. There are 2 hidden layers in our model, each with 5 and 2 hidden units respectively.

The above two baseline classifiers are trained and tested using individual MFCC and LFCC features that are obtained from the same subsets of the dataset defined in Table-1.

End-to-End model:

The end-to-end model defined in [3] is named Time-domain Synthetic Speech Detection Net (TSSDNet) since it uses raw-audio sound waves in the time domain. In this TSSDNet, an advanced CNN structure is considered that includes Inception-style parallel convolution networks and the model is called Inc-TSSDNet. We obtained a pre-trained TSSDNet from [add reference] and tested it on the evaluation subset from Table-1. We didn’t achieve better results with this better since this was trained using the ASVSpoof2019 LA dataset [9]. Therefore, the end-to-end model was not able to generalize and hence failed in predicting the samples from different synthesizers.

The comprehensive results of all the baseline models are detailed in the Results section.

Proposed Models

We formed a total of 8 models using the combinations of Sequential Excitation blocks, ResNet blocks, and front-end acoustic features.

Training Strategy:

We first experimented with following 4 models and developed 4 more models in order to fix the issues we observed with the first 4 models.

•

SEResNet50 + MFCC
•

SEResNet50 + LFCC
•

SEResNet50 + AFF
•

SEResNet34 + AFF

The first 2 models use the individual features extracted as the input to the SEResNet50 neural network. The Third model has the input AFF which means the fused features of MFCC and LFCC using Attentional Feature Fusion to the same SEResNet50 model. While training this model, we observed that the model was overfitting and failed to perform on the test data. This means that the model was not able to generalize the features of different synthesizers. This was also the case with the SEResNet34 model. To resolve this issue, we transformed the dataset by masking samples [Fig 8].

Data Transformation:

The data samples are transformed by creating a masked data where we used time masking and frequency masking techniques [Fig 8]. These techniques select some range of randomly chosen frequencies over a single time frame and some range of time frames for a single frequency and all the selected bits are replaced with their respective mean values. This range is selected as 7% in our models. By this process, we are hiding few characteristics of the features and generalize our original data. Hence by this data transformation, we created 4 more models which improved the accuracy.

•

SEResNet50 + Masked-MFCC
•

SEResNet50 + Masked-LFCC
•

SEResNet50 + Masked-AFF
•

SEResNet34 + Masked-AFF

Apart from the aforementioned 8 models, we also used a pure ResNet model (without SE block) that is trained on image dataset to see its performance on the audio dataset. The Evaluation Metrics and the results of these models are described in the later sections.

TABLE II: Accuracy, EER and ROC-AUC OF THE BASELINE MODELS on FoR and ASVSPOOF2019 LA TEST SETS

Baseline	Dataset	MFCC			LFCC
Model	Name	Acc(%)	AUC	EER	Acc(%)	AUC	EER
RF	FoR (Real+Fake)	69.05	0.785	0.309	61.175	0.682	0.374
MLP	FoR (Real+Fake)	50.00	0.5	0.5	57.8	0.596	0.428
RF	LA (Real+fake)	45.83	0.466	0.563	52.5	0.547	0.481
MLP	LA (Real+fake)	46.67	0.5	0.5	55.833	0.564	0.461

End-to-End	TestSet(%)	Acc(%)	AUC	EER
TSSDNet	FoR (Real+Fake)	64.5	0.588	0.375
TSSDNet	LA (Real+Fake)	47.5	0.399	0.607

IV Evaluation Metrics

To evaluate the performance of the different proposed models on the datasets, we compute the two most commonly used metrics for the evaluation of fake speech detection systems.

•

Equal error rate (EER) is used as the first metric for evaluating the fake speech detection methods. The EER is determined by the point at which the false acceptance rate and the false rejection rate are equal.
•

In order to view the performance of detectors for detecting fake speech from another perspective, we also used the receiver operating characteristic (ROC) curve as a secondary metric. The ROC curve illustrates the relationship between false-positive rate and true-positive rate. We draw the ROC curve and calculate the area under the ROC curve (AUC) for the results of detection. The fake speech detection systems are extremely vulnerable for fake speeches when the ROC points tend towards the diagonal line, and vice versa.

TABLE III: Accuracy, EER and ROC-AUC OF THE Proposed MODELS on FoR TEST SET

ResNet Model	Acc (%)	EER	AUC
MFCC + SEResNet50	61.5	0.385	0.615
LFCC + SEResNet50	50.06	0.499	0.500
Masked MFCC + SEResNet50	66.36	0.337	0.663
Maksed LFCC + SEResNet50	50.05	0.5	0.5
AFF + SEResNet50	71.05	0.186	0.814
AFF + SEResNet34	95.01	0.049	0.950
Masked AFF + SEResNet50	79.02	0.189	0.811
Masked AFF + SEResNet34	90.09	0.098	0.902
Pretrained Image Resnet50	49	-	-

V Results

The Baseline models RF and MLP are trained with FoR train subset and tested with FoR test subset and also on LA dataset to see the performance on the cross-dataset. The pretrained Inc-TSSDNet model was trained on ASVSpoof2019 LA dataset [9], so it was tested using only the FoR test subset. The accuracies and ROC-AUC values are in the Table-2. The performance results of the Proposed models in Table-3 show that the best model with higher accuracy is AFF+SEResNet34 95%.

The simple machine learning clasisfiers RF and MLP didn’t perform well even with the both FoR and LA test sets. So, they definitely can’t detect samples from synthetic speech generative model. When carried out experiments with the end-to-end TSSDNet on these synthesizers, it performed comparatively well on some generative models. So, the comparison of the generalizing capacities is made between the end-to-end model and the best of the proposed models. They are evaluated using the samples obtained from different synthesizers like WaveNet [8], WaveRNN [4], etc and the results are show in Table-4. We can observe from Table-4 that SE ResNet50 + Masked AFF shows better generalizability compared to the other models.

TABLE IV: Comparing the Generalising power (Acc -%) of end-to-end model and best of the proposed models

Speech	Inc-TSSDNet	SEResNet34	SEResNet50
Synthesizer	(end-to-end)	+ AFF	+ MaskedAFF
WaveNet	100	100	100
WaveRNN	100	100	100
FastSpeech	100	80	92.31
Tacotron & WaveGlow	88.8	78.05	94.54
LA(Fake)	35	87.5	98.44

VI Conclusion

Through this work we attempted at experimenting feature fusion and application of squeeze excitation to the synthetic speech classification domain. Our results indicate that the models that utilized Attentional Feature Fusion performed well compared to models that were trained using a single feature. It can also be seen that the EER is significantly reduced when compared to the baseline models, including much complex networks like the end-to-end models. Our experiments on the generalizability of these models revealed that ResNets + AFF could also generalize to unseen dataset.

References

[1] Dai, Y., Gieseke, F., Oehmcke, S., Wu, Y., and Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (2021), pp. 3560–3569.
[2] Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2018), pp. 7132–7141.
[3] Hua, G., Bengjinteoh, A., and Zhang, H. Towards end-to-end synthetic speech detection. IEEE Signal Processing Letters (2021).
[4] Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S., Casagrande, N., Lockhart, E., Stimberg, F., van den Oord, A., Dieleman, S., and Kavukcuoglu, K. Efficient neural audio synthesis. CoRR abs/1802.08435 (2018).
[5] Li, X., Li, N., Weng, C., Liu, X., Su, D., Yu, D., and Meng, H. Replay and synthetic speech detection with res2net architecture, 2021.
[6] Reimao, R., and Tzerpos, V. For: A dataset for synthetic speech detection. In 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD) (2019), IEEE, pp. 1–10.
[7] Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., and Liu, T. Fastspeech: Fast, robust and controllable text to speech. CoRR abs/1905.09263 (2019).
[8] van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., and Kavukcuoglu, K. Wavenet: A generative model for raw audio. CoRR abs/1609.03499 (2016).
[9] Wang, X., Yamagishi, J., Todisco, M., Delgado, H., Nautsch, A., Evans, N., Sahidullah, M., Vestman, V., Kinnunen, T., Lee, K. A., et al. Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech. Computer Speech & Language 64 (2020), 101114.
[10] Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., Le, Q. V., Agiomyrgiannakis, Y., Clark, R., and Saurous, R. A. Tacotron: A fully end-to-end text-to-speech synthesis model. CoRR abs/1703.10135 (2017).